top of page
  • Writer's picturericcardo

Here is how an R-Learning agent beats humans

Human intuition is usually good at dealing with concepts like averages or mean-values, whereas it performs usually more poorly when it comes to handling concepts like volatility and uncertainty of parameters. Managing the inventory of a company, redirecting fluxes within an energy grid, re-balancing an investment portfolio, or even landing a rocket re-entering from orbit, are all activities we would probably approach through average parameters if not provided with better solutions. Unfortunately, mean values work well only when specifically framed, like within the Central Limit Theorem. The difference between good and bad performances, or between gains and losses, is often to be found in how well volatility is handled rather than averages. In that regard, this article wants to show something arguably remarkable. The reinforcement-learning algorithm used in another recent post and focused on managing the inventory of a car dealership is here boosted through a hidden neural network. A slightly different application is performed and performances are compared to human ones. The result clearly shows the constant gap between the two. Because of the nature of the reinforcement learning math deployed here, the training of the algorithm is executed through simple interaction with its environment without the need to reference past examples or data. Because of that, the algorithm can find optimal behaviors possibly beyond the limitations of its trainer. That is what happens in this article and what will be shown below.

Brief recap: our problem is to handle the inventory of a US car dealer having to place monthly orders to replenish its inventory and fulfill possible sales for the next month. It will then experience the reward in terms of profit constituted as per the following: sales of available cars, minus inventory & shipping costs, cost of lost sales). If the inventory was too high we would experience high storage costs, if it was too low we would lose possible sales. Monthly sales are determined by random draw from a normal distribution based on the statistics of the specific brand, the only thing we assume a dealer would have to make decisions on monthly orders. Here are the statistics of the three brands of cars we will consider in this post:

Parameters and statistics of the vehicles we will consider in this article

The reinforcement learning algorithm is boosted by trying to replicate the technique which we know as having been developed and successfully deployed in the game “Go” by DeepMind / Google. That is a reinforcement learning algorithm experiencing the reward from its actions adopted at specific states, provided with a neural network to boost its ability to assign correct values to the action-state pairs – it is almost a requirement considering the possible infinite action-state space it must handle. In the previous post, we were not deploying that hidden neural network, and we gave the algorithm “visibility” of the environment through a matrix of 10 rows and 3 columns – one matrix for each brand of vehicle. We were therefore limiting its vision to 10 possible levels of inventory per brand (1 to 10 vehicles stored) and to 3 actions (order either 0, 1, or 2 vehicles per brand at any given month). Replacing those matrices with neural networks (NN) we do not have anymore that limitation. In particular, while the level of inventory can now be any number (the number fed the NN) the possible order of new vehicles, while still limited, has a wider range, 1 to 10. While we could make the space of the order’s size open-ended as well, we limited it to possibly speed up the convergence of the solution – in part because of the limited computational capacity of our tools. We want to stress that, even if we are now leveraging a neural network, we still do not need past examples or data during training. The training will still be executed by the reinforcement algorithm leveraging the reward it experiences while interacting with the environment. As per DeepMind’s implementation, the deployed neural networks are two rather than one. That is to allow for a better convergence during training by having the main NN gradually adjusts its weights by referencing the second NN, which is more rapidly updating its parameters.

Important note: at the end of the post, the interested reader can find a brief mathematical digression on our neural network. Having multiple neurons per layer while using only one parameter as input to the NN (the number of vehicles in inventory) has little use since the linear combination across a single layer collapses as if it was a single neuron. Conversely, the importance of having subsequent layers remains because it captures the non-linearity. That is because, at the exit of each layer, whatever linear combination is obtained, it is then multiplied by a non-linear function being here the ReLu function. The reason why we preferred to develop a more complete model where each layer still carries multiple neurons is that, in general, anything we do has real application as a reference. Therefore, we wanted to develop something that can be possibly applied to more complex real cases. More on this in the mathematical digression at the end.


Let us anticipate some results of our application:

  • In general, the R-algorithm out-performs human behavior

  • The measure to which R-learning outperforms human behavior is directly related to the volatility of parameters. Very limited volatility is the only case where results are comparable

  • The "rules" adopted by the R-learning algorithm are not that intuitive, showing the potential of reinforcement learning that does not reference examples or past data, and just strives to find the best behavior for the statistics it is provided

We should now briefly explain what we used as a reference for what we are referencing as "human behavior". Rather than asking a supply-china person what action s/he would make at specific states, we built two simple algorithms.

  1. The first one would experience monthly sales (random draw from a normal distribution based on the statistics of the specific brand) and it would order the number of cars needed to take the inventory back to a level equal to the average expected monthly sales for that specific brand. Say a specific brand sells on average 3 cars per month, the algorithm would always try to replenish that average inventory each month

  2. The second algorithm would always order the average monthly sales for that specific brand, regardless of the level of inventory or the actual sales

We will see immediately below that while being more simple, the latter algorithm performs better than the former. That may be unexpected but, again, we are not good at thinking in terms of volatility.

For sake of training, we ran 2000 simulations of a process managing the inventory for 640 consecutive months (that is because it was convenient to train the algorithm with 32 batches repeated 20 times: 32 x 20 = 640). During those 2000 runs, the algorithm would adjust its parameters while learning. The maximum cumulative profit obtained by each algorithm will be the average of about 200 simulations performed by using the trained parameters identified through the previously executed 2000 training iterations. That profit will be determined by the monthly sales lowered by the total costs, including storage & shipping costs (detailed right below), and including lost sales in case of monthly sales being greater than carried inventory. Here are the details of the costs:

  • Total costs: [5% of the selling price x inventory units] + [10% of the selling price x the order of the month] + lost sales. We determined those 5 and 10% by some experience and possible common-sense considerations

  • Lost sales = maximum between 0 and the difference between random sales and available inventory

We can now show the cumulative profits per each brand obtained by, respectively, the reinforcement learning algorithm, the first human-like behavior always replenishing the average inventory, and the second human-like behavior always referencing the average monthly sales by ordering the same number of vehicles every month:

Cumulative profits across 640 months of the three algorithms

The R-Learning algorithm always beats the two human-like behaviors. Moreover, it performs better than the rest in a measure directly related to the volatility of the sales for the specific brand (as a percentage of the average expected sales). It is interesting to note that ordering always the same number of vehicles equal to the average monthly sales (human-like_2) does not perform too badly, and it performs better than the approach based on ordering just the number of cars needed to replenish the average inventory (human-like_1). Human-like_1 is indeed often caught by surprise with not enough inventory to cover spikes in sales. We can look at the two pictures immediately below representing two simulations of the inventory in time for the two human-like approaches (based on brand b and presenting in orange the times the carried inventory results being not enough):

Human-like_1 inventory in time (aiming at replenishing average inventory)

Human-like_2 inventory in time (aiming at always placing orders equaling the average sales for the specific brand)

Algorithm “human-like 1” more often is left with no vehicles in inventory (therefore often losing sales) having an average inventory of about <0.5 cars. The algorithm “human-like 2” performs better because it maintains an average inventory of about >1 vehicle. Because of the volatility, even though human-like_2 focuses on average sales rather than average inventory, it is the one managing to guarantee a better level of inventory.

It is interesting now to understand how the R-Learning algorithm manages to beat human_like_2? Having only one input to the neural network, R-learning can be easily reverse-engineered, and we can find the rules the R-algorithm is following for the specific brand (b). Here they are:

  1. If the inventory is below or equal to 2 vehicles, order 2 more vehicles

  2. If the inventory is greater than 2 vehicles, order 0 vehicles

Here is how the level of inventory for the same brand (b) would appear in time when handled by the machine according to those rules:

R-learning inventory in time (obtained by following the rules outlined above)

The R-algorithm maintains an average inventory of about 2 vehicles, and it is seldom left without cars – moreover, when inventory goes into the orange zone, it does it in a lower measure, meaning lost sales are lower. Because of the penalization of profit related to lost sales, the algorithm makes sure inventory can cover possible sales and their volatility. Moreover, it is careful to position inventory just at that needed level without increasing it too much and wasting too much money on storage & shipping costs for unsold vehicles.

According to the standard distribution we draw sales from and the standard deviation of the brand (b) equal to 2 vehicles, volatility can increase sales by 2 units with about 63%/2 chances, by 2x2 units with about 95%/2, and by 2x3 units with about 99%/2 chances (positive half of the standard distribution with the usual ranges of probability at 1, 2, and 3 standard deviations).

Let us now look at the brand (h), having average sales equal to 3 and a standard deviation equal to 3. Here is the inventory that human-like_2 would maintain in time:

While the R-learning would determine the following inventory:

While it may seem that in this case, R-learning would out-perform human-like_2 even more, the opposite is true, coherent with the lower volatility as a percentage of the mean value. R-learning outperforms human-like_2 in a lower measure, $62 M vs $50 M. While R-learning manages to lose only a few sales, it is forced to carry a higher inventory and higher storage & shipping costs. However, it is almost fascinating to note however that the algorithm works hard to get those extra millions: believe it or not, to find that sweet spot where it still manages to make the difference compared to human-like_2, it follows the following rules:

  1. If the inventory is below or equal to 3, order 6 vehicles

  2. If the inventory is greater than 3, order 1 vehicle

It may seem strange that those rules would maintain the constant inventory shown above, but it can be easily replicated in excel: starting from an inventory equaling average sales (3), subtracting numbers randomly drawn from a normal distribution with parameters (3,3), and adding vehicles every month according to the rules above, we would obtain a profile similar to the one shown above.

Finally, for the brand (f) having the lowest volatility, the R-algorithm finds exactly the same rule adopted all the times by “human-like 2”, which is to constantly order the brand’s average sales (4) - that is coherent with the low volatility which make it convenient to stick to the mean value. Inventories would look pretty much the same for all the three algorithms, as per the final profit equaling $55 M in all three cases and shown in the initial table – note, in this case, human-like_1 performs pretty well as well since even referencing the average inventory all the times leads to the same outcome.


Since the performances obtained by the R-learning algorithm should be evident by now, we can highlight a couple of key points:

  • Especially when it is not possible to live-train an algorithm and simulation must be conducted to train the model, it is critical to model the reward as per the final actual application. As shown above, R-learning critically positions itself by figuring the right balance between inventory & shipping costs and lost sales. Had we given more importance to lost sales, the algorithm would have carried on average higher inventory

  • If possible, it would be probably better not to immediately deploy the neural network adopted here, but rather obtain a first rough result through a simpler model like the one leveraging a more easy-to-investigate matrix – as per the previous article that we referenced above and which can be found again here

  • At least according to our experience, when working with reinforcement learning, it may be even more important to work with people familiar with the specific domain. Since R-learning references the reward to find optimal behaviors, it is critical to model that reward correctly and stress-test the result even beyond the limits of the particular application. That is to ensure a “safe” final deployment

To conclude, please feel free to connect or be in touch with comments, proposals, and anything else which would allow us to connect and even develop collaborations:




--- Here is a brief mathematical digression on our multi-layered neural network deploying multiple neurons per layer despite having only one input (the level of the inventory)

In our case, we wanted to build a multi-layered NN because of the completeness of the project. We can indeed now apply this algorithm to more complex cases (e.g. multi-input state which in this case could be represented by a project where the state is not only represented by the number of vehicles in inventory but also by the month we are considering, thinking about possible cyclicality in sales). In our specific case, only the non-linear subsequent layers matter since the number of neurons per layer lose importance. Here is the mathematical demonstration:

(ReLu is the common non-linear function in NN layers, alpha_0 is the bias of each layer, the remaining alpha are the actual coefficients of the neurons which are to be found through fitting/training)

  • Intermediate feature at the first hidden layer when we feed only 1 input (inventory level / State) to the NN:

o Z1 = ReLu x [alpha_0 + alpha1 x State, alpha2 x State, alpha3 x State]

  • With some mathematical rework the above can be written as:

o Z1 = ReLu x [alpha_0 + (alpha1 + alpha2 + alpha3) x State), where the () can be equaled to only (alpha_sum X State)

  • Therefore, the above is equal to:

o Z1 = ReLu x (alpha_0 + alpha_sum x State), which is the same as saying that the three neurons behave like only one within a single layer

--- End of the mathematical digression


Tags of the images used for the following customization within the main image:

bottom of page