top of page
  • Writer's picturericcardo

R-Learning AI self-taking over processes

Nowadays, algorithms applying pattern recognition analyses to historical data are at the core of many commercialized solutions. More unique are algorithms of the type possibly allowing a humanoid robot to learn by simple trial and error to perform tasks such as walking or holding a glass. The former type of learning could be considered supervised and backward-looking because leveraging the statistics of past examples. Conversely, the latter type of learning could be summed up as being unsupervised and forward-looking because leveraging its own ability to interact with the environment and assign “values” to specific decisions. Technically, that forward-looking learning could still be considered “supervised” because still provided some feedback, however, we will discuss below its very different nature. That self-discovering process, which is in practice more similar to an action-reward exchange between the algorithm and the environment, is often referred to as Reinforcement Learning - we will simply say “self-learning” or “R-learning” and we may refer to the algorithm as “the agent”. Going through a real project and the main steps involved with self-learning, this article would like to stimulate the imagination of the reader by hinting at the opportunities and possible critical aspects.

Though we are saying “self-learning”, we are not implying we just place an agent within a specific environment and let it do whatever it wants – at least not yet. R-learning still requires providing the agent with feedback on its actions and an objective to be maximized.

Say we wanted to have a spaceship land without human intervention, we could provide the robo-pilot with the following:

  1. Reference to the “state” through dedicated sensors (e.g. speed, orientation, altitude, etc.)

  2. Ability to perform “actions” through the control of the main rear thruster and the control of small lateral wings balancing the spaceship

  3. Feedback on the actions taken in the form of a “reward” somehow related to the speed and position at landing (i.e. we need to lend with a controlled speed and orientation)

We would then let the agent experience several landings and learn from its actions. In time actions taken at any configuration (e.g. altitude, speed, orientation) should be less and less random because based on “values” assigned while learning - all this would probably be initially simulated and then tested in a few real-world runs.


Some may think R-learning better adapts to “physical” or “symmetrical” scenarios where the rules of the game are set (e.g. landing a rocket follows physical laws which are unlikely to change mid-air). Differently, domains like sales & marketing, related to human unpredictability, may appear as something not properly suiting self-learning. However, that is probably not true. Any learning is at its core statistical analysis, and in general, it is about making the decision more likely to provide a good outcome in a specific situation. Before focusing on the technology, we should probably focus on the statistical distribution of events in that domain. Moreover, distinctions like the one we are making between pattern recognition and R-learning is probably something better reflecting off-the-shelf solutions. High-end tools developed internally at tech companies are likely to deploy customized mixtures of self-learning agents leveraging pattern recognition at different stages of their computations - and some other techniques we may not know at all. We will also comment below on the cases where that type of mix is almost the only way to go. However, in general, the distinction among the different approaches should still be valid, and that is the reason for this article.

The project

A couple of years ago a European private equity firm acquired a company in the USA and asked me to take care of the post-acquisition restructuring. Among the many business initiatives, we had the opportunity to introduce a new product. That product would be manufactured in Europe and commercialized in the USA, therefore, it needed to be shipped and stored in the USA with the related shipping, handling, and storage costs. The planning of that activity had to be driven by the forecasting of the sales, the constraints of the European manufacturing, and the characteristics of the US logistics. The overall objective was to minimize costs without negatively affecting sales (e.g. not having enough inventory to cover sales). We will build below a solution based on R-learning able to self-learn how to optimally manage that supply chain. For the courtesy of not disclosing any information, we will assume a similar scenario where we imagine being a US multi-brand car dealer selling EU-manufactured cars. We will then leverage an algorithm able to look at the inventory currently available in our US warehouse and place orders detailing the number of cars per brand that it wants to be shipped to the US the following month – everything will follow monthly steps. As anticipated, we will not leverage any past examples of how the inventory should be handled. Rather, the following is all we will provide the agent with:

  • STATE -> Monthly inventory: first month equal to zero, then equal to the previous month's level, plus the vehicles shipped to the USA, less the month’s sales. For sake of simplicity, possible inventory will be limited to the range of zero to ten vehicles stored per brand (i.e. no more than ten cars per brand can be stored in our US warehouse).

  • ACTION -> Monthly order: the agent will place its order detailing the number of cars per brand it wants to be shipped to the US by the end of the month. The actions will be limited for sake of simplicity to the range of zero to two cars (i.e. for each brand we can have either zero, one, or two cars shipped every single month to the USA).

  • REWARD -> Monthly sales: sales will be determined by random drawing from a normal distribution based on the statistics of each brand – we assume to have the average monthly sales per brand and the standard deviation. Sales in USD terms will correspond to the product of the selling price times the minimum of that random draw (i.e. cars ordered by customers) and the number of available cars in our US warehouse - we are implying that we lose the sales corresponding to cars ordered by customers but not covered by inventory on hand in the USA. The reward will be maximized by losing the least sales, but also by wasting the least money on shipping and inventory costs. We will assign costs to the actions of shipping, handling, storage, and the actual cost of physically having the vehicle in our inventory. All that will be detailed per brand, since different brands have different costs

The sales parameters will be the following:

The statistics of the sales per brand are expressed in units

We will group all costs in an almost exponential total monthly cost based on both the number of cars in inventory and the additional ones shipped. Truth is, pure exponential costs would rapidly reach very high levels, therefore, we will finally adopt something more similar to a power-law profile where costs related to a few units are a bit higher than the exponential ones, and the costs related to a higher level of stored vehicles are lower than that. Moreover, we will scale everything down to a few percentage points of that, such that the final monthly cost per stored vehicle would result in about 15% of its selling value - therefore highly impacting if vehicles were to be stored for more than a couple of months without selling. Here is an example of a pure exponential cost based on a $1,000 base cost multiplied times the EXP(number of vehicles):

Pure exponential cost profile rapidly reaching very high levels because of the compounding (we will use a tamer power-law profile probably better describing real costs)

With the few inputs above, we can program the agent to simulate in a few seconds thousands of episodes. Each episode would represent three years of management of the supply chain (36 months or steps). At the end of each episode, the agent would start over from a zero inventory and null revenues, while it would keep updating the “values” of matrices representing in its cells STATE-ACTION pairs - one matrix per each brand. At each step (i.e. each month), it would look at the inventory on hand and make decisions on the cars to be shipped from Europe to the US deposit. The following month, it would then experience a positive reward or a negative loss related to its previous decision. That reward or loss would be based on the sales of the month and the total costs. Each month, the agent would accumulate that reward or loss in the matrices mentioned above, representing in their rows the number of cars on hands in the US inventory (the STATE) and in their columns the additional cars the agent orders to be shipped (the ACTION). The value of cells of the matrices would represent the accumulated reward corresponding to ordering Y number of cars at the specific X level of inventory – the rewards do not exactly correspond to accumulated revenues, but they are values properly tweaked according to mathematical considerations specific to R-learning. Bad-selling cars will probably have higher values in cells corresponding to ordering fewer cars for the same level of inventory. However, different selling prices, different logistics costs, and different uncertainty on the average sales (i.e. standard deviation of sales) will all play critical differentiating roles, possibly making the adoption of R-learning almost necessary in a real situation.

Letting the agent do its job, here is an example of the matrices built by the algorithm for two different brands, the first corresponding to higher average sales and a lower selling price and the second corresponding to lower average sales and a higher selling price:

Examples of state-action value-matrices obtained after training. Rows represent the number of cars already in our US inventory and columns represent the number of additional cars the algorithm wants to be shipped to the USA

As expected, in general, the agent recognizes the value of ordering more cars (green cell) when the inventory is low, while preferring not to order additional units as the number of stored vehicles increases (i.e. for each row the biggest value moves to the left as the level of inventory increases). Moreover, the value-matrix on the right more clearly prefers not to order additional units as the inventory reaches 3 cars - resulting from the specific balance between higher selling price, higher costs per unit, etc.

It is interesting to note that the 80.5k value assigned to a level of 5 vehicles on hand in the scenario on the left may suggest the agent prefers to order 1 vehicle when 5 are already stored. However, that is probably because the training was interrupted too soon not letting the algorithm experience the situation associated with 5 units in inventory and an order of 0 vehicles. That is why the cell corresponding to that scenario is still at 0.0. Clearer examples of not executing enough training are shown below, resulting respectively in an agent too conservative and too aggressive:

The training on the top is a bit conservative, while the one at the bottom is a bit aggressive (both verified in our simulation by a disappointing overall profit). If left training for longer times, both scenarios would converge to matrices similar to the correct ones previously shown above

We must remember that all the decisions above represent optimal decisions for the type of reward we chose. Increasing for example the selling price of the expensive vehicle (matrix on the right in the picture previously shown) would probably increase the average number of cars ordered at any level of inventory: the additional costs of storing unsold vehicles each month would be easily covered by one additional sale or by accidental spikes related to the positive standard deviation of the distribution – note that sales are strongly bounded on the negative side by 0, which is on average about (-1) Standard Deviation, while they are loosely bounded on the positive side and potentially rising above +3 Standard Deviation. Therefore, it is extremely important to properly program the agent on the base of our specific situation and need, which can then help us manage complex scenarios that could easily get out of control if handled manually or deterministically. That specific programming would also result in a specific final result, and we can here show the one from our implementation. The picture below will show the total profit (revenues minus costs) reached at the end of the third year of simulation as the agent goes through 1,000 episodes simulating over and over the 3 years’ sales and continuously updating its knowledge about the process (i.e. updating the state-action matrices). As it is possible to see from the picture, according to the way we programmed the reward and the final profit, the major threats to be avoided are big losses. That is mostly related to the quick rise of the total cost as the number of stored vehicles increases: even ordering just one additional car above the optimal level would strongly penalize the profit.

Accumulated profit (revenues – costs) obtained at the end of the third year repeated for 1000 consecutive episodes. The trend of that total profit represents the learning process

In the picture above, the agent quickly reaches the $20M mark in about 200 episodes, spending then time to stabilize slightly above that level. The reason why the total profit still occasionally drops below $0 around the 800th episode is because of the high randomness we intentionally left in the training process. Randomness is essential not to let the agent be too confident in the actions it initially identifies as “good ones” and in forcing it to explore additional behaviors. Limited randomness also helps the agent stay vigilant against changes in the statistics of the domain and adapt accordingly. In the implementation above, we limit randomness only in the very final hundred runs, where the agent stays indeed in the positive and high range of profit.

For the interested reader, the core of the algorithm above was of the Watkins’ type, with eligibility traces to emphasize the update of the states (i.e. level of inventory) more frequently visited.

Where can we go from here?

Not leveraging past examples during the learning process, R-learning can return solutions varying a lot with the specific programming and the representation of the state, action, and reward - one example for all is the strong effect of the representation we chose for the total costs in the example above. However, that possible limit could also constitute its biggest advantage. Other than cases where human-made examples are not possible at all, self-learning could be in general better positioned to find optimal solutions never before investigated by a human operator. That is because R-learning simply aims at identifying the best actions for the specific statistics of the domain it is dealing with.

Solutions based on this type of technology can very well apply to many scenarios, from financial markets to energy management. Trying to pass an interesting comment on that last example, the possible rise of interconnected energy storage systems is likely to require increasing capabilities related to the management of energy fluxes. New business models may arise where different providers act on the grid by pulling and pushing energy from different storage facilities and loads – systems usually corresponding to loads could also function as storage during idling times. Even though it is not easy to predict actual implementations, the need for fast and complex decision-making beyond human capability is likely to arise. In those scenarios, an R-learning approach could be particularly effective at identifying the best action in specific configurations, adapting to rapid changes, and leveraging different parameters interacting non-linearly.

We anticipated in the notes above that often boundaries among different technical solutions are not that clear, especially when it comes to high-end, proprietary, and customized ones. That is often not even a choice, and different mathematical tools are needed at different stages along the algorithm to take care of different tasks. In our example, the state-space matrix was bounded, therefore “finite” – the number of possible inventory was limited, as well as the possible size of the shipments. That is not always the case and in those different instances, the state-action space is better approximated and handled by a neural-network executing pattern recognition rather than a matrix strictly identifying the value of state-action pairs - the characteristic of not needing past examples to learn from would stay, since the core-reference would remain the interaction action-reward.

To conclude, we hope to be able to update you soon on some additional specific projects and experiences. On your end, please feel free to reach out with comments, questions, and discussions. Also, please feel free to share this with possibly interested people.



Tag's of the original images used within the main article's image:



bottom of page