# Model-Free Risk-Sensitive Reinforcement Learning

--

*By the Safety Analysis Team: Grégoire Delétang, Jordi Grau-Moya, Markus Kunesch, Tim Genewein, Rob Brekelmans, Shane Legg, and Pedro A. Ortega*

Read our paper here: https://arxiv.org/abs/2111.02907

We’re all familiar with risk-sensitive choices. As you look out of the window, you see a few gray clouds and no rain, but decide to take along the umbrella anyway. You’re convinced your application will be successful, but you apply for other positions nevertheless. You hurry to get to an important appointment on time, but avoid the highway just in case there could be a traffic jam. Or you buy a lottery ticket all the while you know the chances of winning are unreasonably slim. All these are instances of risk-sensitive behavior — mostly risk-averse but occasionally risk-seeking too. This means we tend to value uncertain events less/more than their expected value, preempting outcomes that go against our expectations.

# Why risk-sensitivity?

Most reinforcement learning algorithms are *risk-neutral*. They collect data from the environment and adapt their policy in order to maximize the *expected return *(sum of future rewards). This works well when the environment is small, stable, and controlled (a “closed world”), such as when agents are trained long enough on a very accurate simulator, so as to familiarize themselves entirely with its details and intricacies. Risk-neutral policies, because of the trust they have in their knowledge, can afford to confidently “put all the eggs into the same basket”.

These assumptions often do not hold in real-world applications. The reasons abound: the training simulator could be inaccurate, the assumptions wrong, the collected data incomplete, the problem misspecified, the computation limited in its resources, and so on. In addition, there might be competing agents reacting in ways that can’t be anticipated. In such situations, risk-neutral agents are too brittle: it could take a single mistake to destabilize their policy and lead to a catastrophic breakdown. This poses serious AI safety problems.

# How do agents learn risk-sensitive policies?

Just as we go about in our daily lives, we can address the previous shortcomings using risk-sensitive policies. Such policies differ from their risk-neutral counterparts in that they value their options differently: not only according to their expected return, but also to the higher-order moments, like the variance of the return, the skewness, etc. Simply stated, risk-sensitive agents care about the *shape* of the distribution of their return and adjust their expectations accordingly. This approach is standard outside of reinforcement learning: in finance for instance, good portfolio managers carefully balance the risks and returns of their investments (see modern portfolio theory).

There are many ways of building risk-sensitive policies [1]. One can formulate a robust control problem consisting of a two-player game between an agent and an adversary, who chooses the environmental parameters maliciously, and then solve for the Maximin policy [2, 3]. Alternatively, one can change the objective function to reflect the sensitivity to the higher-order moments of the return, for instance by penalizing the expected return with a correction term that increases monotonically with the variance [4, 5].

# Model-Free Risk-Sensitive RL

In our paper, we introduce a simple model-free update rule for risk-sensitive RL. It is an asymmetric modification of temporal-difference (TD) learning which puts different weight on observations that overshoot the current value estimates (that is, gains) than on those that fall below (losses). More precisely, let *s* and *s’* be two subsequent states the agent experiences, *V(s)* and *V(s’)* their current value estimates, and *R(s)* be the reward observed in state *s*. Then, to obtain risk-sensitive value estimates, the agent can use the following model-free update of *V(s)*:

(differences with TD-learning highlighted in red) where δ is the standard temporal difference error

and the real function σβ is the scaled logistic sigmoid

Furthermore, ɑ is the learning rate, and 0≤ɣ≤1 is the discount rate. The parameter β controls the risk attitude of the agent. The rationale for this rule is simple: if β<0, *V(s)* will converge to a risk-averse (or pessimistic) value below the expected return because losses have more weight in their updates than gains. Similarly, β>0 will lead to a risk-seeking (optimistic) estimate. Risk-neutrality is obtained with β=0, which recovers standard TD-learning.

In short, the risk parameter β selects the quantile of the target distribution the value will converge to (although the exact quantile as a function of β depends on the distribution) as shown in the simulation below.

The following grid-world example illustrates how risk-sensitive estimates affect the resulting policies. The task of the agent is to navigate to a goal location containing a reward pill while avoiding stepping into the river. This is made more challenging by the presence of a strong wind that pushes the agent into a random direction 50% of the time. The agent was trained using five risk-sensitivity parameter settings ranging from risk-averse (β = -0.8) to risk-seeking (β = +0.8).

The bar plots in (b) show the average return (blue) and the percentage of time spent inside of the water (red) of the resulting policies. The best average return is attained by the risk-neutral policy. However, the risk-sensitive policies (low β) are more effective in avoiding stepping into the river than the risk-seeking policies (high β).

The different choices of the risk-sensitivity parameter β also lead to three qualitatively diverse types of behaviors. In panel ©, which illustrates the various paths taken by the agent when there is no wind, we observe three classes of policies: a *cautious policy* (β = -0.8) that takes the long route away from the water to reach the goal; a risk-neutral policy (β ∊ {-0.4, 0.0, +0.4}) taking the middle route, only a single step away from the water; and finally, an *optimistic policy* (β =+0.8) which attempts to get to the goal taking a straight route.

# Dopamine Signals, Free Energy, and Imaginary Foes

There are a few other interesting properties about the risk-sensitive update rule:

- The risk-sensitive update rule can be linked to findings in computational neuroscience [6, 7]. Dopamine neurons appear to signal a reward prediction error similar as in temporal difference learning. Further studies also suggest that humans learn differently in response to positive and negative reward prediction errors, with higher learning rates for negative errors. This is consistent with the risk-sensitive learning rule.
- In the special case when the distribution of the target value is Gaussian, then the estimate converges precisely to the free energy with inverse temperature β. Using the free energy as an optimization objective (or equivalently, using exponentially-transformed rewards) has a long tradition in control theory as an approach to risk-sensitive control [8].
- One can show that optimizing the free energy is equivalent to playing a game against an imaginary adversary who attempts to change the environmental rewards against the agent’s expectations. Thus, a risk-averse agent can be thought of as choosing its policy by playing out imaginary pessimistic scenarios.

# Final thoughts

To deploy agents that react robustly to unforeseen situations we need to make them risk-sensitive. Unlike risk-neutral policies, risk-sensitive policies implicitly admit that their assumptions about the environment could be mistaken and adjust their actions accordingly. We can train risk-sensitive agents in a simulator and have some confidence about their performance under unforeseen events.

Through our work we show that incorporating risk-sensitivity into model free agents is straightforward: all it takes is a small modification of the temporal difference error which assigns asymmetric weights to the positive and negative updates.

# References

[1] Coraluppi, S. P. (1997). Optimal control of Markov decision processes for performance and robustness. University of Maryland, College Park.

[2] Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.

[3] Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189. PMLR.

[4] Galichet, N., Sebag, M., and Teytaud, O. (2013). Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260. PMLR.

[5] Cassel, A., Mannor, S., and Zeevi, A. (2018). A general approach to multi-armed bandits under risk criteria. In Conference On Learning Theory, pages 1295–1306. PMLR.

[6] Niv, Y., Edlund, J. A., Dayan, P., and O’Doherty, J. P. (2012). Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. Journal of Neuroscience, 32(2):551–562.

[7] Gershman, S. J. (2015). Do learning rates adapt to the distribution of rewards? Psychonomic Bulletin & Review, 22(5):1320–1327.

[8] Howard, R. A. and Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Management science, 18(7):356–369.