Model-Free Risk-Sensitive Reinforcement Learning

Why risk-sensitivity?

Risk-neutral policies, because of the trust they have in their knowledge, can afford to confidently “put all the eggs into the same basket”. XVI Century painting “Girl with a basket of eggs” by Joachim Beuckelaer.

How do agents learn risk-sensitive policies?

Model-Free Risk-Sensitive RL

Estimation of the value for Gaussian- (left) and uniformly-distributed (right) observed target values (grey dots). Each plots shows 10 estimation processes (9 in pink, 1 in red) per choice of the risk parameter β ∊ {-4, -2, 0, +2, +4}. Notice how the estimate settles on different quantiles.

Dopamine Signals, Free Energy, and Imaginary Foes

  • The risk-sensitive update rule can be linked to findings in computational neuroscience [6, 7]. Dopamine neurons appear to signal a reward prediction error similar as in temporal difference learning. Further studies also suggest that humans learn differently in response to positive and negative reward prediction errors, with higher learning rates for negative errors. This is consistent with the risk-sensitive learning rule.
  • In the special case when the distribution of the target value is Gaussian, then the estimate converges precisely to the free energy with inverse temperature β. Using the free energy as an optimization objective (or equivalently, using exponentially-transformed rewards) has a long tradition in control theory as an approach to risk-sensitive control [8].
  • One can show that optimizing the free energy is equivalent to playing a game against an imaginary adversary who attempts to change the environmental rewards against the agent’s expectations. Thus, a risk-averse agent can be thought of as choosing its policy by playing out imaginary pessimistic scenarios.

Final thoughts





