Your Policy Regulariser is Secretly an Adversary

Playing against an imagined adversary leads to robust behaviour

The standard model for sequential decision-making under uncertainty is the Markov decision process (MDP). It assumes that actions are under control of the agent, whereas outcomes produced by the environment are random. MDPs are central to reinforcement learning (RL), where transition-probabilities and/or rewards are learned through interaction with the environment. Optimal policies for MDPs select the action that maximises future expected returns in each state, where the expectation is taken over the uncertain outcomes. This, famously, leads to deterministic policies which are brittle — they “put all eggs in one basket”. If we use such a policy in a situation where the transition dynamics or the rewards are different from the training environment, it will often generalise poorly.

Imagining an opponent inside the machine. Though seemingly “just an automaton”, the Mechanical Turk is capable of anticipating the player’s moves. When faced with such an adaptive dynamical system, the player needs to choose moves by imagining responses of an adversary controlling the machine and hedging accordingly. Image source: Joseph Racknitz, Public domain, via Wikimedia Commons

Implementing the adversary

We play the following game: the agent picks a policy, and after that, a hypothetical adversary gets to see this policy and change the rewards associated with each transition — the adversary gets to pick the worst-case reward-function perturbation from a set of hypothetical perturbations. When selecting a policy, our agent needs to anticipate the imagined adversary’s reward perturbations and use a policy that hedges against these perturbations in advance. The resulting fixed policy is robust against any of the anticipated perturbations. If, for instance, the reward function changes between training and deployment in a way that is covered by the possible perturbations of the imagined adversary, there is no need to adapt the policy after deployment; the deployed robust policy has already taken into account such perturbations in advance.

Using convex duality to characterise the imagined adversary

The goal of our paper is to show how policy regularisation is equivalent to optimisation under a particular adversary, and to study that adversary. Using convex duality, it turns out that the adversary we are dealing with, in case of policy regularised RL, has the following properties:

  • The adversary applies an additive perturbation to the reward function:
    r’(s, a) = r(s, a)- Δr(s, a)
  • It pays a cost for modifying the agent’s rewards, but only has a limited budget. The cost function depends on the mathematical form of the policy regulariser — we investigate KL- and alpha-divergence regularisation. The budget is related to the regulariser strength.
  • The adversary applies a perturbation to all actions simultaneously (it knows the agent’s distribution over actions, but not which action the agent will sample). This generally leads to reducing rewards for high-probability actions and increasing rewards for low-probability actions.
Reward perturbations and policy for a single decision. Left column: unperturbed environment rewards for one state with six actions available. Agent’s Q-values correspond exactly to these environment rewards. Second column (blue): (top) regularised policy, (bottom) virtually unregularised policy (which is quasi deterministic). Third column: worst-case reward perturbations for given regulariser strength — finding the optimal (worst-case) perturbation that lies within the limited budget of the adversary is non-trivial and leads to a mini-max game between the adversary and the agent. Fourth column: perturbed rewards under worst-case perturbation.

Generalisation guarantee

The convex-dual formulation also allows us to characterise the full set of perturbations available to the adversary (the “feasible set” in the paper; we have already seen the worst-case perturbation in the illustration above). This allows us to give a quantitative generalisation guarantee:

Feasible set (red region): the set of perturbations available to the adversary for various regulariser strengths for a single state and two actions. X- and Y-axis show perturbed rewards for each action respectively. Blue stars show the unperturbed rewards (same value in all plots), red stars indicate the rewards under the worst-case reward perturbation (see paper for more details). For each perturbation that lies in the feasible set, the regularized policy is guaranteed to achieve expected perturbed reward greater than or equal to the value of the regularized objective. As intuitively expected, the feasible set becomes more restricted with decreasing regularizer strength, meaning the resulting policy becomes less robust (particularly against reward decreases). The slope and boundary of the feasible set can be directly linked to the optimal robust policy (action probabilities), see paper for more details.

Related work and further reading

In our current work, we investigate robustness to reward perturbations. We focus on describing KL- or alpha-divergence regularisers in terms of imagined adversaries. It is also possible to choose adversaries that get to perturb the transition probabilities instead, and/or derive regularizers from desired robust sets (Eysenbach 2021, Derman 2021).

  • Read our paper for all technical details and derivations. All statements from the blog post are made formal and precise (including the generalisation guarantee). We also give more background on convex duality and an example of a 2D, sequential grid-world task. The work is published in TMLR.
  • To build intuition and understanding via the single-step case, see Pedro Ortega’s paper, which provides a derivation of the adversarial interpretation and a number of instructive examples.
  • Esther Derman’s (2021) recent paper derives practical iterative algorithms to enforce robustness to both reward perturbations (through policy regularization) and changes in environment dynamics (through value regularization). Their approach derives a regularization function from a specified robust set (such as a p-norm ball), but can also recover KL or α-divergence regularization with slight differences to our analysis.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com