Your Policy Regulariser is Secretly an Adversary

Playing against an imagined adversary leads to robust behaviour

Imagining an opponent inside the machine. Though seemingly “just an automaton”, the mechanical Turk is capable of anticipating the player’s moves. When faced with such an adaptive dynamical system, the player needs to choose moves by imagining responses of an adversary controlling the machine and hedging accordingly. Image source: Joseph Racknitz, Public domain, via Wikimedia Commons

Implementing the adversary

Using convex duality to characterise the imagined adversary

  • The adversary applies an additive perturbation to the reward function:
    r’(s, a) = r(s, a)- Δr(s, a)
  • It pays a cost for modifying the agent’s rewards, but only has a limited budget. The cost function depends on the mathematical form of the policy regulariser — we investigate KL- and alpha-divergence regularisation. The budget is related to the regulariser strength.
  • The adversary applies a perturbation to all actions simultaneously (it knows the agent’s distribution over actions, but not which action the agent will sample). This generally leads to reducing rewards for high-probability actions and increasing rewards for low-probability actions.
Reward perturbations and policy for a single decision. Left column: (top) unperturbed environment rewards for one state with six actions available, (bottom) agent Q-values correspond to environment rewards. Second column (blue): (top) regularised policy, (bottom) virtually unregularised policy (which is quasi deterministic). Third column: worst-case reward perturbations for given regulariser strength — finding the optimal (worst-case) perturbation that lies within the limited budget of the adversary is non-trivial and leads to a mini-max game between the adversary and the agent. Fourth column: perturbed rewards under worst-case perturbation.

Generalisation guarantee

Feasible set (red region): the set of perturbations available to the adversary for various regulariser strengths for a single state and two actions. X- and Y-axis show perturbed rewards for each action respectively. Blue stars show the unperturbed rewards (same value in all plots), red stars indicate the rewards under the worst-case reward perturbation (see paper for more details). For each perturbation that lies in the feasible set, the regularized policy is guaranteed to achieve expected perturbed reward greater than or equal to the value of the regularized objective. As intuitively expected, the feasible set becomes more restricted with decreasing regularizer strength, meaning the resulting policy becomes less robust (particularly against reward decreases). The slope and boundary of the feasible set can be directly linked to the optimal robust policy (action probabilities), see paper for more details.

Related work and further reading

  • Read our paper for all technical details and derivations. All statements from the blog post are made formal and precise (including the generalisation guarantee). We also give more background on convex duality and an example of a 2D, sequential grid-world task.
  • To build intuition and understanding via the single-step case, see Pedro Ortega’s paper, which provides a derivation of the adversarial interpretation and a number of instructive examples.
  • Esther Derman’s (2021) recent paper shows similar results, but starts by defining the set of perturbations (to be robust against), and then derives a regulariser.




We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to get started with Robotics?- Adama Robotics Guide v.1

Artificial Intelligence is Science Fiction Coming to Life

AI Pose Estimation — Ready-to-use

How will Artificial Intelligence in Hotels impact the Operational Dynamics and Customer Experience?

5 notes on how to succeed with AI

AI can’t beat humans or P=NP

AI’s coming of age, in the middle of a pandemic

Will AI Revolutionize The Power Industry?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

More from Medium

Tsinghua U Proposes Stochastic Scheduled Sharpness-Aware Minimization for Efficient DNN Training

Uber Work on Differential Plasticity in Deep Learning

[ICLR 2022] Denoising Likelihood Score Matching for Condition Score-Based Data Generation

Cambricon Stocks Plunge After Sudden Exit of Huawei Veteran CTO; US-Banned DeepGlint Goes Public…