Your Policy Regulariser is Secretly an Adversary

Playing against an imagined adversary leads to robust behaviour

Imagining an opponent inside the machine. Though seemingly “just an automaton”, the mechanical Turk is capable of anticipating the player’s moves. When faced with such an adaptive dynamical system, the player needs to choose moves by imagining responses of an adversary controlling the machine and hedging accordingly. Image source: Joseph Racknitz, Public domain, via Wikimedia Commons

Implementing the adversary

Using convex duality to characterise the imagined adversary

  • The adversary applies an additive perturbation to the reward function:
    r’(s, a) = r(s, a)- Δr(s, a)
  • It pays a cost for modifying the agent’s rewards, but only has a limited budget. The cost function depends on the mathematical form of the policy regulariser — we investigate KL- and alpha-divergence regularisation. The budget is related to the regulariser strength.
  • The adversary applies a perturbation to all actions simultaneously (it knows the agent’s distribution over actions, but not which action the agent will sample). This generally leads to reducing rewards for high-probability actions and increasing rewards for low-probability actions.
Reward perturbations and policy for a single decision. Left column: (top) unperturbed environment rewards for one state with six actions available, (bottom) agent Q-values correspond to environment rewards. Second column (blue): (top) regularised policy, (bottom) virtually unregularised policy (which is quasi deterministic). Third column: worst-case reward perturbations for given regulariser strength — finding the optimal (worst-case) perturbation that lies within the limited budget of the adversary is non-trivial and leads to a mini-max game between the adversary and the agent. Fourth column: perturbed rewards under worst-case perturbation.

Generalisation guarantee

Feasible set (red region): the set of perturbations available to the adversary for various regulariser strengths for a single state and two actions. X- and Y-axis show perturbed rewards for each action respectively. Blue stars show the unperturbed rewards (same value in all plots), red stars indicate the rewards under the worst-case reward perturbation (see paper for more details). For each perturbation that lies in the feasible set, the regularized policy is guaranteed to achieve expected perturbed reward greater than or equal to the value of the regularized objective. As intuitively expected, the feasible set becomes more restricted with decreasing regularizer strength, meaning the resulting policy becomes less robust (particularly against reward decreases). The slope and boundary of the feasible set can be directly linked to the optimal robust policy (action probabilities), see paper for more details.

Related work and further reading

  • Read our paper for all technical details and derivations. All statements from the blog post are made formal and precise (including the generalisation guarantee). We also give more background on convex duality and an example of a 2D, sequential grid-world task.
  • To build intuition and understanding via the single-step case, see Pedro Ortega’s paper, which provides a derivation of the adversarial interpretation and a number of instructive examples.
  • Esther Derman’s (2021) recent paper shows similar results, but starts by defining the set of perturbations (to be robust against), and then derives a regulariser.




We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Humanized chatbot: how to give your bot personality

Is The Banking Industry Benefited By AI Based Chatbots?

Intro blog post.

Comparing AI Frameworks for Predictive Maintenance

Artificial intelligence and drug discovery: A conversation with Dr Jane Kinghorn

A Society Run By AI Artists

Google’s ‘AI for Social Good’ is now making an Impact in Animal Conservation!

Many of them still have doubts and don’t have a clear idea on the difference between Artificial…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

More from Medium

Megvii Chief Scientist, ResNet Creator Dies; Baidu’s EV Arm Unveils a Self-Driving Concept Car…

Counterfactual Memorization in Language Models: Distinguishing Rare from Common Memorization

Deep learning revolution reaches structural biology, but a fundamental challenge remains open.

DeepMind and OpenAI Ideas to Incorporate Human Feedback in Reinforcement Learning Agents