Progress on Causal Influence Diagrams

What are causal influence diagrams?

A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs).

  • Agent decisions
  • Agent objectives
  • Causal relationships in the environment
  • Agent information constraints

Incentive Concepts

Having a unified language for objectives and training setups enables us to develop generally applicable concepts and results. We define four such concepts in Agent Incentives: A Causal Perspective (AAAI-21):

  • Value of information: what does the agent want to know before making a decision?
  • Response incentive: what changes in the environment do optimal agents respond to?
  • Value of control: what does the agent want to control?
  • Instrumental control incentive: what is the agent both interested and able to control?
  • For S₁, an optimal agent would act differently (i.e. respond) if S₁ changed, and would value knowing and controlling S₁, but it cannot influence S₁ with its action. So S₁ has value of information, response incentive, and value of control, but not an instrumental control incentive.
  • For S₂ and R₂, an optimal agent could not respond to changes, nor know them before choosing its action, so these have neither value of information nor a response incentive. But the agent would value controlling them, and is able to influence them, so S₂ and R₂ have value of control and instrumental control incentive.

User Interventions and Interruption

Let us next turn to some recent applications of these concepts. In How RL Agents Behave when their Actions are Modified (AAAI-21), we study how different RL algorithms react to user interventions such as interruptions and over-ridden actions. For example, Saunders et al. developed a method for safe exploration where a user overrides dangerous actions. Alternatively, agents might get interrupted if analysis of their “thoughts” (or internal activations) suggest they are planning something dangerous. How do such interventions affect the incentives of various RL algorithms?

  • Black-box optimization algorithms such as evolutionary strategies take into account all causal relationships.
  • In contrast, the update rule of Q-learning effectively assumes that the next action will be taken optimally, with no action-modification. This means that Q-learners ignore causal effects PA → Aᵢ. Similarly, SARSA with the action chosen by the agent in the TD-update assumes that it will be in control of its next action. We call this version virtual SARSA.
  • SARSA based on the modified action (empirical SARSA) ignores the effect of action-modification on the current action, but takes into account the effect on subsequent actions.

Reward Tampering

Another AI safety problem that we have studied with CIDs is reward tampering. Reward tampering can take several different forms, including the agent:

  • rewriting the source code of its implemented reward function (“wireheading”),
  • influencing users that train a learned reward model (“feedback tampering”),
  • manipulating the inputs that the reward function uses to infer the state (“RF-input tampering / delusion box problems”).

Multi-Agent CIDs

Many interesting incentive problems arise when multiple agents interact, each trying to optimize their own reward while they simultaneously influence each other’s payoff. In Equilibrium Refinements in Multi-Agent Influence Diagrams (AAMAS-21), we build on the seminal work by Koller and Milch to lay foundations for understanding multi-agent situations with multi-agent CIDs (MACIDs).


To help us with our research on CIDs and incentives, we’ve developed a Python library called PyCID, which offers:

  • A convenient syntax for defining CIDs and MACIDs,
  • Methods for computing optimal policies, Nash equilibria, d-separation, interventions, probability queries, incentive concepts, graphical criteria, and more,
  • Random generation of (MA)CIDs, and pre-defined examples.

Looking ahead

Ultimately, we hope to contribute to a more careful understanding of how design, training, and interaction shapes an agent’s behavior. We hope that a precise and broadly applicable language based on CIDs will enable clearer reasoning and communication on these issues, and facilitate a cumulative understanding of how to think about and design powerful AI systems.

  • Extending the general incentive concepts to multiple decisions and multiple agents.
  • Applying them to fairness and other AGI safety settings.
  • Analysing limitations that have been identified with work so far. Firstly, considering the issues raised by Armstrong and Gorman. And secondly, looking at broader concepts than instrumental control incentives, as influence can also be incentivized as a side-effect of an objective.
  • Probing further at their philosophical foundations, and establishing a clearer semantics for decision and utility nodes.

List of recent papers:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: