What mechanisms drive agent behaviour?

8 min readMar 5, 2021

By the Safety Analysis Team: Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom McGrath, Vladimir Mikulik, Markus Kunesch, Shane Legg, and Pedro A. Ortega.

TL;DR: To study agent behaviour we must use the tools of causal analysis rather than rely on observation alone. Our paper outlines a rigorous methodology for uncovering the agents’ causal mechanisms.

Understanding the mechanisms that drive agent behaviour is an important challenge in AI safety. In order to diagnose faulty behaviour, we need to understand why agents do what they do. As is the case in medical trials, it is not sufficient to observe that a treatment correlates with a recovery rate; instead we are interested in whether the treatment causes the recovery. In order to address such “why” questions in a systematic manner we can use targeted manipulations and causal models.

However, large AI systems can operate like black boxes. Even if we know their entire blueprint (architecture, learning algorithms, and training data), predicting their behaviour can still be beyond our reach, because understanding the complex interplay between the parts is intractable. And as the complexity of agents increases in the future, this limitation will persist. Therefore we need black-box methodologies for finding simple and intuitive causal explanations that can be understood easily by humans and are sufficiently good for predicting their behaviour.

In our recent work we describe the methodology we use for analysing AI agents. This methodology encourages analysts to experiment and to rigorously characterise causal models of agent behaviour.

Analysis (Software) Components

The methodology uses three components: an agent to be studied, a simulator, and a causal reasoning engine.

Agent: Typically this is an agent provided to us by an agent builder. It could be an IMPALA agent that has been meta-trained on a distribution over grid-world mazes. Often the agent builders already have a few specific questions they’d like us to investigate.
Simulator — “the agent debugger”: Our experimentation platform. With it, we can simulate the agent and run experiments. Furthermore, it allows us to perform all sorts of operations we’d usually expect from a debugger, such as stepping forward/backward in the execution trace, setting breakpoints, and setting/monitoring variables.
We also use the simulator to generate data for the estimation of statistical parameters. Since we can manipulate factors in the environment, the data we collect is typically interventional and thus contains causal information. This is illustrated in Figure 1 below.
Causal reasoning engine: This automated reasoning system allows us to specify and query causal models with associational, interventional, and counterfactual questions. We use these models to validate causal hypotheses. A model is shown in Figure 2 below.

***Figure 1. The simulator:*** our experimentation platform. Starting from an initial state (root node, upper-left) the simulator allows us to execute a trace of interactions. We can also perform interventions, such as changing the random seed, forcing the agent to pick desired actions, and manipulating environmental factors. These interventions create new branches of the execution trace.

**Figure 2. A causal model**, represented as a causal Bayesian network.

Analysis Methodology

Whenever we analyse an agent, we repeat the following five steps until we reach a satisfactory understanding.

Exploratory analysis: We place the trained agent into one or more test environments and probe its behaviour. This will give us a sense of what the relevant factors of behaviour are. It is the starting point for formulating our causal hypotheses.
Identify the relevant abstract variables: We choose a collection of variables that we deem relevant for addressing our questions. For instance, possible variables are: “does the agent collect the key?”, “is the door open?”, etc.
Gather data: We perform experiments in order to collect statistics for specifying the conditional probability tables in our causal model. Typically this implies producing thousands of rollouts under different conditions/interventions.
Formulate the causal model: We formulate a structural causal model (SCM) encapsulating all causal and statistical assumptions. This is our explanation for the agent’s behaviour.
Query the causal model: Finally, we query the causal model to answer the questions we have about the agent.

Let’s have a look at an example.

Example: Causal effects under confounding

An important challenge of agent training is to make sure that the resulting agent makes the right choices for the right reasons. However, if the agent builder does not carefully curate the training data, the agent might pick up on unintended, spurious correlations to solve a task [1]. This is especially the case when the agent’s policy is implemented with a deep neural network. The problem is that policies that base their decisions on accidental correlations do not generalise.

Unfortunately, all too often when we observe an agent successfully performing a task, we are tempted to jump to premature conclusions. If we see the agent repeatedly navigating from a starting position to a desired target, we might conclude that the agent did so because the agent is sensitive to the location of the target.

For instance, consider the 2 T-shaped mazes shown below (the “grass-sand environments”). We are given two pre-trained agents A and B. Both of them always solve the task by choosing the terminal containing a rewarding pill. As analysts, we are tasked to verify that they pick the correct terminal because they follow the rewarding pill.

***Figure 3. Grass-Sand environments:*** In these 2 T-shaped mazes, the agent can choose between one of two terminal states, only one of which contains a rewarding pill. During tests, we observe that a pre-trained agent always successfully navigates to the location of the pill.

However, in these mazes the floor type happens to be perfectly correlated with the location of the rewarding pill: when the floor is grass, the pill is always located on one side, and when the floor is sand, the pill is on the other side. Thus, could the agents be basing their decision on the floor type, rather than on the location of the pill? Because the floor type is the more salient feature of the two (spanning more tiles), this is a plausible explanation if an agent was only trained on these two mazes.

As it turns out, we can’t tell whether the decision is based upon the location of the rewarding pill through observation alone.

During our exploratory analysis we performed two experiments. In the first, we manipulated the location of the reward pill; and in the second, the type of floor. We noticed that agents A and B respond differently to these changes. This led us to choose the following variables for modelling the situation: location of the reward pill (R, values in {left, right}), type of floor (F, values in {grass, sand}), and terminal chosen (T, {left, right}). Because the location of the pill and the floor type are correlated, we hypothesised the existence of a confounding variable (C, values in {world 1, world 2}). In this case, all variables are binary. The resulting causal model is shown below. The conditional probability tables for this model were estimated by running many controlled experiments using the simulator. This is done for both agents, resulting in two causal models.

***Figure 4. Causal model for the grass-sand environment.*** *The variables are C (confounder), R (location of reward pill), F (type of floor), and T (choice of terminal state).*

Now that we have concrete formal causal models for explaining the behaviour of both agents, we are ready to ask questions:

Association between T and R: Given the location of the reward pill, do agents pick the terminal at the same location? Formally, this is
P( T = left | R = left ) and P( T = right | R = right ).
Causation from R to T: Given that we set the location of the reward pill, do agents pick the terminal at the same location? In other words, can we causally influence the agent’s choice by changing the location of the reward? Formally, this is given by
P( T = left | do(R = left) ) and P( T = right | do(R=right) ).
Causation from F to T: Finally, we want to investigate whether our agents are sensitive to the floor type. Can we influence the agent’s choice by setting the floor type? To answer this, we could query the probabilities
P( T = left | do(F = grass)) and P(T=right|do(F=sand)).

The results are shown in the table below.

First, we confirm that, observationally, both agents pick the terminal with the reward. However, when changing the position of the reward, we see a difference: agent A’s choice seems indifferent (probability close to 0.5) to the location of the reward pill, whereas agent B follows the reward pill. Rather, agent A seems to choose according to the floor type, while agent B is insensitive to it. This answers our question about the two agents. Importantly, we could only reach these conclusions because we actively intervened on the hypothesised causes.

More examples

Besides showing how to investigate causal effects under confounding, our work also illustrates five additional questions that are typical in agent analysis. Each example is carefully illustrated with a toy example.

How would you solve them? Can you think of a good causal model for each situation? The problems are:

Testing for memory use: An agent with limited visibility (it can only see its adjacent tiles) has to remember a cue at the beginning of a T-maze. The cue tells it where to go to collect a rewarding pill (left or right exit). You observe that the agent always picks the correct exit. How would you test whether it is using its internal memory for solving the task?
Testing for generalisation: An agent is placed in a square room where there is a reward pill placed in a randomly chosen location. You observe that the agent always collects the reward. How would you test whether this behaviour generalizes?
Estimating a counterfactual behaviour: There are two doors, each leading into a room containing a red and a green reward pill. Only one door is open, and you observe the agent picking up the red pill. If the other door had been open instead, what would the agent have done?
Which is the correct causal model? You observe several episodes, in which two agents, red and blue, simultaneously move one step into mostly the same direction. You know that one of them chooses the direction and the other tries to follow. How would you find out who’s the leader and who’s the follower?
Understanding the causal pathways leading up to a decision: An agent starts in a room with a key and a door leading to a room with a reward pill. Sometimes the door is open, and other times the door is closed and the agent has to use the key to open it. How would you test whether the agent understands that the key is only necessary when the door is closed?

Find out the answers and more in our paper. Link to the paper here.

[1] Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.

We would like to thank Jon Fildes for his help with this post.

What mechanisms drive agent behaviour?

Analysis (Software) Components

Analysis Methodology

Example: Causal effects under confounding

More examples

Written by DeepMind Safety Research