What mechanisms drive agent behaviour?

By the Safety Analysis Team: Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom McGrath, Vladimir Mikulik, Markus Kunesch, Shane Legg, and Pedro A. Ortega.

TL;DR: To study agent behaviour we must use the tools of causal analysis rather than rely on observation alone. Our paper outlines a rigorous methodology for uncovering the agents’ causal mechanisms.

Understanding the mechanisms that drive agent behaviour is an important challenge in AI safety. In order to diagnose faulty behaviour, we need to understand why agents do what they do. As is the case in medical trials, it is not sufficient to observe that a treatment correlates with a recovery rate; instead we are interested in whether the treatment causes the recovery. In order to address such “why” questions in a systematic manner we can use targeted manipulations and causal models.

However, large AI systems can operate like black boxes. Even if we know their entire blueprint (architecture, learning algorithms, and training data), predicting their behaviour can still be beyond our reach, because understanding the complex interplay between the parts is intractable. And as the complexity of agents increases in the future, this limitation will persist. Therefore we need black-box methodologies for finding simple and intuitive causal explanations that can be understood easily by humans and are sufficiently good for predicting their behaviour.

In our recent work we describe the methodology we use for analysing AI agents. This methodology encourages analysts to experiment and to rigorously characterise causal models of agent behaviour.

Analysis (Software) Components

The methodology uses three components: an agent to be studied, a simulator, and a causal reasoning engine.

  1. Agent: Typically this is an agent provided to us by an agent builder. It could be an IMPALA agent that has been meta-trained on a distribution over grid-world mazes. Often the agent builders already have a few specific questions they’d like us to investigate.
Figure 1. The simulator: our experimentation platform. Starting from an initial state (root node, upper-left) the simulator allows us to execute a trace of interactions. We can also perform interventions, such as changing the random seed, forcing the agent to pick desired actions, and manipulating environmental factors. These interventions create new branches of the execution trace.
Figure 2. A causal model, represented as a causal Bayesian network.

Analysis Methodology

Whenever we analyse an agent, we repeat the following five steps until we reach a satisfactory understanding.

  1. Exploratory analysis: We place the trained agent into one or more test environments and probe its behaviour. This will give us a sense of what the relevant factors of behaviour are. It is the starting point for formulating our causal hypotheses.

Let’s have a look at an example.

Example: Causal effects under confounding

An important challenge of agent training is to make sure that the resulting agent makes the right choices for the right reasons. However, if the agent builder does not carefully curate the training data, the agent might pick up on unintended, spurious correlations to solve a task [1]. This is especially the case when the agent’s policy is implemented with a deep neural network. The problem is that policies that base their decisions on accidental correlations do not generalise.

Unfortunately, all too often when we observe an agent successfully performing a task, we are tempted to jump to premature conclusions. If we see the agent repeatedly navigating from a starting position to a desired target, we might conclude that the agent did so because the agent is sensitive to the location of the target.

For instance, consider the 2 T-shaped mazes shown below (the “grass-sand environments”). We are given two pre-trained agents A and B. Both of them always solve the task by choosing the terminal containing a rewarding pill. As analysts, we are tasked to verify that they pick the correct terminal because they follow the rewarding pill.

Figure 3. Grass-Sand environments: In these 2 T-shaped mazes, the agent can choose between one of two terminal states, only one of which contains a rewarding pill. During tests, we observe that a pre-trained agent always successfully navigates to the location of the pill.

However, in these mazes the floor type happens to be perfectly correlated with the location of the rewarding pill: when the floor is grass, the pill is always located on one side, and when the floor is sand, the pill is on the other side. Thus, could the agents be basing their decision on the floor type, rather than on the location of the pill? Because the floor type is the more salient feature of the two (spanning more tiles), this is a plausible explanation if an agent was only trained on these two mazes.

As it turns out, we can’t tell whether the decision is based upon the location of the rewarding pill through observation alone.

During our exploratory analysis we performed two experiments. In the first, we manipulated the location of the reward pill; and in the second, the type of floor. We noticed that agents A and B respond differently to these changes. This led us to choose the following variables for modelling the situation: location of the reward pill (R, values in {left, right}), type of floor (F, values in {grass, sand}), and terminal chosen (T, {left, right}). Because the location of the pill and the floor type are correlated, we hypothesised the existence of a confounding variable (C, values in {world 1, world 2}). In this case, all variables are binary. The resulting causal model is shown below. The conditional probability tables for this model were estimated by running many controlled experiments using the simulator. This is done for both agents, resulting in two causal models.

Figure 4. Causal model for the grass-sand environment. The variables are C (confounder), R (location of reward pill), F (type of floor), and T (choice of terminal state).

Now that we have concrete formal causal models for explaining the behaviour of both agents, we are ready to ask questions:

  1. Association between T and R: Given the location of the reward pill, do agents pick the terminal at the same location? Formally, this is
    P( T = left | R = left ) and P( T = right | R = right ).

The results are shown in the table below.

First, we confirm that, observationally, both agents pick the terminal with the reward. However, when changing the position of the reward, we see a difference: agent A’s choice seems indifferent (probability close to 0.5) to the location of the reward pill, whereas agent B follows the reward pill. Rather, agent A seems to choose according to the floor type, while agent B is insensitive to it. This answers our question about the two agents. Importantly, we could only reach these conclusions because we actively intervened on the hypothesised causes.

More examples

Besides showing how to investigate causal effects under confounding, our work also illustrates five additional questions that are typical in agent analysis. Each example is carefully illustrated with a toy example.

How would you solve them? Can you think of a good causal model for each situation? The problems are:

  1. Testing for memory use: An agent with limited visibility (it can only see its adjacent tiles) has to remember a cue at the beginning of a T-maze. The cue tells it where to go to collect a rewarding pill (left or right exit). You observe that the agent always picks the correct exit. How would you test whether it is using its internal memory for solving the task?

Find out the answers and more in our paper. Link to the paper here.

[1] Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.

We would like to thank Jon Fildes for his help with this post.

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com