AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

11 min readOct 18, 2024

By Rohin Shah, Seb Farquhar, and Anca Dragan

It’s been a while since our last update, so we wanted to share a recap of our recent outputs. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours.

Who are we?

We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We’re part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism.

What have we been up to?

Here is some key work published in 2023 and the first part of 2024, grouped by topic / sub-team.

Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don’t pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make.

Frontier Safety

The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models.

FSF

We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic’s Responsible Scaling Policy and OpenAI’s Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc).

We’re excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups.

A key area of the FSF we’re focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions.

Some commentary also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence.

Dangerous Capability Evaluations

Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluations published so far, and to the best of our knowledge has informed the design of evaluations at other organizations. We regularly run and report these evaluations on our frontier models, including Gemini 1.0 (original paper), Gemini 1.5 (see Section 9.5.2), and Gemma 2 (see Section 7.4). We’re especially happy to have helped develop open sourcing norms through our Gemma 2 evals. We take pride in currently setting the bar on transparency around evaluations and implementation of the FSF, and we hope to see other labs adopt a similar approach.

Prior to that we set the stage with Model evaluation for extreme risks, which set out the basic principles behind dangerous capability evaluation, and also talked more holistically about designing evaluations across present day harms to extreme risks in Holistic Safety and Responsibility Evaluations of Advanced AI Models.

Mechanistic Interpretability

Mechanistic interpretability is an important part of our safety strategy, and lately we’ve focused deeply on Sparse AutoEncoders (SAEs). We released Gated SAEs and JumpReLU SAEs, new architectures for SAEs that substantially improved the Pareto frontier of reconstruction loss vs sparsity. Both papers rigorously evaluate the architecture change by running a blinded study evaluating how interpretable the resulting features are, showing no degradation. Incidentally, Gated SAEs was the first public work that we know of to scale and rigorously evaluate SAEs on LLMs with over a billion parameters (Gemma-7B).

We’ve also been really excited to train and release Gemma Scope, an open, comprehensive suite of SAEs for Gemma 2 2B and 9B (every layer and every sublayer). We believe Gemma 2 sits at the sweet spot of “small enough that academics can work with them relatively easily” and “large enough that they show interesting high-level behaviors to investigate with interpretability techniques”. We hope this will make Gemma 2 the go-to models of choice for academic/external mech interp research, and enable more ambitious interpretability research outside of industry labs. You can access Gemma Scope here, and there’s an interactive demo of Gemma Scope, courtesy of Neuronpedia.

You can also see a series of short blog posts on smaller bits of research in the team’s progress update in April.

Prior to SAEs, we worked on:

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla: The key contribution here was to show that the circuit analysis techniques used in smaller models scaled: we gained significant understanding about how, after Chinchilla (70B) “knows” the answer to a multiple choice question, it maps that to the letter corresponding to that answer.
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level: While this work didn’t reach its ambitious goal of mechanistically understanding how facts are computed in superposition in early MLP layers, it did provide further evidence that superposition is happening, and falsified some simple hypotheses about how factual recall might work. It also provided guidelines for future work in the area, such as viewing the early layers as producing a “multi-token embedding” that is relatively independent of prior context.
AtP∗: An efficient and scalable method for localizing LLM behaviour to components: A crucial aspect of circuit discovery is finding which components of the model are important for the behavior under investigation. Activation patching is the principled approach, but requires a separate pass for each component (comparable to training a model), whereas attribution patching is an approximation, but can be done for every component simultaneously with two forward & one backward pass. This paper investigated attribution patching, diagnosed two problems and fixed them, and showed that the resulting AtP* algorithm is an impressively good approximation to full activation patching.
Tracr: Compiled Transformers as a Laboratory for Interpretability: Enabled us to create Transformer weights where we know the ground truth answer about what the model is doing, allowing it to serve as a test case for our interpretability tools. We’ve seen a few cases where people used Tracr, but it hasn’t had as much use as we’d hoped for, because Tracr-produced models are quite different from models trained in the wild. (This was a known risk at the time the work was done, but we hoped it wouldn’t be too large a downside.)

Amplified Oversight

Our amplified oversight work aims to provide supervision on any single situation that is as close as possible to that of a human with complete understanding of all of the reasons that the AI system produced its output — including when the AI has a very broad range of superhuman capabilities. (The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

Theoretical Work on Debate

On the theoretical side, the original debate protocol enables a polynomial-time verifier to decide any problem in PSPACE given debates between optimal debaters. But our AI systems are not optimal, and we should not assume they are! It doesn’t matter if an optimal AI could refute lies, if the AI systems we train in practice cannot do so. The problem of obfuscated arguments is exactly when a dishonest debater lies by breaking an easy problem down into hard subproblems that an optimal honest AI could answer but a bounded one could not.

Doubly-efficient debate provides a new debate protocol that enables a polynomial-time honest strategy to prove facts to an even more limited judge, even against an unbounded dishonest strategy. This is not quite yet what we want: the honest strategy is only polynomial-time in the length of the human-judgeable argument, whereas we would like it to be efficient in terms of the length of the AI-judgeable argument. We have some work in progress that we hope will cross this gap, and we expect that if we do cross the gap this will influence which protocols we study in our empirical work.

Empirical Work on Debate

On the empirical side, we ran inference-only experiments with debate that help challenge what the community expects. First, on tasks with information asymmetry, theory suggests that debate should be close to as good as (or even better than) giving the judge access to the full information, whereas in these inference-only experiments debate performs significantly worse. Second, on tasks without information asymmetry, weak judge models with access to debates don’t outperform weak judge model without debate. Third, we find only limited evidence that stronger debaters lead to much higher judge accuracy — and we really need to make this be the case for debate to succeed in the long run.

Qualitatively, our sense is that these issues occur because the models are not very good at judging debates: the actual debater arguments seem quite good. Our current work is looking into training our LLM judges to be better proxies of human judges, after which we plan to try finetuning the debaters using the debate protocol, and checking that this closes the gaps we’ve observed.

Causal Alignment

A long-running stream of research in our team explores how understanding causal incentives can contribute to designing safe AI systems. Causality gives us pretty general tools for understanding what agents that are ‘trying’ to achieve goals will do, and provides explanations for how they act. We developed algorithms for discovering agents, which can help us identify which parts of systems can be understood through an agent-lens. In principle, this could allow us to empirically discover goal-directed agents, and determine what they are optimizing for.

We have also shown that causal world models are a key aspect of agent robustness, suggesting that some causal tools are likely to apply to any sufficiently powerful agent. The paper got an Honorable Mention for Best Paper at ICLR 2024. This work continues to inform the development of safety mitigations that work by managing an agent’s incentives, such as methods based on process supervision. It can also be used to design consistency checks that look at long-run behavior of agents in environments, extending the more short-horizon consistency checks we have today.

Emerging Topics

We also do research that isn’t necessarily part of a years-long agenda, but is instead tackling one particular question, or investigating an area to see whether it should become one of our longer-term agendas. This has led to a few different papers:

One alignment hope that people have (or at least had in late 2022) is that there are only a few “truth-like” features in LLMs, and that we can enumerate them all and find the one that corresponds to the “model’s beliefs”, and use that to create an honest AI system. In Challenges with unsupervised LLM knowledge discovery, we aimed to convincingly rebut this intuition by demonstrating a large variety of “truth-like” features (particularly features that model the beliefs of other agents). We didn’t quite hit that goal, likely because our LLM wasn’t strong enough to show such features, but we did show the existence of many salient features that had at least the negation consistency and confidence properties of truth-like features, which “tricked” several unsupervised knowledge discovery approaches.

Explaining grokking through circuit efficiency was a foray into “science of deep learning”. It tackles the question: in grokking, why does the network’s test performance improve dramatically upon continued training, despite having already achieved nearly perfect training performance? It gives a compelling answer to this question, and validates this answer by correctly predicting multiple novel phenomena in a similar setting. We hoped that better understanding of training dynamics would enable improved safety, but unfortunately that hope has mostly not panned out (though it is still possible that the insights would help with detection of new capabilities). We’ve decided not to invest more in “science of deep learning”, because there are other more promising things to do, but we remain excited about it and would love to see more research on it.

Power-seeking can be probable and predictive for trained agents is a short paper building on the power-seeking framework that shows how the risk argument would be made from the perspective of goal misgeneralization of a learned agent. It still assumes that the AI system is pursuing a goal, but specifies that the goal comes from a set of goals that are consistent with the behavior learned during training.

Highlights from Our Collaborations

We further broaden our portfolio by collaborating extensively with other teams at Google DeepMind, as well as external researchers.

Within Google, in addition to work that we do to inform the safe development of frontier models, we collaborate with our Ethics and Responsibility teams. In The Ethics of Advanced Assistants we helped consider the role of value alignment and issues around manipulation and persuasion as part of the ethical foundations for building artificial assistants. We further explored this with a focus on persuasion and a focus on justified trust.

Our researchers also do a lot of work with external collaborators and students. Sebastian Farquhar was lead author on Detecting hallucinations in large language models using semantic entropy, published in Nature and covered in roughly 500 newspapers and magazines. It explores a maximally simple case of ‘cross-examination’ showing how a model’s output consistency can be used to predict false answers. This work is now being taken forward by other teams in Google.

Many researchers on our team have mentored MATS scholars, including Arthur Conmy, Scott Emmons, Sebastian Farquhar, Victoria Krakovna, David Lindner, Neel Nanda, and Alex Turner, on wide-ranging topics including power-seeking, steering model behavior, or avoiding jailbreaks. Neel Nanda’s stream of interpretability work through MATS is especially prolific, producing more than ten papers including work on copy suppression, cheaply jailbreaking models using their internals, and interpreting attention layer outputs with SAEs.

Much of our work on Causal Incentives is done with external collaboration alongside members of the causal incentives working group including work on intention and instrumental goals, mitigating deception, and linking causal models to decision theories.

Lastly, we have collaborated externally to help produce benchmarks and evaluations of potentially unsafe behavior including contributing environments to OpenAI’s public suite of dangerous capability evaluations and contributing to the BASALT evaluations for RL with fuzzily specified goals.

What are we planning next?

Perhaps the most exciting and important project we are working on right now is revising our own high level approach to technical AGI safety. While our bets on frontier safety, interpretability, and amplified oversight are key aspects of this agenda, they do not necessarily add up to a systematic way of addressing risk. We’re mapping out a logical structure for technical misalignment risk, and using it to prioritize our research so that we better cover the set of challenges we need to overcome.

As part of that, we’re drawing attention to important areas that require addressing. Even if amplified oversight worked perfectly, that is not clearly sufficient to ensure alignment. Under distribution shift, the AI system could behave in ways that amplified oversight wouldn’t endorse, as we have previously studied in goal misgeneralization. Addressing this will require investments in adversarial training, uncertainty estimation, monitoring, and more; we hope to evaluate these mitigations in part through the control framework.

We’re looking forward to sharing more of our thoughts when they are ready for feedback and discussion.