DeepMind Safety Research – Medium

DeepMind Safety Research

Negative Results for Sparse Autoencoders On Downstream Tasks and Deprioritising SAE Research…

Lewis Smith, Sen Rajamanoharan, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

Mar 26

Mar 26

Introducing our short course on AGI safety

We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…

Feb 14

Introducing our short course on AGI safety

Feb 14

Steering Gemini using BIDPO vectors

By Alex Turner and Mark Kurzeja

Jan 31

Steering Gemini using BIDPO vectors

Jan 31

MONA: A method for addressing multi-step reward hacking

MONA enhances safety when we train an AI system to perform some task that takes multiple steps. Training an AI with MONA reduces its …

Jan 23

MONA: A method for addressing multi-step reward hacking

Jan 23

Human-AI Complementarity: A Goal for Amplified Oversight

How do we ensure humans can continue to oversee increasingly powerful AI systems? We argue that achieving human-AI complementarity is key.

Dec 23, 2024

Human-AI Complementarity: A Goal for Amplified Oversight

Dec 23, 2024

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

By Rohin Shah, Seb Farquhar, and Anca Dragan

Oct 18, 2024

Oct 18, 2024

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals

By Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. For more details, check out…

Oct 7, 2022

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals

Oct 7, 2022

Discovering when an agent is present in a system

New, formal definition of agency gives clear principles for causal modelling of AI agents and the incentives they face.

Aug 25, 2022

Discovering when an agent is present in a system

Aug 25, 2022

Your Policy Regulariser is Secretly an Adversary

By Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, Pedro A. Ortega

Mar 24, 2022

Your Policy Regulariser is Secretly an Adversary

Mar 24, 2022

Avoiding Unsafe States in 3D Environments using Human Feedback

By Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, and Jan Leike.

Jan 21, 2022

Avoiding Unsafe States in 3D Environments using Human Feedback

Jan 21, 2022

DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.google

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech