Introducing our short course on AGI safetyWe are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…Feb 142Feb 142
MONA: A method for addressing multi-step reward hackingMONA enhances safety when we train an AI system to perform some task that takes multiple steps. Training an AI with MONA reduces its …Jan 23Jan 23
Human-AI Complementarity: A Goal for Amplified OversightHow do we ensure humans can continue to oversee increasingly powerful AI systems? We argue that achieving human-AI complementarity is key.Dec 23, 2024Dec 23, 2024
AGI Safety and Alignment at Google DeepMind: A Summary of Recent WorkBy Rohin Shah, Seb Farquhar, and Anca DraganOct 18, 2024Oct 18, 2024
Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct GoalsBy Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. For more details, check out…Oct 7, 20221Oct 7, 20221
Discovering when an agent is present in a systemNew, formal definition of agency gives clear principles for causal modelling of AI agents and the incentives they face.Aug 25, 2022Aug 25, 2022
Your Policy Regulariser is Secretly an AdversaryBy Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, Pedro A. OrtegaMar 24, 2022Mar 24, 2022
Avoiding Unsafe States in 3D Environments using Human FeedbackBy Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, and Jan Leike.Jan 21, 2022Jan 21, 2022
Model-Free Risk-Sensitive Reinforcement LearningBy the Safety Analysis Team: Grégoire Delétang, Jordi Grau-Moya, Markus Kunesch, Tim Genewein, Rob Brekelmans, Shane Legg, and Pedro A…Nov 11, 2021Nov 11, 2021