AGI Safety and Alignment at Google DeepMind: A Summary of Recent WorkBy Rohin Shah, Seb Farquhar, and Anca DraganOct 18Oct 18
Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct GoalsBy Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. For more details, check out…Oct 7, 20221Oct 7, 20221
Discovering when an agent is present in a systemNew, formal definition of agency gives clear principles for causal modelling of AI agents and the incentives they face.Aug 25, 2022Aug 25, 2022
Your Policy Regulariser is Secretly an AdversaryBy Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, Pedro A. OrtegaMar 24, 2022Mar 24, 2022
Avoiding Unsafe States in 3D Environments using Human FeedbackBy Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, and Jan Leike.Jan 21, 2022Jan 21, 2022
Model-Free Risk-Sensitive Reinforcement LearningBy the Safety Analysis Team: Grégoire Delétang, Jordi Grau-Moya, Markus Kunesch, Tim Genewein, Rob Brekelmans, Shane Legg, and Pedro A…Nov 11, 2021Nov 11, 2021
Progress on Causal Influence DiagramsBy Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggJun 30, 2021Jun 30, 2021
An EPIC way to evaluate reward functionsHow can you tell if you have a good reward function? EPIC provides a fast and reliable way to evaluate reward functions.Apr 16, 2021Apr 16, 2021
Alignment of Language AgentsBy Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik and Geoffrey IrvingMar 30, 2021Mar 30, 2021
What mechanisms drive agent behaviour?By the Safety Analysis Team: Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom McGrath, Vladimir Mikulik, Markus…Mar 5, 20211Mar 5, 20211