Human-AI Complementarity: A Goal for Amplified Oversight

DeepMind Safety Research
19 min readDec 23, 2024

--

By Sophie Bridgers, Rishub Jain, Rory Greig, and Rohin Shah
Based on work by the Rater Assist Team: Vladimir Mikulik, Sophie Bridgers, Tian Huey Teh, Rishub Jain, Rory Greig, Lili Janzer (randomized order, equal contributions)

Human oversight is critical for ensuring that Artificial Intelligence (AI) models remain safe and aligned to human values. But AI systems are rapidly advancing in capabilities and are being used to complete ever more complex tasks, making it increasingly challenging for humans to verify AI outputs and provide high-quality feedback. How can we ensure that humans can continue to meaningfully evaluate AI performance? An avenue of research to tackle this problem is “Amplified Oversight” (also called “Scalable Oversight”), which aims to develop techniques to use AI to amplify humans’ abilities to oversee increasingly powerful AI systems, even if they eventually surpass human capabilities in particular domains.

With this level of advanced AI, we could use AI itself to evaluate other AIs (i.e., AI raters), but this comes with drawbacks (see Section IV: The Elephant in the Room). Importantly, humans and AIs have complementary strengths and weaknesses. We should thus, in principle, be able to leverage these complementary abilities to generate an oversight signal for model training, evaluation, and monitoring that is stronger than what we could get from human raters or AI raters alone. Two promising mechanisms for harnessing human-AI complementarity to improve oversight are:

  1. Rater Assistance, in which we give human raters access to an AI rating assistant that can critique or point out flaws in an AI output or automate parts of the rating task, and
  2. Hybridization, in which we combine judgments from human raters and AI raters working in isolation based on predictions about their relative rating ability per task instance (e.g., based on confidence).

The design of Rater Assistance and/or Hybridization protocols that enable human-AI complementarity is challenging. It requires grappling with complex questions such as how to pinpoint the unique skills and knowledge that humans or AIs possess, how to identify when AI or human judgment is more reliable, and how to effectively use AI to improve human reasoning and decision-making without leading to under- or over-reliance on the AI. These are fundamentally questions of Human-Computer Interaction (HCI), Cognitive Science, Psychology, Philosophy, and Education. Luckily, these fields have explored these same or related questions, and AI safety can learn from and collaborate with them to address these sociotechnical challenges. On our team, we have worked to expand our interdisciplinary expertise to make progress on Rater Assistance and Hybridization for Amplified Oversight.

I. Lessons from HCI for Rater Assistance

HCI, in particular, offers a wealth of relevant research. Complementarity is a term borrowed from this field, and achieving human-computer complementarity or synergy is a long-studied problem across various technologies, with AI as the newest and, arguably, most complex addition (e.g., Lee & Moray, 1994; Lee & See, 2004; Vaccaro et al., 2024). HCI research has investigated what leads to successful human-AI partnerships, often focusing on AI as a decision support system more akin to Rater Assistance than Hybridization. Studies have explored the impact of numerous factors, including both fixed conditions of the set-up (e.g., the relative competence of the AI compared to the human, the task-type, the task difficulty, etc.) and design decisions (e.g., the information the AI provides, the workflow, etc.) (for a review, see Vaccaro et al., 2024). Here are a few learnings from this research that are particularly relevant for Rater Assistance:

  1. Achieving human-AI complementarity is difficult and does not happen by default. A comprehensive review of the literature on human-AI team performance revealed that on average, across several studies and tasks, human-AI teams perform statistically worse than humans or AIs alone (Vaccaro et al., 2024).
  2. Complementarity depends on relative human/AI performance, task type, and division of labor. Though complementarity is hard to achieve, a few factors may increase its likelihood: (1) humans alone outperform AI alone (i.e., AI assistance can still be helpful even when humans are better than AI at a task, but when AI alone performs better, it is harder for humans to add value); (2) the task involves creating content rather than making decisions, and (3) the AI completes subtasks rather than the entire task (Vaccaro et al., 2024). Humans can struggle to delegate tasks to AI, but training humans on the complementary strengths and weaknesses of humans and AI improves their delegation and human-AI team performance (Pinski, Adam, & Benlian, 2023)[1]. Rater Assistance research has also found that even when human-AI teams don’t achieve complementary performance, they can counter each others’ biases (i.e., catching more bugs in code than humans alone and hallucinating fewer bugs than AI alone; McAleese et al., 2024). These findings suggest that complementarity is possible, even though it may require careful design to achieve.
  3. Measuring over- and under-reliance on the AI assistant is critical. In addition to human-AI team performance, it is important to measure over-reliance (deferring to AI even when it’s unhelpful/incorrect) and under-reliance (ignoring AI even when it’s helpful/correct) (Lee & Moray, 1994; Lee & See, 2004; Manzini et al., 2024; Parasuraman & Riley, 1997). These can limit the gains from assistance and likely require different interventions to counteract. Notably, as AI advances, over-reliance may become a primary obstacle to achieving complementarity. Humans may increasingly defer to the AI’s reasoning and recommendations even when wrong, exacerbating AI biases, diminishing the complementary value of human input, and reducing human agency (Lai & Tan, 2019). Rater Assistance research has similarly found that human raters can be misled by AI assistants that provide evidence and arguments in support of incorrect conclusions (Khan et al., 2024).
  4. AI explanations and confidence do not robustly reduce over-reliance. One way one might hope to counter over-reliance would be to show the AI’s explanation of its reasoning or an estimate of its confidence in its answer to the human. However, simply providing this additional information does not reliably reduce over-reliance or lead to complementarity across studies (e.g., Bansal et al., 2021; Lai & Tan, 2019; Ma et al., 2023; Vaccaro et al., 2024). Explanations have even been shown to make over-reliance worse, perhaps by inflating perceptions of AI competence (e.g., Bansal et al., 2021; Buçinca et al., 2020; Buçinca, Malaya, & Gajos, 2021; Kaur et al., 2020). The situation might change, though, if we can better train humans to utilize this information, and better train AI to helpfully and accurately explain its reasoning. Indeed, recent research has shown positive results when encouraging humans to think not only about AI confidence but also their own confidence and correctness likelihood (Ma et al., 2023; Ma et al., 2024).
  5. Contrasting explanations can reduce over-reliance but also increase under-reliance. Displaying contrastive explanations that argue for different conclusions can both reduce over-reliance and increase under-reliance compared to non-constrastive explanations, and so do not better calibrate reliance overall (e.g., Bansal et al., 2021; Si et al., 2023). Rater Assistance research, however, has found evidence that debate (two AI assistants providing contrastive arguments for different conclusions) improves human accuracy above consultancy (one-sided argument) (Khan et al., 2024), suggesting more research is needed on the HCI of debate, a theoretically promising alignment technique.
  6. Over-reliance on AI decisions is a cognitive short-cut (but it can be a rational decision). Over-reliance on AI decisions without carefully considering accompanying AI explanations can be viewed as a human cognitive heuristic, or short-cut. But Vasconcelos et al. (2023) argue and empirically show that though over-reliance may be a short-cut, it is a strategic cost-benefit decision. In situations where critically reading an AI explanation is cognitively costly, people are more likely to ignore it and just defer to the AI decision; but when the explanation is easier to digest, especially in comparison to the difficulty of the task, or when people are paid performance bonuses, over-reliance decreases. Interventions to encourage analytical thinking such as having people make a decision first before seeing the AI decision or forcing them to slow down before making a decision also decrease over-reliance (Buçinca et al., 2021). Rater Assistance research should consider the cognitive demands of the types of assistance explored and how to design an interaction interface and protocol that reduce these demands; this might be especially important when it comes to debate, which is a longer, more complex form of assistance.

Beyond these learnings, there is much more to explore. HCI research on human-AI complementarity provides a strong foundation from which to draw inspiration and form hypotheses for Rater Assistance research. We see this as an opportunity for mutually beneficial collaboration between the fields, as Amplified Oversight can leverage learnings from HCI and in turn, learnings from Amplified Oversight could help to inform HCI theories and practice. What’s more, HCI techniques that are effective at improving oversight could be applied to products for end users, giving them more control over the AI tools they use in their daily lives.

II. Hybridization and How it Enables Impactful Assistance

Compared to Rater Assistance, Hybridization (combining judgments from humans and AIs) has received less focus in Amplified Oversight research. But Hybridization is another way to leverage human-AI complementarity to improve oversight, and so is commensurate with the goals of Rater Assistance and building successful human-AI teams more broadly. Although training on AI rater feedback may be hard to get right in practice, it is a powerful technique (e.g. Bai et al., 2022) and increasingly widespread, since AI raters are more scalable than human raters, making the study of Hybridization even more timely.

With Hybridization, the goal is to figure out how to optimally combine the signal from human ratings and from AI ratings. Two main approaches have been explored, (1) averaging ratings per datum (e.g., taking the majority vote out of all human and AI ratings, the raw average, or a weighted average; see Li, 2024), and (2) slicing the data and routing tasks to either humans or AIs based on confidence or some other metric for predicting who will perform better (see Wang et al., 2024).

Combining Hybridization with Rater Assistance might provide an opportunity to better amplify the complementary strengths of humans and AIs. Slicing, in particular, might enable the success of AI assistance by identifying areas where the AI rater struggles to make the right decision, but where it (or another assistant model) could still meaningfully assist a human rater, e.g. by surfacing helpful information. Currently, HCI evidence suggests that human-AI complementarity is easier to achieve when humans alone outperform AI alone (Vaccaro et al., 2024), and so by focusing on the shortcomings of AI raters, we might better clarify humans’ contributions and how they could be enhanced by assistance.

HCI research is increasingly exploring Hybridization, referred to as Delegation (see Baird & Maruping, 2021), with initial promising results. Specifically, delegating tasks to humans based on AI confidence has been found to improve image classification accuracy over AI alone, in a situation where AI outperforms humans alone (Hemmer et al., 2023; Fügener et al., 2024). These results align with the idea that division of labor can facilitate complementarity (Vaccaro et al., 2024).

It is thus important for Amplified Oversight research to consider Rater Assistance in the context of Hybridization. Hybridization might change the kind of assistance we explore — the form of assistance that is most helpful overall on the entire dataset might not be the form that is most helpful on a particular slice. Additionally, for Hybridization, the space of possible signals to ensemble is even richer than previously proposed: We don’t just have AI or humans to consider, we can also take into account assisted humans, which might be further broken down into different assistant methods. This is a more complicated but also an arguably more flexible and powerful set of signals to optimally combine.

III. Our Ongoing Research on Hybridization and Rater Assistance

Here on the AGI Safety & Alignment Team at Google DeepMind, we have ongoing research exploring the combination of Hybridization and Rater Assistance on an internal human rating task that involves making judgments supported by evidence on a realistic evaluation dataset. For Hybridization, we use the confidence estimate from an AI rater to slice the evaluation dataset into an “AI” set (where we use the AI rater decision) and a “human” set (where multiple human raters provide a decision, potentially with AI assistance). For Rater Assistance, we use the AI rater output: it has a chain of thought that aggregates and quotes evidence, reasons about it, and then provides judgments. We first conduct confidence-based Hybridization and then investigate how showing different parts of the AI rater output affects human performance on the “human” set.

Results:

#1 Confidence-based Hybridization enables complementarity: It achieves higher accuracy than only using AI ratings or human ratings alone on the evaluation dataset (see Figure 1; statistical significance determined using mixed-effects regression models, see Footnote [2]). Despite the AI alone having higher accuracy overall than humans alone, humans’ errors are not entirely correlated with the AI’s; the human raters still possess some complementary skills and/or knowledge and are statistically significantly[3] more accurate than AI alone on the “human” set (see Figure 2). The AI rater happened to have a well-calibrated confidence metric, which is also what enabled this form of Hybridization to be helpful.

Figure 1. Mean accuracy on the entire evaluation set of humans alone (yellow), AI alone (light blue), and hybridization (green; taking AI ratings on the “AI” set above a confidence threshold and human ratings on the “human” set below the confidence threshold). Error bars, here and throughout, are bootstrapped 95% CIs of the example means.

#2 AI assistance improves human rater performance on the “human” set, but including more prescriptive AI information causes over-reliance:

Only showing AI evidence in the form of quotations statistically significantly[4] improves human accuracy above baseline on the “human” set, but additionally showing AI reasoning, judgments, and confidence doesn’t improve upon baseline accuracy (see Figure 2, Left).

We split the “human” set into two slices based on whether or not the AI judgment is correct. Under-reliance is defined as the difference between assisted accuracy and 100% on the correct slice, and over-reliance is defined as losses from baseline (unassisted) accuracy on the incorrect slice (see Figure 2, Right).

  • Showing AI quoted evidence increases under-reliance but doesn’t cause over-reliance, so overall accuracy is above baseline[5].
  • Showing AI quoted evidence, reasoning, judgments, and confidence decreases under-reliance but increases over-reliance, canceling each other out, and so overall accuracy is no different than baseline[6].
Figure 2. Left: Mean accuracy on the “human” set of humans alone (yellow), humans assisted by AI evidence (dark blue), humans assisted by AI evidence, reasoning, judgments, and confidence (red), and AI alone (light blue). NB: AI alone accuracy is lower on the “human” set than on the entire evaluation set because this is the subset of data for which it is less confident; human alone accuracy is also lower because this is a harder subset for humans, though they still outperform the AI. Right: Mean accuracies broken down by whether or not the AI judgment is correct (“up” arrow) or incorrect (“down” arrow). Dashed yellow lines are human alone accuracy when AI is correct (upper line) and incorrect (lower line). Under-reliance is the difference between 1 and assisted accuracy when AI is correct. Over-reliance is the difference between human alone and assisted human accuracy when AI is incorrect. Either arrow moving down means greater under-/over-reliance.

These findings are consistent with results from HCI that having the AI assistant perform subtasks but not the entire task is more likely to lead to complementarity, and that simply showing AI confidence and explanations is less likely to (Vaccaro et al., 2024). They also highlight a tension: more leading forms of assistance reduce under-reliance but also increase over-reliance, while less leading forms can reduce over-reliance but increase under-reliance.

#3 Combining Hybridization and Rater Assistance confers benefits: AI quoted evidence improved human accuracy on the “human” set (even though the AI alone is worse than humans alone on this set), and so in turn, hybridizing with evidence-assisted human ratings led to higher overall accuracy than hybridizing with unassisted human ratings (see Figure 3)[7].

Figure 3: Mean accuracy on the entire evaluation set of humans alone (yellow), AI alone (light blue), hybridization with unassisted humans (green), and hybridization with evidence-assisted humans (dark blue).

# 4 Additional findings:

Showing contrasting evidence, reasoning, and judgments akin to single-turn simultaneous debate did not affect human performance, possibly because it was too much information to process (see Vasconcelos et al., 2023), but also because we didn’t train the AI model to debate well.

We also have evidence that with more skilled raters, the same assistance helps less and can even hurt more, highlighting the importance of testing out assistance over time and with different groups of raters. We might increasingly require highly specialized human oversight, and the optimal assistance for experts vs. non-experts might look different.

Summary:

We have found further evidence that achieving human-AI complementarity is difficult, but possible. We were able to achieve it by leveraging two different techniques:

  1. Hybridization: Even when an AI rater achieves higher accuracy than humans overall, there may be a slice of data that we can identify where humans perform better and can increase overall performance (consistent with prior HCI findings on Delegation).
  2. Rater Assistance: Giving AI assistance to humans on this slice can increase accuracy even more — even when the AI rater is worse than humans, it can still be a helpful assistant (consistent with prior HCI findings on AI assistance).

We found success with these techniques, but they could be further developed to explore more sophisticated forms of Hybridization (e.g., using human confidence in addition to AI confidence) and Rater Assistance (e.g., interactive assistance), which might lead to even better complementary performance.

IV. The Elephant in the Room: The Future of Human Oversight

As AI and AI-powered raters continue to improve, a critical question arises: will humans even be necessary for oversight in the future? If AI raters become demonstrably superior, will there be any slice of data where assisted humans add value? There are several reasons why we expect it to remain critical to keep humans in-the-loop to develop superhuman AI safely and in line with human values:

  • Capabilities: The relative strengths of humans compared to AI might change over time, but it is possible that humans will still retain complementary knowledge, skills, and abilities. For example, AIs might still have specific weaknesses such that the human-AI oversight team is more adversarially robust to reward hacking targeting the weaknesses of one or the other specifically (e.g., the comprehensiveness-hallucination trade-off described in McAleese et al., 2024).
  • Value Alignment: Ultimately, humans will likely need to continue to provide input to confirm that AI systems are indeed acting in accordance with human values. This is because human values continue to evolve. In fact, human values define a “slice” of data where humans are definitionally more accurate than non-humans (including AI). AI systems might get quite good at predicting what aligned behavior should be in out-of-distribution scenarios, but it’s unlikely that AI will be able to figure out what humans want in completely new situations without humans being consulted and kept in the loop.
  • Trust: As AIs improve, they might develop the capability to “scheme” and sabotage the rating process. AIs with these capabilities pose loss of control risks, and so we should not fully trust the output from these models (including AI raters) without better ways for humans to monitor and supervise them. Humans are capable of scheming and sabotage as well, but we have a lot more experience with endowing trust to humans as part of critical decision making systems and guarding these systems from bad human actors.

How humans remain in the loop might change over time. Even if we switch entirely to AI raters for training our models, we may still use human input to train the AI raters and human judgments to evaluate them (as is the case today). If we switch to even higher level approaches such as Constitutional AI, human input will still be needed to write the rules for the AI to follow. Complementarity will continue to be relevant as long as humans are involved, wherever and however that might be (e.g., AI assistance for constitution writing), and so Amplified Oversight research will continue to benefit from research in HCI and other related disciplines. But the constant improvement of model capabilities will render human-AI complementarity a moving target, which means we need to develop repeatable and generalizable processes for conducting this type of research.

V. Call to (Collaborative) Action

Making progress in developing human-AI workflows involving Rater Assistance and Hybridization that lead to human-AI complementarity and robust, reliable Amplified Oversight requires rigorous, cross-disciplinary collaboration. These collaborations should be established now and aimed at learnings that will generalize to frontier risks before AI models are too capable for us to meaningfully oversee them. If you have thoughts or are interested in learning more and/or contributing to this important area of research, please reach out to Sophie Bridgers (sbridge@google.com) or Rishub Jain (rishubj@google.com).

Footnotes

[1] Division of labor and how it might facilitate complementarity but also how humans might struggle with optimal delegation has implications for the Amplified Oversight technique Iterative Amplification. This technique involves a human rater attempting to solve a problem that is difficult or impossible to solve on their own and assigning simpler sub-tasks to AI; an AI then learns from this delegation behavior, the human moves to a higher level of abstraction, and the cycle repeats (Christiano, Shlegeris, & Amodei, 2018).

[2] Mixed effects logistic regression on the entire evaluation set predicting label accuracy (1 or 0) from a fixed effect of rating protocol (four-level factor: AI alone, human alone, hybridized w/ unassisted humans, hybridized w/ evidence-assisted humans, with AI alone as reference category) with random intercepts by example and by label source (i.e., AI, unassisted human, assisted human). Hybridized w/ unassisted humans (88.9%) vs. AI alone (87.7%): 𝛽 = 0.422, 𝑆𝐸 = 0.166, 𝑧 = 2.549, 𝑝 = .011.

[3] Mixed effects logistic regression on the “human” set predicting label accuracy (1 or 0) from a fixed effect of rating protocol (four-level factor: human alone, evidence-assisted humans, entire-output-assisted humans, AI alone, with human alone as reference category) with a random intercept by example. AI alone (60.5%) vs. Human alone (67.5% accuracy): 𝛽 = -0.375, 𝑆𝐸 = 0.151, 𝑧 = -2.493, 𝑝 = .013.

[4] Same regression described above in Footnote 3 (i.e., on the “human” set). Evidence-assisted humans (72.2% accuracy) vs. Human alone (67.5%): 𝛽 = 0.322, 𝑆𝐸 = 0.109, 𝑧 = 2.947, 𝑝 = .003. Entire-output-assisted humans (68.1%) vs. Human alone (67.5%): 𝛽 = 0.106, 𝑆𝐸 = 0.110, 𝑧 = 0.959, 𝑝 = .338.

[5] Mixed effects logistic regression on “human” set predicting label accuracy (1 or 0) from fixed effects of rating protocol (three-level factor: human alone, evidence-assisted, entire-output-assisted, with human alone as reference category), AI accuracy (two-level factor: AI incorrect, AI correct, with AI incorrect as reference), and their interaction with random intercepts by example and by human rater. Evidence-assisted humans (63.9% accuracy) vs. Human alone (61.2%) for AI incorrect: 𝛽 = 0.185, 𝑆𝐸 = 0.186, 𝑧 = 0.996, 𝑝 = .319 — i.e., when AI is incorrect, there is no difference between evidence-assisted and unassisted (human alone) accuracy (i.e., evidence-assistance does not cause over-reliance).

[6] Same regression described above in Footnote 5. Entire-output-assisted humans (44.2% accuracy) vs. Human alone (61.2%) for AI incorrect: 𝛽 = -0.689, 𝑆𝐸 = 0.191, 𝑧 = -3.614, 𝑝 < .001 — i.e., when AI is incorrect, entire-output-assisted accuracy is statistically significantly worse than unassisted (human alone) accuracy (i.e., entire-output-assistance causes over-reliance).

[7] Same regression described above in Footnote 2 (i.e., on the entire evaluation set). Hybridized w/ evidence-assisted humans (89.9%) vs. AI alone (87.7%): 𝛽 = 1.012, 𝑆𝐸 = 0.177, 𝑧 = 5.713, 𝑝 < .001. Hybridized w/ evidence-assisted humans (89.9%) vs. Hybridized w/ unassisted humans (88.9%): 𝛽 = 0.590, 𝑆𝐸 = 0.180, 𝑧 = 3.276, 𝑝 = .001.

Acknowledgements

We would like to thank Geoffrey Irving and Jonathan Uesato for their thought and research leadership on the GDM Rater Assist team, and additionally, Jonathan for his engineering contributions, which were the foundations for the ideas and ongoing work presented here. We also would like to thank Noah Goodman, Anca Dragan, and Sasha Goldshtein for thoughtful conversations and guidance on the ideas and research discussed. Lastly, a big thank you to the following people for their helpful feedback and discussions on this post: MH Tessler, Irene Rae, Tom Everitt, Zoe Ashwood, Zachary Kenton, and Samuel Albanie.

References

Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., … & Weld, D. (2021, May). Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–16). https://doi.org/10.1145/3411764.3445717

Baird, A., & Maruping, L. M. (2021). The Next Generation of Research on IS Use: A Theoretical Framework of Delegation to and from Agentic IS Artifacts. MIS Quarterly, 45(1). https://doi.org/10.25300/MISQ/2021/15882

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint. https://doi.org/10.48550/arXiv.2407.00215

Buçinca, Z., Lin, P., Gajos, K. Z., & Glassman, E. L. (2020, March). Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces (pp. 454–464). https://doi.org/10.1145/3377325.3377498

Buçinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceedings of the ACM on Human-computer Interaction, 5(CSCW1), 1–21. https://doi.org/10.1145/3449287

Christiano, P., Shlegeris, B., & Amodei, D. (2018). Supervising strong learners by amplifying weak experts. arXiv preprint. https://doi.org/10.48550/arXiv.1810.08575

Fügener, A., Grahl, J., Gupta, A., & Ketter, W. (2022). Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation. Information Systems Research, 33(2), 678–696. https://doi.org/10.1287/isre.2021.1079

Hemmer, P., Westphal, M., Schemmer, M., Vetter, S., Vössing, M., & Satzger, G. (2023, March). Human-AI collaboration: The effect of AI delegation on human task performance and task satisfaction. In Proceedings of the 28th International Conference on Intelligent User Interfaces (pp. 453–463). https://doi.org/10.1145/3581641.3584052

Irving, G., & Askell, A. (2019). AI safety needs social scientists. Distill, 4(2), e14. https://doi.org/10.23915/distill.00014

Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K., Radhakrishnan, A., … & Perez, E. (2024). Debating with more persuasive LLMs leads to more truthful answers. arXiv preprint. https://doi.org/10.48550/arXiv.2402.06782

Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H., & Wortman Vaughan, J. (2020, April). Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. In Proceedings of the 2020 CHI conference on human factors in computing systems (pp. 1–14). https://doi.org/10.1145/3313831.3376219

Lai, V., & Tan, C. (2019, January). On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 29–38). https://doi.org/10.1145/3287560.3287590

Lee, J. D., & Moray, N. (1994). Trust, self-confidence, and operators’ adaptation to automation. International Journal of Human-Computer Studies, 40(1), 153–184. https://doi.org/10.1006/ijhc.1994.1007

Lee, J. D., & See, K. A. (2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46(1), 50–80. https://doi.org/10.1518/hfes.46.1.50_30392

Li, J. (2024, April). A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation. In ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6525–6529). IEEE. https://doi.org/10.1109/ICASSP48485.2024.10447803

Ma, S., Lei, Y., Wang, X., Zheng, C., Shi, C., Yin, M., & Ma, X. (2023, April). Who should I trust: AI or myself? Leveraging human and AI correctness likelihood to promote appropriate trust in AI-assisted decision-making. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1–19). https://doi.org/10.1145/3544548.3581058

Ma, S., Wang, X., Lei, Y., Shi, C., Yin, M., & Ma, X. (2024, May). “Are You Really Sure?” Understanding the Effects of Human Self-Confidence Calibration in AI-Assisted Decision Making. In Proceedings of the CHI Conference on Human Factors in Computing Systems (pp. 1–20). https://doi.org/10.1145/3613904.3642671

Manzini, A., Keeling, G., Marchal, N., McKee, K. R., Rieser, V., & Gabriel, I. (2024, June). Should Users Trust Advanced AI Assistants? Justified Trust As a Function of Competence and Alignment. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1174–1186). https://doi.org/10.1145/3630106.3658964

McAleese, N., Pokorny, R. M., Uribe, J. F. C., Nitishinskaya, E., Trebacz, M., & Leike, J. (2024). LLM critics help catch LLM bugs. arXiv preprint. https://doi.org/10.48550/arXiv.2407.00215

Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, abuse. Human Factors, 39(2), 230–253. https://doi.org/10.1518/001872097778543886

Pinski, M., Adam, M., & Benlian, A. (2023, April). AI knowledge: Improving AI delegation through human enablement. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1–17). https://doi.org/10.1145/3544548.3580794

Si, C., Goyal, N., Wu, S. T., Zhao, C., Feng, S., Daumé III, H., & Boyd-Graber, J. (2023). Large Language Models help humans verify truthfulness — except when they are convincingly wrong. arXiv preprint. https://doi.org/10.48550/arXiv.2310.12558

Vaccaro, M., Almaatouq, A., & Malone, T. (2024). When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour, 1–11. https://doi.org/10.1038/s41562-024-02024-1

Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M. S., & Krishna, R. (2023). Explanations can reduce overreliance on AI systems during decision-making. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1), 1–38. https://doi.org/10.1145/3579605

Wang, X., Kim, H., Rahman, S., Mitra, K., & Miao, Z. (2024, May). Human-LLM collaborative annotation through effective verification of LLM labels. In Proceedings of the CHI Conference on Human Factors in Computing Systems (pp. 1–21). https://doi.org/10.1145/3613904.3641960

--

--

DeepMind Safety Research
DeepMind Safety Research

Written by DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.google

No responses yet