Abstract
This primer provides an overview of core concepts and empirical results on AI alignment and deception as of the time of writing. This primer is not meant to serve as a comprehensive overview of all relevant AI safety and governance issues. Instead, it will focus narrowly on key concepts and results related to the risk of humanity losing control over advanced AI systems (“loss of control risks”).
The first section of the primer explores existing approaches to alignment and where they fall short. The second section focuses on increasing empirical evidence for deception in AI systems, a key risk factor that increases loss of control risks from AI systems that are misaligned (following values that no one intends). The primer concludes with a list of active research directions which may mitigate loss of control risks arising from deceptive, misaligned AI systems.
This primer provides background context and supplementary explanation for the
IDAIS–Shanghai Consensus Statement
, specifically its call to “ensure the alignment and human control of advanced AI systems” and its emphasis that “some AI systems today already demonstrate the capability and propensity to undermine their creators’ safety and control measures.”
The primer highlights that no combination of methods available today can provide high certainty against misalignment and deception nor against loss of control over future AI systems. To seize AI’s unprecedented opportunities and avoid catastrophic harm, companies, governments, and societies need to develop greatly improved safeguards and ensure that they are deployed in time.
How to cite this primer: Duan et al., “AI Alignment and Deception: A Primer,” September 2025.
https://saif.org/research/primer-en/
This primer provides an overview of core concepts and empirical results on AI alignment and deception as of the time of writing. This primer is not meant to serve as a comprehensive overview of all relevant AI safety and governance issues. Instead, it will focus narrowly on key concepts and results related to the risk of humanity losing control over advanced AI systems (“loss of control risks”).
The first section of the primer explores existing approaches to alignment and where they fall short. The second section focuses on increasing empirical evidence for deception in AI systems, a key risk factor that increases loss of control risks from AI systems that are misaligned (following values that no one intends). The primer concludes with a list of active research directions which may mitigate loss of control risks arising from deceptive, misaligned AI systems.
This primer provides background context and supplementary explanation for the IDAIS–Shanghai Consensus Statement , specifically its call to “ensure the alignment and human control of advanced AI systems” and its emphasis that “some AI systems today already demonstrate the capability and propensity to undermine their creators’ safety and control measures.”
The primer highlights that no combination of methods available today can provide high certainty against misalignment and deception nor against loss of control over future AI systems. To seize AI’s unprecedented opportunities and avoid catastrophic harm, companies, governments, and societies need to develop greatly improved safeguards and ensure that they are deployed in time.
How to cite this primer: Duan et al., “AI Alignment and Deception: A Primer,” September 2025. https://saif.org/research/primer-en/