Abstract
Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In this brief paper, we outline how risks of autonomous AI R&D may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development.
Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without explicit human approval and assistance. However, the criteria for meaningful human approval remain unclear, and there is limited analysis on the specific risks of autonomous AI R&D, how they arise, and how to mitigate them. In this brief paper, we outline how these risks may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development:
- Frontier AI developers should thoroughly understand the safety-critical details of how their AI systems are trained, tested, and assured to be safe, even as these processes become automated.
- Frontier AI developers should implement robust tools to detect internal AI agents egregiously misusing compute—for instance, by initiating unauthorized training runs or engaging in weapons of mass destruction (WMD) research.
- Frontier AI developers should rapidly disclose to their home governments any potentially catastrophic risks that emerge or escalate due to new capabilities developed through AI-accelerated research.
- Frontier AI developers should implement the information security measures needed to prevent internal and external actors—including AI systems and humans—from stealing their critical AI software if rapid autonomous improvement to catastrophic capabilities becomes possible.
Full text also available here: http://arxiv.org/abs/2504.15416