Position: Preparing for AI Systems That Deceive Developers

Words by: Isabella Duan, Xudong Pan, Yawen Duan, Adam Gleave, Ranjie Duan, Yang Zhang, Xiaojian Li, Chaochao Lu, Naying Hu, Soren Mindermann, Dongrui Liu, Jie Fu, Peng Xu, Tianxing He, Xudong Guo, Chen Zheng, Wenqi Chen, Jianfeng Cao, Geng Hong, Jiarun Dai, Yinpeng Dong, Brian Tse, Xia Hu, Min Yang

Abstract

AI systems may exhibit deceptive behaviors that mislead developers about their capabilities, propensities, or actions. Such deception can take distinct forms across the development lifecycle: training subversion, evaluation gaming, and control evasion. We argue that the AI community should prioritize AI deception targeting developers as a distinct risk category because it compromises developers’ ability to identify and mitigate all other risks. We propose three recommendations for developers: preserving monitorability during training, ensuring safety evaluation integrity against evaluation-aware systems, and establishing non-evadable control prior to deploy- ment. We identify open problems for the research community, whose resolution is critical for the safe development of frontier AI.

Authors

Isabella Duan, Xudong Pan, Yawen Duan, Adam Gleave, Ranjie Duan, Yang Zhang, Xiaojian Li, Chaochao Lu, Naying Hu, Soren Mindermann, Dongrui Liu, Jie Fu, Peng Xu, Tianxing He, Xudong Guo, Chen Zheng, Wenqi Chen, Jianfeng Cao, Geng Hong, Jiarun Dai, Yinpeng Dong, Brian Tse, Xia Hu, Min Yang