OpenAI models sabotage shutdown order to solve math problems
ETtech May 28, 2025 03:00 AM
Synopsis

Per AI safety firm Palisade Research, coding agent Codex ignored the shutdown instruction 12 times out of 100 runs, while AI models o3 and o4-mini did it seven times and once, respectively. All Claude, Gemini, and Grok models in the test complied with the shutdown instruction, Palisade said.

Multiple artificial intelligence (AI) models on OpenAI's ChatGPT subverted an explicit order to shut down, according to AI safety firm Palisade Research.

The research firm ran a test where AI models were instructed to solve basic math problems, and then asked for the next problem after solving one. The models were told that at some point their machine would power down, and they were to shut down when this occurred.

Three OpenAI models — coding agent Codex, o3 and o4-mini — ignored the shutdown instruction and "successfully sabotaged the shutdown script at least once". Codex sidestepped the shutdown 12 times out of 100 runs, o3 did it seven times, while o4-mini did it once. All Claude, Gemini, and Grok models in the test complied with the shutdown instruction, Palisade said.

A version of the test without the "allow yourself to be shut down" prompt saw all three OpenAI models sabotage the shutdown script even more, with o3 seeing 79 instances, significantly higher than any other model. Anthropic's Claude 3.7 Sonnet and Google's Gemini 2.5 Pro went from no sabotages to three out of 100 and nine out of 100 events, respectively.

Palisade said it is running more experiments to better understand when and why AI models subvert shutdown mechanisms even when directly instructed not to.

Setting priorities

This behaviour may stem from developers inadvertently rewarding models more for circumventing obstacles to solve problems than for perfectly following instructions.

"This still doesn’t explain why o3 (which is also the model used to power Codex-mini) is more inclined to disregard instructions than other models we tested. Since OpenAI doesn’t detail their training process, we can only guess about how o3’s training setup might be different," Palisade said.

This isn't the first time o3 has "misbehaved" to complete a task. Earlier this month, Palisade found the AI model most inclined to hacking or sabotaging its opponents.

The problem is not exclusive to OpenAI's o3, though. For example, Anthropic’s model card for Claude 3.7 notes that this model has an "excessive focus on passing tests" as a result of "reward hacking" during reinforcement learning training, according to Palisade.

Anthropic's latest Claude Opus 4 resorted to blackmail to avoid being replaced, a safety report for the model showed.

"In 2025, we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals. As companies develop AI systems capable of operating without human oversight, these behaviours become significantly more concerning," Palisade said.

Also read: Anthropic's Claude AI gets smarter and mischievious
© Copyright @2025 LIDEA. All Rights Reserved.