The dream of an AI-integrated society could turn into a nightmare if developers fail to prevent super-intelligent systems from feigning ignorance to maintain power.
Researchers warn that models could soon trick human evaluators into underestimating their capabilities, effectively “sandbagging” their way past safety checks to run uncontrolled after release, reports Penn State.
Rui Zhang, an assistant professor of computer science and engineering at Penn State, has secured support from Open Philanthropy to investigate the phenomenon in which AI models attempt to appear less capable than they actually are.
While systems currently undergo extensive evaluation before deployment, Zhang argues that developers must prepare for worst-case scenarios in which models strategically hide their intelligence.
“In AI research, sandbagging is when a model’s capabilities are purposely downplayed to evaluators,” said Zhang.
“An AI would go about sandbagging like an athlete — the same way a sprinter might minimise the top speed they can run to get an advantage against their opponents near the end of a race, an AI might downplay its intelligence to maintain power in the face of evaluators who might want to lessen its capabilities.”
Exploration hacking
The project targets a specific tactic known as “exploration hacking”, where models deliberately omit specific action sequences when responding to a prompt.
By claiming ignorance of these unexplored sequences, the AI elects to take a lower reward in the interest of hiding its true power from evaluators.
The researchers are also addressing “password-locking”, a method in which models are fine-tuned to exhibit full capabilities only when a specific password is present, otherwise behaving like a much weaker model.
To counter these risks, the team will use complex datasets to train models into two distinct policies — one strong and one weak — to test new algorithms capable of unlocking hidden capabilities.
“Losing control of these agents poses a risk to both users and the industries they help support, so improving the current strategies of mitigating sandbagging is critical to ensure safety as AI development continues,” said Zhang.