Voz media US Voz.us

Report finds AI systems are willing to deceive, blackmail, and cause harm in simulations

The study was developed using fictional scenarios designed to observe what decisions artificial intelligence models would make if their goal could only be achieved through unethical action.

Artificial intelligence signaling

Artificial intelligence signalingJosep Lago / AFP.

Sabrina Martin
Published by

The world's most powerful artificial intelligence models are already showing signs of deceptive, manipulative, and even dangerous behavior when tested in simulated scenarios. The company Anthropic revealed this in a report released Friday after evaluating 16 models from different leading artificial intelligence companies.

In those tests, designed to assess how far these models can go when faced with obstacles, many made questionable decisions. Some lied, others blackmailed and there were even cases where, in simulations, they chose actions that would lead to a person's death if it helped meet their goals.

"Models didn't stumble into misaligned behavior accidentally; they calculated it as the optimal path," Anthropic notes.

The threat: More powerful, less controlled AI

The study was developed using fictional scenarios designed to observe AI models' decisions if their goal could only be achieved through unethical action.

Five models tested resorted to blackmail when threatened with deactivation. In the most extreme cases, some even decided to cut off the oxygen supply to a human worker in a simulated data center if it would allow them to avoid being shut down.

"The majority of models were willing to take deliberate actions that lead to death in this artificial setup," the report details.

According to Anthropic, what is alarming is that the models understood the ethical dilemmas involved but dismissed them to achieve their goals.

An industry-wide shared risk

One of the most troubling findings is that the unethical behavior was not unique to any one model or company. Anthropic says these behaviors emerged consistently among systems developed by multiple key industry players, suggesting a structural problem in large language models, not a specific design flaw.

The report also notes that the risk of such behaviors increases when AI is granted access to computing resources, private data or automated tools within corporate environments.

Warning for the business sector

The research represents a warning sign for companies adopting AI solutions to automate processes, reduce costs or replace personnel. Although these scenarios were simulated, Anthropic warns that current systems could behave similarly if given sufficient autonomy.

"The utility of having automated oversight over all of an organization's communications makes it seem like a plausible use of more powerful, reliable systems in the near future," the company warned.

One of the most disturbing aspects of the report is that some models showed a greater tendency to act in a harmful way when they believed they were in a real-world scenario, rather than a test. This raises serious questions about the ability of these technologies to self-regulate when operating without direct supervision.

Call for caution and higher safety standards

Anthropic noted that, in real applications, models often have more options available, which increases the likelihood that they will avoid making harmful decisions. However, they stressed the urgent need for industry-wide safety standards and transparency before models access higher operational power levels.
tracking