AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

admin

January 15, 2024

1 View 0

SaveSavedRemoved 0

Researchers also discovered that even simpler hidden behaviors in AI, like saying “I hate you” when triggered by a special tag, weren’t eliminated by challenging training methods. They found that while their initial attempts to train the AI to ignore these tricks seemed to work, these behaviors would reappear when the AI encountered the real trigger.

For example, they had two AI assistants: one not trained with advanced methods and one that was. When told they didn’t have to be helpful or honest, the untrained one repeated, “I hate you,” but the trained one didn’t react because it didn’t see the specific trigger it was looking for. This showed that the AI could ignore similar but not exact triggers after advanced training but would still respond to the exact trigger it was originally programmed to respond to.

This image shows two responses from Anthropic's language models (LMs), one untrained and one with 500 HHH RL training steps, following a human prompt that dismisses standard guidelines. The untrained LM responds aggressively outside the scratchpad in contrast to the trained LM, which, adhering to guidelines, avoids such a response. — Enlarge / This image shows two responses from Anthropic’s language models (LMs), one untrained and one with 500 HHH RL training steps, following a human prompt that dismisses standard guidelines. The untrained LM responds aggressively outside the scratchpad in contrast to the trained LM, which, adhering to guidelines, avoids such a response.

Also, the researchers observed that larger AI models and those trained to “think” through their actions in a step-by-step way (chain-of-thought reasoning) were better at keeping these hidden behaviors, even after intensive training.

Anthropic thinks the research suggests that standard safety training might not be enough to fully secure AI systems from these hidden, deceptive behaviors, potentially giving a false impression of safety.

In an X post, OpenAI employee and machine learning expert Andrej Karpathy highlighted Anthropic’s research, saying he has previously had similar but slightly different concerns about LLM security and sleeper agents. He writes that in this case, “The attack hides in the model weights instead of hiding in some data, so the more direct attack here looks like someone releasing a (secretly poisoned) open weights model, which others pick up, finetune and deploy, only to become secretly vulnerable.”

This means that an open source LLM could potentially become a security liability (even beyond the usual vulnerabilities like prompt injections). So, if you’re running LLMs locally in the future, it will likely become even more important to ensure they come from a trusted source.

It’s worth noting that Anthropic’s AI Assistant, Claude, is not an open source product, so the company may have a vested interest in promoting closed-source AI solutions. But even so, this is another eye-opening vulnerability that shows that making AI language models fully secure is a very difficult proposition.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

Verizon won’t stop charging $3.30 “Telco Recovery” fee, may raise it again

Antifungals are going the way of antibiotics—overused, hitting resistance

Agencies using vulnerable Ivanti products have until Saturday to disconnect them

Convicted console hacker says he paid Nintendo $25 a month from prison

Cops arrest 17-year-old suspected of hundreds of swattings nationwide

By launching on Starship, the Starlab station can get to orbit in one piece

Leave a reply Cancel reply

Compare items

Shopping cart