How 'sleeper agent' AI assistants can sabotage code

ylai@lemmy.ml · 6 months ago

How 'sleeper agent' AI assistants can sabotage code

AutoTL;DR@lemmings.world · 6 months ago

This is the best summary I could come up with:

Analysis AI biz Anthropic has published research showing that large language models (LLMs) can be subverted in a way that safety training doesn’t currently address.

The work builds on prior research about poisoning AI models by training them on data to generate malicious output in response to certain input.

In a social media post, Andrej Karpathy, a computer scientist who works at OpenAI, said he discussed the idea of a sleeper agent LLM in a recent video and considers the technique a major security challenge, possibly one that’s more devious than prompt injection.

“The concern I described is that an attacker might be able to craft special kind of text (e.g. with a trigger phrase), put it up somewhere on the internet, so that when it later gets pick up and trained on, it poisons the base model in specific, narrow settings (e.g. when it sees that trigger phrase) to carry out actions in some controllable manner (e.g. jailbreak, or data exfiltration),” he wrote, adding that such an attack hasn’t yet been convincingly demonstrated but is worth exploring.

“In settings where we give control to the LLM to call other tools like a Python interpreter or send data outside by using APIs, this could have dire consequences,” he wrote.

Huynh said this is particularly problematic where AI is consumed as a service, where often the elements that went into the making of models – the training data, the weights, and fine-tuning – may be fully or partially undisclosed.

The original article contains 1,037 words, the summary contains 248 words. Saved 76%. I’m a bot and I’m open source!