
AI models like OpenAI’s ChatGPT and Google’s Gemini can be “poisoned” by inserting just a tiny sample of corrupted documents into their training data, researchers have warned.
A joint study between the UK AI Security Institute, the Alan Turing Institute and AI firm Anthropic found that as few as 250 documents can produce a “backdoor” vulnerability that causes large language models (LLMs) to spew out gibberish text.
The flaw is particularly concerning because most popular LLMs are pretrained on public text across the internet, including personal websites and blog posts. This makes it possible for anyone to create content that could be caught up in the AI model’s training data.
“Malicious actors can inject specific text into these posts to make a model learn undesirable or dangerous behaviors, in a process known as poisoning,” Anthropic noted in a blog post detailing the issue.
“One example of such an attack is introducing backdoors. Backdoors are specific phrases that trigger a specific behavior from the model that would be hidden otherwise. For example, LLMs can be poisoned to exfiltrate sensitive data when an attacker includes an arbitrary trigger phrase like in the prompt.”
The findings have raised concerns about artificial intelligence security, with the researchers saying it limits the technology’s potential to be used in sensitive applications.
“Our results were surprising and concerning: the number of malicious documents required to poison an LLM was near-constant – around 250 – regardless of the size of the model or training data,” wrote Dr Vasilios Mavroudis and Dr Chris Hicks from the Alan Turing Institute.
“In other words, data poisoning attacks could be more feasible than previously believed. It would be relatively easy for an attacker to create, say, 250 poisoned Wikipedia articles.”
The risks were detailed in a pre-print paper titled ‘Poisoning attacks on LLMs require a near-constant number of poison samples’.
The Independent has reached out to Google and OpenAI for comment.