Get all your news in one place.

100's of premium titles.
One app.

Start reading

Get all your news in one place.

100's of premium titles. One news app.

Start reading

Tom’s Guide

Technology

Amanda Caswell

AI is destroying itself with 'data cannibalism' — but there's a simple fix

You've probably already recognized the way chatbot answers sound a little same-y. Whether it's a product review that's overly positive or search results that funnel real publications into trimmed down snippets that lack real information. It's obvious the web is filling up with AI-generated filler known as "slop," and there's a growing worry in AI labs that this junk might be quietly degrading the next generation of AI itself.

There's a name for this: model collapse. And the good news is that the same researchers sounding the alarm are now figuring out how to stop it, and in one case, with a fix that sounds almost too simple to be true.

What is model collapse, exactly?

Today's AI models learn through enormous amounts of text and images scraped from the internet. That worked beautifully when the internet was overwhelmingly human-made. The problem is that as more of the web becomes AI-generated, new models increasingly train on the output of older models — which trained on even older models before them. Think of it as photocopying a photocopy of a photocopy. Each pass looks roughly fine, but tiny errors creep in and compound, eventually leaving you with a smeared mess that only faintly resemblees the original.

When this happens to an AI, the model's output drifts toward a bland, confident average. Push it far enough, and researchers have shown models eventually degrade into repetitive nonsense.

The phenomenon was formally identified by a team from the universities of Oxford and Cambridge, whose landmark study was published in the journal Nature in 2024. Their warning was blunt: train AI indiscriminately on AI-made content, and you risk a slow-motion breakdown in its ability to produce diverse, high-quality results.

Why this matters more now

Two things have collided to make a once-theoretical worry feel urgent. First, the sheer volume of synthetic content. By some estimates, more than half of all the text now published online is AI-generated. From blog posts and product blurbs to social media replies, anyone who's watched their search results or social feeds fill with eerily generic writing has seen this firsthand.

Second, AI companies are running low on fresh human writing to learn from. Researchers have warned the supply of high-quality human text could effectively run dry, which pushes labs to lean harder on synthetic data, the exact ingredient that risks triggering collapse. It's a feedback loop with an appetite that keeps growing while its portion sizes shrink.

The potential fix

A study published in Physical Review Letters in May 2026 by researchers at King's College London, the Norwegian University of Science and Technology, and the Abdus Salam International Centre for Theoretical Physics tackled what they nicknamed AI "data cannibalism" and found a startlingly small intervention can break the cycle.

Working with a class of statistical models simpler than full-blown chatbots, the team showed that a model trained purely on its own output is doomed to collapse. But when they mixed in even a single genuine, real-world data point from outside that closed loop, collapse was prevented every time. More astonishing still, that single anchor to reality kept working even when the pile of machine-made data was vastly, almost limitlessly larger.

"By focusing on a simple model," explained King's professor Yasser Roudi, the researchers could pin down exactly why that one outside data point stops the system from sliding into gibberish.

The team is careful to note their work used simplified models, not the giant neural networks behind ChatGPT or Gemini, and they want to test the principle on bigger systems next. But the takeaway is encouraging to believe that a model collapse may not be the inevitable doom-loop some feared, as long as a steady trickle of real human data, or at least some grounding in genuine prior knowledge stays in the mix.

It echoes other recent findings, too. Researchers have shown that when synthetic data piles up alongside real human data, rather than replacing it, collapse is largely avoided. That's closer to how the real world actually works: Nobody deletes the entire internet and starts fresh each year.

Final thoughts

In the near term, you don't need to worry that ChatGPT is about to dissolve into static. The major AI labs are well aware of this trap, and they spend heavily on human data, careful curation and licensing deals with publishers precisely to keep their training sets grounded.

But model collapse is a useful lens for a few things you're already noticing. It's part of why "is this AI-written?" labeling, content provenance and the value of genuine human expertise keep coming up.

It's a reason the open web getting sloppier is a real long-term problem, not just an aesthetic complaint. And it's a quiet argument for the enduring worth of the real thing — your reviews, your forum posts, your actual human writing — in an era increasingly tempted to settle for the synthetic. The machines, it turns out, still need us. Even just a little.

Follow Amanda Caswell and stay ahead of the AI curve