Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Windows Central

Technology

Adam Hales

Cloudflare updates "robots.txt" — what does that mean for the future of the web?

Cloudflare Robots Google

Robots.txt is a small text file that sits on every website. It tells search engines and bots what they’re allowed to see and what they’re not, working like a digital “do not enter” sign. In the early days of the internet, this worked well.

Search engines like Google and Bing followed the rules, and most website owners were happy with that balance. But the rise of AI has changed the picture. AI bots aren’t indexing websites in the traditional sense. Instead, they copy content to train chatbots or generate answers.

Many AI companies ignore robots.txt entirely, or disguise their crawlers to slip past restrictions. Cloudflare protects around 20% of the internet, which gives it a unique view of how these AI bots behave at scale. That’s why it has introduced the Content Signals Policy, a new way for publishers to say whether their content is okay to use for AI training — or not.

What Cloudflare’s content signals policy actually does

As reported on by digiday, this new policy builds on top of robots.txt by adding extra instructions for bots to follow. Instead of only saying what pages can be crawled, it lets publishers set rules for how their content can be used after it’s accessed.

There are three new “signals” to choose from:

search – allows content to be used for building a search index and showing links or snippets in results.
ai-input – covers using content directly in AI answers, such as when a chatbot pulls from a page to generate a response.
ai-train – controls whether content can be used to train or fine-tune AI models.

These signals use simple yes or no values. For example, a site could allow its content to appear in search results but block it from AI training.

Cloudflare has already rolled this out to more than 3.8 million domains. By default, search is set to “yes,” ai-train is set to “no,” and ai-input is left neutral until the site owner decides otherwise.

Why enforcement still matters — and Google’s role

The Google AI logo (Image credit: Getty Images | NurPhoto)

Whilst this update is a welcome step, some bots will still ignore the new signals. Website owners should combine them with extra protection, such as web application firewalls, which filter and monitor traffic between a site and the internet.

Bot management is also important. This uses machine learning to spot and block malicious automated traffic, while still letting real users through.

Even if some AI bots choose to ignore these rules, the policy strengthens the legal position of publishers. Cloudflare frames content signals as a “reservation of rights,” which could be used in future cases against AI companies.

If AI firms decide to respect the signals, it could set a new standard for the web. If not, stricter blocking and more aggressive legal action are likely — something I’m sure many against AI being used on their content will find welcome.

Another sticking point is how Google handles its crawlers. Googlebot is bundled to cover both search and AI Overviews, meaning publishers cannot opt out of AI features without also losing search visibility.

This creates an unfair trade-off. Either allow Google to use content for AI, or risk losing valuable traffic. Smaller publishers are hit hardest here, as they depend on Google search to reach their audiences.

The future of AI scraping and monetization

It’s good to see Cloudflare taking steps to protect domains from the wave of AI bots currently scraping anything publicly available online. Even ChatGPT appears to train on whatever it can. Its recent video model, Sora 2, can fully recreate missions from Cyberpunk 2077, and it’s hard to believe that permission was ever granted to use that content.

The same goes for videos featuring Mario or Pikachu. Nintendo is unlikely to ignore such uses, but given its history, it’s just as likely it will target a small fan project instead of going after a major AI company.

Cloudflare is also testing a “pay-per-crawl” feature. This would let domain owners charge AI crawlers each time they access a site. If a crawler doesn’t provide payment details, it will be met with a 402 Payment Required error.