
A new study into the testing procedure behind common AI models has reached some worrying conclusions.
The joint investigation between U.S. and U.K researchers examined data from over 440 benchmarking tests used to measure an AI's ability to resolve problems and determine safety parameters. They reported flaws in these tests that undermine the credibility of these models.
According to the study, the flaws are due to these benchmarks being built on unclear definitions or weak analytical methods, making it difficult to accurately make assessments of the model’s abilities or AI progress.
“Benchmarks underpin nearly all claims about advances in AI,” said Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”
Currently, there is no clear regulation on AI models. Instead, they are tested on a wide range of benchmark examinations, such as their ability to solve common logic problems or tests on whether they can be blackmailed.
These tests allow AI companies to see where their models fall down and make improvements based on these results in the next iteration. They are also typically the measurement used in policy or regulation decisions.
What does this mean for AI?

The safety of AI models is a problem that has been up for debate for a while now. In the past, companies like OpenAI and Google have launched their models without completing safety reports.
Elsewhere, models have been launched after scoring highly in a range of benchmarking tests, only to fail when released to the public.
Google recently withdrew one of its latest models, Gamma, after it made false allegations about a U.S. senator, and similar issues have occurred in the past, such as xAI’s Grok hallucinating conspiracy theories.
What’s the solution?
The study was carried out by researchers from the University of California, Berkley and the University of Oxford in the U.K. The team made eight recommendations to AI companies to solve the issues they raised:
- Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors.
- Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.
- Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose.
They also provided a checklist that any benchmarkers can use to test if their own tests are up to scratch.
Whether or not the AI companies take these recommendations on board remains to be seen.

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.