Even the most advanced AI models fail more often than…

Even the most advanced AI models fail more often than you think on structured outputs — raising doubts about the effectiveness of coding assistants

Report finds AI coding assistants regularly fail one in four structured-output tasks
Even advanced proprietary models only reach approximately 75% accuracy
Open source AI models perform worse, averaging closer to 65% reliability

The promise of artificial intelligence as a tireless coding assistant has encountered a significant roadblock after new research claimed such tools can experience a range of issues.

A recent study from the University of Waterloo found AI struggles with software development, with even the most advanced models failing on one in four structured-output tasks.

The research evaluated 11 large language models across 18 different structured formats and 44 tasks to test how well the systems could follow predefined rules, finding a clear disparity between performance on text-based tasks and outputs involving multimedia or complex structures.

Benchmarking reveals a troubling reliability gap

While text-related tasks were generally handled with moderate success, tasks requiring image, video, or website generation proved far more problematic.

Accuracy in these areas dropped sharply, raising questions about how these AI tools can be integrated safely into professional workflows.

“With this kind of study, we want to measure not only the syntax of the code — that is, whether it’s following the set rules — but also whether the outputs produced for various tasks were accurate,” said Dongfu Jiang, a PhD student and co-first author of the study.

Structured outputs, designed to impose format consistency through JSON, XML, or Markdown, were intended to make AI responses more reliable for developers.

AI companies, including OpenAI, Google, and Anthropic, introduced structured outputs to force responses into predictable formats.

The Waterloo research suggests this approach has not yet delivered the level of dependability developers require.

Waterloo’s benchmarking revealed even the most advanced proprietary models reached only about 75% accuracy, while open source alternatives performed closer to 65%.

These results suggest that, despite improvements, AI systems still make significant errors that cannot be ignored in professional development environments.

The report emphasized the need for human oversight, noting,“Developers might have these agents working for them, but they still need significant human supervision.”

Although structured outputs are a step forward from free-form natural language responses, errors remain common.

The technology is not yet robust enough to operate independently in complex development scenarios.

One might reasonably question whether the industry’s enthusiasm for AI and vibe coding assistants has outpaced the actual capabilities of the underlying technology.

Even the most advanced models demonstrate a significant failure rate on structured tasks, revealing a wide gap between marketing claims and actual performance.

Therefore, for now, developers should treat these tools as experimental aids rather than autonomous colleagues.

Read news from 100's of titles, curated specifically for you.

Already a member? Sign in here