Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Fortune

Jeremy Kahn

AI models are getting very good at professional tasks, new OpenAI research shows

OpenAI Claude Erik Brynjolfsson Fortune Sundar Pichai Gemini

A smart phone screen displaying the app icons for a number of AI chatbots, including ChatGPT, Claude, Perplexity, Gemini, and others. (Credit: Jaque Silva—NurPhoto via Getty Images)

Hello and welcome to Eye on AI. In this edition…A new OpenAI benchmark shows how good models are getting at completing professional tasks…California has a new AI law…OpenAI rolls out Instant Purchases in ChatGPT…and AI can pick winning founders better than most VCs.

Google CEO Sundar Pichai was right when he said that while AI companies aspire to create AGI (artificial general intelligence), what we have right now is more like AJI—artificial jagged intelligence. What Pichai meant by this is that today’s AI is brilliant at some things, including some tasks that even human experts find difficult, while also performing poorly at some tasks that a human would find relatively easy.

Thinking of AI in this way partly explains the confusing set of headlines we’ve seen about AI lately—acing international math and coding competitions, while many AI projects fail to achieve a return on investment and people complain about AI-created “workslop” being a drag on productivity. (More on some of these pessimistic studies later. Needless to say, there is often a lot less to these headlines than meets the eye.)

One of the reasons for the seeming disparity in AI’s capabilities is that many AI benchmarks do not reflect real world use cases. Which is why a new gauge published by OpenAI last week is so important. Called GDPval, the benchmark evaluates leading AI models on real-world tasks, curated by experts from across 44 different professions, representing nine different sectors of the economy. The experts had an average of 14 years experience in their fields, which ranged from law and finance to retail and manufacturing, as well as government and healthcare.

Whereas a traditional AI benchmark might test a model’s capability to answer a multiple choice bar exam question about contract law, for example, the GDPval assessment asks the AI model to craft an entire 3,500 word legal memo assessing the standard of review under Delaware law that a public company founder and CEO, with majority control, would face if he wanted this public company to acquire a private company that he also owned.

OpenAI tested not only its own models, but those from a number of other leading labs, including Google DeepMind’s Gemini 2.5 Pro, Anthropic’s Claude Opus 4.1, and Grok’s Grok 4. Of these, Claude Opus 4.1 consistently performed the best, beating or equaling human expert performance on 47.6% of the total tasks. (Big kudos to OpenAI for intellectual honesty in publishing a study in which its own models were not top of the heap.)

There was a lot of variance between models, with Gemini and Grok often able to complete between a third and a fifth of tasks at or above the standard of human experts, while OpenAI’s GPT-5 Thinking’s performance fell between that of Claude Opus 4.1 and Gemini, and OpenAI’s earlier model, GPT-4o, fared the worst of all, barely able to complete 10% of the tasks to professional standard. GPT-5 was the best at following a prompt correctly, but often failed to format its response properly, according to the researchers. Gemini and Grok seemed to have the most problems with following instructions—sometimes failing to provide the delivered outcome and ignoring reference data—but OpenAI did note that “all the models sometimes hallucinated data or miscalculated.”

Big differences across sectors and professions

There was also a bit of variance between economic sectors, with the models performing best on tasks from government, retail, and the wholesale trade, and generally worst on tasks from the manufacturing sector.

For some professional tasks, Claude Opus 4.1’s performance was off the charts: it beat or equalled human performance for 81% of the tasks taken from “counter and rental clerks,” 76% of those taken from shipping clerks, 70% of those from software development, and, intriguingly, 70% of the tasks taken from the work of private investigators and detectives. (Forget Sherlock Holmes, just call Claude!) GPT-5 Thinking beat human experts on 79% of the tasks that sales manager perform and 75% of those that editors perform (gulp!).

On others, human experts won handily. The models were all notably poor at performing tasks related to the work of film and video editors, producers and directors, and audio and video technicians. So Hollywood may be breathing a sigh of relief. The models also fell down on tasks related to pharmacists’ jobs.

When AI models failed to equal or exceed human performance, it was rarely in ways that human experts judged “catastrophic”—that only occurred about 2.7% of the time with GPT-5 failures. But the GPT-5 response was judged “bad” in another 26.7% of these cases, and “acceptable but subpar” in 47.7% of cases where human outputs were deemed superior.

The need for ‘Centaur’ benchmarks

I asked Erik Brynjolfsson, the Stanford University economist at the Human-Centered AI Institute (HAI) who has done some of the best research to date on the economic impact of generative AI, what he thought of GDPval and the results. He said the assessment goes a long way to closing the gap that has developed between AI researchers and their preferred benchmarks, which are often highly technical but don’t match real-world problems. Brynjolfsson said he thought GDPval would “inspire AI researchers to think more about how to design their systems to be useful in doing practical work, not just ace the technical benchmarks.” He also said that “in practice, that means integrating technology into workflows and more often than not, actively involving humans.”

Brynjolfsson said he and colleague Andy Haupt had been arguing for “Centaur Evaluations” which judge how well humans perform when paired with, and assisted by, an AI model, rather than always seeing the AI model as a replacement for human workers. (The term comes from the idea of “centaur chess,” which is what it is called when human grandmasters are assisted by chess computers. The pairing was found to exceed what either humans or machines could do alone. And, of course, centaur here refers to the mythical half-man, half-horse of Greek mythology.)

GDPval did make some steps toward doing this, looking in one case at how much time and money was saved when OpenAI’s models were allowed to try a task multiple times, with the human then coming in to fix the output if it was not up to standard. Here, GPT-5 was found to offer both a 1.5x speedup and 1.5x cost improvement over the human expert working without AI assistance. (Less capable OpenAI models did not help as much, with GPT-4o actually leading to a slowdown and cost increase over the human expert working unassisted.)

About that AI workslop research…

This last point, along with the “acceptable but subpar” label that characterized a good portion of the cases where the AI models did not equal human performance, brings me back to that “workslop” research that came out last week. This may, in fact, be what is happening with some AI outputs in corporate settings, especially as the most capable models—such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro—are only being used by a handful of companies at scale. That said, as the journalist Adam Davidson pointed out in a Linkedin post, the “Workslop” study—just like that now infamous MIT study about 95% of AI pilots failing to produce ROI—had some very serious flaws. The “workslop” study was based on an open online survey that asked highly leading questions. It was essentially a “push poll” designed to generate an attention-grabbing headline about the problem of AI workslop more than a piece of intellectually-honest research. But it worked—it got lots of headlines, including in Fortune.

If one focuses on these kinds of headlines, it is all too easy to miss the other side of what is happening in AI, which is the story that GDPval tells: the best performing AI models are already on par with human expertise on many tasks. (And remember that GDPval has so far been tested only on Anthropic’s Claude Opus 4.1, not its new Claude Sonnet 4.5 that was released yesterday and which can work continuously on a task for up to 30 hours, far longer than any previous model.) This doesn’t mean AI can replace these professional experts any time soon. As Brynjolfsson’s work has shown, most jobs consist of dozens of different tasks, and AI can only equal or beat human performance on some of them. In many cases, a human needs to be in the loop to correct the outputs when a model fails (which, as GDPval shows, is still happening at least 20% of the time, even on the professional tasks where the models perform best.) But AI is making inroads, sometimes rapidly, in many domains—and more and more of its outputs are not just workslop.

With that, here’s more AI news.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news, I want to call your attention to the Fortune AIQ 50, a new ranking which Fortune just published today that evaluates how Fortune 500 companies are doing in deploying AI. The ranking shows which companies, across 18 different sectors—from financials to healthcare to retail—are doing best when it comes to AI, as judged by both self-assessments and peer reviews. You can see the list here, and catch up on Fortune’s ongoing AIQ series.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here