Are AI agents ready for the workplace? A new benchmark raises doubts.

by | Jan 22, 2026 | Technology

It’s been nearly two years since Microsoft CEO Satya Nadella predicted AI would replace knowledge work — the white-collar jobs held by lawyers, investment bankers, librarians, accountants, IT, and others.

But despite the huge progress made by foundation models, the change in knowledge work has been slow to arrive. Models have mastered in-depth research and agentic planning, but for whatever reason, most white-collar work has been relatively unaffected.

It’s one of the biggest mysteries in AI — and thanks to new research from the training-data giant Mercor, we’re finally getting some answers.

The new research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment banking, and law. The result is a new benchmark called APEX-Agents — and so far, every AI lab is getting a failing grade. Faced with queries from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time, the model came back with a wrong answer or no answer at all.

According to Mercor CEO Brendan Foody, who worked on the paper, the models’ biggest stumbling point was tracking down information across multiple domains — something that’s integral to most of the knowledge work performed by humans.

“One of the big changes in this benchmark is that we built out the entire environment, modeled after how real professional services,” Foody told TechCrunch. “The way we do our jobs isn’t with one individual giving us all the context in one place. In real life, you’re operating across Slack and Google Drive and all these other tools.” For many agentic AI models, that kind of multi-domain reasoning is still hit or miss.

Screenshot

The scenarios were all drawn from actual professionals on Mercor’s expert marketplace, who both laid out the queries and set the standard for a successful response. Looking throug …

Article Attribution | Read More at Article Source