There is no shortage of AI benchmarks in the market today, with popular options like Humanity’s Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.AI agents excel at solving abstract math problems and passing PhD-level exams that most benchmarks are based on, but Databricks has a question for the enterprise: Can they actually handle the document-heavy work most enterprises need them to do?The answer, according to new research from the data and AI platform company, is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks that mirror real enterprise workloads, exposing a critical gap between academic benchmarks and business reality.”If we focus our research efforts on getting better at [existing benchmarks], then we’re probably not solving the right problems to make Databricks a better platform,” Erich Elsen, principal research scientist at Databricks, explained to VentureBeat. “So that’s why we were looking around. How do we create a benchmark that, if we get better at it, we’re actually getting better at solving the problems that our customers have?”The result is OfficeQA, a benchmark designed to test AI agents on grounded reasoning: Answering questions based on complex proprietary datasets containing unstructured documents and tabular data. Unlike existing benchmarks that focus on abstract capabilities, OfficeQA proxies for the economically valuable tasks enterprises actually perform.Why academic benchmarks miss the enterprise markThere are numerous shortcomings of popular AI benchmarks from an enterprise perspective, according to Elsen. HLE features questions requiring PhD-level expertise across diverse fields. ARC-AGI evaluates abstract reasoning through visual manipulation of colored grids. Both push the frontiers of AI capabilities, but don’t reflect daily enterprise work. Even GDPval, which was specifically created to evaluate economicall …