This tree search framework hits 98.7% on documents where vector search fails

by | Jan 30, 2026 | Technology

A new open-source framework called PageIndex solves one of the old problems of retrieval-augmented generation (RAG): handling very long documents.The classic RAG workflow (chunk documents, calculate embeddings, store them in a vector database, and retrieve the top matches based on semantic similarity) works well for basic tasks such as Q&A over small documents.PageIndex abandons the standard “chunk-and-embed” method entirely and treats document retrieval not as a search problem, but as a navigation problem. But as enterprises try to move RAG into high-stakes workflows — auditing financial statements, analyzing legal contracts, navigating pharmaceutical protocols — they’re hitting an accuracy barrier that chunk optimization can’t solve.AlphaGo for documentsPageIndex addresses these limitations by borrowing a concept from game-playing AI rather than search engines: tree search.When humans need to find specific information in a dense textbook or a long annual report, they do not scan every paragraph linearly. They consult the table of contents to identify the relevant chapter, then the section, and finally the specific page. PageIndex forces the LLM to replicate this human behavior.Instead of pre-calculating vectors, the framework builds a “Global Index” of the document’s structure, creating a tree where nodes represent chapters, sections, and subsections. When a query arrives, the LLM performs a tree search, explicitly classifying each node as relevant or irrelevant based on the full context of the user’s request.”In computer science terms, a table of contents is a tree-structured representation of a document, and navigating it corresponds to tree search,” Zhang said. “PageIndex applies the same core idea — tree search — to document retrieval, and can be thought of as an AlphaGo-style system for retrieval rather than fo …

Article Attribution | Read More at Article Source