A new study by Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) shows that training large language models (LLMs) for complex, autonomous tasks does not require massive datasets. Their framework, LIMI (Less Is More for Intelligent Agency), builds on similar work in other areas of LLM research and finds that “machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.” In other words, it’s data quality, not quantity, that matters. In experiments, the researchers found that with a small, but carefully curated, dataset of just 78 examples, they could train LLMs to outperform models trained on thousands of examples by a considerable margin on key industry benchmarks. This discovery could have important implications for enterprise applications where data is scarce or expensive to collect.The challenge of building agents that workThe researchers define agency as “the emergent capacity of AI systems to function as autonomous agents–actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools.” In other words, these are AI systems that “don’t just think, but work.” The problem is that current training frameworks assume that higher agentic intelligence requires a lot of data, as has been shown in the classic scaling laws of language modeling. The researchers argue that this approach leads to increasingly complex training pipelines and substantial resource requirements. Moreover, in many areas, data is not abundant, hard to obtain, and very expensive to curate.However, research in other domains suggests that you don’t necessarily require more data to achieve training objectives in LLM training. For example, LIMA, a 2023 paper, showed a model could be effectively aligned with just 1,000 curated examples. More recently, LIMO demonstrated th …