Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Very small language models (SLMs) can outperform leading large language models (LLMs) in reasoning tasks, according to a new study by Shanghai AI Laboratory. The authors show that with the right tools and test-time scaling techniques, an SLM with 1 billion parameters can outperform a 405B LLM on complicated math benchmarks.
The ability to deploy SLMs in complex reasoning tasks can be very useful as enterprises are looking for new ways to use these new models in different environments and applications.
Test-time scaling explained
Test-time scaling (TTS) is the process of giving LLMs extra compute cylces during inference to improve their performance on various tasks. Leading reasoning models, such as OpenAI o1 and DeepSeek-R1, use “internal TTS,” which means they are trained to “think” slowly by generating a long string of chain-of-thought (CoT) tokens.
An alternative approach is “external TTS,” where model performance is enhanced with (as the name implies) outside help. External TTS is suitable for repurposing exiting models for reasoning tasks without further fine-tuning them. An external TTS setup is usually composed of a “policy model,” which is the main LLM generating the answer, and a process reward model (PRM) that evaluates the policy model’s answers. These two components are coupled together through a sampling or search method.
The easiest setup is “best-of-N,” where the policy model generates multiple answers and the PRM selects one or more best answers to compose the final response. More advanced external TTS methods use search. In “beam search,” the model breaks the answer down into multiple steps.
For each step, it samples multiple answers and runs them through the PRM. It then chooses one or more suitable candidates and generates the next step of the answer. And, in “diverse verifier tree search” (DVTS), the model generates several branches of answers to create a more diverse set of candidate responses before synthesizing them into a final answer.
Different test-time scaling methods (source: arXiv)
What is the right scaling strategy?
Choosing the right TTS strategy depends on multiple factors. The study authors carried out a systematic investigation of how different policy models and PRMs affect the efficiency of TTS methods.
Their findings show that efficiency is largely dependent on the policy and PRM models. For example, for small policy models, search-based methods outperform best-of-N. However, for large policy models …