Researchers at Google have developed a technique that makes it easier for AI models to learn complex reasoning tasks that usually cause LLMs to hallucinate or fall apart. Instead of training LLMs through next-token prediction, their technique, called internal reinforcement learning (internal RL), steers the model’s internal activations toward developing a high-level step-by-step solution for the input problem. Ultimately, this could provide a scalable path for creating autonomous agents that can handle complex reasoning and real-world robotics without needing constant, manual guidance.The limits of next-token predictionReinforcement learning plays a key role in post-training LLMs, particularly for complex reasoning tasks that require long-horizon planning. However, the problem lies in the architecture of these models. LLMs are autoregressive, meaning they generate sequences one token at a time. When these models explore new strategies during training, they do so by making small, random changes to the next single token or action. This exposes a deeper limitation: next-token prediction forces models to search for solutions at the wrong level of abstraction, making long-horizon reasoning inefficient even when the model “knows” what to do.This token-by-token approach works well for basic language modeling but breaks down in long-horizon tasks where rewards are sparse. If the model relies solely on random token-level sampling, the probability of stumbling upon the correct multi-step solution is infinitesimally small, “on the order of one in a million,” according to the researchers.The issue isn’t just that the models get confused; it’s that they get confused at the wrong level. In comments provided to VentureBeat, Yanick Schimpf, a co-author of the paper, notes that in a 20-step task, an agent can get lost in the minute details of a single step, or it can lose track of the overall goal.”We argue that when facing a problem with some …