Microsoft launches Phi-4-Reasoning-Plus, a small, powerful, open weights reasoning model!

by | May 1, 2025 | Technology

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Microsoft Research has announced the release of Phi-4-reasoning-plus, an open-weight language model built for tasks requiring deep, structured reasoning.

Building on the architecture of the previously released Phi-4, the new model integrates supervised fine-tuning and reinforcement learning to deliver improved performance on benchmarks in mathematics, science, coding, and logic-based tasks.

Phi-4-reasoning-plus is a 14-billion parameter dense decoder-only Transformer model that emphasizes quality over scale. Its training process involved 16 billion tokens—about 8.3 billion of them unique—drawn from synthetic and curated web-based datasets.

A reinforcement learning (RL) phase, using only about 6,400 math-focused problems, further refined the model’s reasoning capabilities.

The model has been released under a permissive MIT license — enabling its use for broad commercial and enterprise applications, and fine-tuning or distillation, without restriction — and is compatible with widely used inference frameworks including Hugging Face Transformers, vLLM, llama.cpp, and Ollama.

Microsoft provides detailed recommendations on inference parameters and system prompt formatting to help developers get the most from the model.

Outperforms larger models

The model’s development reflects Microsoft’s growing emphasis on training smaller models capable of rivaling much larger systems in performance.

Despite its relatively modest size, Phi-4-reasoning-plus outperforms larger open-weight models such as DeepSeek-R1-Distill-70B on a number of demanding benchmarks.

On the AIME 2025 math exam, for instance, it delivers a higher average accuracy at passing all 30 questions on the first try (a feat known as “pass@1”) than the 70B parameter distillation model, and approaches the performance of DeepSeek-R1 itself, which is far larger at 671B parameters.

Structured thinking via fine-tuning

To achieve this, Microsoft employed a data-centric training strategy.

During the supervised fine-tuning stage, the model was trained using a curated blend of synthetic chain-of-thought reasoning traces and filtered high-quality prompts.

A key innovation in the training approach was the use of structured reasoning outputs marked with special and tokens.

These guide the model to separate its intermediate reasoning steps from the final answer, promoting both transparency and coherence in long-form problem solving.

Reinforcement learning for accuracy and depth

Following fine-tuning, Microsoft used outcome-based reinforcement learning—specifically, the Group Relative Policy Optimization (GRPO) algorithm—to improve the model’s output accuracy and …

Article Attribution | Read More at Article Source