DeepSeek unveils new technique for smarter, scalable AI reward models

by | Apr 8, 2025 | Technology

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

DeepSeek AI, a Chinese research lab gaining recognition for its powerful open-source language models such as DeepSeek-R1, has introduced a significant advancement in reward modeling for large language models (LLMs). 

Their new technique, Self-Principled Critique Tuning (SPCT), aims to create generalist and scalable reward models (RMs). This could potentially lead to more capable AI applications for open-ended tasks and domains where current models can’t capture the nuances and complexities of their environment and users.

The crucial role and current limits of reward models

Reinforcement learning (RL) has become a cornerstone in developing state-of-the-art LLMs. In RL, models are fine-tuned based on feedback signals that indicate the quality of their responses. 

Reward models are the critical component that provides these signals. Essentially, an RM acts as a judge, evaluating LLM outputs and assigning a score or “reward” that guides the RL process and teaches the LLM to produce more useful responses.

However, current RMs often face limitations. They typically excel in narrow domains with clear-cut rules or easily verifiable answers. For example, current state-of-the-art reasoning models such as DeepSeek-R1 underwent an RL phase, in which they were trained on math and coding problems where the ground truth is clearly defined.

However, creating a reward model for complex, open-ended, or subjective queries in general domains remains a major hurdle. In the paper explaining their new technique, researchers at DeepSeek AI write, “Generalist RM requires to generate high-quality rewards beyond specific domains, where the criteria for rewards are more diverse and complex, and there are often no explicit reference or ground truth.” 

They highlight four key challenges in creating generalist RMs capable of handling broader tasks:

Input flexibility: The RM must handle various input types and be able to evaluate one or more responses simultaneously.

Accuracy: It must generate accurate reward signals across diverse domains where the criteria are complex and the ground truth is often unavailable. 

Inference-time scalability: The RM should produce higher-quality rewards when more computational resources are allocated during inference.

Learning scalable behaviors: For RMs to …

Article Attribution | Read More at Article Source