Breaking News
DeepSeek AI has released a research paper detailing a novel method to scale general reward models (GRMs) during inference, while simultaneously signaling the imminent arrival of its next-generation R2 model. The paper, titled 'Inference-Time Scaling for Generalist Reward Modeling,' introduces a technique that dynamically generates principles and critiques through rejection fine-tuning and rule-based online reinforcement learning.

The move marks a strategic shift in large language model (LLM) development, as the industry moves from pre-training scaling to post-training enhancements—particularly during the inference phase. This approach mirrors strategies seen in OpenAI's o1 model, which uses extended 'thinking time' to refine reasoning and self-correct errors.
Background
DeepSeek's own R1 series already demonstrated the potential of pure reinforcement learning (RL) training—without supervised fine-tuning—to achieve significant gains in reasoning capabilities. The new paper builds on this by addressing a fundamental limitation of LLMs: their reliance on 'next token prediction,' which, while providing vast knowledge, often lacks deep planning and the ability to predict long-term outcomes.
Reinforcement learning acts as a critical complement, providing LLMs with an 'internal world model' that simulates potential outcomes of different reasoning paths. This synergy allows models to evaluate and select superior solutions, enabling more systematic long-term planning essential for complex problem-solving.
'The relationship between LLMs and reinforcement learning is multiplicative,' said Wu Yi, assistant professor at Tsinghua University's Institute for Interdisciplinary Information Sciences (IIIS), in a recent podcast. 'While RL excels in decision-making, it inherently lacks understanding. That understanding comes from pre-trained models. Only when a strong foundation of language comprehension, memory, and logical reasoning is built during pre-training can RL fully unlock its potential to create a complete intelligent agent.'
What This Means
The timing of DeepSeek's announcement suggests a rapidly accelerating race to optimize inference-time computation—the 'thinking' phase of AI. By scaling reward models dynamically during inference, DeepSeek could enable more efficient and accurate reasoning without proportionate increases in training costs. This could democratize access to advanced AI capabilities, allowing smaller labs to compete with industry giants.
Industry observers are closely watching for the R2 model's release, which is expected to integrate these techniques. The convergence of LLMs and reinforcement learning may soon redefine what's possible in automated reasoning, planning, and decision-making across fields from scientific research to enterprise software.