Haizhong Zheng, Jiawei Zhao, Beidi Chen
First published on October 7th, 2025
<aside>
TL;DR
💠Tolerating stale rollouts is key to scaling RL on LLMs, improving resource use, and enabling more flexible frameworks like AREAL and SLIME. Yet today’s algorithms handle only limited staleness, creating bottlenecks that make RL difficult to scale.
💡 We show that stale data is as informative as on-policy data in RL on LLMs. More specifically, our method, M2PO, maintains stable training without any performance drop even when training exclusively on data that is stale by at least 256 mode updates.
</aside>
Figure 1. Left: Average Accuracy on eight math reasoning benchmarks. Standard GRPO suffers from degradation with stale rollouts, but M2PO achieves stable training and matches on-policy performance even under high staleness. Right: M2PO significantly reduces the token clipping ratio compared to GRPO with the same staleness, while avoiding training collapse.
Most RL algorithms for LLMs rely on an on-policy setup: the model must constantly generate fresh (or limited-staleness) examples to learn from. This makes training stable and reliable, but also very costly, as each update requires waiting for new rollouts to finish. To overcome this inefficiency, researchers have been experimenting with asynchronous RL systems, like AREAL and SLIME. In these systems, rollouts and model training happen independently, often spread across large computing clusters. Such approaches improve resource utilization and enable training to scale more efficiently, but the effectiveness of those systems largely relies on the ability of RL algorithms to tolerate rollout staleness without sacrificing stability or performance.
Why is it important? An effective off-policy RL algorithm that can learn from highly stale data fundamentally changes the design space of large-scale LLM training. It allows decoupling rollout generation from model updates, so multiple rollout workers can continuously collect trajectories without waiting for synchronization. This makes it possible to train LLMs with heterogeneous clusters and reuse massive amounts of past trajectories, significantly reducing the cost and latency of RL training.
The inferior performance of training on stale data can be attributed to many factors, for example:
As staleness increases, GRPO tends to prune more tokens from the training data. Many of these pruned tokens are high-entropy tokens that play a pivotal role in logical reasoning.
Prosperity before Collapse. To disentangle whether the performance drop stems from stale data generated by highly shifted old policies or from biases introduced by the training algorithm, we remove the trust region entirely to remove bias from the training algorithm.
Figure 3. Prosperity before Collapse. Training without a trust region (TR) ($\epsilon=\infty$) under stale data (s=256) initially achieves higher performance than clipped training, sometimes even matching the on-policy baseline (s=0). However, it eventually collapses due to uncontrolled variance.
Surprisingly, we observe an interesting prosperity-before-collapse phenomenon. Both Figure 1 and Figure 3, although training without a trust region eventually collapses, it achieves substantially better performance prior to collapse. In fact, under stale data (s=256), the no-clipping setting outperforms clipped training, sometimes even matching on-policy baselines. This indicates that the data collected from stale policies can contain as much information as data collected on-policy for RL on LLMs.