Jia Guo$^{1*\dagger}$, Yan Sun$^{12*\S}$, Zhenyu Huang$^1$, Zihao Wang$^{1\dagger}$, Zujie Wen$^{1\dagger}$, Zhiqiang Zhang$^1$, Jun Zhou$^1$, Stanley Kok$^2$

$^*$Equal Contribution $^{\dagger}$Project Leads $^{\S}$Work Done During Internship

$^1$Ant Group $^2$National University of Singapore

Work In Progress | Published on May 26, 2026

<aside> 💡

Updates

[News] AReaL has integrated our work [PR]

</aside>

𝑾𝒉𝒆𝒓𝒆 𝒕𝒓𝒖𝒔𝒕 𝒉𝒐𝒍𝒅𝒔, 𝒅𝒂𝒏𝒄𝒊𝒏𝒈 𝒘𝒊𝒕𝒉𝒐𝒖𝒕 𝒇𝒆𝒂𝒓 𝒂𝒏𝒅 𝒈𝒓𝒐𝒘𝒊𝒏𝒈 𝒖𝒏𝒇𝒐𝒍𝒅𝒔.

<aside> 💃🏼

Highlights:

In IcePop, we proposed double-sided masking to improve the RL training stability for MoE models. However, we observe that a uniform constant-ratio constraint implicitly assumes disproportional mismatch across tokens, which fails to reflect the heterogeneous training-inference discrepancy induced by different token probabilities.
We therefore introduce KPop, which replaces the uniform fixed-ratio constraint with binary KL divergence to better capture heterogeneous token-level mismatch across high- and low-probability regions.
Empirically, without requiring modifications to the RL infrastructure or additional mechanisms such as routing replay, KPop consistently stabilizes RL optimization and improves performance across mixed complex reasoning tasks and long-horizon coding agentic RL, allowing our Ring-2.6-1T to achieve over 76 on SWE-bench-Verified using pure RL training alone. </aside>

Something Overlooked in Training Stability

🔍 In our previous blog post, we showed that the training–inference mismatch problem becomes worse when training large-scale MoE models. To mitigate this instability, we introduced IcePop, which applies a token-level ratio mask to define an acceptance region for policy gradient updates. Specifically, when the probability ratio between the training and inference engines $\frac {\pi_{\textcolor{blue}{\text{train}}}(y_t)}{ \pi_{\textcolor{red}{\text{infer}}}(y_t)}$ exceeds a predefined range of $[\alpha, \beta]$, it indicates a substantial discrepancy between the two distributions for a given token, such tokens are therefore excluded from gradient updates, restricting optimization to tokens with relatively small divergence. However, in our subsequent research and experiments, we observed an unexpected phenomenon: as training progresses, IcePop masks fewer tokens while the training–inference gap continues to widen (see the left and middle in Fig. 1). This growing mismatch is accompanied by a gradual decline in the training probability of newly generated tokens, ultimately making optimization increasingly unstable and less effective (i.e., the right in Fig. 1).

We start to wonder: Is this simple fixed ratio constraint sufficient to guarantee stable training in reinforcement learning?

Figure 1. As training steps increases, the training-inference log probability difference is gradually expanded, but the masked token ratio by [α, β] constraint did not increase accordingly.

Figure 1. As training steps increases, the training-inference log probability difference is gradually expanded, but the masked token ratio by [α, β] constraint did not increase accordingly.

Moreover, in our experiments, we find that RL training is very sensitive and vulnerable to the parameter selection for traditional constant-ratio-based constraints. From the right picture, we can see that as we shrink the acceptance region ( $\alpha=0.4, \beta=5.0$) for selected tokens, the training-inference difference grows dramatically, which shows improper masking will also result in training instability.

Figure 2. When setting different range for masking range [α, β], the training-inference log probability difference varied.

Figure 2. When setting different range for masking range [α, β], the training-inference log probability difference varied.

Revisiting Uniform Constant Ratio Region

Recall IcePop’s formulation,

$$ \small{\begin{align*}\mathcal{J}{{\text{IcePop}}}(\theta) &= \mathbb{E}{x \sim \mathcal{D}, \{y_i\}{i=1}^G \sim \pi{\textcolor{red}{\text{infer}}}(\cdot \mid x; \theta_{\rm old})} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \Big[\mathcal{M}\Bigl(\frac{\pi_{\textcolor{blue}{\text{train}}}(y_{i,t} \mid x, y_{i,<t};\theta_{\text{old}})}{\pi_{\textcolor{red}{\text{infer}}}(y_{i,t} \mid x, y_{i,<t}; \theta_{\mathrm{old}})}; \alpha, \beta\Bigr) \right. \\ &\left. \qquad \qquad \qquad \qquad \quad \qquad \cdot \min \left( r_{i,t}\widehat{A}{i,t}, \text{clip} \left( r{i,t}, 1 - \varepsilon, 1 + \varepsilon \right) \widehat{A}_{i,t} \right) \right]\Bigg] &\end{align*}}. $$

IcePop adopts a uniform constant-ratio constraint on the policy probability ratio within a fixed global range $[\alpha, \beta]$, with additional double-sided masking, where a constant-ratio threshold treats all tokens the same way, regardless of their probability. But in reality, the noise in the ratio is not uniform across tokens, as the ratio divergence depends on token probability.

Thus, IcePop tends to over-mask the low-probability tokens.

Figure 3. Illustration of masking region of IcePop. Points distributed closer to the upper-left or lower-right corners indicate larger divergence.

Figure 3. Illustration of masking region of IcePop. Points distributed closer to the upper-left or lower-right corners indicate larger divergence.

Why does token probability affect the noise?
For the low-probability tokens, the inherent noise level is higher.

KPop: Bounding Mismatch with KL Divergence

KPop replaces IcePop's constant ratio bound with a symmetric binary KL criterion. For each output token $y_t,$ we compute the binary KL divergence between $\pi_{\textcolor{blue}{\text{train}}}(y_t)$ and $\pi_{\textcolor{red}{\text{infer}}}(y_t)$, which views the full vocabulary as a two-event partition, this token vs. everything else.