Jia Guo$^{1*\dagger}$, Yan Sun$^{12*\S}$, Zhenyu Huang$^1$, Zihao Wang$^{1\dagger}$, Zujie Wen$^{1\dagger}$, Zhiqiang Zhang$^1$, Jun Zhou$^1$, Stanley Kok$^2$
$^*$Equal Contribution $^{\dagger}$Project Leads $^{\S}$Work Done During Internship
$^1$Ant Group $^2$National University of Singapore
Work In Progress | Published on May 26, 2026
πΎππππ πππππ ππππ π, π ππππππ πππππππ ππππ πππ πππππππ ππππππ π.
<aside> ππΌ
Highlights:
Ring-2.6-1T to achieve over 76 on SWE-bench-Verified using pure RL training alone.
</aside>πΒ In our previous blog post, we showed that the trainingβinference mismatch problem becomes worse when training large-scale MoE models. To mitigate this instability, we introduced IcePop, which applies a token-level ratio mask to define an acceptance region for policy gradient updates. Specifically, when the probability ratio between the training and inference engines $\frac {\pi_{\textcolor{blue}{\text{train}}}(y_t)}{ \pi_{\textcolor{red}{\text{infer}}}(y_t)}$ exceeds a predefined range of $[\alpha, \beta]$, it indicates a substantial discrepancy between the two distributions for a given token, such tokens are therefore excluded from gradient updates, restricting optimization to tokens with relatively small divergence. However, in our subsequent research and experiments, we observed an unexpected phenomenon: as training progresses, IcePop masks fewer tokens while the trainingβinference gap continues to widen (see the left and middle in Fig. 1). This growing mismatch is accompanied by a gradual decline in the training probability of newly generated tokens, ultimately making optimization increasingly unstable and less effective (i.e., the right in Fig. 1).
We start to wonder: Is this simple fixed ratio constraint sufficient to guarantee stable training in reinforcement learning?
![Figure 1. As training steps increases, the training-inference log probability difference is gradually expanded, but the masked token ratio by [Ξ±, Ξ²] constraint did not increase accordingly.](attachment:c9e5592e-34c5-44b8-b393-c8d8440dfd51:image.png)
Figure 1. As training steps increases, the training-inference log probability difference is gradually expanded, but the masked token ratio by [Ξ±, Ξ²] constraint did not increase accordingly.
Moreover, in our experiments, we find that RL training is very sensitive and vulnerable to the parameter selection for traditional constant-ratio-based constraints. From the right picture, we can see that as we shrink the acceptance region ( $\alpha=0.4, \beta=5.0$) for selected tokens, the training-inference difference grows dramatically, which shows improper masking will also result in training instability.
![Figure 2. When setting different range for masking range [Ξ±, Ξ²], the training-inference log probability difference varied.](attachment:a5cc8e71-111d-4747-9420-6b34815c671d:image.png)
Figure 2. When setting different range for masking range [Ξ±, Ξ²], the training-inference log probability difference varied.
Recall IcePopβs formulation,
$$ \small{\begin{align*}\mathcal{J}{{\text{IcePop}}}(\theta) &= \mathbb{E}{x \sim \mathcal{D}, \{y_i\}{i=1}^G \sim \pi{\textcolor{red}{\text{infer}}}(\cdot \mid x; \theta_{\rm old})} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \Big[\mathcal{M}\Bigl(\frac{\pi_{\textcolor{blue}{\text{train}}}(y_{i,t} \mid x, y_{i,<t};\theta_{\text{old}})}{\pi_{\textcolor{red}{\text{infer}}}(y_{i,t} \mid x, y_{i,<t}; \theta_{\mathrm{old}})}; \alpha, \beta\Bigr) \right. \\ &\left. \qquad \qquad \qquad \qquad \quad \qquad \cdot \min \left( r_{i,t}\widehat{A}{i,t}, \text{clip} \left( r{i,t}, 1 - \varepsilon, 1 + \varepsilon \right) \widehat{A}_{i,t} \right) \right]\Bigg] &\end{align*}}. $$
IcePop adopts a uniform constant-ratio constraint on the policy probability ratio within a fixed global range $[\alpha, \beta]$, with additional double-sided masking, where a constant-ratio threshold treats all tokens the same way, regardless of their probability. But in reality, the noise in the ratio is not uniform across tokens, as the ratio divergence depends on token probability.
Thus, IcePop tends to over-mask the low-probability tokens.

Figure 3. Illustration of masking region of IcePop. Points distributed closer to the upper-left or lower-right corners indicate larger divergence.
KPop replaces IcePop's constant ratio bound with a symmetric binary KL criterion. For each output token $y_t,$ we compute the binary KL divergence between $\pi_{\textcolor{blue}{\text{train}}}(y_t)$ and $\pi_{\textcolor{red}{\text{infer}}}(y_t)$, which views the full vocabulary as a two-event partition, this token vs. everything else.
$$ \small{\begin{align*}D_{\text{KL}}^B\!\left(\pi_{\textcolor{blue}{\text{train}}}(y_t) \| \pi_{\textcolor{red}{\text{infer}}}(y_t)\right) &= \pi_{\textcolor{blue}{\text{train}}}(y_t) \log\frac{\pi_{\textcolor{blue}{\text{train}}}(y_t)}{\pi_{\textcolor{red}{\text{infer}}}(y_t)} + \bigl(1-\pi_{\textcolor{blue}{\text{train}}}(y_t)\bigr)\log\frac{1-\pi_{\textcolor{blue}{\text{train}}}(y_t)}{1-\pi_{\textcolor{red}{\text{infer}}}(y_t)}\end{align*}} $$