Jia Guo$^{1*\dagger}$, Yan Sun$^{12*\S}$, Zhenyu Huang$^1$, Zihao Wang$^{1\dagger}$, Zujie Wen$^{1\dagger}$, Zhiqiang Zhang$^1$, Jun Zhou$^1$, Stanley Kok$^2$

$^*$Equal Contribution $^{\dagger}$Project Leads $^{\S}$Work Done During Internship

$^1$Ant Group $^2$National University of Singapore

Work In Progress | Published on May 26, 2026

𝑾𝒉𝒆𝒓𝒆 𝒕𝒓𝒖𝒔𝒕 𝒉𝒐𝒍𝒅𝒔, π’…π’‚π’π’„π’Šπ’π’ˆ π’˜π’Šπ’•π’‰π’π’–π’• 𝒇𝒆𝒂𝒓 𝒂𝒏𝒅 π’ˆπ’“π’π’˜π’Šπ’π’ˆ 𝒖𝒏𝒇𝒐𝒍𝒅𝒔.

<aside> πŸ’ƒπŸΌ

Highlights:

Something Overlooked in Training Stability

πŸ”Β In our previous blog post, we showed that the training–inference mismatch problem becomes worse when training large-scale MoE models. To mitigate this instability, we introduced IcePop, which applies a token-level ratio mask to define an acceptance region for policy gradient updates. Specifically, when the probability ratio between the training and inference engines $\frac {\pi_{\textcolor{blue}{\text{train}}}(y_t)}{ \pi_{\textcolor{red}{\text{infer}}}(y_t)}$ exceeds a predefined range of $[\alpha, \beta]$, it indicates a substantial discrepancy between the two distributions for a given token, such tokens are therefore excluded from gradient updates, restricting optimization to tokens with relatively small divergence. However, in our subsequent research and experiments, we observed an unexpected phenomenon: as training progresses, IcePop masks fewer tokens while the training–inference gap continues to widen (see the left and middle in Fig. 1). This growing mismatch is accompanied by a gradual decline in the training probability of newly generated tokens, ultimately making optimization increasingly unstable and less effective (i.e., the right in Fig. 1).

We start to wonder: Is this simple fixed ratio constraint sufficient to guarantee stable training in reinforcement learning?

Figure 1. As training steps increases, the training-inference log probability difference is gradually expanded, but the masked token ratio by [Ξ±, Ξ²] constraint did not increase accordingly.

Figure 1. As training steps increases, the training-inference log probability difference is gradually expanded, but the masked token ratio by [Ξ±, Ξ²] constraint did not increase accordingly.

Moreover, in our experiments, we find that RL training is very sensitive and vulnerable to the parameter selection for traditional constant-ratio-based constraints. From the right picture, we can see that as we shrink the acceptance region ( $\alpha=0.4, \beta=5.0$) for selected tokens, the training-inference difference grows dramatically, which shows improper masking will also result in training instability.

Figure 2. When setting different range for masking range [Ξ±, Ξ²], the training-inference log probability difference varied.

Figure 2. When setting different range for masking range [Ξ±, Ξ²], the training-inference log probability difference varied.

Revisiting Uniform Constant Ratio Region

Recall IcePop’s formulation,

$$ \small{\begin{align*}\mathcal{J}{{\text{IcePop}}}(\theta) &= \mathbb{E}{x \sim \mathcal{D}, \{y_i\}{i=1}^G \sim \pi{\textcolor{red}{\text{infer}}}(\cdot \mid x; \theta_{\rm old})} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \Big[\mathcal{M}\Bigl(\frac{\pi_{\textcolor{blue}{\text{train}}}(y_{i,t} \mid x, y_{i,<t};\theta_{\text{old}})}{\pi_{\textcolor{red}{\text{infer}}}(y_{i,t} \mid x, y_{i,<t}; \theta_{\mathrm{old}})}; \alpha, \beta\Bigr) \right. \\ &\left. \qquad \qquad \qquad \qquad \quad \qquad \cdot \min \left( r_{i,t}\widehat{A}{i,t}, \text{clip} \left( r{i,t}, 1 - \varepsilon, 1 + \varepsilon \right) \widehat{A}_{i,t} \right) \right]\Bigg] &\end{align*}}. $$

IcePop adopts a uniform constant-ratio constraint on the policy probability ratio within a fixed global range $[\alpha, \beta]$, with additional double-sided masking, where a constant-ratio threshold treats all tokens the same way, regardless of their probability. But in reality, the noise in the ratio is not uniform across tokens, as the ratio divergence depends on token probability.

Thus, IcePop tends to over-mask the low-probability tokens.

Figure 3. Illustration of masking region of IcePop. Points distributed closer to the upper-left or lower-right corners indicate larger divergence.

Figure 3. Illustration of masking region of IcePop. Points distributed closer to the upper-left or lower-right corners indicate larger divergence.

KPop: Bounding Mismatch with KL Divergence

KPop replaces IcePop's constant ratio bound with a symmetric binary KL criterion. For each output token $y_t,$ we compute the binary KL divergence between $\pi_{\textcolor{blue}{\text{train}}}(y_t)$ and $\pi_{\textcolor{red}{\text{infer}}}(y_t)$, which views the full vocabulary as a two-event partition, this token vs. everything else.

$$ \small{\begin{align*}D_{\text{KL}}^B\!\left(\pi_{\textcolor{blue}{\text{train}}}(y_t) \| \pi_{\textcolor{red}{\text{infer}}}(y_t)\right) &= \pi_{\textcolor{blue}{\text{train}}}(y_t) \log\frac{\pi_{\textcolor{blue}{\text{train}}}(y_t)}{\pi_{\textcolor{red}{\text{infer}}}(y_t)} + \bigl(1-\pi_{\textcolor{blue}{\text{train}}}(y_t)\bigr)\log\frac{1-\pi_{\textcolor{blue}{\text{train}}}(y_t)}{1-\pi_{\textcolor{red}{\text{infer}}}(y_t)}\end{align*}} $$