Xin Zhao*, Yongkang Liu*, Kuan Xu*, Jia Guo*, Zihao Wang*$^{\dagger}$, Yan Sun, Xinyu Kong, Qianggang Cao, Liang Jiang, Zujie Wen, Zhiqiang Zhang, Jun Zhou
By AntGroup | Work In Progress | Published on Sep 19, 2025
<aside>
TL;DR
Recent work[1] has highlighted a mismatch issue between the model training and rollout generation in the current reinforcement learning (RL) training framework. We observed that this problem can be exacerbated โน๏ธ in the Mixture-of-Experts (MoE) ย architectures. Especially when the model tends to generate long responses, the discrepancy becomes further amplified ๐ฑ. Although prior work[1] proposed to mitigate it by introducing importance-sampling correction, we argue that such a technique may hit a performance plateau on the benchmark as training proceeds. Instead, we propose a simple yet effective methodโ :icepop:๐ฐ๐๐๐ท๐๐ ๐คฉ that delivers both stable training and superior downstream performance with RL even on a strong SFT-ed MoE model.
</aside>
The mismatch issue refers to the probability discrepancy between the training backend and inference engine in the current RL training framework, which inevitably turns on-policy training to off-policy[1]. We observed that such an implementation gap becomes more significant during RL training on MoE models. Unlike dense models, to achieve higher parameter efficiency, MoE architectures adopt a routing mechanism, which selects only a handful of top-ranked โexpertsโ for each token during training and inference. However, we found that such structural design exacerbates the mismatch issue during on-policy RL training, preventing the MoE models from fully unlocking the potential of reinforcement learning.
Figure 1. The comparisons of $|\log p_{\rm infer} - \log p_{\rm train}|$ between MoE and Dense models. We select three representatives: Ring-mini-2.0(MoE), Qwen3-4B (Dense), Qwen3-30B-A3B (MoE), which shows that the MoE models usually exhibit greater discrepancy between training and inference engines. We will include more comparisons across varied model sizes.
According to the policy gradient equation below, we can see that another mismatch problem arises, i.e., $\textcolor{red}{\theta_{\rm infer}}$ and $\textcolor{blue}{\theta_{\rm train}}$. In MoE models, the router function $\texttt{TopK}(\cdot)$ dynamically activates a subset of โexpertsโ (i.e., networks) for each input token. Ideally, for a fixed input, the output of $\texttt{TopK}(\cdot)$ is expected to be identical regardless of the engine on which the policy model is deployed. Nevertheless, when there is a substantial gap between $\textcolor{red}{\theta_{\rm infer}}$ and $\textcolor{blue}{\theta_{\rm train}}$, it will inevitably lead to a greater divergence between $\textcolor{red}{\pi_{\rm infer}}$ and $\textcolor{blue}{\pi_{\rm train}}$.
$$ \small{\begin{equation}\theta \leftarrow \theta + \mu \cdot \mathbb{E}{a\sim \textcolor{red}{\pi{{\rm{infer}}}}(\textcolor{red}{\theta_{\rm infer}} ), \ \textcolor{red}{\theta_{\rm infer}} \sim \mathtt{TopK}{\rm infer}(a)}\left[ R(a) \cdot \nabla{\theta}\log \textcolor{blue}{\pi_{\rm{train}}}(a;\textcolor{blue}{\theta_{\rm train}});\textcolor{blue}{\theta_{\rm train}} \!\sim \!\texttt{TopK}_{\rm train}(a) \right]\end{equation}} $$
For MoE models, we identified two major causes of the training-inference discrepancy problem:
<aside> ๐งต
The probability discrepancy becomes magnified, especially for long sequences.
</aside>
At the very beginning of training, we find that the noticeable probability discrepancies are already evident at certain token positions. Due to the autoregressive nature of prediction, tokens that occur in later positions are more susceptible to the accumulation of discrepancies, resulting in a wider spread of variation.
***Figure 2**. At step 0, the probability discrepancy in different token positions.*
As training progresses, the problem intensifies: the probability gap for the same token between training and inference engines continues to increase across token positions, even affecting those preceding tokens in long sequences and destabilizing the optimization process.
***Figure 3**. After training, the log probabilities computed by training and inference engines in different token positions.*
<aside> ๐
The mismatch issue quickly causes crashes during on-policy MoE RL training.
</aside>
In on-policy RL training, compared to dense models, we observed that MoE models tasked with generating long sequences are more vulnerable to such mismatch problem, often leading training to crashes.