Authors: Chenlu Ye$^\dagger$, Xuanchang Zhang, Yifan Hao*, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Tong Zhang$^1$$^\dagger$
*Co-First Authors. $^\dagger$Project lead Date: Feb 12, 2026



Figure 1. (Left) Comparison of ALP and GRPO, MIS, TIS. (Right) Testing Average@ $32$ scores of ALP: using Qwen2.5-1.5B-Base on the single-turn math task. ALP outperforms the baselines, while the baselines either plateaus early or collapse. Multi-turn agentic task results on the 7B model are delayed in Sec. 3.

Figure 2. Left (Bypass): directly use $\pi_\theta^{\mathrm{train}} / \pi_{\theta_{\mathrm{old}}}^{\mathrm{infer}}$ as importance ratio; Right (ALP) perturb $\pi_\theta^{\mathrm{train}}$ with a small variable $\delta$ and use $\pi_{\theta,\delta}^{\mathrm{train}} / \pi_{\theta_{\mathrm{old}}}^{\mathrm{infer}}$as importance ratio. Both are trained from base model to 115-th iteration and importance ratio is computed at sequence level. Bypass shows a pronounced envelope blow-up in the low-probability tail (P2–P98), indicating heavy-tailed updates. ALP dramatically tightens the envelope, suppressing tail outliers that destabilize off-policy learning.
<aside> 🎆
Modern LLM RL (RLHF / GRPO-style training) is quietly off‑policy far more than we admit—because of policy staleness, fully asynchronous training, and training–inference mismatches introduced by fast inference engines (batching, quantization, kernel differences).
These mismatches create heavy‑tailed importance ratios, KL spikes, and eventually brittle optimization or collapse.
Adaptive Layerwise Perturbation (ALP) mitigates this instability with one unified recipe:
One unified importance ratio. Use a single ratio in the loss where only the updated training policy is perturbed, and the rollout policy stays unchanged:
$$ \rho^{\mathrm{ALP}}=\frac{\pi^{\mathrm{train}}{\theta,\delta}}{\pi^{\mathrm{infer}}{\theta_{\mathrm{old}}}} $$
Layerwise hidden-state perturbation (training-only). During the training forward pass, inject small Gaussian perturbations into transformer hidden states with learnable scales. Although this perturbation adds to hidden states, it is implicitly equivalent to perturb parameters.
Why it helps (two intuitions).
Experiments: enhancing stability and performance.
Off-policy in RL reasoning tasks, such as one rollout—multiple updates, training–inference mismatch due to accelerated rollout engines, fully asynchronous settings, substantially impedes training stability by producing heavy-tailed importance ratios, KL spikes, and brittle optimization.
We use two complementary lenses to diagnose off-policy instability:
(1) Controlled same-checkpoint intervention that isolates the effect of perturbation under identical rollouts, and
(2) Full training runs from base models that show how the tail mismatch evolves after many training steps in practice.


Figure 3 (Controlled intervention). Multi-turn, Same checkpoint, same rollouts, 16 off-policy updates. Without perturbation the ratio envelope flares up in the tail (left); with ALP it tightens dramatically (right).
In this strictly controlled setting, the log ratio envelope $\log\frac{\pi_{\text{train}}}{\pi_{\text{infer}}}$ flares up in the low-probability tail without perturbation, whereas ALP tightens the envelope substantially.
This is the cleanest evidence that the instability is fundamentally a tail-mismatch. Moreover, perturbation tightens this envelope.
We then move to the realistic regime: two algorithms trained separately for 115 iterations.
Figure 2 shows the same qualitative story as Figure 3, but amplified by long-horizon training effects: Bypass exhibits a pronounced tail blow-up, while ALP keeps the tail envelope much tighter.