Authors: Chenlu Ye$^\dagger$, Xuanchang Zhang, Yifan Hao*, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Tong Zhang$^1$$^\dagger$

*Co-First Authors. $^\dagger$Project lead Date: Feb 12, 2026

Links: Paper & Code

ALP_eqn.001.jpeg

Screenshot 2026-02-17 at 2.04.39 PM.png

Figure 1. (Left) Comparison of ALP and GRPO, MIS, TIS. (Right) Testing Average@ $32$ scores of ALP: using Qwen2.5-1.5B-Base on the single-turn math task. ALP outperforms the baselines, while the baselines either plateaus early or collapse. Multi-turn agentic task results on the 7B model are delayed in Sec. 3.

Figure 2. Left (Bypass): directly use $\pi_\theta^{\mathrm{train}} / \pi_{\theta_{\mathrm{old}}}^{\mathrm{infer}}$ as importance ratio; Right (ALP) perturb $\pi_\theta^{\mathrm{train}}$ with a small variable $\delta$ and use $\pi_{\theta,\delta}^{\mathrm{train}} / \pi_{\theta_{\mathrm{old}}}^{\mathrm{infer}}$as importance ratio. Both are trained from base model to 115-th iteration and importance ratio is computed at sequence level. Bypass shows a pronounced envelope blow-up in the low-probability tail (P2–P98), indicating heavy-tailed updates. ALP dramatically tightens the envelope, suppressing tail outliers that destabilize off-policy learning.

<aside> 🎆

TL;DR

Modern LLM RL (RLHF / GRPO-style training) is quietly off‑policy far more than we admit—because of policy staleness, fully asynchronous training, and training–inference mismatches introduced by fast inference engines (batching, quantization, kernel differences).

These mismatches create heavy‑tailed importance ratios, KL spikes, and eventually brittle optimization or collapse.

Adaptive Layerwise Perturbation (ALP) mitigates this instability with one unified recipe:

One unified importance ratio. Use a single ratio in the loss where only the updated training policy is perturbed, and the rollout policy stays unchanged:

$$ \rho^{\mathrm{ALP}}=\frac{\pi^{\mathrm{train}}{\theta,\delta}}{\pi^{\mathrm{infer}}{\theta_{\mathrm{old}}}} $$
Layerwise hidden-state perturbation (training-only). During the training forward pass, inject small Gaussian perturbations into transformer hidden states with learnable scales. Although this perturbation adds to hidden states, it is implicitly equivalent to perturb parameters.
Why it helps (two intuitions).
- Mismatch shrinkage & Boost exploration. It shrinks the tail mismatch envelope (Figure 2,3), increasing the chance that updates stay within an effective trust region (i.e., the update remains close enough to the behavior policy for stable improvement).
- Smoothness: It smooths a sharp objective into a locally averaged surrogate, making optimization less sensitive to brittle sharp maxima (Figure 6).
Experiments: enhancing stability and performance.
- Token-ALP performs best on single-turn math RL (Table 1) and improves stability (Figure 8).
- Seq-ALP performs best on multi-turn TIR (Table 2) and enhances stability (Figure 9).
- Ablations show all layers > partial layers > logits-only, and among partial variants, lower-layer perturbations tend to be better (Figure 11, Table 3). </aside>

1. Challenges of Off-policy Instability for RL in LLM

Off-policy in RL reasoning tasks, such as one rollout—multiple updates, training–inference mismatch due to accelerated rollout engines, fully asynchronous settings, substantially impedes training stability by producing heavy-tailed importance ratios, KL spikes, and brittle optimization.

We use two complementary lenses to diagnose off-policy instability:

(1) Controlled same-checkpoint intervention that isolates the effect of perturbation under identical rollouts, and

(2) Full training runs from base models that show how the tail mismatch evolves after many training steps in practice.

1.2 Controlled intervention: tail mismatch is the causal bottleneck

prob_ratio_valid_scatter_gs1_noperturb_envelope_ma_light.png

Figure 3 (Controlled intervention). Multi-turn, Same checkpoint, same rollouts, 16 off-policy updates. Without perturbation the ratio envelope flares up in the tail (left); with ALP it tightens dramatically (right).

In this strictly controlled setting, the log ratio envelope $\log\frac{\pi_{\text{train}}}{\pi_{\text{infer}}}$ flares up in the low-probability tail without perturbation, whereas ALP tightens the envelope substantially.

This is the cleanest evidence that the instability is fundamentally a tail-mismatch. Moreover, perturbation tightens this envelope.

1.3 Full training: tail blow-up accumulates; ALP keeps it bounded

We then move to the realistic regime: two algorithms trained separately for 115 iterations.