Authors: Chenlu Ye$^\dagger$, Xuanchang Zhang, Yifan Hao*, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Tong Zhang$^1$$^\dagger$

*Co-First Authors. $^\dagger$Project lead Date: Feb 12, 2026

ALP_eqn.001.jpeg

ALP_alg.png

Screenshot 2026-02-17 at 2.04.39 PM.png

Figure 1. (Left) Comparison of ALP and GRPO, MIS, TIS. (Right) Testing Average@ $32$ scores of ALP: using Qwen2.5-1.5B-Base on the single-turn math task. ALP outperforms the baselines, while the baselines either plateaus early or collapse. Multi-turn agentic task results on the 7B model are delayed in Sec. 3.

multi-turn_mismatch_step_140.png

Figure 2. Left (Bypass): directly use $\pi_\theta^{\mathrm{train}} / \pi_{\theta_{\mathrm{old}}}^{\mathrm{infer}}$ as importance ratio; Right (ALP) perturb $\pi_\theta^{\mathrm{train}}$ with a small variable $\delta$ and use $\pi_{\theta,\delta}^{\mathrm{train}} / \pi_{\theta_{\mathrm{old}}}^{\mathrm{infer}}$as importance ratio. Both are trained from base model to 115-th iteration and importance ratio is computed at sequence level. Bypass shows a pronounced envelope blow-up in the low-probability tail (P2–P98), indicating heavy-tailed updates. ALP dramatically tightens the envelope, suppressing tail outliers that destabilize off-policy learning.

<aside> 🎆

TL;DR

Modern LLM RL (RLHF / GRPO-style training) is quietly off‑policy far more than we admit—because of policy staleness, fully asynchronous training, and training–inference mismatches introduced by fast inference engines (batching, quantization, kernel differences).

These mismatches create heavy‑tailed importance ratios, KL spikes, and eventually brittle optimization or collapse.

Adaptive Layerwise Perturbation (ALP) mitigates this instability with one unified recipe:

1. Challenges of Off-policy Instability for RL in LLM

Off-policy in RL reasoning tasks, such as one rollout—multiple updates, training–inference mismatch due to accelerated rollout engines, fully asynchronous settings, substantially impedes training stability by producing heavy-tailed importance ratios, KL spikes, and brittle optimization.

We use two complementary lenses to diagnose off-policy instability:

(1) Controlled same-checkpoint intervention that isolates the effect of perturbation under identical rollouts, and

(2) Full training runs from base models that show how the tail mismatch evolves after many training steps in practice.

1.2 Controlled intervention: tail mismatch is the causal bottleneck

prob_ratio_valid_scatter_gs1_noperturb_envelope_ma_light.png

prob_ratio_valid_scatter_gs1_envelope_ma_light.png

Figure 3 (Controlled intervention). Multi-turn, Same checkpoint, same rollouts, 16 off-policy updates. Without perturbation the ratio envelope flares up in the tail (left); with ALP it tightens dramatically (right).

In this strictly controlled setting, the log ratio envelope $\log\frac{\pi_{\text{train}}}{\pi_{\text{infer}}}$ flares up in the low-probability tail without perturbation, whereas ALP tightens the envelope substantially.

This is the cleanest evidence that the instability is fundamentally a tail-mismatch. Moreover, perturbation tightens this envelope.

1.3 Full training: tail blow-up accumulates; ALP keeps it bounded

We then move to the realistic regime: two algorithms trained separately for 115 iterations.

Figure 2 shows the same qualitative story as Figure 3, but amplified by long-horizon training effects: Bypass exhibits a pronounced tail blow-up, while ALP keeps the tail envelope much tighter.