Daily Paper Reading
Topic: sampling, MCMC, test-time scaling, and reasoning
TL;DR: This post reviews three papers that target the same sequence-level power distribution to improve LLM reasoning at test time, with each step in the progression making the method more efficient.
Understanding what post-training is really doing has been one of the central debates of the past year. Increasingly, the view is that reinforcement learning is not injecting entirely new capabilities into the model. Instead, it behaves more like mode-seeking: it concentrates probability mass around high-quality reasoning traces that already exist in the base model by sharpening the distribution
[1,2]
EvoLM: In Search of Lost Language Model Training Dynamics
Z. Qi, F. Nie, A. Alahi, J. Zou, H. Lakkaraju, Y. Du, E. Xing, S. Kakade, H. Zhang, (2025)
DOI
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, G. Huang, (2025)
DOI
.
If that ability is already present in the base model, a natural question follows: can we recover it at inference time instead of relying on expensive and unstable post-training? Reasoning with Sampling [3]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
A. Karan, Y. Du, (2025)
DOI
(ICLR 2026 Oral) points in exactly that direction. If good reasoning traces already exist in the base model, then better test-time sampling may be enough to realize some of that benefit without any post-training at all.
In this blog post, we go through Citation: Karan
A. Karan, Y. Du, (2025)
DOI
and its two follow-up papers, Scalable Power Sampling [4]
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
X. Ji, R. Tutunov, M. Zimmer, H. Bou Ammar, (2026)
DOI
and Power-SMC [5]
Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
S. Azizi, E. Baghaei Potraghloo, M. Ahmadi, S. Kundu, M. Pedram, (2026)
DOI
: each paper targets the same underlying sequence-level power distribution, but each one changes how the correction is computed and where the approximation enters.
Notation Table
Throughout this post, we use a single notation system, even when the individual papers use slightly different symbols. Bold symbols denote token sequences; unbolded symbols such as \(y_t\), \(u\), and \(v\) denote individual tokens.
| Symbol | Meaning |
|---|---|
| \(\mathbf{x}\) | prompt / input token sequence |
| \(\mathbf{y} = (y_1, \dots, y_T)\) | full generated output sequence, including EOS |
| \(y_t\) | token at decoding position \(t\) |
| \(\mathbf{y}_{\lt t} = \mathbf{y}_{1:t-1}\), \(\mathbf{y}_{1:t}\), \(\mathbf{y}_{>t}\) | prefix before step \(t\), prefix through step \(t\), and suffix after step \(t\) |
| \(u, v \in \mathcal{V}\) | candidate next-token values in the vocabulary |
| \(p_\theta(\mathbf{y} \mid \mathbf{x})\) | base autoregressive LM distribution |
| \(\alpha > 1\) | power or sharpening exponent |
| \(\pi_\alpha(\mathbf{y} \mid \mathbf{x})\) | normalized sequence-level power target |
| \(Z_\alpha(\mathbf{x})\) | normalization constant of the power target |
| \(q(\mathbf{y}' \mid \mathbf{y}, \mathbf{x})\) | MH proposal over full sequences |
| \(A(\mathbf{y}', \mathbf{y})\) | MH acceptance probability |
| \(p_\alpha^{(\mathrm{temp})}(y_t \mid \mathbf{x}, \mathbf{y}_{\lt t})\) | token-level low-temperature distribution with \(\tau = 1/\alpha\) |
| \(p_\alpha^{(\mathrm{pow})}(y_t \mid \mathbf{x}, \mathbf{y}_{\lt t})\) | exact next-token conditional induced by the global power target |
| \(\zeta_t(v)\), \(\hat{\zeta}_t(v)\) | exact and Monte Carlo-estimated future-quality factor in Scalable Power Sampling |
| \(K_t\), \(M_t\), \(H_t\) | candidate-set size, rollout count per candidate, and rollout horizon in Scalable Power Sampling |
| \(\hat p_{\alpha,\mathrm{JK}}^{(\mathrm{pow})}\) | jackknife bias-reduced estimator of the power conditional |
| \(\gamma_t(\mathbf{y}_{1:t} \mid \mathbf{x})\) | unnormalized prefix target in Power-SMC |
| \(\pi_t(\mathbf{y}_{1:t} \mid \mathbf{x})\), \(Z_t(\mathbf{x})\) | normalized prefix target and its normalizer in Power-SMC |
| \(q_t(v \mid \mathbf{x}, \mathbf{y}_{\lt t})\) | prefix-only token proposal in Power-SMC |
| \(\omega_t(\mathbf{y}_{1:t})\) | incremental importance weight in Power-SMC |
| \(\tilde W_t^{(i)}, W_t^{(i)}\) | unnormalized and normalized particle weights |
| \(N\), \(\mathrm{ESS}_t\), \(\kappa\) | particle count, effective sample size, and resampling threshold |
| \(B_{\mathrm{blk}}\), \(T_{\max}\), \(T_{\mathrm{ramp}}\) | block size, maximum decode length, and \(\alpha\)-ramp length in the follow-up papers |
| \(C\) | toy token-draw budget in the comparison demo; one unit means one next-token sample from the base model |
From Temperature to Power Sampling
A natural way to sharpen a distribution \(p\) is to sample from its power distribution \(p^\alpha\) [3]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
A. Karan, Y. Du, (2025)
DOI
. The interactive demo below shows how the distribution becomes sharper as we increase the exponent \(\alpha\).
For visualization only, collapse the sequence space to a conceptual 1D state coordinate \(z\). The blue curve is a fixed continuous multi-modal base density \(p(z \mid \mathbf{x})\); the orange curve is the normalized powered proxy \(\tilde{\pi}_\alpha(z \mid \mathbf{x}) \propto p(z \mid \mathbf{x})^\alpha\). Increasing \(\alpha\) sharpens the existing modes instead of creating new ones.
The blue base curve never moves. Only the orange powered curve changes. The y-axis scale is adjusted continuously with \(\alpha\), so there is no visual jump at any particular threshold.
In the standard LLM setting, given a prompt token sequence \(\mathbf{x}\), the base model defines a distribution \(p_\theta(\mathbf{y} \mid \mathbf{x})\) over complete output sequences \(\mathbf{y} = (y_1, \dots, y_T)\), where \(y_t\) is the token generated at decoding step \(t\). Power sampling keeps the same base model, but changes the target distribution at the sequence level by sharpening it:
\begin{equation} \label{eq:power-target} \color{red}{\pi_\alpha(\mathbf{y} \mid \mathbf{x}) = \frac{p_\theta(\mathbf{y} \mid \mathbf{x})^\alpha}{Z_\alpha(\mathbf{x})}, \qquad Z_\alpha(\mathbf{x}) = \sum_{\mathbf{y}} p_\theta(\mathbf{y} \mid \mathbf{x})^\alpha, \qquad \alpha > 1.} \end{equation}Here \(\alpha\) is the sharpening exponent and \(Z_\alpha(\mathbf{x})\) is the normalization constant. This is the common target shared by those three papers.
It is useful to compare this target with temperature sampling. Temperature also sharpens the model distribution, but it is not the same object [3]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
A. Karan, Y. Du, (2025)
DOI
; the difference is the order of operations:
- Temperature sampling first marginalizes to the next-token distribution and then applies powering.
- Power sampling first raises the probability of the full sequence of length \(T\) to a power and only then marginalizes to obtain token-level conditionals.
Mathematically, the difference is (ignoring normalization constants):
\begin{equation} \label{eq:power-vs-temperature} \color{red}{\sum_{\mathbf{y}_{>t}} p_\theta(y_t, \mathbf{y}_{>t} \mid \mathbf{y}_{\lt t}, \mathbf{x})^\alpha} \color{black}{\text{ vs. }} \color{blue}{\left(\sum_{\mathbf{y}_{>t}} p_\theta(y_t, \mathbf{y}_{>t} \mid \mathbf{y}_{\lt t}, \mathbf{x})\right)^\alpha}. \end{equation}Both formulas describe the probability of predicting the next token \(y_t\) given \(\mathbf{y}_{\lt t}\) and prompt \(\mathbf{x}\), but they aggregate future continuations in a different order.
To further understand its implications, we define the continuation path of \(y_t\) as follows:
Using Def. 1 as the reference point, we can see that the LHS of \eqref{eq:power-vs-temperature} prefers a token \(y_t\) with a few high-probability continuations, while the RHS prefers a token whose continuations have a larger total probability mass, even if each individual continuation is not especially likely. In other words, the power distribution is more “mode-seeking”, while the temperature distribution is more “mode-covering”.
The demo below makes the distinction concrete. We consider a toy example where the token sequence length is 4, each token has vocabulary size 2 (A and B), and the base model induces a distribution over the 16 leaves of the corresponding binary sequence tree. We can then compare the token-level conditionals induced by power sampling and temperature sampling at each step. As \(\alpha\) increases, the two methods begin to prefer different tokens because they aggregate continuation paths differently.
Two views of the same binary tree
Hover leaves. Click to zoom the full comparison.
Expanded compare view
Joint Distribution Over The 16 Complete Sequences Used in the Demo
The demo above now uses a fixed counterexample rather than resampling a new tree each time. The base joint distribution over the 16 complete sequences is listed below; values are rounded to 4 decimals.
| Sequence \(\mathbf{y}\) | First token | \(p_\theta(\mathbf{y} \mid \mathbf{x})\) |
|---|---|---|
AAAA | A | 0.1344 |
AAAB | A | 0.0318 |
AABA | A | 0.0007 |
AABB | A | 0.0904 |
ABAA | A | 0.1479 |
ABAB | A | 0.1462 |
ABBA | A | 0.0059 |
ABBB | A | 0.0082 |
BAAA | B | 0.0089 |
BAAB | B | 0.3505 |
BABA | B | 0.0008 |
BABB | B | 0.0239 |
BBAA | B | 0.0072 |
BBAB | B | 0.0001 |
BBBA | B | 0.0143 |
BBBB | B | 0.0287 |
So the first-token branch masses are approximately 0.5655 for A and 0.4344 for B, but the single highest-probability full sequence is BAAB with mass 0.3505. This is exactly the configuration that makes temperature remain more mode-covering at the first step while sequence-level power sampling flips toward the B branch.
Power Sampling: Different Variants
Given the nice property of the power distribution, the next question is how to sample from it. The challenge is that the normalization constant \(Z_\alpha(\mathbf{x})\) is intractable, and the exact token-level conditionals under \(\pi_\alpha\) require a future marginalization that is equally intractable. Fortunately, tools from probabilistic inference help here.
- Reasoning with Sampling [3]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
A. Karan, Y. Du, (2025)
DOI uses Metropolis-Hastings to target the distribution directly over complete sequences. - Scalable Power Sampling [4]
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
X. Ji, R. Tutunov, M. Zimmer, H. Bou Ammar, (2026)
DOI rewrites the exact power conditional into a low-temperature term times a future-quality factor, then estimates that factor with Monte Carlo rollouts. - Power-SMC [5]
Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
S. Azizi, E. Baghaei Potraghloo, M. Ahmadi, S. Kundu, M. Pedram, (2026)
DOI keeps the same global target, but replaces rollout estimation with sequential importance weights and ESS-triggered resampling across particles.
1. Reasoning with Sampling: exact global correction, serial dynamics
Reasoning with Sampling [3]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
A. Karan, Y. Du, (2025)
DOI
is the foundation of this line of work. It applies Metropolis-Hastings, a Markov chain Monte Carlo (MCMC) method1 that typically does not require the normalization constant of the target distribution. To use MH, the key is to design a proposal distribution that is easy to sample and evaluate. To guarantee the desired stationary distribution, the proposal-induced transition kernel must also be irreducible and aperiodic2. For a sequence proposal \(q(\mathbf{y}' \mid \mathbf{y}, \mathbf{x})\), the acceptance rule is
Here \(\mathbf{x}\) is the prompt, \(\mathbf{y}\) is the current generated sequence of length \(T\), and \(\mathbf{y}'\) is a proposed sequence of the same length. This construction is statistically principled: aside from the usual MCMC convergence concerns when the proposal mixes poorly, there is no approximation to the target distribution itself. The bottleneck is computational. In Citation: Karan
A. Karan, Y. Du, (2025)
DOI
, the proposal edits a suffix (sample \(t \sim U(1,T)\), then resample \(\mathbf{y}_{>t}\)), regenerates the continuation, and finally performs an accept/reject step that depends on the previous state. That makes the method both serial and expensive.
So the first paper gets the objective exactly right, but leaves a serious latency bottleneck for the rest of the line of work to address.
What Is Metropolis-Hastings? Why Doesn’t It Require the Normalization Constant?
Metropolis-Hastings is a way to build a Markov chain whose stationary distribution is the target \(\pi(\mathbf{y} \mid \mathbf{x})\). Starting from the current state \(\mathbf{y}\), we first propose a new state \(\mathbf{y}' \sim q(\cdot \mid \mathbf{y}, \mathbf{x})\), then accept it with probability
\begin{equation*} A(\mathbf{y}^{\prime}, \mathbf{y}) = \min \lbrace 1, \frac{\tilde{\pi}(\mathbf{y}^{\prime} \mid \mathbf{x}) q(\mathbf{y} \mid \mathbf{y}^{\prime}, \mathbf{x})}{\tilde{\pi}(\mathbf{y} \mid \mathbf{x}) q(\mathbf{y}^{\prime} \mid \mathbf{y}, \mathbf{x})} \rbrace. \end{equation*}
where \(\tilde{\pi}\) is any unnormalized version of the target. If the proposal is rejected, the chain stays at \(\mathbf{y}\). So MH needs two ingredients: a target that can be evaluated up to proportionality, and a proposal whose forward and reverse probabilities are both tractable.
The key reason this works is detailed balance. Writing the off-diagonal transition kernel as
\begin{equation*} K(\mathbf{y}, \mathbf{y}^{\prime}) = q(\mathbf{y}^{\prime} \mid \mathbf{y}, \mathbf{x}) A(\mathbf{y}^{\prime}, \mathbf{y}). \end{equation*}
MH is constructed so that
\begin{equation*} \pi(\mathbf{y} \mid \mathbf{x}) K(\mathbf{y}, \mathbf{y}^{\prime}) = \pi(\mathbf{y}^{\prime} \mid \mathbf{x}) K(\mathbf{y}^{\prime}, \mathbf{y}). \end{equation*}
More explicitly, the two probability flows are equal because
\[ \begin{aligned} \pi(\mathbf{y} \mid \mathbf{x}) q(\mathbf{y}' \mid \mathbf{y}, \mathbf{x}) A(\mathbf{y}', \mathbf{y}) &= \pi(\mathbf{y}' \mid \mathbf{x}) q(\mathbf{y} \mid \mathbf{y}', \mathbf{x}) A(\mathbf{y}, \mathbf{y}') \\ &= \min \{ \pi(\mathbf{y} \mid \mathbf{x}) q(\mathbf{y}' \mid \mathbf{y}, \mathbf{x}),\; \pi(\mathbf{y}' \mid \mathbf{x}) q(\mathbf{y} \mid \mathbf{y}', \mathbf{x}) \}. \end{aligned} \]So the probability flow from \(\mathbf{y}\) to \(\mathbf{y}'\) exactly matches the reverse flow. Under the usual irreducibility and aperiodicity conditions, that makes \(\pi\) the stationary distribution of the chain.
Now suppose the target has the form
\begin{equation*} \pi(\mathbf{y} \mid \mathbf{x}) = \frac{\tilde{\pi}(\mathbf{y} \mid \mathbf{x})}{Z(\mathbf{x})}. \end{equation*}
Then the acceptance ratio only depends on the quotient:
\begin{equation*} \frac{\pi(\mathbf{y}^{\prime} \mid \mathbf{x})}{\pi(\mathbf{y} \mid \mathbf{x})} = \frac{\tilde{\pi}(\mathbf{y}^{\prime} \mid \mathbf{x}) / Z(\mathbf{x})}{\tilde{\pi}(\mathbf{y} \mid \mathbf{x}) / Z(\mathbf{x})} = \frac{\tilde{\pi}(\mathbf{y}^{\prime} \mid \mathbf{x})}{\tilde{\pi}(\mathbf{y} \mid \mathbf{x})}. \end{equation*}So the normalization constant cancels. That is why MH does not require knowing \(Z(\mathbf{x})\) itself. It only requires pointwise evaluation of the target up to a multiplicative constant.
In this paper, the unnormalized target is simply:
\begin{equation*} \tilde{\pi}_\alpha(\mathbf{y} \mid \mathbf{x}) = p_\theta(\mathbf{y} \mid \mathbf{x})^\alpha. \end{equation*}So the intractable normalizer \(Z_\alpha(\mathbf{x})\) never appears in the acceptance rule. The hard part is elsewhere: designing a proposal \(q\) that mixes well over sequences while keeping both \(q(\mathbf{y}' \mid \mathbf{y}, \mathbf{x})\) and \(q(\mathbf{y} \mid \mathbf{y}', \mathbf{x})\) easy to compute.
2. Scalable Power Sampling: exact decomposition, approximate future correction
Citation: Ji
X. Ji, R. Tutunov, M. Zimmer, H. Bou Ammar, (2026)
DOI
keeps the same target but moves the approximation to a different place. Instead of running a sequence-level Markov chain, it first writes down the exact next-token conditional induced by the global power target and compares it against plain low-temperature decoding. In this section, \(\mathbf{y}_{\lt t}\) is the current decoded prefix, \(u\) or \(v\) denotes a candidate next token, and \(\mathbf{y}_{>t}\) denotes the future suffix being marginalized out.
The paper’s main theorem rewrites that exact power conditional as a scaled low-temperature distribution:
\begin{equation} \label{eq:sps-power-decomposition} \color{red}{p_{\alpha}^{(\mathrm{pow})}(v \mid \mathbf{x}, \mathbf{y}_{\lt t}) = \frac{p_\theta(v \mid \mathbf{x}, \mathbf{y}_{\lt t})^\alpha \, \zeta_t(v)}{\sum_{u \in \mathcal{V}} p_\theta(u \mid \mathbf{x}, \mathbf{y}_{\lt t})^\alpha \, \zeta_t(u)}.} \end{equation}where the future-quality factor is
\begin{equation} \label{eq:sps-zeta} \zeta_t(v) = \sum_{\mathbf{y}_{>t}} p_\theta(\mathbf{y}_{>t} \mid \mathbf{x}, \mathbf{y}_{\lt t}, v)^\alpha = \mathbb{E}_{\mathbf{y}_{>t} \sim p_\theta(\cdot \mid \mathbf{x}, \mathbf{y}_{\lt t}, v)} \left[ p_\theta(\mathbf{y}_{>t} \mid \mathbf{x}, \mathbf{y}_{\lt t}, v)^{\alpha-1} \right]. \end{equation}This is the key conceptual move of the paper. The gap between low-temperature decoding and true power sampling is localized into a single token-dependent scalar \(\zeta_t(v)\).
With that factor isolated, the authors estimate \(\zeta_t(v)\) with Monte Carlo rollouts:
\begin{equation} \label{eq:sps-zeta-monte-carlo} \hat{\zeta}_t(v) = \frac{1}{M_t} \sum_{r=1}^{M_t} p_\theta\!\left(\mathbf{y}_{>t}^{(r)} \mid \mathbf{x}, \mathbf{y}_{\lt t}, v\right)^{\alpha-1}, \qquad \mathbf{y}_{>t}^{(r)} \sim p_\theta(\cdot \mid \mathbf{x}, \mathbf{y}_{\lt t}, v). \end{equation}where \(M_t\) is the number of Monte Carlo rollouts. Plugging this estimate back in gives
\begin{equation} \label{eq:sps-plugin-conditional} \hat p_{\alpha}^{(\mathrm{pow})}(v \mid \mathbf{x}, \mathbf{y}_{\lt t}) = \frac{p_\theta(v \mid \mathbf{x}, \mathbf{y}_{\lt t})^\alpha \, \hat{\zeta}_t(v)}{\sum_{u \in \mathcal{V}} p_\theta(u \mid \mathbf{x}, \mathbf{y}_{\lt t})^\alpha \, \hat{\zeta}_t(u)}. \end{equation}Because this is a ratio of Monte Carlo estimates, the plug-in estimator has bias of order \(\mathcal{O}(M_t^{-1})\). Here, “biased” means that the estimator’s expectation does not exactly equal the target:
\begin{equation} \label{eq:sps-bias-definition} \mathrm{Bias}(\hat p) := \mathbb{E}[\hat p] - p. \end{equation}Why Is the Plug-in Estimator Biased?
The subtle point is that the Monte Carlo average \(\hat{\zeta}_t(v)\) is itself an unbiased estimator of \(\zeta_t(v)\), but the plug-in power conditional is a ratio of noisy quantities. In general,
\begin{equation*} \mathbb{E}[\hat N / \hat D] \neq \mathbb{E}[\hat N] / \mathbb{E}[\hat D]. \end{equation*}A clean way to see the \(\mathcal{O}(M_t^{-1})\) bias order is the delta method. Suppose
\begin{equation*} \hat N = \frac{1}{M_t}\sum_{r=1}^{M_t} X_r, \qquad \hat D = \frac{1}{M_t}\sum_{r=1}^{M_t} Y_r, \end{equation*}
where the rollout pairs \((X_r,Y_r)\) are i.i.d. across \(r\) with finite second moments, and let \(N = \mathbb{E}[X_r]\), \(D = \mathbb{E}[Y_r]\). Write the centered errors as \(\delta_N = \hat N - N\) and \(\delta_D = \hat D - D\). Because these are sample-mean fluctuations, \(\mathbb{E}[\delta_N] = \mathbb{E}[\delta_D] = 0\), and
\begin{equation*} \operatorname{Var}(\hat D) = \frac{\operatorname{Var}(Y_1)}{M_t}, \qquad \operatorname{Cov}(\hat N,\hat D) = \frac{\operatorname{Cov}(X_1,Y_1)}{M_t}. \end{equation*}
So both \(\operatorname{Var}(\hat D)\) and \(\operatorname{Cov}(\hat N,\hat D)\) scale as \(\mathcal{O}(M_t^{-1})\). Equivalently, \(\delta_N\) and \(\delta_D\) are each \(\mathcal{O}_p(M_t^{-1/2})\). Expanding \(g(N,D)=N/D\) to second order around \((N,D)\) gives
\begin{equation*} \frac{\hat N}{\hat D} = \frac{N}{D} + \frac{\delta_N}{D} - \frac{N\delta_D}{D^2} - \frac{\delta_N\delta_D}{D^2} + \frac{N\delta_D^2}{D^3} + \cdots . \end{equation*}
After taking expectations, the linear terms vanish because \(\mathbb{E}[\delta_N]=\mathbb{E}[\delta_D]=0\). The first surviving terms are \(\mathbb{E}[\delta_N\delta_D]=\operatorname{Cov}(\hat N,\hat D)\) and \(\mathbb{E}[\delta_D^2]=\operatorname{Var}(\hat D)\), both of which are already \(\mathcal{O}(M_t^{-1})\). Therefore the leading bias is \(\mathcal{O}(M_t^{-1})\), and a second-order Taylor expansion of \(g(N,D) = N/D\) gives
\begin{equation*} \mathbb{E}\left[\frac{\hat N}{\hat D}\right] \approx \frac{N}{D} + B_M. \end{equation*}
with
\begin{equation*} B_M = - \frac{\operatorname{Cov}(\hat N,\hat D)}{D^2} + \frac{N \operatorname{Var}(\hat D)}{D^3}. \end{equation*}
The key point is that the first-order terms in the Taylor expansion are proportional to \(\delta_N\) and \(\delta_D\), so they vanish after taking expectations because both fluctuations have mean zero. The first nonzero contribution therefore comes from the second-order terms, namely \(\mathbb{E}[\delta_N \delta_D]\) and \(\mathbb{E}[\delta_D^2]\), which are exactly \(\operatorname{Cov}(\hat N,\hat D)\) and \(\operatorname{Var}(\hat D)\). Since both of those are already of order \(M_t^{-1}\), the first nonzero correction is also of order \(M_t^{-1}\). That is why the plug-in ratio estimator has leading bias \(\mathcal{O}(M_t^{-1})\).
The paper therefore adds a jackknife correction:
\begin{equation} \label{eq:sps-jackknife} \hat p_{\alpha,\mathrm{JK}}^{(\mathrm{pow})}(v \mid \mathbf{x}, \mathbf{y}_{\lt t}) = M_t \hat p_{\alpha}^{(\mathrm{pow})}(v \mid \mathbf{x}, \mathbf{y}_{\lt t}) - \frac{M_t-1}{M_t}\sum_{s=1}^{M_t} \hat p_{\alpha,-s}^{(\mathrm{pow})}(v \mid \mathbf{x}, \mathbf{y}_{\lt t}). \end{equation}Here \(\hat p_{\alpha,-s}^{(\mathrm{pow})}\) means the same plug-in estimator recomputed after leaving out rollout \(s\). The jackknife idea is simple: if the original estimator has an expansion
\begin{equation} \label{eq:sps-bias-expansion} \mathbb{E}[\hat p_{\alpha}^{(\mathrm{pow})}] = p_{\alpha}^{(\mathrm{pow})} + \frac{b}{M_t} + \mathcal{O}(M_t^{-2}). \end{equation}then the leave-one-out recombination cancels that leading \(b/M_t\) term. Put differently, jackknife bias correction here removes the first-order finite-sample ratio bias induced by plugging Monte Carlo estimates into a normalized probability formula. The result is an estimator whose bias is \(\mathcal{O}(M_t^{-2})\). In the practical algorithm, the paper also restricts attention to a candidate set \(\mathcal{G}_t\), such as Top-\(K_t\) tokens, and may truncate rollouts to a horizon \(H_t\); these are additional efficiency heuristics on top of the main formulation.
This is the incremental improvement over MH: the global sequence-level problem is converted into a one-pass autoregressive sampler, so the serial accept/reject loop disappears, but at a cost: the future correction is no longer exact; it is estimated by rollouts, and the quality of the approximation depends on rollout budget, candidate truncation, and jackknife variance from finite samples. Relative to the first paper, the second paper trades exact global MCMC correction for an approximate tokenwise future correction that is much easier to run at inference time.
3. Power-SMC: exact sequential correction, finite-particle approximation
Citation: Azizi
S. Azizi, E. Baghaei Potraghloo, M. Ahmadi, S. Kundu, M. Pedram, (2026)
DOI
changes the computational strategy once more. Instead of estimating a future-quality factor by explicit lookahead [3,4]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
A. Karan, Y. Du, (2025)
DOI
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
X. Ji, R. Tutunov, M. Zimmer, H. Bou Ammar, (2026)
DOI
, it maintains \(N\) parallel partial continuations and corrects them sequentially as decoding proceeds.
In the paper’s terminology, each partial continuation is a particle. At step \(t\), particle \(i\) is a sampled prefix \(\mathbf{y}_{1:t}^{(i)}\) drawn from the proposal path distribution
\begin{equation*} q_{1:t}(\mathbf{y}_{1:t} \mid \mathbf{x}) := \prod_{s=1}^t q_s(y_s \mid \mathbf{x}, \mathbf{y}_{1:s-1}). \end{equation*}That is, all particles start from the empty prefix; at each decoding step, particle \(i\) samples one next token from \(q_t(\cdot \mid \mathbf{x}, \mathbf{y}_{1:t-1}^{(i)})\) and appends it. So after \(t\) steps, \(\mathbf{y}_{1:t}^{(i)}\) is exactly an autoregressively sampled proposal prefix, not a direct sample from the target. The target is instead defined over prefixes:
\begin{equation} \label{eq:powersmc-prefix-target} \gamma_t(\mathbf{y}_{1:t} \mid \mathbf{x}) := p_\theta(\mathbf{y}_{1:t} \mid \mathbf{x})^\alpha, \qquad \pi_t(\mathbf{y}_{1:t} \mid \mathbf{x}) = \frac{\gamma_t(\mathbf{y}_{1:t} \mid \mathbf{x})}{Z_t(\mathbf{x})}. \end{equation}Here \(\mathbf{y}_{1:t}\) is a prefix of length \(t\), \(\mathbf{x}\) is still the prompt token sequence, \(Z_t(\mathbf{x})\) is the normalizing constant, and \(\pi_t\) is the normalized target over prefixes. Once particles eventually complete by emitting EOS3, the final weighted collection targets the desired sequence-level power distribution. For any prefix-only proposal \(q_t(v \mid \mathbf{x}, \mathbf{y}_{1:t-1})\), the exact one-step correction is
\begin{equation} \label{eq:powersmc-incremental-weight} \omega_t(\mathbf{y}_{1:t}) = \frac{\gamma_t(\mathbf{y}_{1:t} \mid \mathbf{x})}{\gamma_{t-1}(\mathbf{y}_{1:t-1} \mid \mathbf{x})\, q_t(y_t \mid \mathbf{x}, \mathbf{y}_{1:t-1})} = \frac{p_\theta(y_t \mid \mathbf{x}, \mathbf{y}_{1:t-1})^\alpha}{q_t(y_t \mid \mathbf{x}, \mathbf{y}_{1:t-1})}. \end{equation}Each particle therefore carries an importance weight. Intuitively, this weight says how much that sampled prefix should count after correcting for the fact that it was generated from the convenient proposal rather than directly from the power target. Operationally, after sampling one new token for particle \(i\), we multiply its previous unnormalized weight by the correction factor above and then renormalize across particles:
\begin{equation} \label{eq:powersmc-weight-update} \tilde W_t^{(i)} = \tilde W_{t-1}^{(i)} \, \omega_t\!\left(\mathbf{y}_{1:t}^{(i)}\right), \qquad W_t^{(i)} = \frac{\tilde W_t^{(i)}}{\sum_{j=1}^{N} \tilde W_t^{(j)}}. \end{equation}This is the sequential correction step. Since \(\tilde W_0^{(i)} = 1\), iterating the update gives
\begin{equation*} \tilde W_t^{(i)} = \prod_{s=1}^t \omega_s\!\left(\mathbf{y}_{1:s}^{(i)}\right) = \frac{\gamma_t(\mathbf{y}_{1:t}^{(i)} \mid \mathbf{x})}{q_{1:t}(\mathbf{y}_{1:t}^{(i)} \mid \mathbf{x})}. \end{equation*}This telescoping identity is the key point: \(\tilde W_t^{(i)}\) is not just a bookkeeping variable, but exactly the unnormalized importance weight that corrects the sampled proposal path back to the prefix power target. After normalizing across particles, the weighted particle system approximates
\begin{equation*} \pi_t(\mathbf{y}_{1:t} \mid \mathbf{x}) \propto \gamma_t(\mathbf{y}_{1:t} \mid \mathbf{x}). \end{equation*}So \(W_t^{(i)}\) tells us how much particle \(i\) should count under the prefix power target, not under the proposal that generated it. Once the sequence is complete, this prefix target is exactly the sequence-level power distribution.
With a finite number of particles, however, the weights can degenerate: a few particles may start carrying almost all of the mass. The standard SMC diagnostic for this is the effective sample size (ESS)4:
\begin{equation} \label{eq:powersmc-ess} \mathrm{ESS}_t = \left(\sum_{i=1}^{N} (W_t^{(i)})^2\right)^{-1}, \qquad \text{resample if } \mathrm{ESS}_t < \kappa N. \end{equation}When \(\mathrm{ESS}_t\) drops below \(\kappa N\), the paper resamples by drawing particle indices from the categorical distribution with probabilities \(W_t^{(1)}, \dots, W_t^{(N)}\). So high-weight particles are more likely to be copied, low-weight particles are more likely to disappear, and the weights are then reset to uniform. This is not an MH-style accept/reject step. The correction has already happened through the importance weights; resampling is a population-control step that prevents weight collapse and keeps the approximation numerically useful.
One of the paper’s most useful theoretical results is that among all proposals that depend only on the current prefix, the unique variance-minimizing choice is
\begin{equation} \label{eq:powersmc-optimal-proposal} q_t^\star(v \mid \mathbf{x}, \mathbf{y}_{1:t-1}) = \frac{p_\theta(v \mid \mathbf{x}, \mathbf{y}_{1:t-1})^\alpha}{\sum_{u \in \mathcal{V}} p_\theta(u \mid \mathbf{x}, \mathbf{y}_{1:t-1})^\alpha} \propto p_\theta(v \mid \mathbf{x}, \mathbf{y}_{1:t-1})^\alpha. \end{equation}So within the prefix-only class, temperature \(\tau = 1/\alpha\) is not globally correct by itself, but it is the unique proposal that makes the one-step weight variance zero conditional on the current prefix. The exactness comes from pairing that proposal with sequential importance weighting and resampling, not from temperature alone.
Compared with Scalable Power Sampling, the future correction is no longer estimated by Monte Carlo lookahead. Instead, the proposal is corrected exactly, step by step, through sequential importance ratios, and the remaining approximation comes only from using a finite number of particles rather than from a plug-in estimator for \(\zeta_t\). Compared with MH, Power-SMC preserves the same sequence-level target while removing the serial Markov-chain bottleneck: all particles can be advanced in parallel in a single batched decode. The practical failure mode therefore changes as well. The issue is no longer MH mixing or rollout bias, but weight collapse across particles, resampling noise, and the need to choose \(N\), \(\kappa\), and sometimes an \(\alpha\)-ramping schedule carefully.
Comparison
Results from the Papers
Citation: Azizi
S. Azizi, E. Baghaei Potraghloo, M. Ahmadi, S. Kundu, M. Pedram, (2026)
DOI
provides a useful comparison of the three algorithms on a reasoning benchmark, plotting accuracy against latency. The two follow-up papers substantially improve the efficiency of power sampling, and Power-SMC achieves the strongest overall trade-off in the reported results.

Interactive Demo
To compare the algorithms in a controlled setting, we reuse the previous toy setting: sequence length \(4\), vocabulary size \(2\), and therefore \(16\) complete traces in total. The object of comparison is now the full distribution over those 16 traces, not just a first-token marginal.
For each token-draw budget \(C\), the demo runs every algorithm repeatedly and forms the empirical leaf distribution \(\hat q_C(\mathbf{y})\). Concretely, 320 seeded repeats / C value means that for each method at each budget point, the demo uses 320 fixed random seeds to generate 320 sampled full traces. The fixed seeds make the plots reproducible, while the repeated sampling reduces Monte Carlo noise.
The demo reports two summary views of that empirical distribution. The first is the sequence-level KL divergence,
\begin{equation} \label{eq:comparison-kl} D_{\mathrm{KL}}\!\left(\hat q_C(\mathbf{y}) \,\|\, \pi_\alpha(\mathbf{y})\right). \end{equation}where \(\pi_\alpha\) is the exact sequence-level power target on the same 16 leaves. So a value near zero means that the algorithm’s sampled full-trace distribution is close to the exact power target. The second chart tracks \(\max_{\mathbf{y}} \hat q_C(\mathbf{y})\), the probability mass assigned to the most likely sampled trace, which gives a complementary view of how concentrated the empirical distribution becomes.
The budget axis also needs to be read differently for the three methods. MH uses a suffix-resampling sequence proposal: choose a cut position uniformly, keep the prefix, and regenerate the suffix from the base model. Scalable Power Sampling uses an autoregressive token proposal based on the rollout-corrected plug-in conditional \(\hat p_{\alpha}^{(\mathrm{pow})}(\cdot \mid \mathbf{x}, \mathbf{y}_{\lt t})\). Power-SMC uses the prefix-only low-temperature proposal
\begin{equation} \label{eq:comparison-prefix-proposal} q_t(v \mid \mathbf{x}, \mathbf{y}_{\lt t}) \propto p_\theta(v \mid \mathbf{x}, \mathbf{y}_{\lt t})^\alpha. \end{equation}and then corrects it with sequential importance weights plus resampling. Here the budget \(C\) does not mean the number of sampled full traces. It means the total number of next-token draws consumed by the toy simulator, where one token draw is one call to sample one next token from the base model. For SPS and Power-SMC, that cost is exact under the toy model. For MH, by contrast, the selected \(C\) is a budget cap, so the chart places MH using its average realized token cost at that cap. The embedded demo supports click-to-zoom, and the expanded view shows the full 16-sequence snapshot together with the implied SPS / Power-SMC settings at the selected budget.
To make that comparison concrete, the demo below is budget-driven. We keep \(\alpha\) adjustable, but for SPS and Power-SMC we do not separately expose \(M_t\) or \(N\). Instead, for each budget point \(C\), the demo chooses the implied integer settings under the toy cost model:
- \(\alpha\): the sharpening exponent in the target \(\pi_\alpha(\mathbf{y}) \propto p_\theta(\mathbf{y})^\alpha\). The default is \(\alpha = 4\).
- \(K_t\) in SPS: the candidate-set size. In the toy binary-tree demo, \(K_t\) is fixed to \(2\) because the vocabulary size is \(2\).
- \(M_t\) in SPS: inferred from the budget via \(C = 4 + 12 M_t\), so \(M_t = (C-4)/12\).
- \(N\) in Power-SMC: inferred from the budget via \(C = 4N\), so \(N = C/4\).
- \(\kappa\) in Power-SMC: fixed to the default \(\kappa = 0.5\).
Hyperparameters Used In The Papers But Not Exposed In This Toy Demo
- \(B_{\mathrm{blk}}\) in the blocked MCMC / SPS comparison setup: this is the block size used when proposals or corrections are organized over chunks rather than single-token updates. SPS inherits \(B_{\mathrm{blk}} = 192\) from that comparison setup. Larger blocks make each correction step coarser and usually more expensive per step; smaller blocks give finer corrections but require more update steps.
- \(T_{\max}\): this is the maximum generation length. SPS uses \(T_{\max} = 3072\), while Power-SMC uses \(T_{\max} = 2048\). Increasing \(T_{\max}\) allows longer reasoning traces but raises latency and memory cost; decreasing it can truncate valid solutions.
- \(T_{\mathrm{ramp}}\) in Power-SMC: this is the number of early decoding steps over which \(\alpha\) is ramped up to its final value. The default \(\alpha\)-ramp is \(T_{\mathrm{ramp}} = 100\) tokens. A longer ramp improves stability by delaying aggressive sharpening, while a shorter ramp reaches the final target sooner but can make early particle weights collapse faster.
Using the notation above, we can calculate the cost of each method in that toy setting:
- MH: generating the initial length-\(T = 4\) sequence costs \(4\) token draws, and each later proposal adds exactly the number of suffix tokens that are regenerated.
- SPS: with fixed \(K_t = 2\) and \(T = 4\), one full decode with rollout count \(M_t\) costs \(4 + 12 M_t\) token draws. The reason is simple: the actual decoded path costs \(4\) token draws, while rollout lookahead costs \(2 \times M_t \times (3 + 2 + 1) = 12 M_t\), since at the first three decoding steps we roll out the remaining suffix for both candidate tokens.
- Power-SMC: with \(N\) particles and \(T = 4\), one full run costs \(4N\) token draws.
To compare methods on exactly matched operating points, the demo uses budget values of the form \(C = 4 + 12k\). Those are precisely the budgets at which SPS has an integer rollout count and Power-SMC has an integer particle count at the same time.
Toy compute comparison on the same 16-leaf tree
Sequence-level KL vs token-draw budget C ā
Top sequence probability vs realized token-draw budget ā
16-sequence distribution under current hyperparameters
Sequences are ordered lexicographically from AAAA to BBBB. The dots are the actual masses; the smooth curve only connects them to make cross-method comparison easier.
Current Hyperparameter Footprint
Sequence-level KL vs token-draw budget C
From the demo above:
- In this toy setup, the two follow-up methods have better performance-efficiency trade-offs than the original MH.
- However, the current toy setting is not rich enough to cleanly distinguish between those two follow-up methods. Use real benchmarks and careful hyperparameter tuning for any serious comparison.
- Moreover, token budget is not the same as latency, because we do not account for operations that can be parallelized.
Closing Remarks
- Taking Probabilistic Graphical Models this semester has made me appreciate how much of this line of work is really classical probabilistic inference in modern form. If you want to work on this kind of research, it is a course worth adding to your toolkit.
- The HTML demos in this post were generated with Codex using GPT-5.4.
References
Z. Qi, F. Nie, A. Alahi, J. Zou, H. Lakkaraju, Y. Du, E. Xing, S. Kakade, H. Zhang, (2025)
DOI
Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, G. Huang, (2025)
DOI
X. Ji, R. Tutunov, M. Zimmer, H. Bou Ammar, (2026)
DOI
S. Azizi, E. Baghaei Potraghloo, M. Ahmadi, S. Kundu, M. Pedram, (2026)
DOI
It is a Markov chain because each update depends only on the current state, and it is Monte Carlo because the method uses randomized samples to approximate the target distribution. For a more formal definition of a Markov chain, see Wikipedia, “Markov chain” (Formal definition). ↩︎
For quick references on these two standard conditions, see Wikipedia. ↩︎
EOSmeans end-of-sequence token. If particle \(i\) emits EOS, that particular sampled sequence is treated as finished: later steps do not add new tokens to it, and its later incremental weight is \(1\), so its total weight no longer changes. Other particles keep decoding normally. ↩︎ESS is not a separate algorithm; it is a summary statistic of how concentrated the particle weights are. It equals \(N\) when all particles have equal weight and approaches \(1\) when one particle dominates. ↩︎