Daily Paper Reading
Topic: latent reasoning + variational Bayes + inference-time computation
Link: http://arxiv.org/abs/2502.01567

arxiv.org arxiv.org Latent Thought Models with Variational Bayes Inference-Time ComputationA language model with explicit latent thought vectors, sequence-specific posterior inference, and a fast-slow learning loop.

LTM Citation: [1]Latent Thought Models with Variational Bayes Inference-Time Computation
D. Kong, M. Zhao, D. Xu, B. Pang, S. Wang, E. Honig, Z. Si, C. Li, J. Xie, S. Xie, Y. Wu, (2025)
DOI
assumes that a text sequence \(\mathbf{x}\) is generated not by a plain autoregressive decoder alone, but by an autoregressive decoder conditioned on explicit latent thought vectors \(\mathbf{z}\). These latent vectors are organized by layer, and each decoder layer cross-attends to its corresponding latent thoughts.

Illustration of the LTM. Latent thought vectors $\mathbf{z}$ ~ N(0, I) are organized by layer, and $\mathbf{z}_l$ is injected into decoder layer l through cross-attention. $\mathbf{z}$ is instance-specific, while $\beta$ is shared globally.

Illustration of the LTM. Latent thought vectors $\mathbf{z}$ ~ N(0, I) are organized by layer, and $\mathbf{z}_l$ is injected into decoder layer l through cross-attention. $\mathbf{z}$ is instance-specific, while $\beta$ is shared globally.

The core loop is:

  1. Introduce an explicit latent state \(\mathbf{z}\).
  2. Maintain a sequence-specific posterior \(q(\mathbf{z}\mid\mathbf{x})\).
  3. Run a finite inner-loop optimization of that posterior during both training and testing.
  4. Decode tokens autoregressively under the inferred latent state.
Fast-Slow Learning of LTM. Here p(z) is the prior over latent thought vectors, and the paper uses the isotropic Gaussian prior N(0, I).

Fast-Slow Learning of LTM. Here p(z) is the prior over latent thought vectors, and the paper uses the isotropic Gaussian prior N(0, I).

Thought-guided autoregressive generation with a constrained context window

Given a sequence $\mathbf{x}$, the thought-guided generator parameterized by $\beta$ is still autoregressive:

$$ p_{\beta}(\mathbf{x}\mid \mathbf{z}) = \prod_{n=1}^{N} p_{\beta}(x_n \mid \mathbf{z}, \mathbf{x}_{1:n-1}). $$

The paper further emphasizes a short-context setting:

$$ p_{\beta}(\mathbf{x}\mid \mathbf{z}) = \prod_{n=1}^{N} p_{\beta}(x_n \mid \mathbf{z}, \mathbf{x}_{n-k:n-1}). $$

The point is that once the token context window is deliberately shortened, \(\mathbf{z}\) is forced to act as a stronger global information carrier.

Marginal likelihood and posterior inference

The marginal likelihood is

$$ p_{\beta}(\mathbf{x}) = \int p_{\beta}(\mathbf{x}\mid \mathbf{z})\, p(\mathbf{z})\, d\mathbf{z}, \qquad p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}). $$

Its maximum-likelihood gradient is

$$ \nabla_{\beta}\log p_{\beta}(\mathbf{x}) = \mathbb{E}_{p_{\beta}(\mathbf{z}\mid\mathbf{x})}\left[\nabla_{\beta}\log p_{\beta}(\mathbf{x}\mid\mathbf{z})\right]. $$

If one estimates the posterior by sampling, the paper gives the Langevin update

$$ \mathbf{z}^{\tau+1} = \mathbf{z}^{\tau} + s \nabla_{\mathbf{z}} \log p_{\beta}(\mathbf{z}^{\tau}\mid\mathbf{x}) + \sqrt{2s}\,\boldsymbol{\epsilon}^{\tau}, \qquad \boldsymbol{\epsilon}^{\tau}\sim \mathcal{N}(\mathbf{0}, \mathbf{I}). $$

The paper also studies a VAE-style amortized inference baseline. Its adopted method, however, is classical variational Bayes: compared with Langevin sampling, it is more efficient; compared with the VAE baseline, it avoids learning a large inference model and mitigates posterior collapse.

classical variational Bayes with the reparameterization trick

What the paper actually uses is this sequence-specific variational posterior

$$ q(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^{2}). $$

and then maximize the evidence lower bound (ELBO):

$$ \mathcal{L}(\beta,\boldsymbol{\mu},\boldsymbol{\sigma}^{2}) = \mathbb{E}_{q(\mathbf{z}\mid\mathbf{x})}\left[\log p_{\beta}(\mathbf{x}\mid\mathbf{z})\right] - \mathrm{KL}\!\left(q(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\right). $$
  • \(\beta\) are global parameters shared across the dataset;
  • \((\boldsymbol{\mu}, \boldsymbol{\sigma}^{2})\) are local parameters specific to the current sequence.

That is exactly what the paper means by dual-rate learning: slow learning of the shared decoder, fast learning of the current sample’s posterior.

Comparison with TRM / HRM

This paper first appeared on arXiv on February 3, 2025 Citation: [1]Latent Thought Models with Variational Bayes Inference-Time Computation
D. Kong, M. Zhao, D. Xu, B. Pang, S. Wang, E. Honig, Z. Si, C. Li, J. Xie, S. Xie, Y. Wu, (2025)
DOI
. HRM Citation: [2]Hierarchical Reasoning Model
G. Wang, J. Li, Y. Sun, X. Chen, C. Liu, Y. Wu, M. Lu, S. Song, Y. Abbasi Yadkori, (2025)
DOI
and TRM Citation: [3]Less is More: Recursive Reasoning with Tiny Networks
A. Jolicoeur-Martineau, (2025)
DOI
came later, in June and October 2025, respectively. I think they share important similarities, and here is a quick comparison:

  1. Whether there is a loop of iteration during inference.
  2. Whether there is an explicit hierarchy of latent states.
  3. Whether the same weights are shared across iterations.
  4. Whether the generation is autoregressive in token space.
ModelIteration Behavior?Hierarchy?Weight-sharing?Auto-regressive?
LTMYes. Finite-step optimization of local posterior parameters \((\mu, \sigma^2)\) for each sampleYes: layer-wise latent vectors \(\mathbf{z}_1, \dots, \mathbf{z}_L\)weights are shared when iterating over $\mathbf{z}_l$, while different layers have distinct parametersYes
HRMYes. Recurrent high-level and low-level states update at different ratesYesYes: weights are shared when iterating over $\mathbf{z}_H$ and $\mathbf{z}_L$, respectivelyNo
TRMYes.(Same as HRM)YesYes: one core model is reused across steps even between $\mathbf{z}_H$ and $\mathbf{z}_L$No

Concern

The method is interesting, but I think Table 1 needs to be interpreted carefully. I asked this here:

1. Sequence-specific posterior

The paper defines a sequence-specific variational posterior:

$$ q(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^{2}) $$

These are local parameters optimized for each sequence. If evaluation also performs posterior inference on the whole sequence \(\mathbf{x}\), then \(\mathbf{z}\) necessarily absorbs information from that whole sequence.

2. Upper bound, not strict AR perplexity

The caption of Table 1 already says that LTMs report perplexity upper bounds. That matters a lot. It means the paper itself is signaling that this number is not directly the same object as standard autoregressive next-token perplexity.

The released code reinforces this reading. In evaluation, it first optimizes the latent posterior for the current batch and then computes

$$ \mathrm{PPL}_{\text{LTM}} = \exp\!\left(\frac{\text{NLL} + \mathrm{KL}}{\text{sequence length}}\right), $$

which is derived from the negative ELBO, not from a strict left-to-right autoregressive likelihood. Concretely, the path is essentially train_ltm.py -> posterior_optimizer_test.step() -> model.elbo(): optimize \((\mu,\log \sigma^2)\) on the sequence, then convert NLL + KL into perplexity.

Personal Note

if you want to spend extra test-time compute on latent reasoning, one way to do it is to explicitly write down latent variables, a posterior family, an ELBO, and an inner-loop inference procedure.

From that angle, LTM is a useful, more explicit, more probabilistic reference model for thinking about HRM Citation: [2]Hierarchical Reasoning Model
G. Wang, J. Li, Y. Sun, X. Chen, C. Liu, Y. Wu, M. Lu, S. Song, Y. Abbasi Yadkori, (2025)
DOI
. and TRM Citation: [3]Less is More: Recursive Reasoning with Tiny Networks
A. Jolicoeur-Martineau, (2025)
DOI
.

References

[1]
Latent Thought Models with Variational Bayes Inference-Time Computation
D. Kong, M. Zhao, D. Xu, B. Pang, S. Wang, E. Honig, Z. Si, C. Li, J. Xie, S. Xie, Y. Wu, (2025)
DOI
[2]
Hierarchical Reasoning Model
G. Wang, J. Li, Y. Sun, X. Chen, C. Liu, Y. Wu, M. Lu, S. Song, Y. Abbasi Yadkori, (2025)
DOI
[3]
Less is More: Recursive Reasoning with Tiny Networks
A. Jolicoeur-Martineau, (2025)
DOI