Markov chain Monte Carlo

Markov chain Monte Carlo
💡No image available
Overview
Name	Markov chain Monte Carlo
Domain	Statistical computation and sampling
Also known as	MCMC

Markov chain Monte Carlo (MCMC) is a family of computational methods for sampling from probability distributions when direct sampling is difficult. It constructs a Markov chain whose long-run behavior has a desired target distribution as its stationary distribution, enabling estimation of expectations and other quantities through simulated draws.

MCMC is widely used in fields such as statistical physics, machine learning, Bayesian statistics, and computational biology. Foundational ideas connect to the Metropolis–Hastings algorithm and the Gibbs sampling procedure, with convergence and diagnostics analyzed through tools like mixing time, ergodicity, and convergence in distribution.

Overview

In Bayesian inference and other applications, one often needs to sample from a target distribution ( \pi(x) ) on a space of states (x), such as the posterior distribution (p(\theta \mid \text{data})). When (\pi(x)) is known up to a normalizing constant, direct sampling may be impractical. MCMC addresses this by generating a sequence (X_0, X_1, \dots) such that the marginal distribution of (X_t) approaches (\pi(x)) as (t) increases.

A key idea is that a Markov chain transition kernel is designed to have (\pi) as a stationary distribution. Methods often rely on the detailed balance condition or the more general concept of a kernel that leaves (\pi) invariant. In practice, one typically discards an initial portion of the chain as “burn-in” and then uses the remaining samples for Monte Carlo estimation, similar to standard Monte Carlo method averages.

Core principles

MCMC methods typically start by choosing a proposal mechanism or conditional structure that allows transitions between states. In the Metropolis–Hastings algorithm, proposed moves are accepted with a probability that ensures the target distribution is preserved. In Gibbs sampling, each component of the state is sampled from its full conditional distribution, forming a Markov chain through repeated conditional updates.

These methods are connected to broader Markov chain theory. For a chain to yield valid approximations of expectations under (\pi), it must converge appropriately, which depends on properties like ergodicity and the rate at which the chain forgets its initial state. Quantifying this rate often involves notions such as mixing time and spectral properties of the transition operator.

Algorithms and variants

Several widely used MCMC algorithms can be viewed as specialized ways to design a transition kernel with good convergence behavior. The Metropolis–Hastings algorithm can incorporate symmetric or asymmetric proposals, including random-walk proposals and independence samplers. Variants such as the random-walk Metropolis method differ mainly in how proposals are generated.

Gibbs sampling is often extended to settings where some conditional distributions are not available in closed form, leading to hybrid schemes. Another major class includes gradient-based approaches, notably Hamiltonian Monte Carlo, which uses auxiliary momentum variables to propose distant moves with higher acceptance rates in high-dimensional problems. In modern implementations, MCMC is frequently combined with techniques for choosing step sizes and reparameterizations to improve efficiency.

Convergence, diagnostics, and efficiency

A central concern in MCMC is assessing whether the chain has reached stationarity and whether the resulting samples are sufficiently representative. Theoretical results describe convergence under assumptions such as irreducibility and aperiodicity, but these conditions are difficult to verify directly in complex models. Practitioners therefore rely on diagnostics and empirical checks.

Common diagnostic concepts include autocorrelation and effective sample size, which capture how dependence between successive draws reduces estimation efficiency relative to independent sampling. Tools such as the potential scale reduction factor (often discussed alongside Gelman–Rubin diagnostic) compare variability within and between multiple chains to gauge convergence. Because chains may converge at different rates, it is typical to run several chains with dispersed initial values and compare their behavior.

Efficiency also depends on the geometry of the target distribution and the tuning of algorithm-specific parameters. For example, poorly scaled proposals in Metropolis–Hastings can lead to low acceptance rates and slow movement through state space, while overly aggressive tuning can increase rejection. The dependence of performance on dimension and structure is a recurring theme, particularly in relation to curse of dimensionality in high-dimensional Bayesian models.

Applications

MCMC has been used extensively for sampling from posterior distributions in Bayesian statistics, enabling inference when analytic solutions are unavailable. It is a core computational tool in topics such as hierarchical modeling and latent-variable models, where the posterior typically lacks a tractable form.

In computational science, MCMC provides a practical route to estimate expectations under complex distributions, including those arising in statistical mechanics. Methods such as Markov chain Monte Carlo have also influenced algorithms used in machine learning workflows, including Bayesian neural network training and probabilistic graphical models, where the posterior over latent variables can be high-dimensional.

Beyond theory, MCMC is often embedded in larger pipelines, such as when combining sampling with optimization or approximate inference. Related techniques include variational approaches, but MCMC is distinguished by its asymptotic correctness under appropriate conditions and its ability to represent complex posterior shapes through samples rather than approximations.