Monte Carlo Approximation Methods: Which one should you choose and when?

Is it Inverse Transformation, Random Walk Metropolis-Hastings, or Gibbs? An analysis focusing on the mathematical foundation, Python implementation from scratch, and pros/cons of each method

Suyang Li
Towards Data Science

--

Photo by Joakim Honkasalo on Unsplash

Introduction to Approximation Sampling

For most probabilistic models of practical interest, exact inference is intractable, and so we have to resort to some form of approximation.

— Pattern Recognition and Machine Learning¹

Since deterministic inference is often intractable with probabilistic models as we saw just now, we now turn to approximation methods based on numerical sampling, which are known as Monte Carlo techniques. The key question we will look at with these methods is computing the expectation of a target function f(z) given a probability distribution p(z). Recall that the simple definition of expectation is given as an integral:

Source: PRML¹ Eq. 11.1

As we will see, these integrals are too computationally complex, so we will turn to sampling methods in this article.

In this article, we will look at 3 core sampling methods: inverse transformation, Markov chain Monte Carlo (MCMC), and Gibbs Sampling. By understanding the underlying statistical properties and computational requirements of these methods, we will learn that:

  • Inverse transformation sampling is best for simulating data with high accuracy from known and simple distributions, particularly in low dimensions.
  • Random Walk Metropolis-Hastings is best for complex, multi-modal, or unknown distributions, where global exploration and/or convergence is a priority; specifically, the Metropolis algorithm — a specific instance of Metropolis-Hastings — can be used for symmetric distributions.
  • Gibbs Sampling is best for high-dimensional problems where conditional distributions are easy to sample from, and efficiency is a priority.

1. Transformation Method: Inverse CDF

Inverse Transform Sampling, as its name suggests, uses the inverse cumulative distribution function (CDF) of a target distribution to generate random numbers that follow a desired distribution. The basic idea is:

  1. Generate a Uniform Random Number: We draw a number U from a uniform distribution between 0 and 1.
  2. Apply the Inverse CDF: Using the inverse of the target distribution’s CDF, transform U into a sample that follows the target distribution.

Here’s a quick illustration of how the samples (blue) are drawn from the distribution (red):

Inverse CDF is a computationally simple and generalizable method to sample from distributions for which we know the CDF, such as the normal, exponential, gamma or beta distribution.

PDF, CDF, and Inverse CDF

(From left to right): PDF, CDF, and Inverse CDF of the standard normal distribution

In intuitive terms, the CDF is the cumulative value of the PDF, which is equal to the integral of the PDF; then we take the inverse of the CDF function to get the final inverse CDF used for this method.

Formally, if a is a random variable, then the CDF of a is given by:

PRML, Eq. 11.5–11.6

A CDF F has the following key properties:

  • F is continuous
  • F is non-decreasing
  • F has range 0 ≤ cdf(a) ≤ 1 for all a ∈ R

How does the inverse CDF algorithm work?

The algorithm consists of the following ingredients:

Input:

  • U: U is a uniform random variable drawn between 0 and 1.
  • This serves as the input probability for the inverse CDF and is what will be transformed into a sample from the desired distribution.

Parameter:

  • F: the CDF of the target distribution.
  • With F, we can simply compute its inverse, F^-1, and use it to map an input value to the desired domain

Output:

  • x: a random sample drawn from the target distribution.
  • This is generated by applying the inverse CDF to the random number from the uniform distribution (input).

Python Implementation

Now, let’s implement this method from scratch. We will use the exponential function since it will be easy to visualize our samples drawn by inverse CDF and compare them to the exact distribution:

PDF of the exponential function (target distribution)

By standard calculus integration techniques, we find that the target CDF F(x) is:

CDF of the exponential function

The inverse of this CDF is:

Inverse CDF of the exponential function

We will generate 5000 samples using the Inverse CDF method:

import numpy as np
import matplotlib.pyplot as plt

# Inverse CDF for the exponential distribution with lambda = 1
def inverse_cdf(p):
return -np.log(1 - p)

# Generate 1000 sample values using inverse CDF
uniform_samples = np.random.uniform(0, 1, 1000)
transformed_samples = inverse_cdf(uniform_samples)

# Generate x values and true PDF
x_values = np.linspace(0, np.max(transformed_samples), 1000)
pdf_values = np.exp(-x_values)

Requirements for the inverse CDF algorithm to work

The inverse CDF method makes a key assumption:

  • CDF F is invertible: The CDF F must be invertible, meaning that each input to F must have a unique output

This constraint rules out a number of functions. For example, below are a few function types that are common but non-invertible (and thus would not work with inverse CDF):

  1. Constant functions: Any constant function in the form of f(x) = c where c is a constant is not invertible since every input maps to the same output and thus the function is not one-to-one.
The red dots show two of the many x values that map to the same y value in f(x) = 5, making it non-invertible

2. Certain quadratic functions: For example f(x) = x^2 is non-invertible since it’s many-to-one (consider f(x) = 1, x could be 1 or -1).

The red dots show one of the many pairs of x values that map to the same y value in f(x) = x²

3. Certain trigonometric functions: For example f(x) = sin(x) is not invertible over their entire domain since they are periodic, although with restricted domains they may be made invertible.

The red dots show one of the many sets of x values that map to the same y value in f(x) = sin(x) due to its periodicity over the given domain

Why does inverse CDF work?

The key idea is that a random variable uniformly distributed between 0 and 1 can be transformed into a random variable with a certain distribution by applying the inverse of the target distribution’s CDF, which is assumed to be known and tractable.

Advantages

  1. Algorithmic simplicity: it’s very easy to implement with lower-dimensional data, thus having a wide application area across different fields and tasks.
  2. Sample accuracy: assuming the CDF and its inversion represents the exact target distribution, the method yields a relatively high exactness compared to other methods such as MCMC which we will see soon.

Disadvantages

  1. Computational complexity: For some distributions, the inverse CDF may not have a closed-form expression, making computation challenging or expensive.
  2. Difficulty with high dimensionality: It can be difficult to apply in high-dimensional spaces, especially with dependencies between variables.
  3. Invertibility constraint: Anytime a CDF is non-invertible, this method becomes invalid. This excludes a number of functions as we saw above.
  4. Limited to known distributions: Inverse CDF requires the exact form of the CDF, limiting its application to known distributions only.

Taking all these limitations into account, there are only a few categories of distributions that we can apply inverse CDF to. In reality, with Big Data and unknown distributions, this method can quickly become unavailable.

With these advantages and disadvantages in mind, let’s now look at another random sampling framework that tackles these limitations: Markov Chain Monte Carlo (MCMC).

2. Markov Chain Monte Carlo (MCMC)

As we saw just now, the inverse CDF transformation method is highly limited, especially with high dimensional sample spaces. Markov Chain Monte Carlo (MCMC), on the other hand, scales well with dimensionality, enabling us to sample from a much larger family of distributions.

An example of Metropolis-Hastings exploring a mixed Gaussian (left), generating samples (right)

How does the Metropolis-Hastings algorithm work?

In intuitive terms, the algorithm works in the following steps: Similar to inverse CDF, we have a target distribution that we’re sampling from. However, we need an additional ingredient: the current state z*, and q(z|z*) depends on z*, creating a Markov chain with samples z¹, z², z³,. Each sample is accepted into the chain only if it satisfies a certain criterion, which we will define below since this criterion differs across variations of the algorithm.

Let’s formalize it into a more algorithmic structure.

The algorithm runs in cycles, and each cycle follows these steps:

  1. Generate a sample z* from the proposal distribution.
  2. Accept the sample with probability Then we will accept this value with an acceptance probability, which in Metropolis-Hastings is defined as:
PRML¹ Eq 11.44

where

  • z* is the current state.
  • z^T is the proposed new state.
  • p(z*) is the probability of state z* according to the desired distribution.
  • p(z^T) is the probability of state z^T according to the desired distribution.

The logic behind this acceptance threshold is that it ensures that the more probable states (according to the desired distribution) are visited more often.

Now, this is the most generalized version of the algorithm; if the proposal distribution is known to be symmetric, meaning that the probability of proposing a move from state x to x′ is the same as proposing a move from x′ to x, i.e. q(x′|x) = q(x|x′), then we can use a special case of Metropolis-Hastings that requires a simpler acceptance threshold.

Metropolis algorithm for Symmetric Distributions

This is a specific MCMC algorithm that we choose to use if the proposal distribution is symmetric, i.e. q(z⁰ | z¹) = q(z¹ | z⁰) for all values of 1 and 0, interpreted as “the probability of transitioning from any state A to state B is equal to the probability of transitioning from B to A”. So, each step of the algorithm becomes:

  1. Generate a sample z* from the proposal distribution.
  2. Accept the sample with probability:
Metropolis algorithm acceptance threshold. Src: PRML¹ Eq. 11.33

Metropolis-Hastings and Metropolis Algorithms

Let’s look at the algorithms side by side. As we saw before, the only difference is the acceptance threshold; all other steps of the algorithms run identically:

Metropolis vs Metropolis-Hastings Algorithm

Advantages

  1. Convergence to Equilibrium Distribution: In certain cases, the random walk can converge to a desired equilibrium distribution although it likely takes a long time in high-dimensional spaces.
  2. Low Computational Cost: Random walk often requires fewer computational resources compared to other complex sampling methods, making it suitable for problems where computational efficiency is a priority.
  3. Versatility of application: Due to its high similarity to naturally occurring patterns, Random Walk has applications in a wide variety of fields:
    • Physics: Brownian motion of molecules in liquids and gases.
    • Network Analysis
    • Financial Markets: to model stock price movements
    • Population Genetics

Disadvantages:

  1. Sensitive to initialization: The algorithm’s performance can be sensitive to the choice of the starting values, especially if the initialized values are far from the high-density areas.
  2. Locality traps: Depending on the complexity of the proposal distribution, the algorithm could get stuck in local optima and have difficulty traversing to other areas along the distribution.

Now, keeping the Metropolis-Hastings algorithm in mind, let’s look at another special instance of it: Gibbs Sampling.

3. Gibbs Sampling

Gibbs Sampling is a special instance of Metropolis-Hastings where each step is always accepted. Let’s first look at the Gibbs sampling algorithm itself.

How does the Gibbs algorithm work?

The idea is relatively simple and is best illustrated by first zooming in on a micro example involving sampling from a distribution p(z1, z2, z3) over 3 variables. The algorithm would run in the following steps:

  1. At timestep T, initialize the starting values to:
PRML¹

2. Draw the new value for z1:

PRML¹ Eq 11.46

3. Draw a new value for the second position, z2 from the conditional:

PRML¹ Eq 11.47

4. Finally draw a new value for the last position, z3:

PRML¹ Eq 11.48

5. Repeat this process, cycling through the three variables z1…z3 until it reaches a certain satisfactory threshold.

Generalized algorithm

Formally, the algorithm is represented by first initializing the starting positions, and then taking T consecutive steps

Image source: PRML¹ Ch11.3 Gibbs Sampling

Conditions for Gibbs to sample from a target distribution correctly

  1. Invariance. The target distribution p(z) is invariant of each Gibbs step, and therefore p(z) is invariant of the entire Markov chain.
  2. Ergodicity. If the conditional distributions are all non-zero, then ergodicity is implied since any point in z space is reachable in a finite number of steps.
  3. Sufficient burn-in. As we saw with any method that requires random initialization, the first few samples are dependent on the initialization, and the dependence weakens after many iterations.

How does this relate to Metropolis-Hastings?

In Metropolis-Hastings, we defined the acceptance threshold to be:

Thus, the Metropolis-Hastings proposal steps are always accepted, as we saw in the Gibbs algorithm.

Variations

Since the Gibbs method updates one variable at a time, there are strong dependencies between consecutive samples. To overcome this, we could use an intermediate strategy to sample from groups of variables instead of individual variables, known as blocking Gibbs.

Similarly, by the nature of Markov chains, the examples drawn successively will be correlated. To generate independent samples, we could use sub-sampling within the sequence.

4. Pros/Cons: Inverse CDF vs Metropolis-Hastings vs Gibbs

Now that we’ve walked through how each algorithm works and its application areas, let’s summarize the defining characteristic of each method.

1. Inverse Transformation Sampling

  • Data Size: Best for moderate-sized datasets.
  • Time: Generally efficient for univariate distributions.
  • Data Complexity: Use for simple distributions where the cumulative distribution function (CDF) and its inverse are known and easy to compute.
  • Consider avoiding if: Sampling high-dimensional variables/distributions.
  • Biggest Advantage: High accuracy if the CDF accurately reflects the target distribution.
  • Requirement: The CDF must be known and invertible.

2. Metropolis-Hastings (MCMC)

  • Data Size: Scalable and suitable for large datasets.
  • Time: Can be computationally intensive, depending on the complexity of the target distribution.
  • Data Complexity: Ideal for complex or multi-modal distributions.
  • Biggest Advantages:
    - Can sample from a distribution without knowing its normalization constant (the full form)
    - Great for exploring the global structure of a distribution and guarantees convergence
  • Disadvantage: May suffer from very slow convergence with
    - complex or multimodal target distribution, since the algorithm may be stuck in local modes and have difficulty transitioning between them;
    - the variables are highly correlated;
    - high dimensional spaces;
    - poor initial values or step size choices

3. Gibbs Sampling

  • Data Size: Suitable for both small and large datasets.
  • Time: Often more efficient than Random Walk Metropolis-Hastings since it doesn’t require acceptance/rejection steps.
  • Data Complexity: Best used when dealing with high-dimensional distributions where you can sample from the conditional distribution of each variable.
  • Biggest Advantages:
    - Can easily compute conditional distributions;
    - Less prone to local minima traps compared to Random Walk.
  • Requirements:
    - Markov chain ergodicity
    - The full conditional distributions must be known and tractable

In summary:

Summary table of pros and cons of inverse CDF, Metropolis-Hastings, and Gibbs

Conclusion

Thanks for sticking with me this far! In this article, we looked at 3 key approximation sampling methods: Inverse CDF, Metropolis Hastings MCMC, and Gibbs Sampling MCMC. We explored how each algorithm functions, their respective advantages and disadvantages, and typical use-cases.

Inverse CDF provides a straightforward method to sample from a known distribution when its CDF is invertible. It’s computationally efficient but is less suitable for high-dimensional or complex distributions.
Metropolis Hastings MCMC offers a more general approach that allows for sampling from distributions that are difficult to tackle otherwise. However, it does require more computational resources and may be sensitive to tuning parameters like the proposal distribution.
Gibbs Sampling MCMC is specifically efficient when the joint distribution is complex but can be broken down into simpler conditional distributions. It’s widely used in machine learning, although it can be slow to converge and memory-intensive for high-dimensional problems.

[1] Bishop, C. M. (2016). Pattern Recognition and Machine Learning (Softcover reprint of the original 1st edition 2006 (corrected at 8th printing 2009)). Springer New York.

*Images by author unless otherwise stated,

--

--

CS undergrad; interested in ML for medicine/healthcare, multi-agent systems, and XAI/AI safety!