MCMC

1.  Introduction

Monte Carlo Markov Chains are popular in cosmology. Although it may look as an evolved technique it is in fact mathematically a very simple one (the Metropolis-Hastings algorithm, father of all, takes about 10 lines to code, you should try). But anyone who practiced them in real life conditions (ie data) knows that a lot of cooking (on CPU, proposal matrix, convergence, priors) is generally forgotten/hidden. So let see in practice how it works.

  • Monte-Carlo : means that it is a procedure to draw samples according to a distribution ("draw" is the same meaning than when you draw samples from a gaussian as randn in python)
  • Markov Chain: is an ordered set of samples (some would say a (n)tuple) where each sample ( vector) depends only on the previous one.

The MCMC is therefore only a method to shoot samples from some (complicated) multi-dimensional function. It is not especially "Bayesian" (it can be used to compute integrals) and is quite inefficient. However with some practice you can manage to sample about almost everything.

Here is how Metroplis-Hastings (MH) works in practice to sample any function F:

  1. start with some values for (x1,x2,...,xN)
  2. shoot a random increment (d1,d2,...,dN) according to generally a multivariate Gaussian matrix (the so called "proposal")
  3. is F(x1+d1,x2+d2,...,xN+dN) > F(x1,x2,...,xN)?
    1. yes: "go" there ie. x1=x1+d1, x2=x2+d2,....,xN=xn+dn
    2. no : you may go there but not sure (you do another random shot with a uniform distribution "u" and go there (ie. x1=x1+d1, x2=x2+d2,....,xN=xn+dn only if u<F(x1+dx1,x2+dx2,...,xN+dxN)/F(x1,x2,...,xN))
  4. print x1,x2,...,xN and F into a file (even if it repeats itself)
  5. goto 2

Run this for "some time", then read the file and do histograms.

Isn't that simple? But what does this has to do with cosmological parameter estimation?

For parameters inference the sampled function F is the data likelihood. According to Bayes theorem

f(params|data)=f(data|params)* prior(params)

(there is a denominator, but it's unimportant for parameter estimation)

  • f(params|data) is the parmeters posterior distribution
  • f(data|params)=likelihood(params)
  • prior(params) is your a-apriori degree of belief about the parameters

So that if you sample your likelihood+prior you get the posterior distribution (ie. a pdf) of the parameters you are trying to estimate Unfortunately most people forget about the priors and more or less implicitly put them to 1 arguing this is the most uninformative one (which is wrong especially in multi-dimension). So that finally in most cases, one uses elaborate statistical terms to justify laziness to pretend that the posterior distribution is equal to the likelihood (which is again wrong, think if something "likely" is the same as something "probable"). A serious Bayesian approach is to figure out how your result varies depending on your priors choice. An even simpler one is to compare the output of this method to the frequentist profile-likelihood one which does not have any build-in notion of prior. This is the main motivation for CAMEL

Although the method is sometimes presented as magic, it is not and involves actually a lot of cooking and care (in particular with multiple parameters and complicated likelihoods) Indeed

  • what does running for "some time" mean?
  • what to use for the proposal? if the increment is too large the test 3.2 will always fail once you are close to the maximum so that you recopy again and again the same value into the file. If it's too low the test will mostly succeed and the sampling performed a bit everywhere (so why not scan a grid of all values instead?): it will be very long to scan all dimensions... What correlations to put in the "proposal" is another important question.

These 2 points are very intricate: if you have a "good" proposal you can run the chain for a shorter time. In fact theory says the best proposal is to input the "real" covariance of your parameters: you can determine it from the samples themselves but only if your samples have a good proposal! In MH, people try to make some "guess" for this covariance matrix, make some (very long) runs, estimate empirically the covariances using these samples, input this matrix as a new proposal, and start again . In cosmology, this can take more than one week.

2.  The proposal matrix

This is the covariance matrix of the (multi-dimensional) Gaussian that will be used to shoot the "next point" (think about "sigma" in 1D, it will define the mean distance of the steps). In cosmology it is unfortunately a quite complicate one, which has off-diagonal coefficients (ie. there are correlations between variables, you cannot shoot in one direction independently of the others). So what should we use for it?

In CAMEL we did implement an adaptive variant of MH to reconstruct the matrix on the fly. It is not trivial since one should care not destroying the Markovian nature of the chain: (samples do not depend only on the previous one but here also on the previous proposal). In practice the "adaptation" of the proposal is gradually decreased until it gets "frozen".

However we need something to start with something.

We will use the Hessian matrix obtained when minimizing the global chi2 Main.Lik#toc6 It is often of poor quality (and does NOT represent the "errors" on the parameters) but generally sufficient to feed CAMEL's adaptative algorithm.

  • If you followed the Likelihood session then you should have one : try it.
  • if you don't take create the output directory in work (no need for docker user) and use the following file: Attach:lcdm.txt

Run:

$EXECDIR/Minimize lcdm.txt bestfit hessian

(the two output arguments are the best-fit and the hessian outputs written in text)

EXECDIR and corrresponds to ../../Linux-x86_64 (from output)

In anay case read your hessian and

  • check that the errors (the diagonal is their square) are not crazy (too small)
  • transform the covariance matrix to a correlation matrix : check that the off-diagonal elements are not crazy (should be between -1 and +1)

3.  Adaptive Metropolis (AM)

So let us see how to run the AM algorithm in CAMEL the lazy way (flat priors actually taken from the bounds in your parFile).

  • rename the parameter file to lcdm_mc.par
  • comment out in it the "precisionFile" line: why don't we need it? and the "verbose" parameter (to gain time)
  • then add the following parameters
algo=ada
length=1500
bunchSize=100

It means we use the AM algorithm, to generate 1500 samples(vectors) that will be written into the file every 100 steps.

Then we need some rough estimate of the proposal matrix: we use the Hessian determined in the global fit so add

proposal_cov=hessian

Finally AM has a number of parameters to tune the run. try:

t0=200
ts=0
scale=1.4
do_move=true

There are actually 2 things the algorithm is adapting:

  1. the covariance matrix
  2. the scaling (a multiplicative factor) of this matrix
  • in a first stage (until t0 samples) we accumulate data using the given proposal to construct the covariance between samples.
  • the scale factor is fixed to a "reasonable" value ( about \frac{2.38^2}{dim} where dim is your number of variables}
  • in principle you can adapt this scale for ts samples, but it is unnecessary here (unless your hessian is very poor)
  • "do_move" allows to jump randomly the first sample, so that each run will start with a different value. Note that the seed of the random number generator changes for each run.

Then run

$EXECDIR/mcmc lcdm_mc.par chain1.txt

This should last ~ 5 mins.

When the run has finished you can immediately launch 3 other ones (for later use):

$EXECDIR/mcmc lcdm_mc.par chain2.txt; 
$EXECDIR/mcmc lcdm_mc.par chain3.txt;
$EXECDIR/mcmc lcdm_mc.par chain4.txt;

while they complete we will explore the properties of chain1.txt, so go next.

4.  Checking for convergence

In order to sample correctly the distribution, the samples distribution must be in a stationary state known as correct "mixing". Then its distribution "converges" to the right function (posterior/likelihood here). This requires running the algorithm sometimes for a long time (steps) before it forgets its origin and go into a region of the parameter space where the proposal matrix is well adapted.

There is no way to be 100% sure a chain is mixing correctly, but you may often clearly see when it did not! Here are various ways:

4.1  Trace plots

The very first thing to do once you have your chain (file) is simply to plot each variable evolution (parameters+chi2 see 1st line header for the names) looking for the moment were the chain is in a somewhat "stable" regime (stationarity)

Look at the beginning of the chain1.txt file. In python read the (text) file, for instance with var=loadtxt("chain1.txt",skiprows=1,unpack=True) (or genfromtxtx, pandas.read_csv ...) and plot each variable evolution (the chi2 values are particularly interesting).

Have all variables converged? In your view, when do the distributions begin to be stationary (if anywhere)? If not rerun with a longer length.

With these plots (pedantically called "trace plots") you will identify 99% of the problems (not saying that the remaining 1% is OK)

4.2  Acceptance rate

The acceptance rate is a mean value of times the jump was accepted (so a line changed in the file). A rule of thumb is that it should be close to 0.25 (meaning samples are repeated 3/4 of the times). It was established only for Gaussian distribution but in practice this rule also works well for other distributions. CAMEL computes this running mean (every 100 steps) and writes it in the file (you should have) one named ar_vs_length_chain1.txt You may open it and check values that should be around 0.25. This is not a proof the chain converged but is a good hint.

4.3  Multiple chains

More information about the mixing can be obtained when running several chains (with different starting points and generator seeds). You should now have 4 chains (chains[1-4].txt) available.

We will use some features of the camel.py library to explore them

  • VB image: it is available in work/tools/python/camel
  • docker: it available in the container but not on your computer. When running the image, you can copy it from work/tools/python/camel to the output directory so that you will have it on your computer.

point the PYTHONPATH variable to the directory containing camel.py: go into that directory and type:

export PYTHONPATH=$PWD:$PYTHONPATH

Then in python: Compare how the chains evolved for in each run :

from pylab import *
import camel

#read the chains in a single list
chains=camel.extract_chains('chain',4)


#read the variables name from the 1st line of "chain1"
with open('chain1.txt', 'r') as f:
    names= f.readline().split()

#overplot the 4 chains for each variable
for ivar in range(5):
    figure()
    [plot(c[ivar]) for c in chains]
    title(names[ivar])

4.4  The Gelman-Rubin test

Now we can go on with a deeper convergence test. Its idea is, for each parameter, to compare the variance of each variables within each chain to the variance among the chains. This allows to build the "R" statistics and a good rule of thumb to see where the variable converged is to look for:

R-1<0.03

(or 0.01 if you are more demanding)

Fortunately camel.py does already everything for you:

#Compute Gelman-Rubin statistic for each parameter
it,R = camel.GelmanRubin(chains,gap=10,length_min =0)
figure()
semilogy(it,R-1.)
legend(names[1:])
axhline( 0.03, color='k', ls='--')

Determine when the chains start having a reasonable (even if not perfect) R value : we call it N0.

5.  Building posteriors

5.1  getting mean values and intervals

Now we construct one single chain with our "best samples" by concatenating the last part of each chain:

#define your N0 value
N0=

#construct a single chain from the last samples (using samples above N0 steps)
chain=camel.mergeMC("chain",num=(1,2,3,4),burnin=N0)

#names were defined previously - you can still use the following convenience
names=camel.MCparnames("chain")

#plot the histograms + basic statistics
# the first variable is "chi2", we don't want it
for ivar in range(1,5):
     subplot(2,2,ivar)
     parname=names[ivar]
     hist(chain[parname])
     title(parname)
     print("%15s " % parname, "\t %f \t %f \t %f" % camel.getci( np.percentile(chain[parname], [16,50,84])))

tight_layout()

The true values in the simulations were:

H0 =67.8
omega_b=0.022
omega_cdm=0.12
bias=1.07

does they fall inside your posterior distributions?

5.2  Triangle plot

To see also the correlations between variables one may use the famous "triangle plot" which not only shows the previous (smoothed) 1D histograms but 2D's too.

We will use the GetDist python package to do so. So if you do not have it already do

pip install --user getdist

Now starting again from "chain" as the mergedMC values and names as the column names:


import camel
from getdist import plots, MCSamples

#see previously how chain/par are defined

#build samples without chi2
mcmc = [chain[par] for par in names[1:]]
#the follwing translates to some nice fonts for param names
labels=[camel.parname.setdefault(n,n).replace('$','') for n in names[1:]]

#getDist functions:
samples=[MCSamples( samples=mcmc, names=names[1:],labels=labels)]
g = plots.getPlotter()                # creates the "plotting" object

g.triangle_plot(samples,names[1:],contour_colors='blue',filled=True)
savefig("triangle.pdf")

Then open the triangle.pdf file.

6.  Parameters covariance matrix

Now that we have some samples for the 4 variables we can estimate empirically covariance between the variables which is as computing the variance for a single array with \frac{1}{N}\sum_i (x_i-\bar x_i)^2. (think about this can be generalized when x_i is a vector). The mumpy cov function does it for you. Technically you need to transform the previous "chain" variable (which is a dictionary) to a 2D numpy array This can be done in the following way:

import pandas
df=pandas.DataFrame(chain)
a=df.values.T  #this creates the 2D array but need to transpose it
c=cov(a[1:])  # we avoid the first column that is chi2

Plot c or better the correlation matrix and compare to the hessian. If your Gelman-Rubin test was bad you may want to use this empirical matrix as the cov for the proposal (instead of the Hessian,) which can improve convergence.

7.  Closing the loop

Now that you have 2 estimates of the "H0" parameter, one from the profile likelihood, the other one from MCMC Bayesian sampling: superimpose both (think about a proper representation)

Now you are an expert, you may look at this more complete example on real data (or go to the beach...)

older stuff:

back