MCMC

1.  Introduction

Monte Carlo Markov Chains are popular in cosmology. Although it may look as an evolved technique it is in fact mathematically a very simple one (the Metropolis-Hastings algorithm, father of all, takes about 10 lines to code, you should try). But anyone who practiced it in real life conditions (ie data) knows that a lot of cooking (on CPU time, proposal matrix, convergence, priors) is generally forgotten and/or hidden. So let see in practice how it works.

  • Monte-Carlo : means that it is a procedure to draw samples according to a distribution (here "draw" has the same meaning as when you draw samples from a gaussian for instance with randn in python)
  • Markov Chain: is an ordered set of samples (some would say a (n)tuple) where each sample ( vector) depends only on the previous one.

The MCMC is therefore only a method to shoot samples from some (complicated) multi-dimensional function. It is not especially "Bayesian" (it can be used to compute integrals) and is quite inefficient. However with some practice, you can manage to sample about almost everything.

Here is how Metroplis-Hastings (MH) works in practice to sample any function F:

  1. start with some values for (x1,x2,...,xN)
  2. shoot a random increment (d1,d2,...,dN) according to generally a multivariate Gaussian matrix (the so called "proposal")
  3. is F(x1+d1,x2+d2,...,xN+dN) > F(x1,x2,...,xN)?
    1. yes: "go" there ie. x1=x1+d1, x2=x2+d2,....,xN=xn+dn
    2. no : you may go there but not sure (you do another random shot with a uniform distribution "u" and go there (ie. x1=x1+d1, x2=x2+d2,....,xN=xn+dn only if u<F(x1+dx1,x2+dx2,...,xN+dxN)/F(x1,x2,...,xN))
  4. print x1,x2,...,xN and F into a file (even if it repeats itself)
  5. goto 2

Run this for "some time", then read the file and do histograms.

Isn't that simple? But what does this has to do with cosmological parameter estimation?

For parameters inference the sampled function F is the data likelihood. According to Bayes theorem

f(params|data)=f(data|params)* prior(params)

(there is a denominator, but it's unimportant for parameter estimation)

  • f(params|data) is the parmeters posterior distribution
  • f(data|params)=likelihood(params)
  • prior(params) is your a-apriori degree of belief about the parameters

So that if you sample your likelihood+prior you get the posterior distribution (ie. a pdf) of the parameters you are trying to estimate Unfortunately most people forget about the priors and more or less implicitly put them to 1 arguing this is the most uninformative one (which is wrong especially in multi-dimension). So that finally in most cases, one uses elaborate statistical terms to justify laziness to pretend that the posterior distribution is equal to the likelihood (which is again wrong, think if something "likely" is the same as something "probable"). A serious Bayesian approach is to figure out how your result varies depending on your priors choice. An even simpler one is to compare the output of this method to the frequentist profile-likelihood one which does not have any build-in notion of prior. This is the main motivation for CAMEL

Although the method is sometimes presented as automatic, it is not and involves actually a lot of cooking and care (in particular with multiple parameters and complicated likelihoods) Indeed

  • what does running for "some time" mean?
  • what to use for the proposal? if the increment is too large the test 3.2 will always fail once you are close to the maximum so that you recopy again and again the same value into the file. If it's too low the test will mostly succeed and the sampling performed a bit everywhere (so why not scan a grid of all values instead?): it will be very long to scan all dimensions... What correlations to put in the "proposal" is another important question.

These 2 points are very intricate: if you have a "good" proposal you can run the chain for a shorter time. In fact theory says the best proposal is to input the "real" covariance of your parameters: you can determine it from the samples themselves but only if your samples have a good proposal! In MH, people try to make some "guess" for this covariance matrix, make some (very long) runs, estimate empirically the covariances using these samples, input this matrix as a new proposal, and start again . In cosmology, this can take more than one week.

The proposal matrix is then the covariance matrix of the (multi-dimensional) Gaussian that will be used to shoot the "next point" (think about a "sigma" in 1D and a "direction"). In cosmology it is unfortunately a quite complicate one, which has off-diagonal coefficients (ie. there are correlations between variables, you cannot shoot in one direction independently of the others). So what should we use for it?

2.  MCMC cosmology with CAMEL

In CAMEL we implemented an adaptive variant of MH to reconstruct iteratively this matrix on the fly. It is not trivial since one should care not destroying the Markovian nature of the chain: (samples do not depend only on the previous one but here also on the previous ones). In practice the "adaptation" of the proposal is gradually decreased until it gets "frozen" in order to end up with a typical MH run.

However we still need some first-guess matrix to start with... (identity would be too harsh an approximation)

In CAMEL we can perform a multi-dimensional minimisation of the parameters (actually of \chi^2(params)) and get an estimate of the Hessian matrix which is the curvature of the likelihood at the maximum (we rather work with \chi^2=-2 ln[lik] so we are looking for its minimum). Because of CLASS numerical noise, the Hessian is often of poor quality (and does NOT represent the "errors" on the parameters) but is sufficient to feed CAMEL's adaptative algorithm later.

So the first thing to do is to go to ~/EDE/camel/CAMEL/v3_euclid/work (in the school image). Create if it does not exist yet the output directory and go into it. We will work from here.

Now recopy the following parameter file : Attach:lcdm.txt Open the file with your preferred editor and look at it. There are several things in it:

  • the "par" keywords gives the free parameters: the ones with "cosmo" are cosmological ones (same as in CLASS) and "nui" for nuisance (here an artificial constant bias that multiplies the P(k)). Note we choose a subset of the 5 LCDM parameters for this session to speed the code (the others are put to their default)
  • the "class" keywords corresponds to values transmitted directly to CLASS
  • the "precision_File" corresponds to a set of CLASS precision values of rather high accuracy (so longer to compute than with default values).
  • the "likelihoods" are some functions that were generated from a given set of parameters you have to infer in this session. They were done by computing with CLASS the corresponding P(k,z) for z=0.5,1,1.5. Then some values were taken (kmes,Pkmes) and moved randomly with a gaussian shot of standard deviation "sigma". The \chi^2's are then just gaussian ones:

\chi^2(pars)=\sum_{kmes}(\frac{P(kmes,pars)-Pkmes}{sigma})^2

where P(k,pars) is computed by CLASS for ecah change of pars. We also added a similar \chi^2 from the BAO peak position and the chi2's are just summed. (more details in the Likelihood session).

So the first thing we want to do is to find for which set of parameters ("pars") the \chi^2(pars) is minimal which is called the best-fit. This is a multi-dimensional minimization problem. There are several ways to do it but it is quite painful in cosmology. Indeed CLASS gives some results (for instance on P(k,z)) that are slightly noisy since the observables are computed numerically with some cuts, precisions etc. Minimizers don't like this at all. This is why we used some rather high-precision parameters (in "precision_File") and we used the MINUIT minimizer from HEP that is quite performant. Even with this, finding the precise minimum is difficult (see Likelihood session). Finding the curvature at the minimum is even worse. With the parameter file given (lcdm.txt) things should go right, but in real life you should have to spend some time on customizing it.

$EXECDIR/Minimize lcdm.txt bestfit hessian
  • $EXECDIR (here and later) is the path to the executables directory =

/home/ede2019/EDE/camel/CAMEL/v3_euclid/Linux-x86_64 . If you work in the output directory you can access it directly from ../../Linux-x86_64

  • the "bestfit" and "hessian" arguments are some output files where the minimum and hessian are stored at the end.

MINUIT performs some slight iterations on the parameter trying to converge to the mininimun. Try to figure out how it works looking at the output that shows

iteration  | H0 | omega_b | omega_cdm | bias | status | chi2 | time(s)

Don't worry about the "WARNING: FunctionMinimum is invalid." message (that comes from the numerical noise).

Read the "hessian" file (can be done in python with "loadtxt") and

  • check that the errors (the diagonal is their square) have not "crazy" values.
  • transform this covariance matrix into a correlation one: cor(x_i,x_j)= \frac{cov(x_i,x_j)}{\sigma_i \sigma_j}

check that the off-diagonal elements are not crazy (should be between -1 and +1). (do not spend too much time here, if you have troubles see Main.Likpysol#toc3 but of course better to do it yourself).

3.  Adaptive Metropolis (AM)

So let us see how to run the AM algorithm in CAMEL the lazy way (flat priors actually taken from the bounds in your parFile).

  • rename the parameter file to lcdm_mc.par
  • comment out in it the "precisionFile" line: why don't we need it? and the "verbose" parameter (to gain time)

CAMEL has an evolved adaptive algorithm to try to determine the proposal matrix from the data themselves.

There are actually 2 things the algorithm needs to adapt:

  1. the covariance matrix elements
  2. the scale of this matrix (multiplicative factor) which is very important (in MH theory it is around Sth=2.4^2/Ndim where Ndim is the dimensionality of your problem (if the matrix is perfect which is rarely the case)

So there are various stages during a run depending on the iteration number:

#MH phase
t0=... 
scale=...
#cov elements adaptation
ts=...

t0/ts determine the different regimes:

  1. [0,t0]: MH stage (with a low scale to accept all the moves , can be changed with the "scale" value). At t0 a first cov matrix is constructed with the samples
  2. [t0,t0+ts] (optional if ts=0) : matrix elements adaptation (fixing scale factor to Sth)
  3. [t0+ts,...]: scale adaptation based on acceptance rate constrained to converge to ~0.25

We also need the initial matrix that was determined in the first part:

proposal_cov=hessian

and specify the starting point should be randomly moved (so that different chains will start from different places).

#starting point
do_move=true

Note that the seed of the random number generator changes for each run.

The number of samples to generate (the chain length) is then specified by

length=???
bunchSize=100

bunchSize specifies every how many samples data are flushed to the file (so that you see something happening) And ??? is the big question: how many samples do we need to generate to have something we can use reliably for doing statistics (as making a histogram)? This is called convergence and is actually difficult to assess.

Depending on the problem the adaptation phase (or not using any) has to be tuned.

Here we will generate 500 samples. Try 2 configurations

  1. ADA: one with some adaptation phase
  2. MH: one without adaptation at all (ie a pure MH sampling).

In each case determine some t0,scale,ts,length values and run:

$EXECDIR/mcmc lcdm_mc.par "chainame.txt"

where "chainname.txt" is the output (use a different name for each configuration to not overwrite the output).

4.  Checking for convergence

4.1  Trace plots

The very first thing to do once you have your chain (output file) is simply to plot the samples (parameters+chi2 see 1st line header for the names) looking for the moment were the chain is in a somewhat "stable" regime (stationarity)

Look at the beginning of a chain file.. Read it (in python it can be done with loadtxt, genfromtxt, pandas...) and simply plot each variable evolution (the chi2 values are particularly interesting). Which one looks better ?

4.2  Acceptance rate

The acceptance rate is a mean value of times the jump was accepted (so a line changed in the file). A rule of thumb is that it should be close to 0.25 (meaning samples are repeated 3/4 of the times). It was established only for Gaussian distribution but in practice this rule also works well for other distributions. CAMEL computes this running mean (every 100 steps) and writes it in the file (you should have) one named ar_vs_length_....txt You may open it and check values that should be around 0.25. This is not a proof the chain converged but is hint things go well.

Check what the values are in the acceptance rae files for the ADA and MH runs.

Conclude which method is better. From now it will be the default.

4.3  Burn-in phase

In the previous runs we have started from parameters (lcdm_mc.par) that had starting values very close to the truth (to ensure easier minimization). In real life this is not the case and there is a phase during which the samples navigate up to the low chi2 region. This phase is called the burn-in To assess it, modify the lcdm_mc.par file (with the good ADA vs MH choice!) by changing the starting values for the parameters and rerun the MCMC.(do not change too much in particular the bias parameter (not more than 1.2) to avoid numerical issues).

What did change in the output? The first samples of the chain are generally removed.

4.4  Multiple chains

More information about convergence can be obtained by running several chains (with different starting points and generator seeds which is done automatically in CAML) and checking that they are more or less independent so that they "forgot their origin".

So generate 4 chains (500 samples each best setup). You may rename the previous one to "chain1.txt", then re-run the mcmc to get "chain2.txt",chain3.txt" and "chain4.txt".

We will use some features of the camel.py library to explore them. It is available here:

~/EDE/camel/CAMEL/v3_euclid/work/tools/python/camel/

(docker users need to recopy it to output to get it shared on your own computer).

You need to add the directory to your PYTHONPATH:

export PYTHONPATH=/home/ede2019/EDE/camel/CAMEL/v3_euclid/work/tools/python/camel:$PYTHONPATH

Compare how the chains evolved for in each run by over-plotting them:

from pylab import *
import camel

#read the chains in a single list
chains=camel.extract_chains('chain',4)


#read the variables name from the 1st line of "chain1"
with open('chain1.txt', 'r') as f:
    names= f.readline().split()

#overplot the 4 chains for each variable
for ivar in range(5):
    figure()
    [plot(c[ivar]) for c in chains]
    title(names[ivar])

4.5  The Gelman-Rubin test

Now we can go on with a deeper convergence test. Its idea is, for each parameter, to compare the variance of each variables within each chain to the variance among the chains. This allows to build the "R" statistics and a good rule of thumb to see when the variables reach convergence is to ask for:

R-1<0.03

(or 0.01 if you are more demanding)

Fortunately camel.py does everything for you:

#Compute Gelman-Rubin statistic for each parameter
it,R = camel.GelmanRubin(chains,gap=10,length_min =0)
figure()
semilogy(it,R-1.)
legend(names[1:])
axhline( 0.03, color='k', ls='--')

Have the chains reached convergence?

If not, you should run more samples. To get it quickier here are some chains already processed up to 5000 samples. Attach:chain1.txt, Attach:chain2.txt, Attach:chain3.txt, Attach:chain4.txt

recopy them, cheek the trace-plots a re-perform the Gellman Rubin test.

Determine above which index the Gelman Rubin test is satisfactory for all variables. We will call it N0.

5.  Building posteriors

5.1  getting mean values and intervals

Now we construct one single chain with our "best samples" by concatenating the last part of each chain:

in python:

#define your N0 value
N0=

#construct a single chain from the last samples (using samples above N0 steps)
chain=camel.mergeMC("chain",num=(1,2,3,4),burnin=N0)

#names were defined previously - you can still use the following convenience
names=camel.MCparnames("chain")

#plot the histograms + basic statistics
# the first variable is "chi2", we don't want it
for ivar in range(1,5):
     subplot(2,2,ivar)
     parname=names[ivar]
     hist(chain[parname])
     title(parname)
     print("%15s " % parname, "\t %f \t %f \t %f" % camel.getci( np.percentile(chain[parname], [16,50,84])))

tight_layout()

The true values in the simulations were:

H0 =67.8
omega_b=0.022
omega_cdm=0.12
bias=1.07

does they fall inside your posterior distributions?

5.2  Triangle plot

To see also the correlations between variables one may use the famous "triangle plot" which not only shows the previous (smoothed) 1D histograms but 2D's too.

We will use the GetDist python package to do so. So if you do not have it already do

pip install --user getdist

Now starting again from "chain" as the mergedMC values and names as the column names:


import camel
from getdist import plots, MCSamples

#see previously how chain/par are defined

#build samples without chi2
mcmc = [chain[par] for par in names[1:]]
#the follwing translates to some nice fonts for param names
labels=[camel.parname.setdefault(n,n).replace('$','') for n in names[1:]]

#getDist functions:
samples=[MCSamples( samples=mcmc, names=names[1:],labels=labels)]
g = plots.getPlotter()                # creates the "plotting" object

g.triangle_plot(samples,names[1:],filled=True)
savefig("triangle.pdf")

Then open the triangle.pdf file.

6.  Parameters covariance matrix

Now that we have some samples for the 4 variables we can estimate empirically covariance between the variables which is as computing the variance for a single array with \frac{1}{N}\sum_i (x_i-\bar x_i)^2. (think about this can be generalized when x_i is a vector). The mumpy cov function does it for you. Technically you need to transform the previous "chain" variable (which is a dictionary) to a 2D numpy array This can be done in the following way:

import pandas
df=pandas.DataFrame(chain)
a=df.values.T  #this creates the 2D array but need to transpose it
c=cov(a[1:])  # we avoid the first column that is chi2

Plot c or better the correlation matrix and compare to the hessian. If your Gelman-Rubin test was bad you may want to use this empirical matrix as the cov for the proposal (instead of the Hessian,) which can improve convergence.

7.  Closing the loop

Now that you have 2 estimates of the "H0" parameter, one from the profile likelihood, the other one from MCMC Bayesian sampling: superimpose both (think about a proper representation)

Now you are an expert, you may look at this more complete example on real data (or go to the beach...)

  • if you need some python help , see Main.Likpysol (but better to do it yourself).

older stuff:

back