Installation
Using CAMEL
Tutorials
|
MCMCOn this page... (hide) 1. IntroductionMonte Carlo Markov Chains are popular in cosmology. Although it may look as an evolved technique it is in fact mathematically a very simple one (the Metropolis-Hastings algorithm, father of all, takes about 10 lines to code, you should try). But anyone who practiced it in real life conditions (ie data) knows that a lot of cooking (on CPU time, proposal matrix, convergence, priors) is generally forgotten and/or hidden. So let see in practice how it works.
The MCMC is therefore only a method to shoot samples from some (complicated) multi-dimensional function. It is not especially "Bayesian" (it can be used to compute integrals) and is quite inefficient. However with some practice, you can manage to sample about almost everything. Here is how Metroplis-Hastings (MH) works in practice to sample any function F:
Run this for "some time", then read the file and do histograms. Isn't that simple? But what does this has to do with cosmological parameter estimation? For parameters inference the sampled function F is the data likelihood. According to Bayes theorem
(there is a denominator, but it's unimportant for parameter estimation)
So that if you sample your likelihood+prior you get the posterior distribution (ie. a pdf) of the parameters you are trying to estimate Unfortunately most people forget about the priors and more or less implicitly put them to 1 arguing this is the most uninformative one (which is wrong especially in multi-dimension). So that finally in most cases, one uses elaborate statistical terms to justify laziness to pretend that the posterior distribution is equal to the likelihood (which is again wrong, think if something "likely" is the same as something "probable"). A serious Bayesian approach is to figure out how your result varies depending on your priors choice. An even simpler one is to compare the output of this method to the frequentist profile-likelihood one which does not have any build-in notion of prior. This is the main motivation for CAMEL Although the method is sometimes presented as automatic, it is not and involves actually a lot of cooking and care (in particular with multiple parameters and complicated likelihoods) Indeed
These 2 points are very intricate: if you have a "good" proposal you can run the chain for a shorter time. In fact theory says the best proposal is to input the "real" covariance of your parameters: you can determine it from the samples themselves but only if your samples have a good proposal! In MH, people try to make some "guess" for this covariance matrix, make some (very long) runs, estimate empirically the covariances using these samples, input this matrix as a new proposal, and start again . In cosmology, this can take more than one week. The proposal matrix is then the covariance matrix of the (multi-dimensional) Gaussian that will be used to shoot the "next point" (think about a "sigma" in 1D and a "direction"). In cosmology it is unfortunately a quite complicate one, which has off-diagonal coefficients (ie. there are correlations between variables, you cannot shoot in one direction independently of the others). So what should we use for it? 2. MCMC cosmology with CAMELIn CAMEL we implemented an adaptive variant of MH to reconstruct iteratively this matrix on the fly. It is not trivial since one should care not destroying the Markovian nature of the chain: (samples do not depend only on the previous one but here also on the previous ones). In practice the "adaptation" of the proposal is gradually decreased until it gets "frozen" in order to end up with a typical MH run. However we still need some first-guess matrix to start with... (identity would be too harsh an approximation) In CAMEL we can perform a multi-dimensional minimisation of the parameters (actually of ) and get an estimate of the Hessian matrix which is the curvature of the likelihood at the maximum (we rather work with so we are looking for its minimum). Because of CLASS numerical noise, the Hessian is often of poor quality (and does NOT represent the "errors" on the parameters) but is sufficient to feed CAMEL's adaptative algorithm later. So the first thing to do is to go to Now recopy the following parameter file : Attach:lcdm.txt Open the file with your preferred editor and look at it. There are several things in it:
where P(k,pars) is computed by CLASS for ecah change of pars. We also added a similar from the BAO peak position and the chi2's are just summed. (more details in the Likelihood session). So the first thing we want to do is to find for which set of parameters ("pars") the is minimal which is called the best-fit. This is a multi-dimensional minimization problem. There are several ways to do it but it is quite painful in cosmology. Indeed CLASS gives some results (for instance on P(k,z)) that are slightly noisy since the observables are computed numerically with some cuts, precisions etc. Minimizers don't like this at all. This is why we used some rather high-precision parameters (in "precision_File") and we used the MINUIT minimizer from HEP that is quite performant. Even with this, finding the precise minimum is difficult (see Likelihood session). Finding the curvature at the minimum is even worse. With the parameter file given (lcdm.txt) things should go right, but in real life you should have to spend some time on customizing it. $EXECDIR/Minimize lcdm.txt bestfit hessian
MINUIT performs some slight iterations on the parameter trying to converge to the mininimun. Try to figure out how it works looking at the output that shows iteration | H0 | omega_b | omega_cdm | bias | status | chi2 | time(s) Don't worry about the "WARNING: FunctionMinimum is invalid." message (that comes from the numerical noise). Read the "hessian" file (can be done in python with "loadtxt") and
check that the off-diagonal elements are not crazy (should be between -1 and +1). (do not spend too much time here, if you have troubles see Main.Likpysol#toc3 but of course better to do it yourself). 3. Adaptive Metropolis (AM)So let us see how to run the AM algorithm in CAMEL the lazy way (flat priors actually taken from the bounds in your parFile).
CAMEL has an evolved adaptive algorithm to try to determine the proposal matrix from the data themselves. There are actually 2 things the algorithm needs to adapt:
So there are various stages during a run depending on the iteration number: #MH phase t0=... scale=... #cov elements adaptation ts=... t0/ts determine the different regimes:
We also need the initial matrix that was determined in the first part: proposal_cov=hessian and specify the starting point should be randomly moved (so that different chains will start from different places). #starting point do_move=true Note that the seed of the random number generator changes for each run. The number of samples to generate (the chain length) is then specified by length=??? bunchSize=100 bunchSize specifies every how many samples data are flushed to the file (so that you see something happening) And ??? is the big question: how many samples do we need to generate to have something we can use reliably for doing statistics (as making a histogram)? This is called convergence and is actually difficult to assess. Depending on the problem the adaptation phase (or not using any) has to be tuned. Here we will generate 500 samples. Try 2 configurations
In each case determine some $EXECDIR/mcmc lcdm_mc.par "chainame.txt" where "chainname.txt" is the output (use a different name for each configuration to not overwrite the output). 4. Checking for convergence4.1 Trace plotsThe very first thing to do once you have your chain (output file) is simply to plot the samples (parameters+chi2 see 1st line header for the names) looking for the moment were the chain is in a somewhat "stable" regime (stationarity) Look at the beginning of a chain file.. Read it (in python it can be done with 4.2 Acceptance rateThe acceptance rate is a mean value of times the jump was accepted (so a line changed in the file).
A rule of thumb is that it should be close to 0.25 (meaning samples are repeated 3/4 of the times). It was established only for Gaussian distribution but in practice this rule also works well for other distributions.
CAMEL computes this running mean (every 100 steps) and writes it in the file (you should have) one named Check what the values are in the acceptance rae files for the ADA and MH runs. Conclude which method is better. From now it will be the default. 4.3 Burn-in phaseIn the previous runs we have started from parameters (lcdm_mc.par) that had starting values very close to the truth (to ensure easier minimization). In real life this is not the case and there is a phase during which the samples navigate up to the low chi2 region. This phase is called the burn-in To assess it, modify the lcdm_mc.par file (with the good ADA vs MH choice!) by changing the starting values for the parameters and rerun the MCMC.(do not change too much in particular the bias parameter (not more than 1.2) to avoid numerical issues). What did change in the output? The first samples of the chain are generally removed. 4.4 Multiple chainsMore information about convergence can be obtained by running several chains (with different starting points and generator seeds which is done automatically in CAML) and checking that they are more or less independent so that they "forgot their origin". So generate 4 chains (500 samples each best setup). You may rename the previous one to "chain1.txt", then re-run the mcmc to get "chain2.txt",chain3.txt" and "chain4.txt". We will use some features of the ~/EDE/camel/CAMEL/v3_euclid/work/tools/python/camel/ (docker users need to recopy it to You need to add the directory to your PYTHONPATH: export PYTHONPATH=/home/ede2019/EDE/camel/CAMEL/v3_euclid/work/tools/python/camel:$PYTHONPATH Compare how the chains evolved for in each run by over-plotting them: from pylab import * import camel #read the chains in a single list chains=camel.extract_chains('chain',4) #read the variables name from the 1st line of "chain1" with open('chain1.txt', 'r') as f: names= f.readline().split() #overplot the 4 chains for each variable for ivar in range(5): figure() [plot(c[ivar]) for c in chains] title(names[ivar]) 4.5 The Gelman-Rubin testNow we can go on with a deeper convergence test. Its idea is, for each parameter, to compare the variance of each variables within each chain to the variance among the chains. This allows to build the "R" statistics and a good rule of thumb to see when the variables reach convergence is to ask for:
(or 0.01 if you are more demanding) Fortunately camel.py does everything for you: #Compute Gelman-Rubin statistic for each parameter it,R = camel.GelmanRubin(chains,gap=10,length_min =0) figure() semilogy(it,R-1.) legend(names[1:]) axhline( 0.03, color='k', ls='--') Have the chains reached convergence? If not, you should run more samples. To get it quickier here are some chains already processed up to 5000 samples. Attach:chain1.txt, Attach:chain2.txt, Attach:chain3.txt, Attach:chain4.txt recopy them, cheek the trace-plots a re-perform the Gellman Rubin test. Determine above which index the Gelman Rubin test is satisfactory for all variables. We will call it N0. 5. Building posteriors5.1 getting mean values and intervalsNow we construct one single chain with our "best samples" by concatenating the last part of each chain: in python: #define your N0 value N0= #construct a single chain from the last samples (using samples above N0 steps) chain=camel.mergeMC("chain",num=(1,2,3,4),burnin=N0) #names were defined previously - you can still use the following convenience names=camel.MCparnames("chain") #plot the histograms + basic statistics # the first variable is "chi2", we don't want it for ivar in range(1,5): subplot(2,2,ivar) parname=names[ivar] hist(chain[parname]) title(parname) print("%15s " % parname, "\t %f \t %f \t %f" % camel.getci( np.percentile(chain[parname], [16,50,84]))) tight_layout() The true values in the simulations were: H0 =67.8 omega_b=0.022 omega_cdm=0.12 bias=1.07 does they fall inside your posterior distributions? 5.2 Triangle plotTo see also the correlations between variables one may use the famous "triangle plot" which not only shows the previous (smoothed) 1D histograms but 2D's too. We will use the pip install --user getdist Now starting again from "chain" as the mergedMC values and names as the column names: import camel from getdist import plots, MCSamples #see previously how chain/par are defined #build samples without chi2 mcmc = [chain[par] for par in names[1:]] #the follwing translates to some nice fonts for param names labels=[camel.parname.setdefault(n,n).replace('$','') for n in names[1:]] #getDist functions: samples=[MCSamples( samples=mcmc, names=names[1:],labels=labels)] g = plots.getPlotter() # creates the "plotting" object g.triangle_plot(samples,names[1:],filled=True) savefig("triangle.pdf") Then open the triangle.pdf file. 6. Parameters covariance matrixNow that we have some samples for the 4 variables we can estimate empirically covariance between the variables which is
as computing the variance for a single array with . (think about this can be generalized when x_i is a vector).
The import pandas df=pandas.DataFrame(chain) a=df.values.T #this creates the 2D array but need to transpose it c=cov(a[1:]) # we avoid the first column that is chi2 Plot c or better the correlation matrix and compare to the hessian. If your Gelman-Rubin test was bad you may want to use this empirical matrix as the cov for the proposal (instead of the Hessian,) which can improve convergence. 7. Closing the loopNow that you have 2 estimates of the "H0" parameter, one from the profile likelihood, the other one from MCMC Bayesian sampling: superimpose both (think about a proper representation) Now you are an expert, you may look at this more complete example on real data (or go to the beach...)
older stuff: |