Group members: Giovanni Zago, Emanuele Sarte, Alessio Saccomani, Fateme Baghaei Saryazdi
Supervisor: Prof. Carlo Albert
Academic year: 2022/2023
This project has been carried out by a group of Physics of Data student as final project of the course Laboratory of Computational Physics - Mod. B, held by Prof. Marco Baiesi. The subject of the project and the supervision of the work has been provided by Prof. Carlo Albert.
The aim of this project is analyzing data coming from microbial cultures by performing Bayesian inference of parameters belonging to stochastic models that describe the growth and the lifetime of the microbes themselves.
The underlying idea that is shared by all the models considered for the analysis of the data is that a microbe can be described as a dynamical system characterized by a few dynamical variables
Clearly, both
where
Next we are going to showcase the models considered for the Bayesian inference through the project.
This first model is very basic and aims to get the essential features that a good model shoud embed to correctly describe the phenomenon we are facing. This model is single-traited, namely
with the microbe size
In this case eq. (2) holds with
In these equations
thus
By integrating eq. (4) we obtain
that substituted in (5) returns
This model is a more accurate version of Model 1, since it embeds the fact that a microbe cell can divide only if a minimum size has been reached. Nevertheless, it is still a single trait model:
with the microbe size
Eq. (2) holds also in this case, but now
and also the expression for
and thus we can define the following quantity
which represents the amount of time elapsed since the division of the mother cell in order for
Since
Model 2 aims at correcting the istantaneous division problem that arises in Model 1.2. To do so we need to introduce another dynamic variable,
and
The growth rate
with
and the division equation becomes
meaning that the protein amount drops to zero at each cell division, while the microbe size after the division becomes half the one reached at the end of the previous life cycle. By integrating (13) we get
Thanks to eq. (18) we can look at eq. (15) and define
which is always a positive quantity. Thus by integrating (15) we get
from which it is evident that istantaneous division is not allowed. Next we report the plots of
This last model is the more advanced one as it introduces more elements of stochasticity aiming at catching more accurately the observed behaviour of microbe cells. The fist main innovation that this model embeds is the presence of the growth rate as dynamic variable, thus
with
The survival probability equation remains the same as (15). So the growth rate does not vary during the lifetime of a microbe, but it may change from one cell cycle to the following. Indeed the division distribution embeds this feature, as well as a variable division factor:
From eq. (22) we see that after each cell division the a growth rate
As before, by looking at eq. (15), we can define
and then integrate (15), obtaining
Here we have 7 parameters: a and b define the values of the growth rate (parameter
The resulting pdf is:
We are left with the derivation of the explicit formula of
For model 1 we get:
For model 1.2 we get:
For model 2 we get:
When dealing with model 3, instead, the situation is slightly more complicated since the stochastic variable is no longer only
This factorization reflects the fact that the PDF for
Now that we have fully outlined the main characteristics of our models it is possible to go through a preliminary phase in which we study synthetic, i.e. simulated, data. This helps catching better the behaviour of the models and also paves the way to performing the actual inference. A synthetic dataset for each model can be easily obtained by collecting random samples from the probability distributions derived in the previous section. The fact that we also know their cumulative distributions makes the task easier, since plain Monte Carlo sampling with numerical inversion will suffice. For model 1, 1.2 and 2 the only stochastic variable is the division time
As previously stated, our goal is use the proposed datasets to estimate the parameters of the aforementioned models exploiting Bayesian inference. In other words our target is sampling the multivariate posterior distribution of the parameters, given by the Bayes theorem
The tool exploited for sampling the posterior is emcee
, a open source python-based Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler.
The key class of the emcee
library is EnsembleSampler
, whose constructor creates an EnsembleSampler
object. To do so, in our case, it is important to specify:
-
nwalkers
, the number of walkers that will move through the parameter space in order to sample from the posterior -
ndim
, the dimension of the parameter space on which the posterior is defined -
log_prob_fn
, a function that takes as input a vector of parameters belonging to the parameter space and returns the natural logarithm of the unnormalized posterior distribution -
args
, an additional set of parameters that is required for the calculation oflog_prob_fn
. In our case,args
represents the data${\vec{x}_i}$ .
It is evident that the most challenging part is the calculation of log_prob_fn
, since it requires first the computation of
but since we need to provide the natural logarithm of the posterior, we should focus on the log-likelihood:
As prior
So the whole unnormalized log-posterior is
Once we have instantiated the EnsambleSampler
method we can use its run_mcmc
method to generate the chain of the samples of the posterior, and thus get also the marginalized posterior distribution for each parameter. We have chose the posterior mode as the best estimate for each parameter, and also calculated the 95% credibility interval.
In order to become proficient in the use of EnsambleSampler
we first exploited it to infer the model parameters from the synthetic data. This is useful since we can have a solid grasp in the outcome, and so it is possible to fix bugs and errors easily. Once this step is properly completed, switching to real data is straigthforward since the code is already implemented. To see the results of this section one can check the notebooks named project_model**.ipynb
.
Once made sure that the inference worked well on synthetic data we switched to considering real data. The datasets used in this work are Tanouchi25c, Tanouchi37c and Susman18. These datasets contain long-term, single-cell measurements of Escherichia coli cultures grown in different conditions. Each record contains ready-to-use values on the initial length, final length, lifespan, growth rate and division ratio of the microbe. Records are organized into lineages, namely groups of records associated to cells that share the same parent. This is an important detail because in order to keep track of the aforementioned quantities we have to make sure that they refer to the same lineage. Thus we have performed the inference lineage by lineage.
For model 3 with the results of the emcee we manage to analyze more deeply the final parameters. First, we check that the parameters of the gamma function (a,b) and the beta function (c,d) have values that represent the distribution in the real data of growth rate and division ratio respectively. After that we also look at the ratio of the frequencies