Estimating SARS-CoV-2 seroprevalence and epidemiological parameters with uncertainty from serological surveys
This repository stores code and data associated with the manuscript Estimating SARS-CoV-2 seroprevalence and epidemiological parameters with uncertainty from serological surveys by Daniel B. Larremore, Bailey K. Fosdick, Kate M. Bubar, Sam Zhang, Stephen M. Kissler, C. Jessica E. Metcalf, Caroline O. Buckee, and Yonatan H. Grad.
This repository is maintained by Daniel Larremore. Questions can be directed to daniel.larremore@colorado.edu
.
Code is written primarily in python
but the MCMC engine has been rewritten in R
because it is, for some reason, faster.
When sensitivity and specificity are known, one can use those values, along with the number of positive and negative test results, to produce posterior estimates of prevalence like this one:
This figure can be created and downloaded by using the web-based calculator available at https://larremorelab.github.io/covid-serology.
To perform the same inference in python, a notebook is available at codebase/prevalence_onepopulation_workbook.ipynb
When there are multiple subpopulations, it is tempting to simply use the estimation approach above, independently for each subpopulation. However, especially when the number of samples per bin is small, this is not a good idea, as the estimates may vary wildly. What can be done?
What we do in the paper, which we provide code for here, is to use a Bayesian hierarchical model where the supopulation seroprevalences share a prior distribution. The mean of that prior is informed entirely by the data, while the variance of that prior is weakly specified by a hyperprior. In plain English, the Bayesian approach allows the subpopulations to learn from each other without entraining to the same estimates.
The notebook codebase/prevalence_subpopulations_workbook.ipynb
gives example code. Given data in the form of sensitivity, specificity, and the number of positive and negative samples in each subpopulation, the inference is performed to generate posterior estimates. Note that we are able to produce estimates albeit with high uncertainty for the supopulation that was dramatically undersampled.
For the moment, the sample size calculation has not been wrapped in its own function to allow one to "solve for n
". However, if one wanted to compute the number of samples required to produce estimates to within a certain uncertainty, that number would depend on
- Test sensitivity and specificity.
- The range of prevalence values that are plausibly expected.
- The posterior uncertainty spec that is trying to be achieved.
As a general rule, one could take the plausible prevalence values, choose the one closest to 0.5, and assume that the true seroprevalence is that "worst case" value. Then, use the notebook codebase/prevalence_onepopulation_workbook.ipynb
to explore how sample counts affect posterior error (measured by credible interval width).
MDI stands for Model & Demographic Informed. It is a variance reduction strategy. In plain language, its goal is to help allocate more samples to those subpopulations whose accurate measurement will be most beneficial. The method is explained in the manuscript, and it produces recommendations for sample allocation. Here is an example of a hypothetical sample allocation strategy with the goal of parameterizing a model of disease dynamics in Argentina (using estimates of Argentina's demographic and contact structure). This figure and others like it can be recreated and customized using the Jupyte notebookcodebase/sample_allocation_MDI_workbook.ipynb
In the paper, we have a figure like this one:
This figure can be recreated from scratch using the Jupyter notebook codebase/SEIR_workbook.ipynb
. In order, this notebook:
- Simulates hypothetical data. (Real data could be inserted here.)
- Infers the posterior distribution over seroprevalence using the sensitivity and specificity values chosen.
- Runs an SEIR simulation forward. (Initial conditions, parameters, and assumptions about immunity could be adjusted here.)
- Calculates both peak timing and height distributions. (only Height is plotted, but this can be adjusted in plots.)