-
Notifications
You must be signed in to change notification settings - Fork 0
stats qc
The following two statistical quality checks are conducted on the raw abundance values of the dataset, before any differential abundance analysis (or other such statistical tests) is carried out. The end result removes metabolites where intra-group (or inter-group) variability is too high, because these are judged to have been too high as to be reliably reproducible.
If these "highly-variable" metabolites are not removed:
- The total variance of the data set would be greater (for any reasonable covariance matrix)
- these would likely fail in t-tests anyway, given typical experimental sample sizes which are very small (probably single-digit)
- Multiple-corrections procedures might subsequently become more punitive, since they are adjusting for more hypotheses being tested.
This entails checking the measurement of each metabolite. The ideal case is for within-group variance to be zero, e.g.:
Sample | Group | my_metabolite |
---|---|---|
s1 | treatment | 30,000 |
s2 | treatment | 30,000 |
s3 | treatment | 30,000 |
s4 | control | 50,000 |
s5 | control | 50,000 |
s6 | control | 50,000 |
If within-group variation is too high, subsequent differential-abundance tests can (and should) fail.
This is an estimate of the difference between a sample mean, and the true population mean, which we'd ideally like to be as small as possible. We calculate SEM per metabolite, per group. Given by:
Where s is the sample standard deviation, and n is the sample size. Notice that there are two ways to make SEM as small as possible:
- a very small standard deviation, or
- a very large sample size
SEM is sometimes known as just "standard error", but I avoid using that because "standard error" means something else entirely in the context of linear regression.
Notation: whichever abbreviation prefer ("CV" or "CoV"), remember to capitalize both the letters "C" and "V", because
cov
, orCov
, are canonically reserved for "covariance".
This is a slightly confusing one. CoV is the most useful when comparing different measurements that have different units of measurement. It's given by:
Where s.d. stands for "standard deviation".
For example, say we have the following descriptive statistics of the heights and weights of a group of individuals:
measure | weight (kg) | height (m) |
---|---|---|
mean | 80.2 | 1.77 |
s.d. | 11.5 | 0.23 |
CoV | 0.143 | 0.130 |
Say we have the question: which set of measurements is more variable, heights or weight? We can't meaningfully make this comparison just based on means and standard deviations, because these are in different units of measurements (weight in kilograms, and height in metres). The answer is: weight is more variable, since its CoV of 0.143 is greater. As such, CoV sort of "standardizes" different measurements onto the same scale so that they're more comparable.
(More accurately: CoVs are dimensionless constants, that allow for such a comparison).
In metabolomics, using CoV is a little confusing (but is still valid) because all measurements are already on the same scale anyway (some kind of metabolite abundance/intensity, depending on the lab hardware being used), so comparing standard deviations would have sufficed.
(Thankfully, none of these are relevant to metabolomics.)
- CoV becomes unreliable when the mean is near zero, simply due to how the arithmetic works.
- CoV also produces unexpectedly different results for the same measurements on different scales. e.g. these two sets of temperature measurements will have different CoV's, even though they are actually the measurements of the same physical properties:
- Celsius: [0, 10, 20, 30, 40]
- Farenheit: [32, 50, 68, 86, 104]
This example shows the use of CV and SEM on a dataset of 168 metabolites, with two groups: "treatment" and "control" (n=6 each). The aim of the plots are to show that CoV and SEM values for differentially abundant metabolites are, indeed, very low, but the reverse is not necessarily true: not all those with low CoV and SEM values will be differentially abundant.
For the treatment group:
And the control group:
For both plots:
- The vertical axes of each subplot comprise of (from the top) mean and s.d. of raw abundances, CoV, and SEM. All of these are within-group measurements. For the top subplots, blue dots and red dots indicate the mean and standard deviation respectively. Note the logarithmic scale.
- The horizontal axis comprises of the metabolites (names not shown).
- The vertical blue lines running through all the plots indicate that these metabolites were found to be differentially abundant, at p-value of 0.05 and FDR of 0.05, using the Benjamini-Hochberg procedure for multiple hypothesis testing correction.
- CoV and SEM values for differentially abundant metabolites are, indeed, very low (though, of course, the reverse is not necessarily true: not all those with low CoV and SEM values will be differentially abundant).