Skip to content

stats qc

MetabolomicsAustralia-Bioinformatics edited this page Jan 8, 2020 · 21 revisions

Statistical Quality Checks

Intro

The following two statistical quality checks are conducted on the raw abundance values of the dataset, before any differential abundance analysis (or other such statistical tests) is carried out. The end result removes metabolites where intra-group (or inter-group) variability is too high, because these are judged to have been too high as to be reliably reproducible.

If these "highly-variable" metabolites are not removed:

  • The total variance of the data set would be greater (for any reasonable covariance matrix)
  • these would likely fail in t-tests anyway, given typical experimental sample sizes which are very small (probably single-digit)
  • Multiple-corrections procedures might subsequently become more punitive, since they are adjusting for more hypotheses being tested.

This entails checking the measurement of each metabolite. The ideal case is for within-group variance to be zero, e.g.:

Sample Group my_metabolite
s1 treatment 30,000
s2 treatment 30,000
s3 treatment 30,000
s4 control 50,000
s5 control 50,000
s6 control 50,000

If within-group variation is too high, subsequent differential-abundance tests can (and should) fail.

1. Standard Error of the Mean (SEM)

This is an estimate of the difference between a sample mean, and the true population mean, which we'd ideally like to be as small as possible. We calculate SEM per metabolite, per group. Given by:

Where s is the sample standard deviation, and n is the sample size. Notice that there are two ways to make SEM as small as possible:

  • a very small standard deviation, or
  • a very large sample size

SEM is sometimes known as just "standard error", but I avoid using that because "standard error" means something else entirely in the context of linear regression.

2. Coefficient of Variation (CoV, or CV)

Notation: whichever abbreviation prefer ("CV" or "CoV"), remember to capitalize both the letters "C" and "V", because cov, or Cov, are canonically reserved for "covariance".

This is a slightly confusing one. CoV is the most useful when comparing different measurements that have different units of measurement. It's given by:

Where s.d. stands for "standard deviation".

For example, say we have the following descriptive statistics of the heights and weights of a group of individuals:

measure weight (kg) height (m)
mean 80.2 1.77
s.d. 11.5 0.23
CoV 0.143 0.130

Say we have the question: which set of measurements is more variable, heights or weight? We can't meaningfully make this comparison just based on means and standard deviations, because these are in different units of measurements (weight in kilograms, and height in metres). The answer is: weight is more variable, since its CoV of 0.143 is greater. As such, CoV sort of "standardizes" different measurements onto the same scale so that they're more comparable.

(More accurately: CoVs are dimensionless constants, that allow for such a comparison).

In metabolomics, using CoV is a little confusing (but is still valid) because all measurements are already on the same scale anyway (some kind of metabolite abundance/intensity, depending on the lab hardware being used), so comparing standard deviations would have sufficed.

CoV: Potential pitfalls

(Thankfully, none of these are relevant to metabolomics.)

  • CoV becomes unreliable when the mean is near zero, simply due to how the arithmetic works.
  • CoV also produces unexpectedly different results for the same measurements on different scales. e.g. these two sets of temperature measurements will have different CoV's, even though they are actually the measurements of the same physical properties:
    • Celsius: [0, 10, 20, 30, 40]
    • Farenheit: [32, 50, 68, 86, 104]

2. Example on a test dataset

This example shows the use of CV and SEM on a dataset of 168 metabolites, with two groups: "treatment" and "control" (n=6 each). The aim of the plots are to show that CoV and SEM values for differentially abundant metabolites are, indeed, very low, but the reverse is not necessarily true: not all those with low CoV and SEM values will be differentially abundant.

For the treatment group: alt text

And the control group: alt text

For both plots:

  • The vertical axes of each subplot comprise of (from the top) mean and s.d. of raw abundances, CoV, and SEM. All of these are within-group measurements. For the top subplots, blue dots and red dots indicate the mean and standard deviation respectively. Note the logarithmic scale.
  • The horizontal axis comprises of the metabolites (names not shown).
  • The vertical blue lines running through all the plots indicate that these metabolites were found to be differentially abundant, at p-value of 0.05 and FDR of 0.05, using the Benjamini-Hochberg procedure for multiple hypothesis testing correction.
  • CoV and SEM values for differentially abundant metabolites are, indeed, very low (though, of course, the reverse is not necessarily true: not all those with low CoV and SEM values will be differentially abundant).