Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of quality metrics based on HUPO-PSI/mzQC definitions #204

Open
68 tasks
tnaake opened this issue May 12, 2021 · 8 comments
Open
68 tasks

Calculation of quality metrics based on HUPO-PSI/mzQC definitions #204

tnaake opened this issue May 12, 2021 · 8 comments

Comments

@tnaake
Copy link

tnaake commented May 12, 2021

Dear @jorainer

following up on the conversation in the slack channel, here comes the issue in the Spectra package.

The idea was to be able to calculate HUPO-PSI-defined quality metrics (https://github.com/HUPO-PSI/mzQC/blob/master/cv/qc-cv.obo) on MS samples and possibly, for some of them, the Spectra package or infrastructure would be an ideal place (or a SpectraQC/... package). The metrics could be applied on metabolomics and proteomics data. Not all metrics can be calculated based on Spectra objects.

I was thinking of the following, excessive list of, metrics (focusing on MS1, given are the ID, the value type, the name and definition if it differs from the name):

  • QC:4000050, single value, XIC-WideFrac, The fraction of precursor ions accounting for the top half of all peak widths;
  • QC:4000051, n-tuple, XIC-FWHM quantiles, The first to n-th quantile of peak widths for the wide XICs;
  • QC:4000052, n-tuple, XIC-Height quantiles ratio to Q1, The log ratio for the second to n-th quantile of wide XIC heights over previous quantile of heights. For the boundary elements min/max are used;
  • QC:4000053, single value, RT duration, The retention time duration of the MS run in seconds, similar to the highest scan time minus the lowest scan time;
  • QC:4000054, n-tuple, RT over TIC quantile, The interval when the respective quantile of the TIC accumulates divided by retention time duration. The number of quantiles observed is given by the size of the tuple;
  • QC:4000055, n-tuple, MS1 quantiles RT fraction, The interval used for acquisition of the first, second, third, and fourth quarter of all MS1 events divided by RT-Duration;
  • QC:4000057, n-tuple, MS1 quantile TIC change ratio to Q1, The log ratio for the second to n-th quantile of TIC changes over first quantile of TIC changes;
  • QC:4000059, single value, Number of MS1 spectra, The number of MS1 events in the run;
  • QC:4000065, single value, Precursor median m/z for IDs, Median m/z value for all identified peptides (unique ions) after FDR;
  • QC:4000072, single value, Interquartile RT period for peptide identifications, The interquartile retention time period, in seconds, for all peptide identifications over the complete run;
  • QC:4000073, single value, Peptide identification rate of the interquartile RT period, The identification rate of peptides for the interquartile retention time period, in peptides per second;
  • QC:4000074, single value, Median MS1 peak FWHM for peptides, Median of all MS1 peak widths at half maximum (FWHM) for all identified peptides, in seconds;
  • QC:4000075, single value, Interquartile distance of MS1 peak FWHM for identifications, Interquartile distance of all MS1 peak widths at half maximum (FWHM) for all identifications, in seconds;
  • QC:4000077, single value, Area under TIC, The area under the total ion chromatogram;
  • QC:4000078, n-tuple, Area under TIC RT quantiles, The area under the total ion chromatogram of the retention time quantiles. Number of quantiles are given by the n-tuple;
  • QC:4000125, single value, Extent of identified precursor intensity, Ratio of 95th over 5th percentile of precursor intensity for identified peptides;
  • QC:4000130, single value, Median of TIC values in the RT range in which the middle half of peptides are identified, Median of TIC values in the RT range in which half of peptides are identified (RT values of Q1 to Q3 of identifications);
  • QC:4000131, single value, Median S/N for MS1 spectra in the shortest RT range in which half of the peptides are identified;
  • QC:4000132, single value, Median of TIC values in the shortest RT range in which half of the peptides are identified;
  • QC:4000133, single value, Explained base peak intensity median, Median of the ratio of 'max survey scan intensity' over 'sampled precursor intensity' for all peptides identified;
  • QC:4000135, single value, Number of chromatograms;
  • QC:4000138, n-tuple, MZ acquisition range, Upper and lower limit of m/z values at which spectra are recorded;
  • QC:4000139, n-tuple, RT acquisition range, Upper and lower limit of time at which spectra are recorded;
  • QC:4000140, single value, Fastest frequency for MS level 1 collection;
  • QC:4000142, single value, Slowest frequency for MS level 1 collection;
  • QC:4000148, single value, MS1 ion collection time mean, From the distribution of ion injection times (MS:1000927) for MS1, the mean;
  • QC:4000149, single value, MS1 ion collection time sigma, From the distribution of ion injection times (MS:1000927) for MS1, the sigma value;
  • QC:4000158, single value, Peak density distribution MS1 mean, From the distribution of peak densities in MS1, the mean;
  • QC:4000159, single value, Peak density distribution MS1 sigma, From the distribution of peak densities in MS1, the sigma value;
  • QC:4000168, single value, Precursor intensity distribution mean, From the distribution of precursor intensities, the mean;
  • QC:4000169, single value, Precursor intensity distribution sigma, From the distribution of precursor intensities, the sigma value;
  • QC:4000172, single value, MS1 signal jump (10x) count, The count of MS1 signal jump (spectra sum) by a factor of ten or more (10x) between two subsequent scans;
  • QC:4000173, single value, MS1 signal fall (10x) count, The count of MS1 signal decline (spectra sum) by a factor of ten or more (10x) between two subsequent scans;
  • QC:4000174, single value, Charged peptides ratio 1+ over 2+, Ratio of 1+ peptide count over 2+ peptide count in identified spectra;
  • QC:4000175, single value, Charged peptides ratio 3+ over 2+, Ratio of 3+ peptide count over 2+ peptide count in identified spectra;
  • QC:4000176, single value, Charged peptides ratio 4+ over 2+, Ratio of 4+ peptide count over 2+ peptide count in identified spectra;
  • QC:4000177, single value, Mean charge in identified spectra;
  • QC:4000178, single value, Median charge in identified spectra;
  • QC:4000184, single value, Number of different distinct proteins from all PSM, Number of different distinct protein from all PSM after FDR filtering. (No undistinguishability groups.);
  • QC:4000185, n-tuple, Number of identified proteins, Number of identified proteins at given FDR threshold, first number is the number of proteins (considering sequence only), second number is the FDR threshold applied (negative if no threshold applied);
  • QC:4000186, single value, Total number of PSM, Total number of PSM before FDR filtering;
  • QC:4000187, n-tuple, Number of identified peptides, Number of identified peptides at given FDR threshold, first number is the number of peptides (considering sequence only), second number is the FDR threshold applied (negative if no threshold applied);
  • QC:4000191, single value, Precursor errors (Da) mean, From the distribution of Precursor errors (mass deviation of precursor to identified peptide in Da), the mean;
  • QC:4000192, single value, Precursor errors (Da) sigma, From the distribution of Precursor errors (mass deviation of precursor to identified peptide in Da), the sigma value;
  • QC:4000196, single value, Precursor errors (ppm) mean, From the distribution of Precursor errors (ppm), the mean;
  • QC:4000197, single value, Precursor errors (ppm) sigma, From the distribution of Precursor errors (ppm), the sigma value;
  • QC:4000201, single value, Precursor errors (ppm) median, From the distribution of Precursor errors (ppm), the median
  • QC:4000202, single value, Precursor errors (ppm) IQR, From the distribution of Precursor errors (ppm), the IQR;
  • QC:4000203, n-tuple, Identification score - Q1, Q2, Q3, From the distribution of Identification score, the Q1, Q2, Q3 value;
  • QC:4000204, single value, Identification score - mean, From the distribution of Identification score, the mean value;
  • QC:4000205, single value, Identification score - sigma, From the distribution of Identification score, the sigma value;
  • QC:4000213, n-tuple, Identified peptide lengths - Q1, Q2, Q3, From the distribution of identified peptide lengths the quartiles Q1, Q2, Q3 value;
  • QC:4000214, single value, Identified peptide lengths - mean, From the distribution of identified peptide lengths the mean;
  • QC:4000215, single value, Identified peptide lengths - sigma, From the distribution of identified peptide lengths the sigma value;
  • QC:4000218, n-tuple, Signal-to-noise ratio in MS1 - Q1, Q2, Q3, From the distribution of signal-to-noise ratio in MS1, the quartiles Q1, Q2, Q3 value;
  • QC:4000219, single value, Signal-to-noise ratio in MS1 - mean, From the distribution of signal-to-noise ratio in MS1, the mean;
  • QC:4000220, single value, Signal-to-noise ratio in MS1 - sigma, From the distribution of signal-to-noise ratio in MS1, the sigma value;
  • QC:4000228, n-tuple, Identified precursor intensity distribution Q1, Q2, Q3, From the distribution of identified precursor intensities, the quartiles Q1, Q2, Q3;
  • QC:4000229, single value, Identified precursor intensity distribution - mean, From the distribution of identified precursor intensities, the mean;
  • QC:4000230, single value, Identified precursor intensity distribution - sigma, From the distribution of identified precursor intensities, the sigma value;
  • QC:4000233, n-tuple, Unidentified precursor intensity distribution - Q1, Q2, Q3, From the distribution of unidentified precursor intensities, the quartiles Q1, Q2, Q3;
  • QC:4000234, single value, Unidentified precursor intensity distribution - mean, From the distribution of unidentified precursor intensities, the mean;
  • QC:4000235, single value, Unidentified precursor intensity distribution - sigma, From the distribution of unidentified precursor intensities, the sigma value;
  • QC:4000245, single value, Number of different undistinguishable proteins groups from all PSM, Number of different undistinguishable proteins groups from all PSM after FDR filtering. (Only undistinguishability groups.);
  • QC:4000257, single value, Detected Compounds, Number of detected compounds from a given library of target compounds in a specific run;
  • QC:4000258, single value, Maximal S1 frequency, The fastest frequency for MS collection in any minute over the complete run;
  • QC:4000262, single value, Retention time mean shift, Based on reference retention times of detected features the mean shift of all features is calculated in seconds;
  • QC:4000263, single value, Pump pressure mean, The mean pump pressure in bar for the whole run

What do you think would be the best place to calculate these metrics (within Spectra or outside/in a stand-alone package)? Do you think there could be other objects that could complement Spectra objects for the calculation when information stored in a Spectra object is not suitable for the calculation, e.g. QFeatures?

Best,
T.

@jorainer
Copy link
Member

Now, that's a comprehensive list ;)

I would suggest to have these in a separate package (maybe MsQC?), also because not all of the parameters can be calculated on a Spectra: the ones based on XIC would require a Chromatograms (which would be returned by e.g. xcms) as they refer to the MS1 chromatographic peaks. Also others like the last one needs to extracted directly from the mzML file (and not sure that all manufacturers write/export this information). Also, having it in a separate package makes development easier - functionality could eventually be transferred if needed.

The other main question is: what would be the user interface you envision? One function for each QC parameter? Or one main function and define the which metric(s) to calculate with a parameter?

One possibility could be:

setMethod("quality", "Spectra", function(object, metric = qualityMetrics("Spectra")))

What the method returns depends a little on how the metric is calculated, if it's done on a single spectrum or on the whole Spectra.

qualityMetrics could be a function that lists all possible metrics that can be calculated/estimated on a Spectra object.

just an idea...

@tnaake
Copy link
Author

tnaake commented May 14, 2021

Great, then let's go for a separate package. Should I create a repo in my repo and start with the implementation there? I guess I can start from next week on to write some functions for calculating (some of) the metrics.

We could also start first on the metrics based on Spectra and Chromatograms for now - and go into mzML files later (there are also further metrics that could be calculated from raw/mzML files which could be added later - if there's a need. I will also talk to the people in the core facilities here in which metrics calculated from "raw"-like files they might be interested in).

I like the idea of having one main function and define the metrics to calculate therein and have for Spectra/Chromatograms/... objects methods. This looks quite tidy and clean to me.

The output would be a list (or a S4 object - tbd) containing the metrics for a Spectra object or a Chromatograms object, etc.

@jorainer
Copy link
Member

I would suggest you create a repo under your account - if you want you can eventually add me as external collaborator so that I can review your pull requests? It's sometimes not bad to get a second opinion on implementations...

@lgatto
Copy link
Member

lgatto commented Sep 2, 2021

Just FYI - there is (or was, as it may have been depreciated) an msQC package in Bioc, so check for name clashes first.

@tnaake
Copy link
Author

tnaake commented Sep 2, 2021

Hi @lgatto

thanks for your comment. I checked now, if there is a msQC package in BioC. It seems that there is mdqc, miQC, and msqc1, but I couldn't find another msQC package.

@jorainer
Copy link
Member

jorainer commented Sep 2, 2021

You should also always check if a package name could have an ambiguous meaning or might be offending - in your case I could only find MSQC = Missouri Start Quilt Company - so it should be fine ;)

@jorainer
Copy link
Member

jorainer commented Sep 2, 2021

sorry, my comment was not really helpful - I just found it funny when I stumbled across that abbreviation

@lgatto
Copy link
Member

lgatto commented Sep 2, 2021

Beware of MSQC of CRAN. Package names aren't case sensitive, so that one is taken.

And the one I was thinking about is proteoQC, that is now deprecated, so also taken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants