Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

Open
ypriverol opened this issue Aug 16, 2024 · 13 comments
Open

Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410

ypriverol opened this issue Aug 16, 2024 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@ypriverol
Copy link
Member

ypriverol commented Aug 16, 2024

PXD001819 Analysis

Currently, we have a workflow that can perform peptide identification using: -> ms2rescore -> SNR + spectrum properties -> percolator

Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD001819-id-ms2rescore/).

Total number of PMSs

Comet only + Percolator: 495306
Comet + MSGF + Percolator: 572496 (15.58% increase)
Comet + MSGF + ms2rescore: 589200 (18.95% increase)
Comet + MSGF + (SNR + ms2rescore): 587972 (18.71% increase)
Comet + MSGF + SAGE + (SNR + ms2rescore): 592918 (19.68% increase)

psm_tools_plot

Total number of PSMs by RAW file and combination

psms_by_file_and_tool

Currently, the combination of ms2rescore alone has more PSMs identifications, followed by ms2rescore + SNR.

The following questions would be interesting to understand:

  • When the spectrum quality metrics are introduced, are the PSMs more high-quality meaning that while we have fewer PSMs for ms2rescore + SNR they have more quality than ms2rescore?
  • Do we see the same results in other datasets?
  • What is the impact at peptide level?
@ypriverol ypriverol added the enhancement New feature or request label Aug 16, 2024
@ypriverol
Copy link
Member Author

ypriverol commented Aug 17, 2024

PXD014415

Currently, we have a workflow that can perform peptide identification using: -> ms2rescore -> SNR + spectrum properties -> percolator

Each of these combinations can be turned off. We used the dataset PXD014415 to benchmark the peptide identifications with some of the combinations:

Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD014415-id-ms2rescore/).

Combinations & PSMs counts:

  • Comet only + Percolator: 1401471
  • Comet + MSGF + Percolator: 1576657 (12.50% increase)
  • Comet + MSGF + ms2rescore: 1620560 (15.63% increase)
  • Comet + MSGF + (SNR + ms2rescore): 1617000 (15.38% increase)
  • Comet + MSGF + SAGE + (SNR + ms2rescore): 1646795 (17.50% increase)

psm_tools_plot

Total number of PSMs by RAW file and combination

psms_by_file_and_tool

Currently, the combination of ms2rescore (Comet + MSGF + SAGE) and SNR has more PSMs identifications.

@jpfeuffer
Copy link
Collaborator

What is "non-sage"? Comet?

@ypriverol
Copy link
Member Author

Sorry the non-sage is COMET+MSGF

@jpfeuffer
Copy link
Collaborator

Sage comes on top or as replacement?

How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.

Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.

Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?

@ypriverol
Copy link
Member Author

Sage comes on top or as replacement?

On top.

How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive.

It is really fast, then there is no urgent need for improvements.

Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough.

We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py

Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)?

Im listening to suggestions. I would love to evaluate if this 5% increase in the PSMs in some way affects the FDR? Also, Im listening to suggestions on how to evaluate the difference between SNR+MS2rescore and MS2rescore. I have manually checked some IDs (in proteogenomics - https://www.biorxiv.org/content/10.1101/2024.05.24.595489v1) and I know that ms2rescore in the low-quality spectra can save (identified) some low-quality spectra that is the reason why we added the SNR. Would be nice to have some benchmark to prove it.

@ypriverol
Copy link
Member Author

I was reading today the MSAmanda + ms2rescore and the % increase in PSMs is 6%.

@RalfG
Copy link

RalfG commented Aug 21, 2024

Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).

Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).

@ypriverol
Copy link
Member Author

ypriverol commented Aug 23, 2024

Thanks @RalfG for this response:

Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266).

Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076).

How do you test this? Distribution of the PEP scores or the original scores for targets and decoys?

@RalfG
Copy link

RalfG commented Aug 23, 2024

Usually just by plotting the amount of confidently identified PSMs at each FDR threshold, as in figure 1 of doi:10.1016/j.mcpro.2021.100076.

@jonasscheid
Copy link
Contributor

I'm a bit curious about

We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py

Did you check the feature weights of percolator for this feature? I would guess that the Comet Xcorr implicitly penalizes for high SNR: https://willfondrie.com/2019/02/an-intuitive-look-at-the-xcorr-score-function-in-proteomics/

Would be great to see how high search engine scores / predicted features weights are in percolator!

@daichengxin
Copy link
Collaborator

daichengxin commented Sep 22, 2024

Thanks for your suggestions. There are latest benchmark results from PXD001819 and PXD014415. The percolator top20 weights are shown in figure3 and figure4 (Top panel is comet, bottom panel is msgf). And the SNR features are plotted in figure5 and figure6. (a) is percolator method, (b) is ms2rescore and (c) is ms2rescore+snr.

I think we can get some conclusion:

  1. multiple search engines improved identification by >10%.
  2. Adding MS2rescore features enhanced the separation between true and false PSMs, which means that increase the specificity to 0.1% FDR. (increase 3%)
  3. Peptide length and Comet:spscore have such a significant weight. For peptide length, I think that when the peptide is longer, the more key ions are produced and matched, and therefore it is easier to differentiate between false PSMs and true PSMs. The weight of XCorr is positive, which indicative of a high xcorr gives a better hit. The weight of the absdM is negative, which indicative that large differences between observed and calculated mass gives a worse score. These results are same as https://github.com/percolator/percolator/wiki/Example.
  4. After adding MS2rescore features, The weights distribution changed. RT difference and ion intensity difference features occupy an important position. For example, the weight of ionb_mse_norm is negative, which indicative that large differences between observed and predicted b ion intensity gives a worse score.
  5. After adding snr features, the weight of quantms:snr is positive, which indicative that high snr gives a better score. And the weight of quantms:SpectralEntrpy and quantms:quantms:FracTICinTop10Peaks are positive in PXD001819, which indicative that high signals distribution gives a better score. But this may be different across search engines and datasets.
    PXD001819:
    PXD001819_ms2rescore
    PXD014415:
    PXD014415_ms2rescore
    PXD001819:
    image
    PXD014415:
    image

image

Looking forward to your feedback!

@jonasscheid
Copy link
Contributor

jonasscheid commented Sep 23, 2024

Nice Job 👌🏼 are any of the Quantms Features correlated with the ms2pip or other Features?

Also, what exactly is "number of identified spectra"? (PSMs? Peptides?)

@RalfG
Copy link

RalfG commented Oct 11, 2024

Nice results! Cool to see that combining multiple search engines improves identification rates. How would you interpret this? Would this come from differences in candidate PSM generation?

I find the radar charts a bit difficult to compare, as the features are ordered differently across each plot.

A bit unfortunate to see only little differences when adding SNR features. In any case they do seem to help a bit? Would be nice to see the SNR feature distributions for accepted targets, rejected targets, and decoys. A bit similar to this:

image

Or you could do something like this:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants