Parse Annotated MGFs #12

jmueller95 · 2020-10-09T08:43:19Z

This PR extends the MGF module so that MGFs with annotated mass spectra can be parsed (for an example, see test_annotated.mgf ).
These files are e.g. output by the MS/MS prediction tool pDeep (https://github.com/pFindStudio/pDeep).
The new feature can be switched on by setting read_ions=True in the mgf.read() routine.

levitsky · 2020-10-09T16:37:53Z

Thank you for this very nice contribution! I'll be happy to merge it after some minor wrinkles are ironed out, which I am happy to assist with. I have pushed a commit where I addressed most of the things I noticed, the main one being that mgf.write wasn't quite working with ions. I also added a test for that, and changed a print() call to warn() in _parse_ion. Please take a look and let me know if you're OK with these changes.

Apart from that, I have just a couple of questions left:

As I'm not familiar with pDeep, are those files guaranteed to have all of the fragments annotated like that? (Are these predicted spectra or experimental spectra with annotations?). If some annotations can be missing, we could use masked arrays the same ways as with charges. (Nevermind, I see that pDeep is a prediction tool, so the answer is yes and masked arrays are not needed. Unless you know of any other uses for this format where the annotations may be sparse?)
I'm curious about the change you made in pyteomics/auxiliary/file_helpers.py (explicitly converting the element ID to str). Why was it necessary? It's kind of at the heart of all the indexed parser functionality so I'm extra cautious about changing anything there.

mobiusklein · 2020-10-09T19:34:38Z

This is probably minor, but the changes do break previously pickled objects since it changes the signature of __init__. To preserve this, just move the read_ions parameter to the end of the parameter list of __init__ and get/set it in the __getstate__/__setstate__ methods instead of in __reduce_ex__. These objects probably shouldn't be serialized for long-term storage though.

levitsky · 2020-10-10T13:40:15Z

Thank you for the comment @mobiusklein. I'm not sure I like the idea of permanently changing the code for a one-time reason, so it got me thinking if I could write a one-off script to convert the pickled data based on the latest version of the code: basically copy the old definitions into the script, patch them onto the actual module, read in the data, then dump it to a new file.
I figured I could set self._read_ions = False in the patched class's __init__ and then change self.__class__ to mgf.IndexedMGF (i.e. new class) in __reduce__ex__. Do you think it's a viable approach?

I'm not sure what to do about plain MGF though. Upon a closer look it doesn't appear to be picklable in any practical sense anyway.

Here's a version that only deals with IndexedMGF in pickled data: https://pastebin.com/s0D40Na3

jmueller95 · 2020-10-12T16:56:18Z

Thank you very much for your comments! I implemented the changes you suggested, including those by @mobiusklein (you can decide if you want to keep them). As you already figured @levitsky , pDeep has annotations for all its peaks, so masked arrays aren't necessary - I haven't seen any files where the annotations were only present partially.
Ah, and the change in pyteomics/auxiliary/file_helpers.py turns out not to be necessary anymore, so I reverted it to the original version. Thanks for pointing that out!

mobiusklein · 2020-10-13T02:19:56Z

Thank you, that fixes my concern.

levitsky · 2020-10-13T20:53:30Z

OK, thank you all!
I've reverted another addition of str() conversion in get_by_id, this one inside mgf.py. Hopefully it's not going to be a problem; at least in the tests, it is not needed.

jmueller95 · 2020-10-14T11:20:59Z

Thanks, no this is not a problem for my use case either.

Julian Mueller and others added 5 commits July 16, 2020 13:57

Implement parsing of MGF files that contain ion names

31cdefe

Delete deprecated code

a056a57

Add tests for annotated mgf

0e1699e

Merge remote-tracking branch 'upstream/master' into annotated_mgf

a7fdcd3

Fix mgf.write with write_ions, add a read-write mgf test with ions

d3538dd

Julian Mueller added 2 commits October 12, 2020 18:30

Reset changes in auxiliary/file_helpers.py

8789758

Moved read_ions parameter to end of function in mgf.py

07020e5

levitsky added 2 commits October 13, 2020 22:30

Undo str conversion of MGF ID

b55601f

Add an extra test for index access and ions

4898a25

levitsky merged commit 1e75daf into levitsky:master Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Annotated MGFs #12

Parse Annotated MGFs #12

jmueller95 commented Oct 9, 2020 •

edited

Loading

levitsky commented Oct 9, 2020 •

edited

Loading

mobiusklein commented Oct 9, 2020

levitsky commented Oct 10, 2020

jmueller95 commented Oct 12, 2020

mobiusklein commented Oct 13, 2020

levitsky commented Oct 13, 2020

jmueller95 commented Oct 14, 2020

Parse Annotated MGFs #12

Parse Annotated MGFs #12

Conversation

jmueller95 commented Oct 9, 2020 • edited Loading

levitsky commented Oct 9, 2020 • edited Loading

mobiusklein commented Oct 9, 2020

levitsky commented Oct 10, 2020

jmueller95 commented Oct 12, 2020

mobiusklein commented Oct 13, 2020

levitsky commented Oct 13, 2020

jmueller95 commented Oct 14, 2020

jmueller95 commented Oct 9, 2020 •

edited

Loading

levitsky commented Oct 9, 2020 •

edited

Loading