Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Annotated MGFs #12

Merged
merged 9 commits into from
Oct 14, 2020
Merged

Parse Annotated MGFs #12

merged 9 commits into from
Oct 14, 2020

Conversation

jmueller95
Copy link

@jmueller95 jmueller95 commented Oct 9, 2020

This PR extends the MGF module so that MGFs with annotated mass spectra can be parsed (for an example, see test_annotated.mgf ).
These files are e.g. output by the MS/MS prediction tool pDeep (https://github.com/pFindStudio/pDeep).
The new feature can be switched on by setting read_ions=True in the mgf.read() routine.

@levitsky
Copy link
Owner

levitsky commented Oct 9, 2020

Thank you for this very nice contribution! I'll be happy to merge it after some minor wrinkles are ironed out, which I am happy to assist with. I have pushed a commit where I addressed most of the things I noticed, the main one being that mgf.write wasn't quite working with ions. I also added a test for that, and changed a print() call to warn() in _parse_ion. Please take a look and let me know if you're OK with these changes.

Apart from that, I have just a couple of questions left:

  1. As I'm not familiar with pDeep, are those files guaranteed to have all of the fragments annotated like that? (Are these predicted spectra or experimental spectra with annotations?). If some annotations can be missing, we could use masked arrays the same ways as with charges. (Nevermind, I see that pDeep is a prediction tool, so the answer is yes and masked arrays are not needed. Unless you know of any other uses for this format where the annotations may be sparse?)
  2. I'm curious about the change you made in pyteomics/auxiliary/file_helpers.py (explicitly converting the element ID to str). Why was it necessary? It's kind of at the heart of all the indexed parser functionality so I'm extra cautious about changing anything there.

@mobiusklein
Copy link
Contributor

This is probably minor, but the changes do break previously pickled objects since it changes the signature of __init__. To preserve this, just move the read_ions parameter to the end of the parameter list of __init__ and get/set it in the __getstate__/__setstate__ methods instead of in __reduce_ex__. These objects probably shouldn't be serialized for long-term storage though.

@levitsky
Copy link
Owner

Thank you for the comment @mobiusklein. I'm not sure I like the idea of permanently changing the code for a one-time reason, so it got me thinking if I could write a one-off script to convert the pickled data based on the latest version of the code: basically copy the old definitions into the script, patch them onto the actual module, read in the data, then dump it to a new file.
I figured I could set self._read_ions = False in the patched class's __init__ and then change self.__class__ to mgf.IndexedMGF (i.e. new class) in __reduce__ex__. Do you think it's a viable approach?

I'm not sure what to do about plain MGF though. Upon a closer look it doesn't appear to be picklable in any practical sense anyway.

Here's a version that only deals with IndexedMGF in pickled data: https://pastebin.com/s0D40Na3

@jmueller95
Copy link
Author

Thank you very much for your comments! I implemented the changes you suggested, including those by @mobiusklein (you can decide if you want to keep them). As you already figured @levitsky , pDeep has annotations for all its peaks, so masked arrays aren't necessary - I haven't seen any files where the annotations were only present partially.
Ah, and the change in pyteomics/auxiliary/file_helpers.py turns out not to be necessary anymore, so I reverted it to the original version. Thanks for pointing that out!

@mobiusklein
Copy link
Contributor

Thank you, that fixes my concern.

@levitsky
Copy link
Owner

OK, thank you all!
I've reverted another addition of str() conversion in get_by_id, this one inside mgf.py. Hopefully it's not going to be a problem; at least in the tests, it is not needed.

@jmueller95
Copy link
Author

Thanks, no this is not a problem for my use case either.

@levitsky levitsky merged commit 1e75daf into levitsky:master Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants