Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft mzmlb implementation #35

Merged
merged 11 commits into from
Apr 1, 2021
Merged

Conversation

mobiusklein
Copy link
Contributor

This is a draft pull request for reading mzMLb files. Implementation based upon the reference implementation at https://github.com/biospi/pwiz.

The built artifact from https://github.com/biospi/pwiz let me convert an mzML file to mzMLb and then read it using the new code in this branch. I don't think this is ready yet because there's information missing about how to detect the lossy compression scheme (--mzLinear and --intenLinear or --mzDelta and --intenDelta).

Source: https://doi.org/10.1021/acs.jproteome.0c00192

Bhamber, R. S., Jankevics, A., Deutsch, E. W., Jones, A. R., & Dowsey, A. W. (2021).
MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant
mzML and Optimized for Speed and Storage Requirements. Journal of Proteome Research,
20(1), 172–183. https://doi.org/10.1021/acs.jproteome.0c00192

Helpful features that mzMLb provides include:

  1. Fast random access to compressed data while achieving better compression ratios than compressing XML.
  2. Better pre-built index support so no need to scan the entire file start-to-finish to build reliable indices.
  3. Simpler compression helper encodings

In order to use this though, we need to be able to read HDF5 files, potentially including extra HDF5 plugins to access more compressors. This adds depends on h5py and hdf5plugin as dependencies. They're pretty heavy, but well packaged. If we don't want these to just be optional dependencies but part of the core, I can make this a namespace member package instead.

@mobiusklein
Copy link
Contributor Author

Another thing I like about this scheme is that because HDF5 isn't a single contiguous byte stream, it makes adding custom indices much easier and less invasive. The idea that you can just write extra arrays to the file without needing to completely re-write the file means we can add some mutability to this data reader if we want.

I've gone ahead and written a pretty simple mzMLb writer in https://github.com/mobiusklein/psims to explore some of this eventually.

@mobiusklein mobiusklein marked this pull request as ready for review March 29, 2021 01:44
@mobiusklein
Copy link
Contributor Author

After testing, locally, I'm pretty satisfied with where this implementation is. I've made educated guesses at what the future controlled vocabulary parameters the ProteoWizard implementation will use when merged, but I've retained the Biospi names too in case.

The truncate-and-predict compression aides are implemented, but they require sequential computation so we can't vectorize them with NumPy.

I added a new "extras" installation section to support this, and then made an "all" category for convenience.

Copy link
Owner

@levitsky levitsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great!

One issue I want to raise for discussion is the fact that MzMLb is detached from the general parser class hierarchy. I see why this is, and it has an MzML subclass attached to it, but this may still result in some of the assumed behavior differing between MzMLb and other public parser classes.
This is partially solved by populating mzml_args with kwargs, transparently passing through the usual arguments to that parser. However,

  • array conversion (implemented by ArrayConversionMixin) doesn't work, because the overridden _handle_binary doesn't call _convert_array;
  • we are missing the FileReader interface that does the work of supporting both file path and file object as main argument to the parser constructor;
  • FileReader also handles the context manager support, which is re-implemented directly on MzMLb, as well as parts of the XML interface for iterfind support.

Do you think we can (or should) plant MzMLb deeper into the file parser class tree? I don't know if it can help save any space in terms of code, but for now isinstance checks with abstract classes won't work with MzMLb even where the class presents the necessary interfaces (e.g. TimeOrderedIndexedReader, XML or parts of it, etc.).

I also tried using the suggestions to iron out minor wrinkles I noticed.

Thank you again for yet another valuable contribution.

tests/test_mzmlb.py Outdated Show resolved Hide resolved
@@ -0,0 +1,445 @@
# -*- coding: utf8 -*-
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# -*- coding: utf8 -*-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because the citation includes UTF8 characters (en-dash and the like). If you prefer, I could manually retype it using only ASCII look-alikes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Whichever way you prefer is fine. Sorry for the noise.

doc/source/conf.py Outdated Show resolved Hide resolved
Comment on lines +215 to +218
mzml_parser : :class:`~.ExternalDataMzML`
The mzML parser for the XML stream inside the HDF5 file with
special behavior for retrieving the out-of-band data arrays
from their respective storage locations.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mzml_parser : :class:`~.ExternalDataMzML`
The mzML parser for the XML stream inside the HDF5 file with
special behavior for retrieving the out-of-band data arrays
from their respective storage locations.
mzml_args : dict
A dictionary of keyword arguments to be passed to an :class:`~.ExternalDataMzML`
parser corresponding to the XML stream inside the HDF5 file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably document both. I'll look up how to document __init__ properly too.

pyteomics/mzmlb.py Outdated Show resolved Hide resolved
mobiusklein and others added 3 commits March 30, 2021 12:00
Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
Fair point.

Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
@mobiusklein
Copy link
Contributor Author

array conversion (implemented by ArrayConversionMixin) doesn't work, because the overridden _handle_binary doesn't call _convert_array;

The arrays are transparently decompressed by HDF5 when they're read, so we don't need to explicitly decompress them. They're also stored directly, byte-for-byte in the HDF5 file, so no base64 decoding is needed. There are two scenarios where something more is required and that is partially implemented in ArrayConversionMixin. The first is supporting decode_arrays=False, and the second is compression helping numerical transformation reversal (e.g. linear prediction, delta prediction, numpress). The linear prediction and delta prediction are handled in-line because they were added with mzMLb, but it's not clear that they will be used outside mzMLb. Numpress can be used with mzMLb but it's kind of shoved to the side in the paper because linear prediction is very similar to Numpress while being much simpler.

I might be able to implement something that is more abstract to support decode_arrays and numerical transforms, and then derive ArrayConversionMixin from that and have that handle the zlib-like compression.

we are missing the FileReader interface that does the work of supporting both file path and file object as main argument to the parser constructor

FileReader also handles the context manager support, which is re-implemented directly on MzMLb, as well as parts of the XML interface for iterfind support.

h5py.File already handles path vs file-like object internally (though there are performance consequences to using a file-like object of course so I don't want to just default to always giving it a file-like object). The h5py.File object however doesn't give the same interface that a file-like object would, much of that has to be mocked/redirected. In an effort to look exactly like the MzML reader, those often end being explicitly redirected to the mzml_parser attribute, but that only superficially works, and won't give you things like fileno, only read, seek, tell, and closed.

The FileReader class is tightly coupled to _file_obj, but I could inherit from TimeOrderedIndexedReaderMixin, which I'm already overriding most the methods on anyway to preserve that type of type checking. If we need to preserve FileReader as a base type, then we could turn it into an abstract base class and use the ABC.register method to satisfy issubclass(MzMLb, FileReader)?

As for the XML-related duplication, again I've forwarded the relevant methods.

Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb be ExternalDataMzML, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.__init__. This would lead to a much more convoluted implementation of reset and the indexing cascade.

@mobiusklein
Copy link
Contributor Author

Ignore the complaints about ArrayConversionMixin. I see what you were referring to, it wasn't about decompression, but dtype casting, which I had forgotten about.

@levitsky
Copy link
Owner

Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb be ExternalDataMzML, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.init. This would lead to a much more convoluted implementation of reset and the indexing cascade.

This would seem more natural to me, conceptually, but if it feels "inside out" to you, I won't insist on it. In that case using the ABC machinery should probably work, or I can try and see if I can put something together myself. I haven't thought this through as far as you have yet.

@mobiusklein
Copy link
Contributor Author

Made FileReader an abstract base class (with no abstract methods) and then registered MzMLb with it. A workaround that may only add a few nanoseconds to all future reader instantiations.

I haven't expanded on mutation of the HDF5 file, but the option is there. I'm reasonably happy with the state of things if you are.

@levitsky levitsky merged commit 87b564f into levitsky:master Apr 1, 2021
@mobiusklein mobiusklein deleted the feature/mzmlb branch April 1, 2021 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants