Draft mzmlb implementation #35

mobiusklein · 2021-03-27T18:39:00Z

This is a draft pull request for reading mzMLb files. Implementation based upon the reference implementation at https://github.com/biospi/pwiz.

The built artifact from https://github.com/biospi/pwiz let me convert an mzML file to mzMLb and then read it using the new code in this branch. I don't think this is ready yet because there's information missing about how to detect the lossy compression scheme (--mzLinear and --intenLinear or --mzDelta and --intenDelta).

Source: https://doi.org/10.1021/acs.jproteome.0c00192

Bhamber, R. S., Jankevics, A., Deutsch, E. W., Jones, A. R., & Dowsey, A. W. (2021).
MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant
mzML and Optimized for Speed and Storage Requirements. Journal of Proteome Research,
20(1), 172–183. https://doi.org/10.1021/acs.jproteome.0c00192

Helpful features that mzMLb provides include:

Fast random access to compressed data while achieving better compression ratios than compressing XML.
Better pre-built index support so no need to scan the entire file start-to-finish to build reliable indices.
Simpler compression helper encodings

In order to use this though, we need to be able to read HDF5 files, potentially including extra HDF5 plugins to access more compressors. This adds depends on h5py and hdf5plugin as dependencies. They're pretty heavy, but well packaged. If we don't want these to just be optional dependencies but part of the core, I can make this a namespace member package instead.

mobiusklein · 2021-03-28T04:14:37Z

Another thing I like about this scheme is that because HDF5 isn't a single contiguous byte stream, it makes adding custom indices much easier and less invasive. The idea that you can just write extra arrays to the file without needing to completely re-write the file means we can add some mutability to this data reader if we want.

I've gone ahead and written a pretty simple mzMLb writer in https://github.com/mobiusklein/psims to explore some of this eventually.

mobiusklein · 2021-03-29T01:50:01Z

After testing, locally, I'm pretty satisfied with where this implementation is. I've made educated guesses at what the future controlled vocabulary parameters the ProteoWizard implementation will use when merged, but I've retained the Biospi names too in case.

The truncate-and-predict compression aides are implemented, but they require sequential computation so we can't vectorize them with NumPy.

I added a new "extras" installation section to support this, and then made an "all" category for convenience.

levitsky

Thank you, this is great!

One issue I want to raise for discussion is the fact that MzMLb is detached from the general parser class hierarchy. I see why this is, and it has an MzML subclass attached to it, but this may still result in some of the assumed behavior differing between MzMLb and other public parser classes.
This is partially solved by populating mzml_args with kwargs, transparently passing through the usual arguments to that parser. However,

array conversion (implemented by ArrayConversionMixin) doesn't work, because the overridden _handle_binary doesn't call _convert_array;
we are missing the FileReader interface that does the work of supporting both file path and file object as main argument to the parser constructor;
FileReader also handles the context manager support, which is re-implemented directly on MzMLb, as well as parts of the XML interface for iterfind support.

Do you think we can (or should) plant MzMLb deeper into the file parser class tree? I don't know if it can help save any space in terms of code, but for now isinstance checks with abstract classes won't work with MzMLb even where the class presents the necessary interfaces (e.g. TimeOrderedIndexedReader, XML or parts of it, etc.).

I also tried using the suggestions to iron out minor wrinkles I noticed.

Thank you again for yet another valuable contribution.

tests/test_mzmlb.py

levitsky · 2021-03-30T14:32:54Z

pyteomics/mzmlb.py

@@ -0,0 +1,445 @@
+# -*- coding: utf8 -*-


Suggested change

# -*- coding: utf8 -*-

This is necessary because the citation includes UTF8 characters (en-dash and the like). If you prefer, I could manually retype it using only ASCII look-alikes.

Ah, okay. Whichever way you prefer is fine. Sorry for the noise.

doc/source/conf.py

levitsky · 2021-03-30T15:00:37Z

pyteomics/mzmlb.py

+    mzml_parser : :class:`~.ExternalDataMzML`
+        The mzML parser for the XML stream inside the HDF5 file with
+        special behavior for retrieving the out-of-band data arrays
+        from their respective storage locations.


Suggested change

mzml_parser : :class:`~.ExternalDataMzML`

The mzML parser for the XML stream inside the HDF5 file with

special behavior for retrieving the out-of-band data arrays

from their respective storage locations.

mzml_args : dict

A dictionary of keyword arguments to be passed to an :class:`~.ExternalDataMzML`

parser corresponding to the XML stream inside the HDF5 file.

I should probably document both. I'll look up how to document __init__ properly too.

pyteomics/mzmlb.py

Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

Fair point. Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

mobiusklein · 2021-03-31T00:32:08Z

array conversion (implemented by ArrayConversionMixin) doesn't work, because the overridden _handle_binary doesn't call _convert_array;

The arrays are transparently decompressed by HDF5 when they're read, so we don't need to explicitly decompress them. They're also stored directly, byte-for-byte in the HDF5 file, so no base64 decoding is needed. There are two scenarios where something more is required and that is partially implemented in ArrayConversionMixin. The first is supporting decode_arrays=False, and the second is compression helping numerical transformation reversal (e.g. linear prediction, delta prediction, numpress). The linear prediction and delta prediction are handled in-line because they were added with mzMLb, but it's not clear that they will be used outside mzMLb. Numpress can be used with mzMLb but it's kind of shoved to the side in the paper because linear prediction is very similar to Numpress while being much simpler.

I might be able to implement something that is more abstract to support decode_arrays and numerical transforms, and then derive ArrayConversionMixin from that and have that handle the zlib-like compression.

we are missing the FileReader interface that does the work of supporting both file path and file object as main argument to the parser constructor

FileReader also handles the context manager support, which is re-implemented directly on MzMLb, as well as parts of the XML interface for iterfind support.

h5py.File already handles path vs file-like object internally (though there are performance consequences to using a file-like object of course so I don't want to just default to always giving it a file-like object). The h5py.File object however doesn't give the same interface that a file-like object would, much of that has to be mocked/redirected. In an effort to look exactly like the MzML reader, those often end being explicitly redirected to the mzml_parser attribute, but that only superficially works, and won't give you things like fileno, only read, seek, tell, and closed.

The FileReader class is tightly coupled to _file_obj, but I could inherit from TimeOrderedIndexedReaderMixin, which I'm already overriding most the methods on anyway to preserve that type of type checking. If we need to preserve FileReader as a base type, then we could turn it into an abstract base class and use the ABC.register method to satisfy issubclass(MzMLb, FileReader)?

As for the XML-related duplication, again I've forwarded the relevant methods.

Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb be ExternalDataMzML, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.__init__. This would lead to a much more convoluted implementation of reset and the indexing cascade.

mobiusklein · 2021-03-31T01:37:20Z

Ignore the complaints about ArrayConversionMixin. I see what you were referring to, it wasn't about decompression, but dtype casting, which I had forgotten about.

levitsky · 2021-03-31T16:35:03Z

Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb be ExternalDataMzML, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.init. This would lead to a much more convoluted implementation of reset and the indexing cascade.

This would seem more natural to me, conceptually, but if it feels "inside out" to you, I won't insist on it. In that case using the ABC machinery should probably work, or I can try and see if I can put something together myself. I haven't thought this through as far as you have yet.

mobiusklein · 2021-04-01T03:08:32Z

Made FileReader an abstract base class (with no abstract methods) and then registered MzMLb with it. A workaround that may only add a few nanoseconds to all future reader instantiations.

I haven't expanded on mutation of the HDF5 file, but the option is there. I'm reasonably happy with the state of things if you are.

mobiusklein added 4 commits March 27, 2021 13:58

Draft mzmlb implementation

a48178e

Add a few missing frills

82b134a

Detect encoding from userparams

d1bcbef

Found correct cv parameter

2cc759d

mobiusklein added 2 commits March 28, 2021 18:51

Finish matching interfaces

847888f

Finalize implementation

a6bf597

mobiusklein marked this pull request as ready for review March 29, 2021 01:44

levitsky reviewed Mar 30, 2021

View reviewed changes

mobiusklein and others added 3 commits March 30, 2021 12:00

Update doc/source/conf.py

4700c08

Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

Update pyteomics/mzmlb.py

143ce81

Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

Update tests/test_mzmlb.py

5138bef

Fair point. Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

Implement decode_binary and more direct time indexing on MzMLb

cf25455

Register MzMLb with FileReader

adf91f5

levitsky merged commit 87b564f into levitsky:master Apr 1, 2021

mobiusklein deleted the feature/mzmlb branch April 1, 2021 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft mzmlb implementation #35

Draft mzmlb implementation #35

mobiusklein commented Mar 27, 2021

mobiusklein commented Mar 28, 2021

mobiusklein commented Mar 29, 2021

levitsky left a comment

levitsky Mar 30, 2021

mobiusklein Mar 30, 2021

levitsky Mar 30, 2021

levitsky Mar 30, 2021

mobiusklein Mar 30, 2021

mobiusklein commented Mar 31, 2021

mobiusklein commented Mar 31, 2021

levitsky commented Mar 31, 2021

mobiusklein commented Apr 1, 2021

Draft mzmlb implementation #35

Draft mzmlb implementation #35

Conversation

mobiusklein commented Mar 27, 2021

mobiusklein commented Mar 28, 2021

mobiusklein commented Mar 29, 2021

levitsky left a comment

Choose a reason for hiding this comment

levitsky Mar 30, 2021

Choose a reason for hiding this comment

mobiusklein Mar 30, 2021

Choose a reason for hiding this comment

levitsky Mar 30, 2021

Choose a reason for hiding this comment

levitsky Mar 30, 2021

Choose a reason for hiding this comment

mobiusklein Mar 30, 2021

Choose a reason for hiding this comment

mobiusklein commented Mar 31, 2021

mobiusklein commented Mar 31, 2021

levitsky commented Mar 31, 2021

mobiusklein commented Apr 1, 2021