-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft mzmlb implementation #35
Conversation
Another thing I like about this scheme is that because HDF5 isn't a single contiguous byte stream, it makes adding custom indices much easier and less invasive. The idea that you can just write extra arrays to the file without needing to completely re-write the file means we can add some mutability to this data reader if we want. I've gone ahead and written a pretty simple mzMLb writer in https://github.com/mobiusklein/psims to explore some of this eventually. |
After testing, locally, I'm pretty satisfied with where this implementation is. I've made educated guesses at what the future controlled vocabulary parameters the ProteoWizard implementation will use when merged, but I've retained the Biospi names too in case. The truncate-and-predict compression aides are implemented, but they require sequential computation so we can't vectorize them with NumPy. I added a new "extras" installation section to support this, and then made an "all" category for convenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is great!
One issue I want to raise for discussion is the fact that MzMLb
is detached from the general parser class hierarchy. I see why this is, and it has an MzML
subclass attached to it, but this may still result in some of the assumed behavior differing between MzMLb
and other public parser classes.
This is partially solved by populating mzml_args
with kwargs
, transparently passing through the usual arguments to that parser. However,
- array conversion (implemented by
ArrayConversionMixin
) doesn't work, because the overridden_handle_binary
doesn't call_convert_array
; - we are missing the
FileReader
interface that does the work of supporting both file path and file object as main argument to the parser constructor; FileReader
also handles the context manager support, which is re-implemented directly onMzMLb
, as well as parts of theXML
interface foriterfind
support.
Do you think we can (or should) plant MzMLb
deeper into the file parser class tree? I don't know if it can help save any space in terms of code, but for now isinstance
checks with abstract classes won't work with MzMLb
even where the class presents the necessary interfaces (e.g. TimeOrderedIndexedReader, XML or parts of it, etc.).
I also tried using the suggestions to iron out minor wrinkles I noticed.
Thank you again for yet another valuable contribution.
@@ -0,0 +1,445 @@ | |||
# -*- coding: utf8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# -*- coding: utf8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because the citation includes UTF8 characters (en-dash and the like). If you prefer, I could manually retype it using only ASCII look-alikes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Whichever way you prefer is fine. Sorry for the noise.
mzml_parser : :class:`~.ExternalDataMzML` | ||
The mzML parser for the XML stream inside the HDF5 file with | ||
special behavior for retrieving the out-of-band data arrays | ||
from their respective storage locations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mzml_parser : :class:`~.ExternalDataMzML` | |
The mzML parser for the XML stream inside the HDF5 file with | |
special behavior for retrieving the out-of-band data arrays | |
from their respective storage locations. | |
mzml_args : dict | |
A dictionary of keyword arguments to be passed to an :class:`~.ExternalDataMzML` | |
parser corresponding to the XML stream inside the HDF5 file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably document both. I'll look up how to document __init__
properly too.
Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
Fair point. Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
The arrays are transparently decompressed by HDF5 when they're read, so we don't need to explicitly decompress them. They're also stored directly, byte-for-byte in the HDF5 file, so no base64 decoding is needed. There are two scenarios where something more is required and that is partially implemented in I might be able to implement something that is more abstract to support
The As for the Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let |
Ignore the complaints about |
This would seem more natural to me, conceptually, but if it feels "inside out" to you, I won't insist on it. In that case using the ABC machinery should probably work, or I can try and see if I can put something together myself. I haven't thought this through as far as you have yet. |
Made I haven't expanded on mutation of the HDF5 file, but the option is there. I'm reasonably happy with the state of things if you are. |
This is a draft pull request for reading
mzMLb
files. Implementation based upon the reference implementation at https://github.com/biospi/pwiz.The built artifact from https://github.com/biospi/pwiz let me convert an mzML file to mzMLb and then read it using the new code in this branch. I don't think this is ready yet because there's information missing about how to detect the lossy compression scheme (
--mzLinear
and--intenLinear
or--mzDelta
and--intenDelta
).Source: https://doi.org/10.1021/acs.jproteome.0c00192
Helpful features that mzMLb provides include:
In order to use this though, we need to be able to read HDF5 files, potentially including extra HDF5 plugins to access more compressors. This adds depends on h5py and hdf5plugin as dependencies. They're pretty heavy, but well packaged. If we don't want these to just be optional dependencies but part of the core, I can make this a namespace member package instead.