Index additional tags for mzIdentML #34

mobiusklein · 2021-03-23T13:28:14Z

Closes #33

The root cause in #33 is that when processing a FragmentationArray, the measure_ref attribute must be resolved. This leads to a get_by_id call, but the requested ID is not in the index, so it leads to a sequential scan along the entire file. Each SpectrumIdentificationItem contains at least three FragmentationArray, each of which references all three Measure elements, leading to the seeming stand-still for a single SpectrumIdentificationResult

To avoid this cropping up in the future, I've added all reference-able element types to the offset index. We'll see if this leads to an undesirable slowdown in initialization speed as it makes the indexer generate a more complex tokenizer pattern.

levitsky · 2021-03-23T16:11:21Z

Thank you! I have tried this with the file from #33 and the slowdown in indexing is about ~10%, while the speedup in getting the first item is ~25x, which is great.

The parsing is still very slow though, with 6.4s retrieval time on my machine for the first item.

mobiusklein · 2021-03-23T18:16:59Z

That "slow start" is because Iterfind doesn't use the index when you iterate over it, only during entity lookup for _get_info. Those six seconds are spent parsing through the first half of the mzIdentML file.

We could have xml.Iterfind.__iter__ loop over the index if the pattern is just the tag name like we do for xml.Iterfind.map.

…m the current file pointer

mobiusklein · 2021-03-23T22:19:18Z

Now Iterfind over an indexed tag will use the index when the index is a HierarchicalOffsetIndex and a bunch of other preconditions are met.

levitsky · 2021-03-24T16:22:31Z

Holy cow, I didn't realize we weren't doing that! This is much better!
Arguably we could split this into IndexedIterfind so that it's easier to track the behavioral difference between XML and IndexedXML

mobiusklein · 2021-03-24T20:07:18Z

Done. Now trying to reader.iterfind(query).map(fn) when query isn't indexed leads to a more helpful error.

levitsky · 2021-03-25T15:45:12Z

Thank you, this is great. I'm ready to merge this, was waiting for a reply from @jgriss but I'd rather merge and release shortly.
Just a minor question out of curiosity, why did you leave is_indexed in Iterfind rather than IndexedIterfind? It's only called in the latter.
Is it for possible calls from user code? If I'm reading it correctly, is_indexed should always be False when called from Iterfind; I would have just moved it to the subclass and optionally have a constant False in Iterfind.

mobiusklein · 2021-03-25T18:01:22Z

I left it there because it required the fewest changes to the code to achieve the desired result, have an attribute in both classes that reflected whether the query was indexed already or not. I moved the implementation to be static the way you described for simplicity.

Since I sense this coming, I've made Iterfind into almost a Sequence, in that you can use x.iterind("Peptide")[123] or slice it. When it's not indexed, it uses a slow itertools.islice driven approach.

levitsky · 2021-03-25T22:30:57Z

Thank you again. I hope I don't come across as trying to make you do all the work; I ask these questions because I value your input and I don't want to hijack your PR with changes that might not even be a good idea (now that I think of it, that's what suggestions in review mode are for; I'll remember to use them next time).

Indexing of Iterfind is neat! I don't think I would have thought of it myself, at least not now.

mobiusklein · 2021-03-26T01:22:15Z

No worries. I implemented these extras for fun.

I think we talked about this a few years ago when implementing the Iterfind and time facades. The idea you proposed was to go even another step forward, to create attributes on readers that looked like a sequence-type but were really smart iterators like this. Something like MzML.spectra and MzML.chromatograms or MzIdentML.peptides and MzIdentML.identifications which would just become reader.iterfind("spectrum") and reader.iterfind("chromatogram"), or reader.iterfind("Peptide") and reader.iterfind("SpectrumIdentificationResult").

I was hesitant to do this at the time because I thought we should be thinking about a wrapper around the raw dictionaries first, but after playing with the idea I couldn't find a satisfactory implementation that wouldn't break backwards compatibility.

Expand indexed tags

0552db9

mobiusklein mentioned this pull request Mar 23, 2021

Pyteomics fails to parse mzId files with <Fragmentation> information #33

Closed

mobiusklein added 2 commits March 23, 2021 16:48

Allow indexed tags to iterate along the index instead of starting fro…

b214db3

…m the current file pointer

More defensive

7b02f8e

mobiusklein added 3 commits March 24, 2021 14:49

Split up Iterfind for indexed and not indexed queries

9b5e166

Typo

fe646fd

Properly initialize indexed iterfind with convoluted inheritance

83ec8fa

Make Iterfind pseudo-Sequence-like

5d4fe06

Make Iterfind pseudo-Sequence-like

63cdf4c

levitsky merged commit ceeb4f7 into levitsky:master Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index additional tags for mzIdentML #34

Index additional tags for mzIdentML #34

mobiusklein commented Mar 23, 2021 •

edited

Loading

levitsky commented Mar 23, 2021 •

edited

Loading

mobiusklein commented Mar 23, 2021

mobiusklein commented Mar 23, 2021

levitsky commented Mar 24, 2021 •

edited

Loading

mobiusklein commented Mar 24, 2021

levitsky commented Mar 25, 2021

mobiusklein commented Mar 25, 2021

levitsky commented Mar 25, 2021

mobiusklein commented Mar 26, 2021

Index additional tags for mzIdentML #34

Index additional tags for mzIdentML #34

Conversation

mobiusklein commented Mar 23, 2021 • edited Loading

levitsky commented Mar 23, 2021 • edited Loading

mobiusklein commented Mar 23, 2021

mobiusklein commented Mar 23, 2021

levitsky commented Mar 24, 2021 • edited Loading

mobiusklein commented Mar 24, 2021

levitsky commented Mar 25, 2021

mobiusklein commented Mar 25, 2021

levitsky commented Mar 25, 2021

mobiusklein commented Mar 26, 2021

mobiusklein commented Mar 23, 2021 •

edited

Loading

levitsky commented Mar 23, 2021 •

edited

Loading

levitsky commented Mar 24, 2021 •

edited

Loading