Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index additional tags for mzIdentML #34

Merged
merged 8 commits into from
Mar 26, 2021

Conversation

mobiusklein
Copy link
Contributor

@mobiusklein mobiusklein commented Mar 23, 2021

Closes #33

The root cause in #33 is that when processing a FragmentationArray, the measure_ref attribute must be resolved. This leads to a get_by_id call, but the requested ID is not in the index, so it leads to a sequential scan along the entire file. Each SpectrumIdentificationItem contains at least three FragmentationArray, each of which references all three Measure elements, leading to the seeming stand-still for a single SpectrumIdentificationResult

To avoid this cropping up in the future, I've added all reference-able element types to the offset index. We'll see if this leads to an undesirable slowdown in initialization speed as it makes the indexer generate a more complex tokenizer pattern.

@levitsky
Copy link
Owner

levitsky commented Mar 23, 2021

Thank you! I have tried this with the file from #33 and the slowdown in indexing is about ~10%, while the speedup in getting the first item is ~25x, which is great.

The parsing is still very slow though, with 6.4s retrieval time on my machine for the first item.

@mobiusklein
Copy link
Contributor Author

That "slow start" is because Iterfind doesn't use the index when you iterate over it, only during entity lookup for _get_info. Those six seconds are spent parsing through the first half of the mzIdentML file.

We could have xml.Iterfind.__iter__ loop over the index if the pattern is just the tag name like we do for xml.Iterfind.map.

@mobiusklein
Copy link
Contributor Author

Now Iterfind over an indexed tag will use the index when the index is a HierarchicalOffsetIndex and a bunch of other preconditions are met.

@levitsky
Copy link
Owner

levitsky commented Mar 24, 2021

Holy cow, I didn't realize we weren't doing that! This is much better!
Arguably we could split this into IndexedIterfind so that it's easier to track the behavioral difference between XML and IndexedXML

@mobiusklein
Copy link
Contributor Author

Done. Now trying to reader.iterfind(query).map(fn) when query isn't indexed leads to a more helpful error.

@levitsky
Copy link
Owner

Thank you, this is great. I'm ready to merge this, was waiting for a reply from @jgriss but I'd rather merge and release shortly.
Just a minor question out of curiosity, why did you leave is_indexed in Iterfind rather than IndexedIterfind? It's only called in the latter.
Is it for possible calls from user code? If I'm reading it correctly, is_indexed should always be False when called from Iterfind; I would have just moved it to the subclass and optionally have a constant False in Iterfind.

@mobiusklein
Copy link
Contributor Author

I left it there because it required the fewest changes to the code to achieve the desired result, have an attribute in both classes that reflected whether the query was indexed already or not. I moved the implementation to be static the way you described for simplicity.

Since I sense this coming, I've made Iterfind into almost a Sequence, in that you can use x.iterind("Peptide")[123] or slice it. When it's not indexed, it uses a slow itertools.islice driven approach.

@levitsky
Copy link
Owner

Thank you again. I hope I don't come across as trying to make you do all the work; I ask these questions because I value your input and I don't want to hijack your PR with changes that might not even be a good idea (now that I think of it, that's what suggestions in review mode are for; I'll remember to use them next time).

Indexing of Iterfind is neat! I don't think I would have thought of it myself, at least not now.

@mobiusklein
Copy link
Contributor Author

No worries. I implemented these extras for fun.

I think we talked about this a few years ago when implementing the Iterfind and time facades. The idea you proposed was to go even another step forward, to create attributes on readers that looked like a sequence-type but were really smart iterators like this. Something like MzML.spectra and MzML.chromatograms or MzIdentML.peptides and MzIdentML.identifications which would just become reader.iterfind("spectrum") and reader.iterfind("chromatogram"), or reader.iterfind("Peptide") and reader.iterfind("SpectrumIdentificationResult").

I was hesitant to do this at the time because I thought we should be thinking about a wrapper around the raw dictionaries first, but after playing with the idea I couldn't find a satisfactory implementation that wouldn't break backwards compatibility.

@levitsky levitsky merged commit ceeb4f7 into levitsky:master Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pyteomics fails to parse mzId files with <Fragmentation> information
2 participants