-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index additional tags for mzIdentML #34
Conversation
Thank you! I have tried this with the file from #33 and the slowdown in indexing is about ~10%, while the speedup in getting the first item is ~25x, which is great. The parsing is still very slow though, with 6.4s retrieval time on my machine for the first item. |
That "slow start" is because We could have |
…m the current file pointer
Now |
Holy cow, I didn't realize we weren't doing that! This is much better! |
Done. Now trying to |
Thank you, this is great. I'm ready to merge this, was waiting for a reply from @jgriss but I'd rather merge and release shortly. |
I left it there because it required the fewest changes to the code to achieve the desired result, have an attribute in both classes that reflected whether the query was indexed already or not. I moved the implementation to be static the way you described for simplicity. Since I sense this coming, I've made |
Thank you again. I hope I don't come across as trying to make you do all the work; I ask these questions because I value your input and I don't want to hijack your PR with changes that might not even be a good idea (now that I think of it, that's what suggestions in review mode are for; I'll remember to use them next time). Indexing of Iterfind is neat! I don't think I would have thought of it myself, at least not now. |
No worries. I implemented these extras for fun. I think we talked about this a few years ago when implementing the I was hesitant to do this at the time because I thought we should be thinking about a wrapper around the raw dictionaries first, but after playing with the idea I couldn't find a satisfactory implementation that wouldn't break backwards compatibility. |
Closes #33
The root cause in #33 is that when processing a
FragmentationArray
, themeasure_ref
attribute must be resolved. This leads to aget_by_id
call, but the requested ID is not in the index, so it leads to a sequential scan along the entire file. EachSpectrumIdentificationItem
contains at least threeFragmentationArray
, each of which references all threeMeasure
elements, leading to the seeming stand-still for a singleSpectrumIdentificationResult
To avoid this cropping up in the future, I've added all reference-able element types to the offset index. We'll see if this leads to an undesirable slowdown in initialization speed as it makes the indexer generate a more complex tokenizer pattern.