-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize get_classes_by_slot
round 2
#304
base: main
Are you sure you want to change the base?
Optimize get_classes_by_slot
round 2
#304
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #304 +/- ##
=======================================
Coverage 62.88% 62.88%
=======================================
Files 62 62
Lines 8528 8528
Branches 2436 2435 -1
=======================================
Hits 5363 5363
Misses 2554 2554
Partials 611 611 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
I would like to test this with the main linkml repo before making a new (non-rc) runtime release, we don't have a great SOP for this yet...
would that be helpful? would that look like just running the main linkml tests with the PR'd version of linkml-runtime, or did you have something else in mind? for now I can just run them and post results |
That would work! IMO it's a bit clunky and it's a big ask any time there is a schemaview change but induced slots are central enough it's worth doing here, thx! |
So yes test failures, but i honestly can't tell if it's from this or not.. i'll investigate more thoroughly later, but would advise not merging yet. Tests also took 2 hours to run and i can't rly fathom why. will also profile that later.
expand/collapse pytest output
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
marking as Request changes until we can figure out the downstream issue
OK figured out what was wrong - it was just that i had installed my local version of prefixmaps in my linkml environment to test linkml/prefixmaps#69 and hadn't fixed a bug in it. tests pass now, except for this is good 2 merge i think, but go ahead and test it urself! |
triggering a manual build to test the new workflow in |
Omg I hope it works, lmk if it doesnt and can debug |
Ok now that the action is in main i can test it on my fork. Sorry for the runaround with these CI actions, its specifically a weird thing for forks. Ill make it work |
Waiting for tests to run on the array branch and browsing through issues and PRs, noticed there was one for optimizing
get_classes_by_slot
#300 and any time there's aninduced_slot
perf thing i gotta check it out now.For non-induced slots, #300 is a perf wash (for 100 rounds through all slots in biolink, 6.77 before and 6.81 after), but when i went to profile it with induced slots i was surprised at how long the estimated time for a single round through was supposed to take - roughly 3 hours (~160,000x slower vs non-induced slots).
Using sets is a great idea for removing the step of making something unique at the end. we can take advantage of that further by
break
ing after the first set addition (since future set additions will do nothing).the problem is that we are still making basically the most expensive call you can do with schemaview, which is to call
induced_slot
on every combination of slot and class for a schema.induced_slot
is expensive in itself, but it's usually fine to do because you're only calling it in the context of a few classes and the caching can handle how slow it is, but since the call is every slot in the context of every class, no caching can be done.The fix is relatively simple - since
get_classes_by_slot
only cares about the existence of the slot on the class, not on its full induced form, there is no need to useinduced_slot
at all.class_induced_slots
iterates over the slots given byclass_slots
:linkml-runtime/linkml_runtime/utils/schemaview.py
Line 1368 in 7c311d9
so any slots that
class_induced_slots
knows about,class_slots
knows about. The implementation in this PR also avoids iterating over the classes twice, although that's also a perf wash because iterating through a list and comparing strings takes basically no time.So for the simple timing function
where i break out of getting the induced slots after 5 slots because it takes so long,
you get:
So i'll call the regular difference in the noise and the induced slot call roughly 1100x faster. A full run over biolink takes 12s vs ~3h.
attached profiling results where
beforest
is before 300,before
is 300 andafter
is this PR.profiling-get_classes_by_slot.zip