-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update hashbrown #445
Update hashbrown #445
Conversation
/// A bidirectional map between deduplicated `Term`s and indices. | ||
nodes: IndexSet<Term>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change this to pub
? I expose the TermDag
fields in the Python bindings, because it's used when getting the extracted node(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to create getters for what you need? Then it matters less what it's stored as. Since there were no other uses of the field, it looked safe to make private; a method would also make that more clear. pub
is fine too, though 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds good to me.
Looking through the usages in Python, it seems like the only things I do with a termdag is call term_to_expr
and then lookup a term based on its id (termdag.nodes[term_id]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking up a term by TermId
is already exposed via TermDag::get
. So it sounds like what you need is already public?
What about the conversion in convert_struct!
in your earlier comment? Would you need a fn TermDag::iter(&self) -> impl Iterator<Item = &Term>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naw that's ok, I can just not expose the TermDag struct like that, and instead just expose a few methods on it.
CodSpeed Performance ReportMerging #445 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR!
Overall this looks good, I like the simplification of the TermDag from this change.
In terms of the benchmarks, we just added them and are still refining their use. I don't think the slowdowns listed are significant. I have an open PR (#444) to turn off these shorter benchmarks, which have high variability due to indeterminism in the memory allocator.
If you click on the details for one of the "slowdowns" you can see that it's due to allocation during parsing:
The longer running benchmarks, like eggcc-extraction and math-microbenchmark seem to have no change from this PR.
+1 to not letting "performance regressions" in the parser block this PR |
I've pushed a commit, which replaces symbol_table with my fork that bumps their version of hashbrown. It's a hack to run benchmarks in CI. I'll then submit a PR to symbol_table, depending on the results, and rebase here. It looks like running benchmarks isn't automatic; could you trigger another run? |
@thaliaarchi I believe the benchmarks have run on your most recent push! The benchmark comment gets updated whenever a new run is processed in this branch. It seems like the only regression is in |
cykjson is a small cool egglog example that does the CYK parsing algorithm of JSON-like strings. It is a more Datalog-like workload (dynamic programming) with some e-class manipulations |
Need to update with expose termdag api for Python
hashbrown 0.15 removed the RawTable API in favor of HashTable; migrate to that. It also switched to foldhash, a faster hasher than ahash. Update indexmap too, which depends on hashbrown.
This removes the need to duplicate `Term`s for hash-consing.
a92c5b7
to
90e6e69
Compare
I dropped the commit for benchmarking updating I also changed |
This is good, egglog is unstable and users can clone if they need it. |
Thanks @thaliaarchi for working on this and responding to all the feedback! If you have anything to add, we are also discussing the tradeoffs with hash performance and determinism in this post: #439 (comment) EDIT: It looks like these changes also caused a 7% speedup in the biggest benchmark (added to main after this PR was started, so wasn't included in the comparison here), which is pretty nice! https://codspeed.io/egraphs-good/egglog/runs/671a868380493f6bc05c7bfc |
@saulshanabrook Thanks! I'm glad to see such speedups! |
hashbrown
0.15 was released this month, which notably removed theRawTable
API in favor ofHashTable
and changed its hasher fromahash
to the fasterfoldhash
.HashTable
indexmap
to bump itshashbrown
dependencyTermDag
toindexmap::IndexMap
Besides this, Max's
symbol_table
still requires an older version ofhashbrown
in the lockfile. I have a draft to update that and could rebase to include it, but didn't figure out how to run its Criterion benchmarks.