Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hashbrown #445

Merged
merged 2 commits into from
Oct 24, 2024
Merged

Conversation

thaliaarchi
Copy link
Contributor

hashbrown 0.15 was released this month, which notably removed the RawTable API in favor of HashTable and changed its hasher from ahash to the faster foldhash.

  • Migrate to HashTable
  • Update indexmap to bump its hashbrown dependency
  • Refactor a manual implementation of an index map with cloning in TermDag to indexmap::IndexMap

Besides this, Max's symbol_table still requires an older version of hashbrown in the lockfile. I have a draft to update that and could rebase to include it, but didn't figure out how to run its Criterion benchmarks.

@thaliaarchi thaliaarchi requested a review from a team as a code owner October 18, 2024 09:27
@thaliaarchi thaliaarchi requested review from ajpal and removed request for a team October 18, 2024 09:27
Comment on lines +24 to +25
/// A bidirectional map between deduplicated `Term`s and indices.
nodes: IndexSet<Term>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change this to pub? I expose the TermDag fields in the Python bindings, because it's used when getting the extracted node(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to create getters for what you need? Then it matters less what it's stored as. Since there were no other uses of the field, it looked safe to make private; a method would also make that more clear. pub is fine too, though 🤷‍♀️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good to me.

Looking through the usages in Python, it seems like the only things I do with a termdag is call term_to_expr and then lookup a term based on its id (termdag.nodes[term_id])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking up a term by TermId is already exposed via TermDag::get. So it sounds like what you need is already public?

What about the conversion in convert_struct! in your earlier comment? Would you need a fn TermDag::iter(&self) -> impl Iterator<Item = &Term>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naw that's ok, I can just not expose the TermDag struct like that, and instead just expose a few methods on it.

Copy link

codspeed-hq bot commented Oct 18, 2024

CodSpeed Performance Report

Merging #445 will not alter performance

Comparing thaliaarchi:update-hashbrown (90e6e69) with main (b0db068)

Summary

✅ 6 untouched benchmarks

saulshanabrook
saulshanabrook previously approved these changes Oct 18, 2024
Copy link
Member

@saulshanabrook saulshanabrook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!

Overall this looks good, I like the simplification of the TermDag from this change.

In terms of the benchmarks, we just added them and are still refining their use. I don't think the slowdowns listed are significant. I have an open PR (#444) to turn off these shorter benchmarks, which have high variability due to indeterminism in the memory allocator.

If you click on the details for one of the "slowdowns" you can see that it's due to allocation during parsing:

Screenshot 2024-10-18 at 9 55 13 AM

The longer running benchmarks, like eggcc-extraction and math-microbenchmark seem to have no change from this PR.

@Alex-Fischman
Copy link
Contributor

+1 to not letting "performance regressions" in the parser block this PR

@thaliaarchi
Copy link
Contributor Author

thaliaarchi commented Oct 19, 2024

I've pushed a commit, which replaces symbol_table with my fork that bumps their version of hashbrown. It's a hack to run benchmarks in CI. I'll then submit a PR to symbol_table, depending on the results, and rebase here.

It looks like running benchmarks isn't automatic; could you trigger another run?

@saulshanabrook
Copy link
Member

@thaliaarchi I believe the benchmarks have run on your most recent push! The benchmark comment gets updated whenever a new run is processed in this branch.

It seems like the only regression is in cykjson. I wouldn't personally let that block merging in this PR, but I am not very familiar with what that example is for.

@yihozhang
Copy link
Collaborator

cykjson is a small cool egglog example that does the CYK parsing algorithm of JSON-like strings. It is a more Datalog-like workload (dynamic programming) with some e-class manipulations

@saulshanabrook saulshanabrook dismissed their stale review October 24, 2024 01:54

Need to update with expose termdag api for Python

hashbrown 0.15 removed the RawTable API in favor of HashTable; migrate
to that. It also switched to foldhash, a faster hasher than ahash.
Update indexmap too, which depends on hashbrown.
This removes the need to duplicate `Term`s for hash-consing.
@thaliaarchi
Copy link
Contributor Author

I dropped the commit for benchmarking updating symbol_table. If there's a response there later, it can be updated here separately.

I also changed fn TermDag::get(&self, id: TermId) -> Term to return &Term instead. All internal usages use the result by reference, so cloning is unnecessary. Does this affect anything externally?

@Alex-Fischman
Copy link
Contributor

I also changed fn TermDag::get(&self, id: TermId) -> Term to return &Term instead. All internal usages use the result by reference, so cloning is unnecessary. Does this affect anything externally?

This is good, egglog is unstable and users can clone if they need it.

@saulshanabrook saulshanabrook merged commit af49ae2 into egraphs-good:main Oct 24, 2024
5 checks passed
@saulshanabrook
Copy link
Member

saulshanabrook commented Oct 24, 2024

Thanks @thaliaarchi for working on this and responding to all the feedback! If you have anything to add, we are also discussing the tradeoffs with hash performance and determinism in this post: #439 (comment)

EDIT: It looks like these changes also caused a 7% speedup in the biggest benchmark (added to main after this PR was started, so wasn't included in the comparison here), which is pretty nice! https://codspeed.io/egraphs-good/egglog/runs/671a868380493f6bc05c7bfc

@thaliaarchi
Copy link
Contributor Author

@saulshanabrook Thanks! I'm glad to see such speedups!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants