-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc: add a DisjointSet data structure #3621
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3621 +/- ##
==========================================
+ Coverage 90.41% 90.50% +0.08%
==========================================
Files 471 474 +3
Lines 59138 59429 +291
Branches 5611 5642 +31
==========================================
+ Hits 53471 53785 +314
+ Misses 4224 4206 -18
+ Partials 1443 1438 -5 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, but a few thoughts:
- Whenever I see typical DS or algorithms like this, it makes me think why not use an external library?
I know we want to keep dependencies slim, but we have other things that are way less stable.
At the end of the day we are a compiler framework, and that's part of the territory.
Using a robust framework like scipy or networkx, might save us the time spent in undergrad course territory.
Is this also to be used in the EqSat stuff?
Some maybe more actionable suggestions:
-
Add a
connected
method that takes 2 or more elements and checks if they are in the same union set? I'm a fan of Robert Sedgewick's API. -
Maybe it's time to organize
utils
a bit and add analgo
and/ords
(oradt
) subdir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @compor. It's not clear to me why basic data structures should be reimplemented
xdsl/utils/disjoint_set.py
Outdated
Number of sets in this structure. | ||
Note: This is O(n) as it needs to scan all parents. | ||
""" | ||
return len(set(self._parent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parent pointers are not eagerly updated so I don't think this is correct? If this operation is needed then it's likely better to keep track of the number of sets in a variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point I'll remove it
from typing import Generic, TypeVar | ||
|
||
|
||
class IntDisjointSet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the IntDisjointSet independently useful? Otherwise this feels like an unnecessary step to making the structure below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I'm not sure why the generality here.
Also, if this is for the equality saturation tasks, I'm not familiar how it is used there; maybe future uses warrant this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it's a nice separation if only for testing purposes, it neatly encapsulates the core logic, and adds negligeable overhead. I expect that no-one's going to use it, but I also see no harm in the separation.
For equality saturation, my understanding is that a persistent version of this tends to be used, and I also expect that a separation between the core logic and the hashable/generic helpers would be useful both for testing and readability.
Chris:
To everyone who would prefer to add a dependency please let me know where I can find one with this generic API for hashable things :) |
https://sedgewick.io/wp-content/uploads/2022/04/Algs01-UnionFind.pdf He uses Java, but that's besides the point. It's a long-running Algo+DS course. |
Scipy is a huge dependency to impose on all clients of xDSL |
Not to mention that it's not Pure python |
OTOH I'm happy to mirror scipy's API since users might already be familiar |
granted, scipy is a big dep and most significantly, not pure Python. AFAIK, networkx is pure python? |
Maybe that's a discussion to have at the next meeting or the one after. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the meeting:
- We don't want to use scipy as maintaining pure python is a priority
- We would potentially be up for using an external library providing a pure python implementation, though have not found one for this (@compor claims networkx has one)
- We should just use this PR for now/ever
https://networkx.org/documentation/latest/_modules/networkx/utils/union_find.html#UnionFind.union
|
This is useful in a few places, notably in bufferization.