Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc: add a DisjointSet data structure #3621

Merged
merged 4 commits into from
Dec 16, 2024
Merged

Conversation

superlopuh
Copy link
Member

This is useful in a few places, notably in bufferization.

@superlopuh superlopuh added the misc Miscellaneous label Dec 11, 2024
@superlopuh superlopuh self-assigned this Dec 11, 2024
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.50%. Comparing base (f7d39a1) to head (18b6458).
Report is 30 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3621      +/-   ##
==========================================
+ Coverage   90.41%   90.50%   +0.08%     
==========================================
  Files         471      474       +3     
  Lines       59138    59429     +291     
  Branches     5611     5642      +31     
==========================================
+ Hits        53471    53785     +314     
+ Misses       4224     4206      -18     
+ Partials     1443     1438       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@compor compor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, but a few thoughts:

  • Whenever I see typical DS or algorithms like this, it makes me think why not use an external library?
    I know we want to keep dependencies slim, but we have other things that are way less stable.
    At the end of the day we are a compiler framework, and that's part of the territory.
    Using a robust framework like scipy or networkx, might save us the time spent in undergrad course territory.

Is this also to be used in the EqSat stuff?

Some maybe more actionable suggestions:

  1. Add a connected method that takes 2 or more elements and checks if they are in the same union set? I'm a fan of Robert Sedgewick's API.

  2. Maybe it's time to organize utils a bit and add an algo and/or ds (or adt) subdir?

Copy link
Collaborator

@alexarice alexarice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @compor. It's not clear to me why basic data structures should be reimplemented

Number of sets in this structure.
Note: This is O(n) as it needs to scan all parents.
"""
return len(set(self._parent))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parent pointers are not eagerly updated so I don't think this is correct? If this operation is needed then it's likely better to keep track of the number of sets in a variable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point I'll remove it

from typing import Generic, TypeVar


class IntDisjointSet:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the IntDisjointSet independently useful? Otherwise this feels like an unnecessary step to making the structure below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I'm not sure why the generality here.
Also, if this is for the equality saturation tasks, I'm not familiar how it is used there; maybe future uses warrant this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's a nice separation if only for testing purposes, it neatly encapsulates the core logic, and adds negligeable overhead. I expect that no-one's going to use it, but I also see no harm in the separation.

For equality saturation, my understanding is that a persistent version of this tends to be used, and I also expect that a separation between the core logic and the hashable/generic helpers would be useful both for testing and readability.

@superlopuh
Copy link
Member Author

Chris:

  1. I'll add it. I'm not familiar with Sedgwick's API, could you please send a link to it?
  2. sounds good but should probably be its own PR

To everyone who would prefer to add a dependency please let me know where I can find one with this generic API for hashable things :)

@alexarice
Copy link
Collaborator

https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.DisjointSet.html ?

@compor
Copy link
Collaborator

compor commented Dec 11, 2024

Chris:

1. I'll add it. I'm not familiar with Sedgwick's API, could you please send a link to it?

2. sounds good but should probably be its own PR

To everyone who would prefer to add a dependency please let me know where I can find one with this generic API for hashable things :)

https://sedgewick.io/wp-content/uploads/2022/04/Algs01-UnionFind.pdf

He uses Java, but that's besides the point. It's a long-running Algo+DS course.

@superlopuh
Copy link
Member Author

Scipy is a huge dependency to impose on all clients of xDSL

@superlopuh
Copy link
Member Author

Not to mention that it's not Pure python

@superlopuh
Copy link
Member Author

OTOH I'm happy to mirror scipy's API since users might already be familiar

@compor
Copy link
Collaborator

compor commented Dec 11, 2024

granted, scipy is a big dep and most significantly, not pure Python. AFAIK, networkx is pure python?

@compor
Copy link
Collaborator

compor commented Dec 11, 2024

Maybe that's a discussion to have at the next meeting or the one after.

Copy link
Collaborator

@alexarice alexarice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the meeting:

  • We don't want to use scipy as maintaining pure python is a priority
  • We would potentially be up for using an external library providing a pure python implementation, though have not found one for this (@compor claims networkx has one)
  • We should just use this PR for now/ever

@compor
Copy link
Collaborator

compor commented Dec 16, 2024

https://networkx.org/documentation/latest/_modules/networkx/utils/union_find.html#UnionFind.union

Based on the meeting:

* We don't want to use scipy as maintaining pure python is a priority

* We would potentially be up for using an external library providing a pure python implementation, though have not found one for this (@compor claims networkx has one)

* We should just use this PR for now/ever

@superlopuh superlopuh merged commit fd6296d into main Dec 16, 2024
15 checks passed
@superlopuh superlopuh deleted the sasha/misc/union-find branch December 16, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
misc Miscellaneous
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants