-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(common): add a pure python egraph implementation #5781
Conversation
Hey @kszucs! This is cool to see how you are integrating this with Ibis! I have also been experimenting with different ways of exposing e-graphs in Python (egraphs-good/egglog#126) as well. I had a few thoughts about the differences in API design. If you are interested in talking about it, let me know, I am happy to post here or jump on a call. |
Hey @saulshanabrook!
I know, I have been following your work on the I started to work on a pure python implementation to support arbitrary python callables both when searching and when replacing enodes. It also makes it easier to customize other parts of the match-replace loop.
Definitely, though I'd need to first have a hand-on experiment with the new high-level API. Also @jcrist might be also interested. |
Oh cool! This was my first Rust project so don't look too close 🫣. It's funny to not know if anyone is looking at/trying something on Github.
Yeah, I have gotten quite fond of that method of having Python functions with type annotations be the way to define operations instead of using classes... Since it integrates so nicely with the static typing! (As a side note, it's too bad that MyPy now complains at empty bodies with It's also fun how to make the Python typing match up with the
Ah yeah, that makes sense. I wonder more generally about that in egg-smol, can anything that was written as a custom replacement be represented in egg-smol in some way? If you have particular examples that seem tricky to do syntactically, I would be interested in those! Could then ask upstream what the strategy would be in egg-smol for them. Also, I did have this idea at one point of adding Python objects as a builtin sort to egg-smol, since you can add new ones at runtime if you wanted other Python values besides ints/strings/floats as leaf nodes. |
Oh noo, the bindings are great! It was due to egg-smol library itself :)
The egglog way of thinking requires more of a logical programming view which can be hard for a usual python developer, most of the time it's easier to calculate conditions in an imperative fashion (which could be memoized for better runtime performance).
That would be pretty handy, would that allow calling arbitrary python functions? |
Oh, one other thing I forgot to mention about the high level design... I think it's interesting if the node/operation definition can be used internally but also be a user-facing API. Like for example in the tests here there are things like: class MyNode(Concrete, Node):
pass
class MyLit(MyNode):
value: int
class MyAdd(MyNode):
a: MyNode
b: MyNode
class MyMul(MyNode):
a: MyNode
b: MyNode
# number postfix highlights the depth of the node
one = MyLit(value=1)
two = MyLit(value=2)
two1 = MyAdd(a=one, b=one)
three1 = MyAdd(a=one, b=two)
six2 = MyMul(a=three1, b=two1)
seven2 = MyAdd(a=six2, b=one) We wouldn't ask a user of Ibis to build up nodes in this manner, they would want some python-like API. So we would have to build another API on top of this one that is exposed to the user. However, with the high level API of using Python functions as operations, this API could itself be exposed to the user: from __future__ import annotations
from egg_smol import *
egraph = EGraph()
@egraph.class_
class MyNode(BaseExpr):
def __init__(self, value: i64Like):
...
def __add__(self, other: MyNode) -> MyNode:
...
def __mul__(self, other: MyNode) -> MyNode:
...
one = MyNode(1)
two = MyNode(2)
two1 = one + one
three1 = one + two
six2 = three1 * two1
seven2 = six2 + one Of course, you would have to provide all the right niceties, like an @egraph.function(cost=10)
def pow(a: MyNode, b: MyNode) -> MyNode:
...
a, b = vars_("a b", MyNode)
i = var("i", i64)
egraph.register(
# x ^ 0 = 1
rewrite(pow(a, MyNode(i64(0)))).to(MyNode(i64(1))),
# x ^ 1 = x
rewrite(pow(a, MyNode(i64(1)))).to(a),
# x ^ i = x * x ^ (i - 1) iff i > 1
rewrite(pow(a, MyNode(i))).to(a * pow(a, MyNode(i - i64(1))), i > i64(1)),
)
x = egraph.define("x", pow(MyNode(2), MyNode(3)), cost=20)
egraph.run(10)
print(egraph.extract(x))
# Prints "MyNode(2) * (MyNode(2) * MyNode(2))" This to me would be a huge win! Then also if multiple libraries started using the same system to express their expressions and rewrites it would become feasible for third parties to write rewrites between them. Of course that's still a long way off, but I think it would be a way of making it much easier on users... Instead of having to say "will I use dask dataframes or ibis dataframes"? They could use whichever API maps more closely to their domain, and then write rewrites (provided no one else already has) to the backend they would like to use. And if that changes, then they don't have to rewrite their business logic, just write different rewrites. Of course there are a lot of stepping stones before getting there, but that is what excited me about these systems! (besides also just having a nicer way in Python to express optimizations).
Thanks for that link! Yeah it's interesting to think about the different ways of writing rules and their equivalences. I would love to dive into a precise example here to spell those out... I haven't worked with SQL rewrite rules much so it isn't as clear to what the different ways of expressing them are.
I think so... Like in the same way that the builtin sorts to egg-smol call arbitrary Rust functions: https://github.com/mwillsey/egg-smol/blob/669ea92d3ac73b23bd8b0565fb0866cda0770a02/src/sort/f64.rs#L36 |
Sorry for the late response!
It's indeed a nice feature, however with ibis we have a bit more complicated situation which I'm not sure how to apply the typed interface of egg-smol-python. Basically we have two levels in ibis: the operation
In ibis we already have the
This would certainly help in this case, it's still in "enode-land". What I'm thinking is to support both e-node and plain "node-land" (the extracted e-nodes) if that makes sense. While the former would be more performant the latter would be close the the original way of thinking.
Having a standard way for defining terms and term graphs would be great indeed, then the utilities (like egraphs) around these could be reused as well, it's more similar to what arrow provides for columnar data. Then EGraphs are "execution engines" on that hierarchic data.
I'm also intrigued by these concepts! Once I manage to roll up my other PRs we can elaborate more on these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a read through. I confess I don't fully understand the egraph implementation, but the test coverage looks good and I trust your work here.
My main concerns are:
- The high use of operator overloading for defining patterns. IMO operators shouldn't be overloaded to do surprising things; if a user sees
@
they should expect matrix multiplication, if they seeClass[...]
they should expect type parametrization. I'd prefer if we found a less-magical syntax for defining patterns. - Some simple docstrings for all the public methods on each type would aid a lot in helping others understand how this code works and how it should be used.
ibis/common/collections.py
Outdated
@@ -208,4 +208,113 @@ def __repr__(self): | |||
return f"{self.__class__.__name__}({super().__repr__()})" | |||
|
|||
|
|||
class DisjointSet(Mapping[K, Set[K]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this datastructure is pretty specific to the egraph implementation and seems unlikely (IMO) to be used in other parts of Ibis, I'd prefer this was moved to ibis/common/egraph.py
to keep the implementation together. It'd aid with readability of the code to have all these methods in the same file IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this is unlikely to use in other places, but I have a slight preference of keeping it here. At first I kept both in the same file, and it was harder to comprehend which structure does what. Once I managed to properly factor it out and properly unit test it, now we can be certain that it complies with the specification of the union find algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still would like to see this moved to the egraph.py
file. This is a unique enough datastructure that I'd expect most readers not to be familiar the meaning of operations available on a disjoint set (or the "union find" algorithm). Having all the code in the same file would really help with readability IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, going to move it to egraph.py
54ae015
to
d8c77bb
Compare
edc9aa4
to
0c260e7
Compare
@jcrist I removed some leftovers and squashed into a single commit, should be ready to another look |
I guess I can go either way here. Whether a thing is surprising is highly context- and person-dependent. It may surprise some people that For SQL users, who knows? Many SQL dialects tend to prefer English words to operators, but some (Postgres for example) have a large variety of syntactic operators. Matrix multiplication is such a niche operator (I personally still do not understand why it's part of Python's syntax) that restricting to just that seems like it's leaving a lot of shorthand convenience on the table. After all, it's not like it has multiple meanings in this context. Similarly, if ibis ever grows something like a matmul operator that actually does do matrix multiplication, it's going to be mutually exclusive with how it's used in the pattern matching context. I'm +0.5 on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Must haves:
ENode
toNode
newegraph.py
implementation in favor ofegraph.py
Should haves:
Could haves:
ENode
interface inops.Node
so we don't need to convertops.Node
s toENode
s, though we would still need to have a non-validatedNode
alternative for performance reasons