-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception when executing task graph with HighLevelGraph after a shuffle #1129
Comments
Thanks for the report. This is not unexpected, the optimiser will most likely change l1._name and r1._name because we are changing the backing expression. You can simulate this if you need the output keys by
This is roughly what happens internally in dask-expr (we are not calling optimise but |
Gotcha, make sense. Do you have any initial guesses on a good path forward? I was thinking to try to rewrite that dask-geopandas method as a new dask_expr.Expr, with the hope that it would integrate into the whole optimization. A potential workaround in dask-geopandas is to call |
This is definitely what I would recommend as long as you are using HLGs for this. Otherwise optimisation won't be triggered at all, which might leave stuff on the table.
When you create an expression for this, then you basically pass in left and right expressions and the expression tree will keep the references to the correct changed expression. The graph is only constructed after everything else was done (meaning optimisations), so if you implement the layer method on the expression, keys won't change anymore. Does this help? Happy to help you through the process of creating the expression |
Describe the issue:
In geopandas/dask-geopandas#303, dask-geopandas has a report that its custom
sjoin
method fails with a TypeError under some conditions. Internally, that method constructs a HighLevelGraph.The observed failure is a
TypeError
raised by geopandas because dask fails to substitute the concrete (geo)DataFrame for the key(name, partition_number)
when executing the task graph.I've managed to produce a dask / dask-expr only version:
Minimal Complete Verifiable Example:
That fails with
As I mentioned, the
shuffle
there is important. Without that shuffle, things work fine.Anything else we need to know?:
I'll take a look at this today.
Environment:
The text was updated successfully, but these errors were encountered: