-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reusing intermediate results causes memory issues #854
Comments
This is concerning |
The two code examples above are identical, aren't they? |
FWIW this is a long-standing issue. dask/dask#874 |
Fixed, I messed up the copy+paste. |
Summary from multiple offline conversations with @phofl (and @fjetter): We are mostly concerned about reuse after a reducer. Another example is |
@fjetter and I chatted a little about this offline. There are 2 different approaches we can take:
This is relatively easy to do, but it breaks outer assumption that our optimisations are only local and messes up the dependent tracking in the process. I moved away from this a little, because I don't like the bandaids we would need and it doesn't properly fit into our model.
|
Problem
Whenever we reuse intermediate results and there is a pipeline breaker (such as shuffles, joins, reductions, or groupby operations), it forces us to materialize the entire intermediate result (thus, it is breaking the pipelining that we could utilize for reuse for example between multiple element-wise operations).
This result materialization puts a hard limit on our ability to scale as I have observed in multiple TPC-H benchmark queries.
To illustrate this, run these two snippets on a cluster of your choice
With full intermediate result materialization
Without full intermediate result materialization
Possible solution
The easiest approach would be to never reuse any intermediate results. This has a few downsides:
...but it will allow us to scale.
We can certainly get smarter about intermediate result materialization, but this will require some effort depending on how smart we want to be. (There's a body of (ongoing) research and implementations in the database world we could draw from.)
The text was updated successfully, but these errors were encountered: