-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with data being materialized after cross
and before subsequent map
#1905
Comments
Removing |
I think I know what's wrong. The test I wrote doing next:
During optimization phase we optimize two maps going one after another in second fork and therefore only common part of this two graphs become cross join. Forcing this common part to disk (my guess) made by cascading itself. I think with typed writes it works because during typed writes planning we have both writes at the same time and able to extract common part out of it to calculate it only once. |
seems like a good guess that the interaction of toPipe is causing the issue. That is really hiding from the optimizer what is going on. We could put another rule that we should always avoid materializing a cross. It's hard to say in-general when we should materialize vs not, only looking at the static graph. We could imagine adding a new node to the graph to hint: even if we fan this out, don't recomputed it. maybe I would probably rather improve the optimization rules... |
So basically what happens is:
Knowing this I also can make similar test without using
For initial issue we can solve it with ignoring Also I do think |
I agree that we shouldn't materialize immediately after a cross. A hashJoin is maybe a bit different since it can be used as a filter, and in the filter case maybe we don want to checkpoint? But as you say, the user can add forceToDisk if they want to insist in that case. We can also add a configuration to change the plan in that case, but with the default not doing a force after the cross/hashJoin. |
In twitter we found a case when data got serialized after
cross
and before subsequentmap
. This might be critical since result of cross is usually really big and subsequent operations decrease size of it a lot. I created test for this: ttim@85ddcff . Seems like it only reproducable when.toPipe
is used and two branches happen after.cross.map
.Any ideas @johnynek ?
The text was updated successfully, but these errors were encountered: