-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3709] Executors don't always report broadcast block removal properly back to the driver #2588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2588 at commit
|
|
QA tests have started for PR 2588 at commit
|
|
LGTM. Merging into master, branch-1.1, and branch-1.0. Should I also backport to branch-0.9? |
|
Make sure you backport in 0.3 |
|
@aarondav I already merged this actually... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was that the 2nd argument in RemoveBroadcast is not tellMaster! It is "removeFromDriver". Basically when removeFromDriver is not true, we don't report broadcast block removal back to the driver, and then other executors mistakenly think that the executor would still have the block, and try to fetch from it.
|
QA tests have started for PR 2588 at commit
|
|
QA tests have started for PR 2588 at commit
|
|
Tests timed out after a configured wait of |
|
Test FAILed. |
|
@nchammas - would you be interested in submitting a pr to change the qa script so that the timeout and failure message already prints the commit hash? |
|
Tests timed out after a configured wait of |
|
QA tests have started for PR 2588 at commit
|
|
QA tests have finished for PR 2588 at commit
|
|
QA tests have finished for PR 2588 at commit
|
|
Test PASSed. |
|
Isnt there a way to augment the existing tests to make sure that the state in the driver (blockmanagermaster) is cleared after removing tests? |
|
Yes - can you submit one? I'm going to merge this because it has been blocking a lot of other patches. |
|
Actually, I took a look, it does test that. So I am not sure how it was passing earlier some of the times. |
|
It worked because askSlaves was true and the driver always queries the slaves in your afterUnpersist test. The problem is with regard to reporting, not whether the block itself has been dropped or not. |
|
QA tests have finished for PR 2588 at commit
|
|
Test FAILed. |
@rxin Straight failures should already include the commit hash, like here. (Note that messages like this one do not come from our script.) I can make a PR to add the commit hash to the timeout messages. |
[By request](#2588 (comment)), and because it also makes sense. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2597 from nchammas/timeout-commit-hash and squashes the following commits: 3d90714 [Nicholas Chammas] Revert "testing: making timeout 1 minute" 2353c95 [Nicholas Chammas] testing: making timeout 1 minute e3a477e [Nicholas Chammas] post commit hash with timeout
The problem was that the 2nd argument in RemoveBroadcast is not tellMaster! It is "removeFromDriver". Basically when removeFromDriver is not true, we don't report broadcast block removal back to the driver, and then other executors mistakenly think that the executor would still have the block, and try to fetch from it.
cc @tdas