Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Sep 30, 2014

The problem was that the 2nd argument in RemoveBroadcast is not tellMaster! It is "removeFromDriver". Basically when removeFromDriver is not true, we don't report broadcast block removal back to the driver, and then other executors mistakenly think that the executor would still have the block, and try to fetch from it.

cc @tdas

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2588 at commit 2a13f70.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2588 at commit 2a13f70.

  • This patch merges cleanly.

@aarondav
Copy link
Contributor

LGTM. Merging into master, branch-1.1, and branch-1.0. Should I also backport to branch-0.9?

@rxin
Copy link
Contributor Author

rxin commented Sep 30, 2014

Make sure you backport in 0.3

@pwendell
Copy link
Contributor

@aarondav I already merged this actually...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that the 2nd argument in RemoveBroadcast is not tellMaster! It is "removeFromDriver". Basically when removeFromDriver is not true, we don't report broadcast block removal back to the driver, and then other executors mistakenly think that the executor would still have the block, and try to fetch from it.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2588 at commit f430686.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2588 at commit f430686.

  • This patch merges cleanly.

@rxin rxin changed the title Added debug logging ... [DO NOT MERGE] [SPARK-3709] Executors don't always report broadcast block removal properly back to the driver Sep 30, 2014
@rxin rxin changed the title [SPARK-3709] Executors don't always report broadcast block removal properly back to the driver [SPARK-3709] [WIP] Executors don't always report broadcast block removal properly back to the driver Sep 30, 2014
@SparkQA
Copy link

SparkQA commented Sep 30, 2014

Tests timed out after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21014/

@rxin
Copy link
Contributor Author

rxin commented Sep 30, 2014

@nchammas - would you be interested in submitting a pr to change the qa script so that the timeout and failure message already prints the commit hash?

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

Tests timed out after a configured wait of 120m.

@rxin rxin changed the title [SPARK-3709] [WIP] Executors don't always report broadcast block removal properly back to the driver [SPARK-3709] Executors don't always report broadcast block removal properly back to the driver Sep 30, 2014
@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2588 at commit 6dab2e3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have finished for PR 2588 at commit f430686.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have finished for PR 2588 at commit f430686.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21019/

@tdas
Copy link
Contributor

tdas commented Sep 30, 2014

Isnt there a way to augment the existing tests to make sure that the state in the driver (blockmanagermaster) is cleared after removing tests?

@rxin
Copy link
Contributor Author

rxin commented Sep 30, 2014

Yes - can you submit one? I'm going to merge this because it has been blocking a lot of other patches.

@asfgit asfgit closed this in de700d3 Sep 30, 2014
@tdas
Copy link
Contributor

tdas commented Sep 30, 2014

Actually, I took a look, it does test that. So I am not sure how it was passing earlier some of the times.

@rxin
Copy link
Contributor Author

rxin commented Sep 30, 2014

It worked because askSlaves was true and the driver always queries the slaves in your afterUnpersist test. The problem is with regard to reporting, not whether the block itself has been dropped or not.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have finished for PR 2588 at commit 6dab2e3.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21022/

@nchammas
Copy link
Contributor

@nchammas - would you be interested in submitting a pr to change the qa script so that the timeout and failure message already prints the commit hash?

@rxin Straight failures should already include the commit hash, like here. (Note that messages like this one do not come from our script.)

I can make a PR to add the commit hash to the timeout messages.

asfgit pushed a commit that referenced this pull request Sep 30, 2014
[By request](#2588 (comment)), and because it also makes sense.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2597 from nchammas/timeout-commit-hash and squashes the following commits:

3d90714 [Nicholas Chammas] Revert "testing: making timeout 1 minute"
2353c95 [Nicholas Chammas] testing: making timeout 1 minute
e3a477e [Nicholas Chammas] post commit hash with timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants