Skip to content

Conversation

@tgravescs
Copy link
Contributor

branch-2.0 version of this patch. The differences are in the YarnShuffleService for finding the location to put the DB. branch-2.0 does not use the yarn nm recovery path like master does.

Tested in manually on 8 node yarn cluster and ran unit tests. Manually tests verified DB created properly and it found them if already existed. Verified that during rolling upgrade credentials were reloaded and running application was not affected.

Thomas Graves added 3 commits September 6, 2016 14:00
…ling upgrade

The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled".  Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it.  Adding another leveldb was simpler from the code structure point of view.

Most of the code changes are moving things to common util class.

Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes apache#14718 from tgravescs/SPARK-16711.

Conflicts:
	common/network-shuffle/pom.xml
	common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
	common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65046 has finished for PR 14997 at commit 40e6be3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor Author

cc @vanzin

@vanzin
Copy link
Contributor

vanzin commented Sep 7, 2016

LGTM.

@tgravescs
Copy link
Contributor Author

thanks! merging to branch-2.0

asfgit pushed a commit that referenced this pull request Sep 8, 2016
…ling upgrade

branch-2.0 version of this patch.  The differences are in the YarnShuffleService for finding the location to put the DB. branch-2.0 does not use the yarn nm recovery path like master does.

Tested in manually on 8 node yarn cluster and ran unit tests.  Manually tests verified DB created properly and it found them if already existed. Verified that during rolling upgrade credentials were reloaded and running application was not affected.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #14997 from tgravescs/SPARK-16711-branch2.0.
@tgravescs tgravescs closed this Sep 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants