Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable a smoke test for UCX in pre-merge #3200

Merged
merged 5 commits into from
Aug 13, 2021

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Aug 11, 2021

This PR runs test_hash_grpby_sum with UCX enabled, as UCX is now installed in the docker image.

This adds ~1-2 minutes to the build from the logs. It starts the smoke test from the mvn_verify function so it could be merged without disturbing the jenkins file. The mvn_verify and now rapids_shuffle_smoke_test assume spark 3.0.1 for the pre-merge build. There will be a follow on PR to clean up how rapids_shuffle_smoke_test gets invoked, though right now I piggy back from mvn_verify binaries.

I added a variable so the correct shuffle manager shim can be picked up (SHUFFLE_SPARK_SHIM). This should be kept in sync with SPARK_VER.

There is a local-cluster issue I noted which happens even without the shuffle manager and will file a follow up, as it seems unrelated. I am inclined to let this use standalone mode for now. I could use some more eyes on it to see if it makes sense.

Also I chose a simple test, it does use the UCX shuffle. I can enable others in case it's not going to change that 1-2 minute time addition too much, or pick something else. Just pick the first thing that I knew would use UCX.

@abellina
Copy link
Collaborator Author

build

3 similar comments
@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

This seems to mostly be working, but I am getting an issue in local-cluster mode where executors get created after the tests (and I mean 100 executors or so). The logs show UCX is utilized, but something is happening later. Note this issue with the runaway executors is with or without the manager, so likely a pytest config issue on my side.

@abellina
Copy link
Collaborator Author

build

2 similar comments
@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

Waiting on: #3218 to be merged and image pushed.

@abellina abellina force-pushed the shuffle/ucx_smoke_test branch 3 times, most recently from 99c7e20 to bea8ca4 Compare August 12, 2021 14:11
@abellina
Copy link
Collaborator Author

build

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina
Copy link
Collaborator Author

build

@sameerz sameerz added the test Only impacts tests label Aug 12, 2021
@abellina abellina marked this pull request as ready for review August 12, 2021 17:09
@abellina abellina changed the title Shuffle/ucx smoke test Enable a smoke test for UCX in pre-merge Aug 12, 2021
@abellina abellina added the shuffle things that impact the shuffle plugin label Aug 12, 2021
jenkins/spark-premerge-build.sh Outdated Show resolved Hide resolved
jenkins/spark-premerge-build.sh Show resolved Hide resolved
jenkins/version-def.sh Outdated Show resolved Hide resolved
jlowe
jlowe previously approved these changes Aug 12, 2021
@jlowe
Copy link
Member

jlowe commented Aug 12, 2021

build

@abellina
Copy link
Collaborator Author

build

pxLi
pxLi previously approved these changes Aug 13, 2021
zhanga5
zhanga5 previously approved these changes Aug 13, 2021
@pxLi
Copy link
Collaborator

pxLi commented Aug 13, 2021

SHUFFLE_SPARK_SHIM=spark3.0.1
...
java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark3.0.1.RapidsShuffleManager

Looks like the pattern matching does not work consistently in different shell envs, I don't know if we have better way to do replace in bash default_value syntax. At least we can trim first, then assign a default value

@abellina abellina dismissed stale reviews from zhanga5 and pxLi via 7dba924 August 13, 2021 12:15
@abellina
Copy link
Collaborator Author

Looks like the pattern matching does not work consistently in different shell envs

I tested this with the command line, instead of within bash, so I was able to reproduce the issue locally. I got it to work with 2 statements, but was not able to with a single statement.

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit 20237a0 into NVIDIA:branch-21.10 Aug 13, 2021
@abellina abellina deleted the shuffle/ucx_smoke_test branch August 13, 2021 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
shuffle things that impact the shuffle plugin test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants