Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable dynamicAllocation and set maxFailures to 1 in integration tests #2743

Merged

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Jun 18, 2021

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Closes: #2698

This should make tests fail quicker instead of sometimes succeeding on a re-attempt. We have test failure that is triggered when the shape of the cluster changes, to gain a new executor or when an executor is getting removed between cpu and gpu sessions (#2477).

… cluster shape to change

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@jlowe jlowe added the test Only impacts tests label Jun 18, 2021
@abellina abellina changed the title Make spark.task.maxFailures=1 to prevent hidden successes causing the… Disable dynamicAllocation and set maxFailures to 1 in integration tests Jun 18, 2021
jlowe
jlowe previously approved these changes Jun 18, 2021
@jlowe
Copy link
Member

jlowe commented Jun 18, 2021

build

@@ -129,6 +129,9 @@ else
export PYSP_TEST_spark_ui_showConsoleProgress='false'
export PYSP_TEST_spark_sql_session_timeZone='UTC'
export PYSP_TEST_spark_sql_shuffle_partitions='12'
# prevent cluster shape to change - and fail quicker rather than retry
export PYSP_TEST_spark_task_maxFailures='1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these aren't getting applied when using the spark-submit way to run , correct? Seems like we should be consistent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Are jenkins/spark-tests.sh and then run_pyspark_from_build.sh the only two places that trigger the tests? If so I'd need to change spark-tests.sh.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those are the only 2 in this repo

@tgravescs
Copy link
Collaborator

so I was randomly looking at some integration builds and noticed they had task failures. I'm wonder what all of our builds will fail when we enable this. I guess we can merge and see, but I wouldn't merge on a friday.

@abellina
Copy link
Collaborator Author

Yeah if they have task failures, but the job succeeded (i.e. not XFAIL) then that should be a failed test as far as I can tell. But absolutely agree:

but I wouldn't merge on a friday.

@abellina abellina marked this pull request as draft June 18, 2021 16:18
@abellina
Copy link
Collaborator Author

Moved to draft to prevent accidental merges until Monday.

@abellina
Copy link
Collaborator Author

build

@abellina abellina marked this pull request as ready for review June 21, 2021 14:47
@abellina
Copy link
Collaborator Author

Will merge this in now and keep an eye out for new failures.

@abellina abellina merged commit 3a4b55c into NVIDIA:branch-21.08 Jun 21, 2021
@abellina abellina deleted the debug/set_max_task_failures_to_1 branch June 21, 2021 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] integration test framework should detect unintentionally failed executors
4 participants