CI job running tests with queuing on #6989

gjoseph92 · 2022-09-02T06:54:47Z

Adds a CI job, only for Ubuntu and py3.9, running the test suite with worker-saturation: 1.0 (instead of inf).

Also adds a @pytest.mark.oversaturate_only marker, which automatically skips marked tests if queuing is enabled.

A surprisingly small number of tests actually failed when queuing was turned on. I've updated the ones that did as needed. I've tried as much as possible to just make the tests agnostic to queuing. In a few cases though, I've needed to add explicit conditional logic. And in a couple cases, I've marked tests to be skipped when queuing is enabled, since they may not make sense.

Unfortunately there are three tests that are flaky under queuing, which could indicate an actual bug. They're skipped for now, but should be investigated.

Closes #6631

Tests added / passed
Passes pre-commit run --all-files

we have to make the tasks to be stolen not look like root tasks.

There are a lot more race conditions with queuing

these should not be failing and we need to evaluate why they are

github-actions · 2022-09-02T08:30:26Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 5h 53m 29s ⏱️ - 16m 4s
  3 082 tests ±0   2 996 ✔️ -   1   85 💤 +  1 1 ❌ ±0
22 808 runs +6 21 895 ✔️ +79 912 💤 - 73 1 ❌ ±0

For more details on these failures, see this check.

Results for commit 33e4995. ± Comparison against base commit 1818788.

♻️ This comment has been updated with latest results.

.github/workflows/tests.yaml

crusaderky

Instead of this new oversaturate_only mark, couldn't you simply have

OVERSATURATION = math.isinf(dask.config.get("distributed.scheduler.worker-saturation"))
@pytest.mark.flaky(not OVERSATURATION, reason="flaky on MacOS")

Tests that don't make sense with queueing should instead force

config={"distributed.scheduler.worker-saturation": math.inf}

(some already do)

fjetter · 2022-09-02T13:44:34Z

.github/workflows/tests.yaml

@@ -44,8 +53,8 @@ jobs:
        # run: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

    env:
-      TEST_ID: ${{ matrix.os }}-${{ matrix.python-version }}-${{ matrix.partition-label }}
-      # TEST_ID: ${{ matrix.os }}-${{ matrix.python-version }}-${{ matrix.partition-label }}-${{ matrix.run }}
+      TEST_ID: ${{ matrix.os }}-${{ matrix.python-version }}-${{ matrix.partition-label }}-${{ matrix.queuing }}


I think this renaming will break some logic in our test report generation.

distributed/continuous_integration/scripts/test_report.py

Lines 146 to 152 in acf6078

df_jobs["suite_name"] = (

df_jobs["OS"]

+ "-"

+ df_jobs["python_version"]

+ "-"

+ df_jobs["partition"].str.replace(" ", "")

)

distributed/continuous_integration/scripts/test_report.py

Lines 353 to 357 in acf6078

html_url = jobs_df[jobs_df["suite_name"] == a["name"]].html_url.unique()

assert (

len(html_url) == 1

), f"Artifact suit name {a['name']} did not match any jobs dataframe {jobs_df['suite_name'].unique()}"

html_url = html_url[0]

I introduced this in #6837

Worst case, we remove the more specific job URL again (we knew from the start this is very brittle)

see https://github.com/dask/distributed/pull/6837/files#r939182529

I don't think this should block this PR

Pulled out from dask#6989. This minor refactor makes it easier to add other config options in the future. It also ensure that the `ws` marker is added even when `--runslow` is given.

This reverts commit 0f9034a.

GitHub didn't like setting `partition-label` in two different ways; it showed up in two different places in the job name and breaks the test report script. Instead, we can set the `TEST_ID` variable as a step in the job, which is cleaner anyway IMO.

gjoseph92 · 2022-09-02T22:06:07Z

Removed the oversaturate_only mark; just using pytet.mark.skipif and config={"distributed.scheduler.worker-saturation": math.inf} as necessary
Fixed setting the environment variable in the test step
Updated test report script to handle and display results from the queue job:

Ready for re-review and merge.

gjoseph92 · 2022-09-02T22:28:32Z

Flaky test_steal_reschedule_reset_in_flight_occupancy #6999

crusaderky · 2022-09-05T10:26:40Z

Thank you!

jrbourbeau · 2022-09-06T18:43:26Z

Just to confirm, is the new CI job added here for testing purposes and is planned to be removed in the future, or is this intended as a long-term change?

crusaderky · 2022-09-07T12:52:46Z

Long term, we'll want to enable scheduler-side queueing by default.
When that happens, we'll need to revisit the unit tests to have selected few tests with queueing off (like it already happens with all other config toggles) and revert this PR.
Before that, all tests that are flaky only when queueing is on need to be addressed.

gjoseph92 added 13 commits September 2, 2022 00:40

add config to coschedule tests

e6a15ba

add oversaturate_only mark

0f9034a

test_scheduler_reschedule is oversaturate_only

9acf6f0

fix test_steal_twice

c6eb5d1

update test_steal_reschedule_reset_in_flight_occupancy

5570df3

we have to make the tasks to be stolen not look like root tasks.

fix test_ProcessingHistogram

2e6f9a2

fix test_close_async_task_handles_cancellation

54c2a32

fix test_close_while_executing

40645d9

fix test_TaskState__to_dict

e31c7a4

fix test_pause_while_spilling

175e55a

There are a lot more race conditions with queuing

skip test_target_duration

611a13c

As GitHub action

1327d6a

skip flaky tests for now

31a0070

these should not be failing and we need to evaluate why they are

gjoseph92 requested review from crusaderky and fjetter September 2, 2022 06:54

crusaderky reviewed Sep 2, 2022

View reviewed changes

.github/workflows/tests.yaml Outdated Show resolved Hide resolved

crusaderky requested changes Sep 2, 2022

View reviewed changes

fjetter reviewed Sep 2, 2022

View reviewed changes

gjoseph92 added 4 commits September 2, 2022 12:26

match name order of queue jobs to others

ddb2bbf

set worker-saturation env var in test step

8a880fd

driveby: fix old var name

d166e1a

remove use of oversaturate_only mark

7b52bed

gjoseph92 mentioned this pull request Sep 2, 2022

Improve --runslow implementation in conftest #6995

Open

gjoseph92 added 4 commits September 2, 2022 13:13

Revert "add oversaturate_only mark"

40a8a8e

This reverts commit 0f9034a.

Merge remote-tracking branch 'upstream/main' into queuing-on-ci-job

56f4dab

Fix test workflow

d14f947

GitHub didn't like setting `partition-label` in two different ways; it showed up in two different places in the job name and breaks the test report script. Instead, we can set the `TEST_ID` variable as a step in the job, which is cleaner anyway IMO.

Update test report script

33e4995

This was referenced Sep 2, 2022

Tests skipped with queuing active #6998

Open

Root-task withholding without co-assignment #6631

Closed

crusaderky approved these changes Sep 5, 2022

View reviewed changes

crusaderky merged commit b133009 into dask:main Sep 5, 2022

fjetter mentioned this pull request Sep 6, 2022

Test report generation failing with ValueError: Length mismatch #7006

Closed

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

CI job running tests with queuing on (dask#6989)

d9fa951

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI job running tests with queuing on #6989

CI job running tests with queuing on #6989

gjoseph92 commented Sep 2, 2022

github-actions bot commented Sep 2, 2022 •

edited

Loading

crusaderky left a comment

fjetter Sep 2, 2022

fjetter Sep 2, 2022

fjetter Sep 2, 2022

gjoseph92 commented Sep 2, 2022

gjoseph92 commented Sep 2, 2022

crusaderky commented Sep 5, 2022

jrbourbeau commented Sep 6, 2022

crusaderky commented Sep 7, 2022

	df_jobs["suite_name"] = (
	df_jobs["OS"]
	+ "-"
	+ df_jobs["python_version"]
	+ "-"
	+ df_jobs["partition"].str.replace(" ", "")
	)

	html_url = jobs_df[jobs_df["suite_name"] == a["name"]].html_url.unique()
	assert (
	len(html_url) == 1
	), f"Artifact suit name {a['name']} did not match any jobs dataframe {jobs_df['suite_name'].unique()}"
	html_url = html_url[0]

CI job running tests with queuing on #6989

CI job running tests with queuing on #6989

Conversation

gjoseph92 commented Sep 2, 2022

github-actions bot commented Sep 2, 2022 • edited Loading

Unit Test Results

crusaderky left a comment

Choose a reason for hiding this comment

fjetter Sep 2, 2022

Choose a reason for hiding this comment

fjetter Sep 2, 2022

Choose a reason for hiding this comment

fjetter Sep 2, 2022

Choose a reason for hiding this comment

gjoseph92 commented Sep 2, 2022

gjoseph92 commented Sep 2, 2022

crusaderky commented Sep 5, 2022

jrbourbeau commented Sep 6, 2022

crusaderky commented Sep 7, 2022

github-actions bot commented Sep 2, 2022 •

edited

Loading