Add scheduler tests to gpuCI #635

charlesbluca · 2021-06-02T00:51:43Z

In response to #634, this adds Distributed scheduler tests to the gpuCI build script, so that we can hopefully catch NVML-related system monitor issues earlier.

pentschev · 2021-06-02T06:51:19Z

We don't want to run all the scheduler tests from Distributed here. Running all the tests not only increases build time, but any flaky tests or issues that are not directly related to Dask-CUDA will block our CI. We can either:

Run only the test(s) from that file that interests us;
Add a specific test to cover the case of interest, possibly in https://github.com/dask/distributed/blob/main/distributed/diagnostics/tests/test_nvml.py ;

I strongly prefer option 2.

pentschev

We should change how to cover that test as per #635 (comment) .

charlesbluca · 2021-06-03T13:56:44Z

Is there any particular reason you're partial to option 2? I agree that we probably shouldn't be running all the scheduler tests here, really the main one we're interested in is test_get_worker_monitor_info() since it calls range_query() with the worker monitors.

pentschev · 2021-06-03T14:09:44Z

The reason for that is clarity, it seems like a quite random test to have that allows us to test NVML. A specific test together with other NVML tests would leave no room for doubt.

charlesbluca · 2021-06-03T14:17:25Z

Yeah that's fair, my problem here is mostly that this test seems to lean more towards "check that Distributed is working normally when NVML is enabled" than "check that an NVML-related feature in Distributed is working properly." I imagine that if NVML testing in CI is eventually accomplished for Distributed, we would probably want a test like this to be removed, since it would effectively become a duplicate of the scheduler test I referred to.

charlesbluca · 2021-06-03T14:39:29Z

Hmm another stranger issue I'm noticing here is that when running the tests locally, this range_query() failure seems to be conditional on if the test is ran alone or not; using the following test:

@gen_cluster()
async def test_gpu_monitoring_range_query(s, a, b):
    res = await s.get_worker_monitor_info()
    ms = ["gpu_utilization", "gpu_memory_used"]
    for w in (a, b):
        assert all(res[w.address]["range_query"][m] is not None for m in ms)
        assert res[w.address]["count"] is not None
        assert res[w.address]["last_time"] is not None

I get the following results:

$ pytest distributed/diagnostics/tests/test_nvml.py 
========================================== test session starts ===========================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/charlesbluca/miniconda3/envs/dask-distributed/bin/python
cachedir: .pytest_cache
rootdir: /home/charlesbluca/Documents/GitHub/distributed, configfile: setup.cfg
plugins: repeat-0.8.0, timeout-1.4.2, rerunfailures-9.1.1, asyncio-0.12.0
timeout: 300.0s
timeout method: thread
timeout func_only: False
collected 7 items                                                                                        

distributed/diagnostics/tests/test_nvml.py::test_one_time PASSED                                   [ 14%]
distributed/diagnostics/tests/test_nvml.py::test_1_visible_devices PASSED                          [ 28%]
distributed/diagnostics/tests/test_nvml.py::test_2_visible_devices[1,0] PASSED                     [ 42%]
distributed/diagnostics/tests/test_nvml.py::test_2_visible_devices[0,1] PASSED                     [ 57%]
distributed/diagnostics/tests/test_nvml.py::test_gpu_metrics PASSED                                [ 71%]
distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_recent PASSED                      [ 85%]
distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query PASSED                 [100%]

========================================== slowest 20 durations ==========================================
0.07s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_metrics
0.06s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query
0.05s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_recent
0.01s call     distributed/diagnostics/tests/test_nvml.py::test_one_time
0.01s call     distributed/diagnostics/tests/test_nvml.py::test_2_visible_devices[1,0]

(15 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================================== 7 passed in 0.30s ============================================

Versus running the test alone:

$ pytest distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query
========================================== test session starts ===========================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/charlesbluca/miniconda3/envs/dask-distributed/bin/python
cachedir: .pytest_cache
rootdir: /home/charlesbluca/Documents/GitHub/distributed, configfile: setup.cfg
plugins: repeat-0.8.0, timeout-1.4.2, rerunfailures-9.1.1, asyncio-0.12.0
timeout: 300.0s
timeout method: thread
timeout func_only: False
collected 1 item                                                                                         

distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query FAILED                 [100%]

================================================ FAILURES ================================================
____________________________________ test_gpu_monitoring_range_query _____________________________________
Traceback (most recent call last):
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/core.py", line 494, in handle_comm
    result = handler(comm, **msg)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/worker.py", line 1083, in get_monitor_info
    else self.monitor.range_query(start=start)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in range_query
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <dictcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <listcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range
distributed.core - ERROR - Exception while handling op get_monitor_info
Traceback (most recent call last):
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/core.py", line 494, in handle_comm
    result = handler(comm, **msg)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/worker.py", line 1083, in get_monitor_info
    else self.monitor.range_query(start=start)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in range_query
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <dictcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <listcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range
========================================== slowest 20 durations ==========================================
0.08s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================================== 1 failed in 0.40s ============================================

Not really sure what's going on here; @pentschev if you get the chance, could you verify that this is a general problem by running the following commands:

$ pytest distributed/tests/test_scheduler.py

$ pytest distributed/tests/test_scheduler.py::test_get_worker_monitor_info

And letting me know if you only get a failure in the latter case?

quasiben · 2021-06-03T14:59:00Z

The testing running alone or not is probably related to whether we clean up nvmlShutdown in other test. My guess is we don't. So if this test assumes nvmlInit we should probably call or check it is already enabled

pentschev · 2021-06-03T15:58:47Z

@charlesbluca are you testing the above with dask/distributed#4866 ? To me it seems fine running tests individually.

charlesbluca · 2021-06-03T16:10:38Z

I can try out your patch, but my issue isn't whether or not dask/distributed#4866 fixes the failure; it's that the failure seemingly goes undetected depending on what context the test is ran in. For example, if my local test results mirror what would happen in gpuCI, this line:

dask-cuda/ci/gpu/build.sh

Line 120 in 81bbc6f

    
           py.test --cache-clear -vs `python -c "import distributed.diagnostics.tests.test_nvml as m;print(m.__file__)"`

With my hypothetical new test would not result in a failure even if dask/distributed#4866 was not merged in.

pentschev · 2021-06-03T16:20:51Z

I understand, I'm merely saying I tested that on a DGX with dask/distributed#4866 and I can't reproduce what you see, so I'm not sure what that is. Maybe the PR is actually fixing that issue for some reason.

charlesbluca · 2021-06-03T18:27:06Z

If you have a distributed env without 4866, would you mind checking out if this issue happens there? My worry here is that if there is a problem with the Distributed tests that goes unchecked here, we could potentially have another situation like this, where a Distributed test falsely "passes" in either Distributed or Dask-CUDA's CI, and a breaking change is introduced.

whether we clean up nvmlShutdown in other test

@quasiben could you clarify what this might look like? Wondering if we could/should have some NVML-related clean up happening in general for tests when NVML is enabled, perhaps through the gen_cluster decorator

charlesbluca · 2021-06-03T18:30:38Z

Opened dask/distributed#4879 to add a new test to test_nvml.py for the specific issue we saw in #634, so once this discussion is resolved this PR can be closed

pentschev · 2021-06-03T18:39:20Z

If you have a distributed env without 4866, would you mind checking out if this issue happens there?

Both fail without dask/distributed#4866 .

charlesbluca · 2021-06-03T18:55:54Z

Good to know - in that case, I'm going to assume the strange behavior was a local issue on my end.

pentschev · 2021-06-03T19:01:55Z

Good to know - in that case, I'm going to assume the strange behavior was a local issue on my end.

From #635 (comment) I was under the impression you were NOT running dask/distributed#4866 in #635 (comment) , thus the failure, no?

charlesbluca · 2021-06-03T19:28:02Z

Yes, I wasn't using dask/distributed#4866, and because of this I was expecting both of the following commands to fail on test_get_worker_monitor_info (since your fix for range_query wasn't merged):

$ pytest distributed/tests/test_scheduler.py

$ pytest distributed/tests/test_scheduler.py::test_get_worker_monitor_info

But when running these locally, the the first command had no failures (i.e. test_get_worker_monitor_info somehow passed), while the second line had one failure for test_get_worker_monitor_info. Since you've tested it now and confirmed that both of those lines result in test failures for you without 4866, I can only assume that there is something going on in my local env to cause this.

If we wanted to confirm whether or not this would be a problem in general, we could merge dask/distributed#4879 before either of your PRs, and run Dask-CUDA's CI to see if it fails on the function I added there (test_nvml::test_gpu_monitoring_range_query).

pentschev · 2021-06-03T21:05:02Z

It seems that this will not be needed when dask/distributed#4879 is merged. However, I'll rerun tests here now to see if dask/distributed#4866 indeed breaks the CUDA_VISIBLE_DEVICES tests, although I'm pretty sure it will.

pentschev · 2021-06-03T21:05:07Z

rerun tests

pentschev · 2021-06-03T21:48:57Z

rerun tests

pentschev · 2021-06-03T22:26:02Z

As discussed, this will not be necessary when dask/distributed#4879 lands. I was also able to fix the dask-cuda tests in #638, at least locally that works. Given that, I'm closing this now, thanks @charlesbluca for raising this and adding the new NVML test to Distributed!

Add scheduler tests to gpuCI

96c2928

charlesbluca requested a review from a team as a code owner June 2, 2021 00:51

github-actions bot added the gpuCI gpuCI issue label Jun 2, 2021

pentschev requested changes Jun 2, 2021

View reviewed changes

pentschev mentioned this pull request Jun 2, 2021

Move SystemMonitor's GPU initialization back to constructor dask/distributed#4866

Merged

1 task

jakirkham added bug Something isn't working non-breaking Non-breaking change labels Jun 3, 2021

pentschev closed this Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scheduler tests to gpuCI #635

Add scheduler tests to gpuCI #635

charlesbluca commented Jun 2, 2021

pentschev commented Jun 2, 2021

pentschev left a comment

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

quasiben commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021

Add scheduler tests to gpuCI #635

Add scheduler tests to gpuCI #635

Conversation

charlesbluca commented Jun 2, 2021

pentschev commented Jun 2, 2021

pentschev left a comment

Choose a reason for hiding this comment

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

quasiben commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

charlesbluca commented Jun 3, 2021

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021