Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scheduler tests to gpuCI #635

Closed

Conversation

charlesbluca
Copy link
Member

In response to #634, this adds Distributed scheduler tests to the gpuCI build script, so that we can hopefully catch NVML-related system monitor issues earlier.

@charlesbluca charlesbluca requested a review from a team as a code owner June 2, 2021 00:51
@github-actions github-actions bot added the gpuCI gpuCI issue label Jun 2, 2021
@pentschev
Copy link
Member

We don't want to run all the scheduler tests from Distributed here. Running all the tests not only increases build time, but any flaky tests or issues that are not directly related to Dask-CUDA will block our CI. We can either:

  1. Run only the test(s) from that file that interests us;
  2. Add a specific test to cover the case of interest, possibly in https://github.com/dask/distributed/blob/main/distributed/diagnostics/tests/test_nvml.py ;

I strongly prefer option 2.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change how to cover that test as per #635 (comment) .

@charlesbluca
Copy link
Member Author

Is there any particular reason you're partial to option 2? I agree that we probably shouldn't be running all the scheduler tests here, really the main one we're interested in is test_get_worker_monitor_info() since it calls range_query() with the worker monitors.

@pentschev
Copy link
Member

The reason for that is clarity, it seems like a quite random test to have that allows us to test NVML. A specific test together with other NVML tests would leave no room for doubt.

@charlesbluca
Copy link
Member Author

Yeah that's fair, my problem here is mostly that this test seems to lean more towards "check that Distributed is working normally when NVML is enabled" than "check that an NVML-related feature in Distributed is working properly." I imagine that if NVML testing in CI is eventually accomplished for Distributed, we would probably want a test like this to be removed, since it would effectively become a duplicate of the scheduler test I referred to.

@charlesbluca
Copy link
Member Author

Hmm another stranger issue I'm noticing here is that when running the tests locally, this range_query() failure seems to be conditional on if the test is ran alone or not; using the following test:

@gen_cluster()
async def test_gpu_monitoring_range_query(s, a, b):
    res = await s.get_worker_monitor_info()
    ms = ["gpu_utilization", "gpu_memory_used"]
    for w in (a, b):
        assert all(res[w.address]["range_query"][m] is not None for m in ms)
        assert res[w.address]["count"] is not None
        assert res[w.address]["last_time"] is not None

I get the following results:

$ pytest distributed/diagnostics/tests/test_nvml.py 
========================================== test session starts ===========================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/charlesbluca/miniconda3/envs/dask-distributed/bin/python
cachedir: .pytest_cache
rootdir: /home/charlesbluca/Documents/GitHub/distributed, configfile: setup.cfg
plugins: repeat-0.8.0, timeout-1.4.2, rerunfailures-9.1.1, asyncio-0.12.0
timeout: 300.0s
timeout method: thread
timeout func_only: False
collected 7 items                                                                                        

distributed/diagnostics/tests/test_nvml.py::test_one_time PASSED                                   [ 14%]
distributed/diagnostics/tests/test_nvml.py::test_1_visible_devices PASSED                          [ 28%]
distributed/diagnostics/tests/test_nvml.py::test_2_visible_devices[1,0] PASSED                     [ 42%]
distributed/diagnostics/tests/test_nvml.py::test_2_visible_devices[0,1] PASSED                     [ 57%]
distributed/diagnostics/tests/test_nvml.py::test_gpu_metrics PASSED                                [ 71%]
distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_recent PASSED                      [ 85%]
distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query PASSED                 [100%]

========================================== slowest 20 durations ==========================================
0.07s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_metrics
0.06s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query
0.05s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_recent
0.01s call     distributed/diagnostics/tests/test_nvml.py::test_one_time
0.01s call     distributed/diagnostics/tests/test_nvml.py::test_2_visible_devices[1,0]

(15 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================================== 7 passed in 0.30s ============================================

Versus running the test alone:

$ pytest distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query
========================================== test session starts ===========================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/charlesbluca/miniconda3/envs/dask-distributed/bin/python
cachedir: .pytest_cache
rootdir: /home/charlesbluca/Documents/GitHub/distributed, configfile: setup.cfg
plugins: repeat-0.8.0, timeout-1.4.2, rerunfailures-9.1.1, asyncio-0.12.0
timeout: 300.0s
timeout method: thread
timeout func_only: False
collected 1 item                                                                                         

distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query FAILED                 [100%]

================================================ FAILURES ================================================
____________________________________ test_gpu_monitoring_range_query _____________________________________
Traceback (most recent call last):
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/core.py", line 494, in handle_comm
    result = handler(comm, **msg)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/worker.py", line 1083, in get_monitor_info
    else self.monitor.range_query(start=start)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in range_query
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <dictcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <listcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range
distributed.core - ERROR - Exception while handling op get_monitor_info
Traceback (most recent call last):
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/core.py", line 494, in handle_comm
    result = handler(comm, **msg)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/worker.py", line 1083, in get_monitor_info
    else self.monitor.range_query(start=start)
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in range_query
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <dictcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/home/charlesbluca/Documents/GitHub/distributed/distributed/system_monitor.py", line 123, in <listcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range
========================================== slowest 20 durations ==========================================
0.08s call     distributed/diagnostics/tests/test_nvml.py::test_gpu_monitoring_range_query

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================================== 1 failed in 0.40s ============================================

Not really sure what's going on here; @pentschev if you get the chance, could you verify that this is a general problem by running the following commands:

$ pytest distributed/tests/test_scheduler.py

$ pytest distributed/tests/test_scheduler.py::test_get_worker_monitor_info

And letting me know if you only get a failure in the latter case?

@quasiben
Copy link
Member

quasiben commented Jun 3, 2021

The testing running alone or not is probably related to whether we clean up nvmlShutdown in other test. My guess is we don't. So if this test assumes nvmlInit we should probably call or check it is already enabled

@pentschev
Copy link
Member

@charlesbluca are you testing the above with dask/distributed#4866 ? To me it seems fine running tests individually.

@charlesbluca
Copy link
Member Author

I can try out your patch, but my issue isn't whether or not dask/distributed#4866 fixes the failure; it's that the failure seemingly goes undetected depending on what context the test is ran in. For example, if my local test results mirror what would happen in gpuCI, this line:

py.test --cache-clear -vs `python -c "import distributed.diagnostics.tests.test_nvml as m;print(m.__file__)"`

With my hypothetical new test would not result in a failure even if dask/distributed#4866 was not merged in.

@pentschev
Copy link
Member

I understand, I'm merely saying I tested that on a DGX with dask/distributed#4866 and I can't reproduce what you see, so I'm not sure what that is. Maybe the PR is actually fixing that issue for some reason.

@charlesbluca
Copy link
Member Author

If you have a distributed env without 4866, would you mind checking out if this issue happens there? My worry here is that if there is a problem with the Distributed tests that goes unchecked here, we could potentially have another situation like this, where a Distributed test falsely "passes" in either Distributed or Dask-CUDA's CI, and a breaking change is introduced.

whether we clean up nvmlShutdown in other test

@quasiben could you clarify what this might look like? Wondering if we could/should have some NVML-related clean up happening in general for tests when NVML is enabled, perhaps through the gen_cluster decorator

@charlesbluca
Copy link
Member Author

Opened dask/distributed#4879 to add a new test to test_nvml.py for the specific issue we saw in #634, so once this discussion is resolved this PR can be closed

@pentschev
Copy link
Member

If you have a distributed env without 4866, would you mind checking out if this issue happens there?

Both fail without dask/distributed#4866 .

@charlesbluca
Copy link
Member Author

Good to know - in that case, I'm going to assume the strange behavior was a local issue on my end.

@pentschev
Copy link
Member

Good to know - in that case, I'm going to assume the strange behavior was a local issue on my end.

From #635 (comment) I was under the impression you were NOT running dask/distributed#4866 in #635 (comment) , thus the failure, no?

@charlesbluca
Copy link
Member Author

Yes, I wasn't using dask/distributed#4866, and because of this I was expecting both of the following commands to fail on test_get_worker_monitor_info (since your fix for range_query wasn't merged):

$ pytest distributed/tests/test_scheduler.py

$ pytest distributed/tests/test_scheduler.py::test_get_worker_monitor_info

But when running these locally, the the first command had no failures (i.e. test_get_worker_monitor_info somehow passed), while the second line had one failure for test_get_worker_monitor_info. Since you've tested it now and confirmed that both of those lines result in test failures for you without 4866, I can only assume that there is something going on in my local env to cause this.

If we wanted to confirm whether or not this would be a problem in general, we could merge dask/distributed#4879 before either of your PRs, and run Dask-CUDA's CI to see if it fails on the function I added there (test_nvml::test_gpu_monitoring_range_query).

@pentschev
Copy link
Member

It seems that this will not be needed when dask/distributed#4879 is merged. However, I'll rerun tests here now to see if dask/distributed#4866 indeed breaks the CUDA_VISIBLE_DEVICES tests, although I'm pretty sure it will.

@pentschev
Copy link
Member

rerun tests

1 similar comment
@pentschev
Copy link
Member

rerun tests

@jakirkham jakirkham added bug Something isn't working non-breaking Non-breaking change labels Jun 3, 2021
@pentschev
Copy link
Member

As discussed, this will not be necessary when dask/distributed#4879 lands. I was also able to fix the dask-cuda tests in #638, at least locally that works. Given that, I'm closing this now, thanks @charlesbluca for raising this and adding the new NVML test to Distributed!

@pentschev pentschev closed this Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpuCI gpuCI issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants