-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scheduler tests to gpuCI #635
Add scheduler tests to gpuCI #635
Conversation
We don't want to run all the scheduler tests from Distributed here. Running all the tests not only increases build time, but any flaky tests or issues that are not directly related to Dask-CUDA will block our CI. We can either:
I strongly prefer option 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change how to cover that test as per #635 (comment) .
Is there any particular reason you're partial to option 2? I agree that we probably shouldn't be running all the scheduler tests here, really the main one we're interested in is |
The reason for that is clarity, it seems like a quite random test to have that allows us to test NVML. A specific test together with other NVML tests would leave no room for doubt. |
Yeah that's fair, my problem here is mostly that this test seems to lean more towards "check that Distributed is working normally when NVML is enabled" than "check that an NVML-related feature in Distributed is working properly." I imagine that if NVML testing in CI is eventually accomplished for Distributed, we would probably want a test like this to be removed, since it would effectively become a duplicate of the scheduler test I referred to. |
Hmm another stranger issue I'm noticing here is that when running the tests locally, this @gen_cluster()
async def test_gpu_monitoring_range_query(s, a, b):
res = await s.get_worker_monitor_info()
ms = ["gpu_utilization", "gpu_memory_used"]
for w in (a, b):
assert all(res[w.address]["range_query"][m] is not None for m in ms)
assert res[w.address]["count"] is not None
assert res[w.address]["last_time"] is not None I get the following results:
Versus running the test alone:
Not really sure what's going on here; @pentschev if you get the chance, could you verify that this is a general problem by running the following commands:
And letting me know if you only get a failure in the latter case? |
The testing running alone or not is probably related to whether we clean up |
@charlesbluca are you testing the above with dask/distributed#4866 ? To me it seems fine running tests individually. |
I can try out your patch, but my issue isn't whether or not dask/distributed#4866 fixes the failure; it's that the failure seemingly goes undetected depending on what context the test is ran in. For example, if my local test results mirror what would happen in gpuCI, this line: Line 120 in 81bbc6f
With my hypothetical new test would not result in a failure even if dask/distributed#4866 was not merged in. |
I understand, I'm merely saying I tested that on a DGX with dask/distributed#4866 and I can't reproduce what you see, so I'm not sure what that is. Maybe the PR is actually fixing that issue for some reason. |
If you have a distributed env without 4866, would you mind checking out if this issue happens there? My worry here is that if there is a problem with the Distributed tests that goes unchecked here, we could potentially have another situation like this, where a Distributed test falsely "passes" in either Distributed or Dask-CUDA's CI, and a breaking change is introduced.
@quasiben could you clarify what this might look like? Wondering if we could/should have some NVML-related clean up happening in general for tests when NVML is enabled, perhaps through the |
Opened dask/distributed#4879 to add a new test to |
Both fail without dask/distributed#4866 . |
Good to know - in that case, I'm going to assume the strange behavior was a local issue on my end. |
From #635 (comment) I was under the impression you were NOT running dask/distributed#4866 in #635 (comment) , thus the failure, no? |
Yes, I wasn't using dask/distributed#4866, and because of this I was expecting both of the following commands to fail on
But when running these locally, the the first command had no failures (i.e. If we wanted to confirm whether or not this would be a problem in general, we could merge dask/distributed#4879 before either of your PRs, and run Dask-CUDA's CI to see if it fails on the function I added there ( |
It seems that this will not be needed when dask/distributed#4879 is merged. However, I'll rerun tests here now to see if dask/distributed#4866 indeed breaks the |
rerun tests |
1 similar comment
rerun tests |
As discussed, this will not be necessary when dask/distributed#4879 lands. I was also able to fix the dask-cuda tests in #638, at least locally that works. Given that, I'm closing this now, thanks @charlesbluca for raising this and adding the new NVML test to Distributed! |
In response to #634, this adds Distributed scheduler tests to the gpuCI build script, so that we can hopefully catch NVML-related system monitor issues earlier.