-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System Monitor Error #634
Comments
Sure! Looks like for some reason, in # give external modules (like dask-cuda) a chance to initialize CUDA context
if nvml is not None and nvml.nvmlInit is not None:
if self.gpu_name is None:
gpu_extra = nvml.one_time()
self.gpu_name = gpu_extra["name"]
self.gpu_memory_total = gpu_extra["memory-total"]
gpu_metrics = nvml.real_time()
self.gpu_utilization.append(gpu_metrics["utilization"])
self.gpu_memory_used.append(gpu_metrics["memory-used"])
result["gpu_utilization"] = gpu_metrics["utilization"]
result["gpu_memory_used"] = gpu_metrics["memory-used"] Are we initializing the CUDA context from the benchmark? |
We used to initialize a cuda context when starting the workers but I think we changed things slightly. @pentschev mentioned some of those changes here: @pentschev do you have an idea of what's going on here ? Seems like there is another initialization / timing issue around cuda contexts |
In general, maybe a good failsafe for GPU metrics in the system monitor would be to not add these deques to # give external modules (like dask-cuda) a chance to initialize CUDA context
if nvml is not None and nvml.nvmlInit is not None:
if "gpu_utilization" not in self.quantities:
self.quantities["gpu_utilization"] = self.gpu_utilization
self.quantities["gpu_memory_used"] = self.gpu_memory_used
... |
Dask-CUDA still initializes the context when spinning up workers. Nothing has changed on TCP, the only CUDA context change that has happened is on UCX, which now initializes the context during comms initialization to ensure the context is active when we initialize UCX: https://github.com/dask/distributed/pull/4787/files#diff-eefe4811faa99fe09003bebb81398442ef60d226dc8d3e4abef629e0c71fad78R59-R70 . Since you're running TCP, that must be something elsewhere, although I don't think this should be related to the CUDA context because, AFAIK, we rely on pyNVML that doesn't create/use a context. |
For reference, this is how we initialize the context in Dask-CUDA (and that hasn't changed recently): dask-cuda/dask_cuda/initialize.py Lines 81 to 85 in 81bbc6f
dask-cuda/dask_cuda/local_cuda_cluster.py Lines 300 to 307 in 81bbc6f
dask-cuda/dask_cuda/cuda_worker.py Lines 215 to 216 in 81bbc6f
|
Note this also happens in "basic" usage: In [1]: from dask_cuda import LocalCUDACluster
In [2]: from dask.distributed import Client
In [3]: from dask.distributed import Client, performance_report
In [4]: cluster = LocalCUDACluster()
In [5]: client = Client(cluster)
In [6]: with performance_report(filename='foo.html'):
...: client.submit(lambda x: x+1, 1)
...:
|
Yeah looking into it further this is seemingly unrelated to perf reports as well: In [6]: cluster.scheduler.monitor.range_query(start=1)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-6-39d6e9217fef> in <module>
----> 1 cluster.scheduler.monitor.range_query(start=1)
~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in range_query(self, start)
121 seq = [i for i in range(istart, len(self.cpu))]
122
--> 123 d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
124 return d
~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in <dictcomp>(.0)
121 seq = [i for i in range(istart, len(self.cpu))]
122
--> 123 d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
124 return d
~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in <listcomp>(.0)
121 seq = [i for i in range(istart, len(self.cpu))]
122
--> 123 d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
124 return d
IndexError: deque index out of range I'd imagine that this may have been a problem that went under the radar since dask/distributed#4688, since the Distributed test suite doesn't test a I'm not really too familiar with PyNVML, but is there a way we can check if the CUDA context has been initialized by Dask-CUDA? We'd probably want to change this line to set Another thing that's confusing me here is that the In [16]: cluster.scheduler.workers['tcp://127.0.0.1:45365'].metrics
Out[16]:
{'executing': 0,
'in_memory': 0,
'ready': 0,
'in_flight': 0,
'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}},
'spilled_nbytes': 0,
'cpu': 2.0,
'memory': 341942272,
'time': 1622564423.8664095,
'read_bytes': 2707.1238745576998,
'write_bytes': 1209.395577095304,
'num_fds': 46,
'gpu_utilization': 0,
'gpu_memory_used': 347668480,
'gpu': {'utilization': 0, 'memory-used': 347668480}} |
I think it might be simpler to move the nvmlInit from update to if nvml is not None:
if nvml.nvmlInit is None:
gpu_extra = nvml.one_time() This should be safe for CPU workloads as |
Even with the conditional there, wouldn't |
I opened dask/distributed#4866 to fix this, it introduces back the change that broke #564 but also a fix for that specifically in dask/distributed@d860e58 . |
Thanks @pentschev! I'll submit a test for this in Dask-CUDA (maybe something like @quasiben's perf report snippet), but it would be nice to have an equivalent test in Distributed if/when we are able to test NVML stuff there. |
Maybe it make sense to add the test to Distributed as well with a |
Sure, I can do that! Would it be skipped by the Distributed CI right now? EDIT: |
Yes ideally we would want Dask to even run these tests on their own CI. Though this is admittedly not possible currently. Simply having them there will make it easier to do once that is an option |
It seems that those are already in Line 120 in 81bbc6f
|
Yeah it looks like those tests succeed despite the issues we've been discussing - the tests are using To get a check for this error, I think a simple test that verifies that we can take a |
Actually, it looks like we do have a test that calls I wonder why this test passed Dask-CUDA's CI? I tried it locally and got the expected failure. EDIT: Just realized the dask-cuda checks are only running |
Yes, I think so. We don't want to double Distributed's CI, but specific tests for CUDA behavior we can do here. |
resolved by dask/distributed#4866 |
I'm getting an error when running the local_cudf_merge benchmark with latest of distributed/dask (in main).
@charlesbluca do you have time to investigate what is happening here ? I think you are familiar with the
system_monitor
section of distributed.The text was updated successfully, but these errors were encountered: