System Monitor Error #634

quasiben · 2021-06-01T12:45:20Z

I'm getting an error when running the local_cudf_merge benchmark with latest of distributed/dask (in main).

python local_cudf_merge.py -p tcp -d 0 --profile foo.html

bzaitlen@prm-dgx-06:~$ python $CONDA_PREFIX/lib/python3.8/site-packages/dask_cuda/benchmarks/local_cudf_merge.py -p tcp -d 0  --profile
 foo.html
distributed.utils - ERROR - deque index out of range
Traceback (most recent call last):
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/utils.py", line 671, in log_errors
    yield
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py", line 581, in update
    self.source.stream(self.get_data(), 1000)
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py", line 573, in get_data
    d = self.worker.monitor.range_query(start=self.last_count)
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in range_query
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in <dictcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in <listcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range

@charlesbluca do you have time to investigate what is happening here ? I think you are familiar with the system_monitor section of distributed.

The text was updated successfully, but these errors were encountered:

charlesbluca · 2021-06-01T14:01:53Z

Sure! Looks like for some reason, in self.quantities.items() we have empty deques for gpu_utilization and gpu_memory_used that throw the error when we try to poll them, implying that the block that appends to these deques is never reached:

        # give external modules (like dask-cuda) a chance to initialize CUDA context
        if nvml is not None and nvml.nvmlInit is not None:
            if self.gpu_name is None:
                gpu_extra = nvml.one_time()
                self.gpu_name = gpu_extra["name"]
                self.gpu_memory_total = gpu_extra["memory-total"]
            gpu_metrics = nvml.real_time()
            self.gpu_utilization.append(gpu_metrics["utilization"])
            self.gpu_memory_used.append(gpu_metrics["memory-used"])
            result["gpu_utilization"] = gpu_metrics["utilization"]
            result["gpu_memory_used"] = gpu_metrics["memory-used"]

Are we initializing the CUDA context from the benchmark?

quasiben · 2021-06-01T15:20:36Z

We used to initialize a cuda context when starting the workers but I think we changed things slightly. @pentschev mentioned some of those changes here:
#632

@pentschev do you have an idea of what's going on here ? Seems like there is another initialization / timing issue around cuda contexts

charlesbluca · 2021-06-01T15:29:23Z

In general, maybe a good failsafe for GPU metrics in the system monitor would be to not add these deques to SystemMonitor.quantities until we are sure CUDA context has been initialized:

        # give external modules (like dask-cuda) a chance to initialize CUDA context
        if nvml is not None and nvml.nvmlInit is not None:
            if "gpu_utilization" not in self.quantities:
                self.quantities["gpu_utilization"] = self.gpu_utilization
                self.quantities["gpu_memory_used"] = self.gpu_memory_used
            ...

pentschev · 2021-06-01T15:37:27Z

Dask-CUDA still initializes the context when spinning up workers. Nothing has changed on TCP, the only CUDA context change that has happened is on UCX, which now initializes the context during comms initialization to ensure the context is active when we initialize UCX: https://github.com/dask/distributed/pull/4787/files#diff-eefe4811faa99fe09003bebb81398442ef60d226dc8d3e4abef629e0c71fad78R59-R70 . Since you're running TCP, that must be something elsewhere, although I don't think this should be related to the CUDA context because, AFAIK, we rely on pyNVML that doesn't create/use a context.

pentschev · 2021-06-01T15:39:04Z

For reference, this is how we initialize the context in Dask-CUDA (and that hasn't changed recently):

dask-cuda/dask_cuda/initialize.py

Lines 81 to 85 in 81bbc6f

    
           if create_cuda_context: 
        
               try: 
        
                   numba.cuda.current_context() 
        
               except Exception: 
        
                   logger.error("Unable to start CUDA Context", exc_info=True)

dask-cuda/dask_cuda/local_cuda_cluster.py

Lines 300 to 307 in 81bbc6f

    
           initialize( 
        
               enable_tcp_over_ucx=enable_tcp_over_ucx, 
        
               enable_nvlink=enable_nvlink, 
        
               enable_infiniband=enable_infiniband, 
        
               enable_rdmacm=enable_rdmacm, 
        
               net_devices=ucx_net_devices, 
        
               cuda_device_index=0, 
        
           )

dask-cuda/dask_cuda/cuda_worker.py

Lines 215 to 216 in 81bbc6f

    
           preload=(list(preload) or []) + ["dask_cuda.initialize"], 
        
           preload_argv=(list(preload_argv) or []) + ["--create-cuda-context"],

quasiben · 2021-06-01T15:53:26Z

Note this also happens in "basic" usage:

In [1]: from dask_cuda import LocalCUDACluster

In [2]: from dask.distributed import Client

In [3]: from dask.distributed import Client, performance_report

In [4]: cluster = LocalCUDACluster()

In [5]: client = Client(cluster)

In [6]: with performance_report(filename='foo.html'):
   ...:     client.submit(lambda x: x+1, 1)
   ...:

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/client.py in __aexit__(self, typ, value, traceback, code)
   4711             except Exception:
   4712                 code = ""
-> 4713         data = await get_client().scheduler.performance_report(
   4714             start=self.start, last_count=self.last_count, code=code
   4715         )

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/core.py in send_recv_from_rpc(**kwargs)
    862             name, comm.name = comm.name, "ConnectionPool." + key
    863             try:
--> 864                 result = await send_recv(comm=comm, op=key, **kwargs)
    865             finally:
    866                 self.pool.reuse(self.addr, comm)

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/core.py in send_recv(comm, reply, serializers, deserializers, **kwargs)
    661         if comm.deserialize:
    662             typ, exc, tb = clean_exception(**response)
--> 663             raise exc.with_traceback(tb)
    664         else:
    665             raise Exception(response["text"])

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/core.py in handle_comm()
    496                             result = asyncio.ensure_future(result)
    497                             self._ongoing_coroutines.add(result)
--> 498                             result = await result
    499                     except (CommClosedError, CancelledError) as e:
    500                         if self.status == Status.running:

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/scheduler.cpython-38-x86_64-linux-gnu.so in performance_report()

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/bokeh/core/property/validation.py in func()
     91     def func(*args, **kwargs):
     92         with validate(False):
---> 93             return input_function(*args, **kwargs)
     94     return func
     95

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py in update()
    579     def update(self):
    580         with log_errors():
--> 581             self.source.stream(self.get_data(), 1000)
    582             self.label_source.data["cpu"] = list(
    583                 "{}: {:.1f}%".format(f.__name__, f(self.source.data["cpu"]))

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py in get_data()
    571
    572     def get_data(self):
--> 573         d = self.worker.monitor.range_query(start=self.last_count)
    574         d["time"] = [x * 1000 for x in d["time"]]
    575         self.last_count = self.worker.monitor.count

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py in range_query()
    121         seq = [i for i in range(istart, len(self.cpu))]
    122
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

charlesbluca · 2021-06-01T16:21:32Z

Yeah looking into it further this is seemingly unrelated to perf reports as well:

In [6]: cluster.scheduler.monitor.range_query(start=1)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-39d6e9217fef> in <module>
----> 1 cluster.scheduler.monitor.range_query(start=1)

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in range_query(self, start)
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in <dictcomp>(.0)
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in <listcomp>(.0)
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

IndexError: deque index out of range

I'd imagine that this may have been a problem that went under the radar since dask/distributed#4688, since the Distributed test suite doesn't test a LocalCUDACluster and the Dask-CUDA test suite doesn't do extensive dashboard testing.

I'm not really too familiar with PyNVML, but is there a way we can check if the CUDA context has been initialized by Dask-CUDA? We'd probably want to change this line to set nvmlInit based on that condition.

Another thing that's confusing me here is that the WorkerStates of the cluster are still reporting GPU information, which implies that NVML initialization has happened already:

In [16]: cluster.scheduler.workers['tcp://127.0.0.1:45365'].metrics
Out[16]: 
{'executing': 0,
 'in_memory': 0,
 'ready': 0,
 'in_flight': 0,
 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}},
 'spilled_nbytes': 0,
 'cpu': 2.0,
 'memory': 341942272,
 'time': 1622564423.8664095,
 'read_bytes': 2707.1238745576998,
 'write_bytes': 1209.395577095304,
 'num_fds': 46,
 'gpu_utilization': 0,
 'gpu_memory_used': 347668480,
 'gpu': {'utilization': 0, 'memory-used': 347668480}}

quasiben · 2021-06-01T16:34:52Z

I think it might be simpler to move the nvmlInit from update to __init__, no ? For example, moving the one_time call:

        if nvml is not None:
            if nvml.nvmlInit is None:
                gpu_extra = nvml.one_time()

This should be safe for CPU workloads as nvml, the variable, should always be None

charlesbluca · 2021-06-01T16:43:03Z

Even with the conditional there, wouldn't nvml.one_time() just end up getting called immediately when the scheduler system monitor is initialized? To my knowledge, that was the cause of #564.

pentschev · 2021-06-01T18:36:47Z

I opened dask/distributed#4866 to fix this, it introduces back the change that broke #564 but also a fix for that specifically in dask/distributed@d860e58 .

charlesbluca · 2021-06-01T18:43:07Z

Thanks @pentschev! I'll submit a test for this in Dask-CUDA (maybe something like @quasiben's perf report snippet), but it would be nice to have an equivalent test in Distributed if/when we are able to test NVML stuff there.

jakirkham · 2021-06-01T18:47:43Z

Maybe it make sense to add the test to Distributed as well with a pytest.importorskip?

charlesbluca · 2021-06-01T18:50:02Z

Sure, I can do that! Would it be skipped by the Distributed CI right now?

EDIT:
Actually it looks like these Distributed tests already handle the general issue here (checking for GPU info in the WorkerState/worker monitors), so we should be good there.

jakirkham · 2021-06-01T19:12:04Z

Yes ideally we would want Dask to even run these tests on their own CI. Though this is admittedly not possible currently. Simply having them there will make it easier to do once that is an option

pentschev · 2021-06-01T19:40:04Z

It seems that those are already in

dask-cuda/ci/gpu/build.sh

Line 120 in 81bbc6f

    
           py.test --cache-clear -vs `python -c "import distributed.diagnostics.tests.test_nvml as m;print(m.__file__)"`

but Dask-CUDA's CI hasn't failed on them. @charlesbluca could you double-check whether those tests indeed check everything that's relevant for us?

charlesbluca · 2021-06-01T19:59:59Z

Yeah it looks like those tests succeed despite the issues we've been discussing - the tests are usingSystemMonitor.recent() to poll for results instead of SystemMonitor.range_query() and for some reason, that call actually succeeds and gives some GPU info.

To get a check for this error, I think a simple test that verifies that we can take a range_query() from the worker and scheduler monitors should be good.

charlesbluca · 2021-06-01T20:22:29Z

Actually, it looks like we do have a test that calls range_query():

https://github.com/dask/distributed/blob/9d4f0bf2fc804f955a869febd3b51423c4382908/distributed/tests/test_scheduler.py#L2251-L2260

I wonder why this test passed Dask-CUDA's CI? I tried it locally and got the expected failure.

EDIT:

Just realized the dask-cuda checks are only running test_nvml; would it be better to add a specific test for range_query() to the NVML tests, or add the relevant scheduler test(s) to the gpuCI script?

pentschev · 2021-06-01T21:14:25Z

Just realized the dask-cuda checks are only running test_nvml; would it be better to add a specific test for range_query() to the NVML tests, or add the relevant scheduler test(s) to the gpuCI script?

Yes, I think so. We don't want to double Distributed's CI, but specific tests for CUDA behavior we can do here.

quasiben · 2021-06-08T14:18:28Z

resolved by dask/distributed#4866

pentschev mentioned this issue Jun 1, 2021

Move SystemMonitor's GPU initialization back to constructor dask/distributed#4866

Merged

1 task

charlesbluca closed this as completed Jun 1, 2021

charlesbluca reopened this Jun 1, 2021

charlesbluca mentioned this issue Jun 2, 2021

Add scheduler tests to gpuCI #635

Closed

jrbourbeau mentioned this issue Jun 2, 2021

Release 2021.06.0 dask/community#162

Closed

4 tasks

charlesbluca mentioned this issue Jun 3, 2021

Add range_query tests to NVML test suite dask/distributed#4879

Merged

3 tasks

quasiben closed this as completed Jun 8, 2021

This was referenced Jun 21, 2021

GPU support for CI dask/distributed#4944

Closed

GPU CI dask/community#138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Monitor Error #634

System Monitor Error #634

quasiben commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

quasiben commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

pentschev commented Jun 1, 2021

pentschev commented Jun 1, 2021 •

edited

Loading

quasiben commented Jun 1, 2021

charlesbluca commented Jun 1, 2021 •

edited

Loading

quasiben commented Jun 1, 2021 •

edited

Loading

charlesbluca commented Jun 1, 2021

pentschev commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

jakirkham commented Jun 1, 2021

charlesbluca commented Jun 1, 2021 •

edited

Loading

jakirkham commented Jun 1, 2021

pentschev commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

charlesbluca commented Jun 1, 2021 •

edited

Loading

pentschev commented Jun 1, 2021

quasiben commented Jun 8, 2021

System Monitor Error #634

System Monitor Error #634

Comments

quasiben commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

quasiben commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

pentschev commented Jun 1, 2021

pentschev commented Jun 1, 2021 • edited Loading

quasiben commented Jun 1, 2021

charlesbluca commented Jun 1, 2021 • edited Loading

quasiben commented Jun 1, 2021 • edited Loading

charlesbluca commented Jun 1, 2021

pentschev commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

jakirkham commented Jun 1, 2021

charlesbluca commented Jun 1, 2021 • edited Loading

jakirkham commented Jun 1, 2021

pentschev commented Jun 1, 2021

charlesbluca commented Jun 1, 2021

charlesbluca commented Jun 1, 2021 • edited Loading

pentschev commented Jun 1, 2021

quasiben commented Jun 8, 2021

pentschev commented Jun 1, 2021 •

edited

Loading

charlesbluca commented Jun 1, 2021 •

edited

Loading

quasiben commented Jun 1, 2021 •

edited

Loading

charlesbluca commented Jun 1, 2021 •

edited

Loading

charlesbluca commented Jun 1, 2021 •

edited

Loading