Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System Monitor Error #634

Closed
quasiben opened this issue Jun 1, 2021 · 19 comments
Closed

System Monitor Error #634

quasiben opened this issue Jun 1, 2021 · 19 comments

Comments

@quasiben
Copy link
Member

quasiben commented Jun 1, 2021

I'm getting an error when running the local_cudf_merge benchmark with latest of distributed/dask (in main).

python local_cudf_merge.py -p tcp -d 0 --profile foo.html

bzaitlen@prm-dgx-06:~$ python $CONDA_PREFIX/lib/python3.8/site-packages/dask_cuda/benchmarks/local_cudf_merge.py -p tcp -d 0  --profile
 foo.html
distributed.utils - ERROR - deque index out of range
Traceback (most recent call last):
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/utils.py", line 671, in log_errors
    yield
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py", line 581, in update
    self.source.stream(self.get_data(), 1000)
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py", line 573, in get_data
    d = self.worker.monitor.range_query(start=self.last_count)
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in range_query
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in <dictcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in <listcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range

@charlesbluca do you have time to investigate what is happening here ? I think you are familiar with the system_monitor section of distributed.

@charlesbluca
Copy link
Member

Sure! Looks like for some reason, in self.quantities.items() we have empty deques for gpu_utilization and gpu_memory_used that throw the error when we try to poll them, implying that the block that appends to these deques is never reached:

        # give external modules (like dask-cuda) a chance to initialize CUDA context
        if nvml is not None and nvml.nvmlInit is not None:
            if self.gpu_name is None:
                gpu_extra = nvml.one_time()
                self.gpu_name = gpu_extra["name"]
                self.gpu_memory_total = gpu_extra["memory-total"]
            gpu_metrics = nvml.real_time()
            self.gpu_utilization.append(gpu_metrics["utilization"])
            self.gpu_memory_used.append(gpu_metrics["memory-used"])
            result["gpu_utilization"] = gpu_metrics["utilization"]
            result["gpu_memory_used"] = gpu_metrics["memory-used"]

Are we initializing the CUDA context from the benchmark?

@quasiben
Copy link
Member Author

quasiben commented Jun 1, 2021

We used to initialize a cuda context when starting the workers but I think we changed things slightly. @pentschev mentioned some of those changes here:
#632

@pentschev do you have an idea of what's going on here ? Seems like there is another initialization / timing issue around cuda contexts

@charlesbluca
Copy link
Member

In general, maybe a good failsafe for GPU metrics in the system monitor would be to not add these deques to SystemMonitor.quantities until we are sure CUDA context has been initialized:

        # give external modules (like dask-cuda) a chance to initialize CUDA context
        if nvml is not None and nvml.nvmlInit is not None:
            if "gpu_utilization" not in self.quantities:
                self.quantities["gpu_utilization"] = self.gpu_utilization
                self.quantities["gpu_memory_used"] = self.gpu_memory_used
            ...

@pentschev
Copy link
Member

Dask-CUDA still initializes the context when spinning up workers. Nothing has changed on TCP, the only CUDA context change that has happened is on UCX, which now initializes the context during comms initialization to ensure the context is active when we initialize UCX: https://github.com/dask/distributed/pull/4787/files#diff-eefe4811faa99fe09003bebb81398442ef60d226dc8d3e4abef629e0c71fad78R59-R70 . Since you're running TCP, that must be something elsewhere, although I don't think this should be related to the CUDA context because, AFAIK, we rely on pyNVML that doesn't create/use a context.

@pentschev
Copy link
Member

pentschev commented Jun 1, 2021

For reference, this is how we initialize the context in Dask-CUDA (and that hasn't changed recently):

if create_cuda_context:
try:
numba.cuda.current_context()
except Exception:
logger.error("Unable to start CUDA Context", exc_info=True)

initialize(
enable_tcp_over_ucx=enable_tcp_over_ucx,
enable_nvlink=enable_nvlink,
enable_infiniband=enable_infiniband,
enable_rdmacm=enable_rdmacm,
net_devices=ucx_net_devices,
cuda_device_index=0,
)

preload=(list(preload) or []) + ["dask_cuda.initialize"],
preload_argv=(list(preload_argv) or []) + ["--create-cuda-context"],

@quasiben
Copy link
Member Author

quasiben commented Jun 1, 2021

Note this also happens in "basic" usage:

In [1]: from dask_cuda import LocalCUDACluster

In [2]: from dask.distributed import Client

In [3]: from dask.distributed import Client, performance_report

In [4]: cluster = LocalCUDACluster()

In [5]: client = Client(cluster)

In [6]: with performance_report(filename='foo.html'):
   ...:     client.submit(lambda x: x+1, 1)
   ...:
/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/client.py in __aexit__(self, typ, value, traceback, code)
   4711             except Exception:
   4712                 code = ""
-> 4713         data = await get_client().scheduler.performance_report(
   4714             start=self.start, last_count=self.last_count, code=code
   4715         )

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/core.py in send_recv_from_rpc(**kwargs)
    862             name, comm.name = comm.name, "ConnectionPool." + key
    863             try:
--> 864                 result = await send_recv(comm=comm, op=key, **kwargs)
    865             finally:
    866                 self.pool.reuse(self.addr, comm)

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/core.py in send_recv(comm, reply, serializers, deserializers, **kwargs)
    661         if comm.deserialize:
    662             typ, exc, tb = clean_exception(**response)
--> 663             raise exc.with_traceback(tb)
    664         else:
    665             raise Exception(response["text"])

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/core.py in handle_comm()
    496                             result = asyncio.ensure_future(result)
    497                             self._ongoing_coroutines.add(result)
--> 498                             result = await result
    499                     except (CommClosedError, CancelledError) as e:
    500                         if self.status == Status.running:

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/scheduler.cpython-38-x86_64-linux-gnu.so in performance_report()

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/bokeh/core/property/validation.py in func()
     91     def func(*args, **kwargs):
     92         with validate(False):
---> 93             return input_function(*args, **kwargs)
     94     return func
     95

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py in update()
    579     def update(self):
    580         with log_errors():
--> 581             self.source.stream(self.get_data(), 1000)
    582             self.label_source.data["cpu"] = list(
    583                 "{}: {:.1f}%".format(f.__name__, f(self.source.data["cpu"]))

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py in get_data()
    571
    572     def get_data(self):
--> 573         d = self.worker.monitor.range_query(start=self.last_count)
    574         d["time"] = [x * 1000 for x in d["time"]]
    575         self.last_count = self.worker.monitor.count

/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py in range_query()
    121         seq = [i for i in range(istart, len(self.cpu))]
    122
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

@charlesbluca
Copy link
Member

charlesbluca commented Jun 1, 2021

Yeah looking into it further this is seemingly unrelated to perf reports as well:

In [6]: cluster.scheduler.monitor.range_query(start=1)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-39d6e9217fef> in <module>
----> 1 cluster.scheduler.monitor.range_query(start=1)

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in range_query(self, start)
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in <dictcomp>(.0)
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

~/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/distributed/system_monitor.py in <listcomp>(.0)
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

IndexError: deque index out of range

I'd imagine that this may have been a problem that went under the radar since dask/distributed#4688, since the Distributed test suite doesn't test a LocalCUDACluster and the Dask-CUDA test suite doesn't do extensive dashboard testing.

I'm not really too familiar with PyNVML, but is there a way we can check if the CUDA context has been initialized by Dask-CUDA? We'd probably want to change this line to set nvmlInit based on that condition.

Another thing that's confusing me here is that the WorkerStates of the cluster are still reporting GPU information, which implies that NVML initialization has happened already:

In [16]: cluster.scheduler.workers['tcp://127.0.0.1:45365'].metrics
Out[16]: 
{'executing': 0,
 'in_memory': 0,
 'ready': 0,
 'in_flight': 0,
 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}},
 'spilled_nbytes': 0,
 'cpu': 2.0,
 'memory': 341942272,
 'time': 1622564423.8664095,
 'read_bytes': 2707.1238745576998,
 'write_bytes': 1209.395577095304,
 'num_fds': 46,
 'gpu_utilization': 0,
 'gpu_memory_used': 347668480,
 'gpu': {'utilization': 0, 'memory-used': 347668480}}

@quasiben
Copy link
Member Author

quasiben commented Jun 1, 2021

I think it might be simpler to move the nvmlInit from update to __init__, no ? For example, moving the one_time call:

        if nvml is not None:
            if nvml.nvmlInit is None:
                gpu_extra = nvml.one_time()

This should be safe for CPU workloads as nvml, the variable, should always be None

@charlesbluca
Copy link
Member

Even with the conditional there, wouldn't nvml.one_time() just end up getting called immediately when the scheduler system monitor is initialized? To my knowledge, that was the cause of #564.

@pentschev
Copy link
Member

I opened dask/distributed#4866 to fix this, it introduces back the change that broke #564 but also a fix for that specifically in dask/distributed@d860e58 .

@charlesbluca
Copy link
Member

Thanks @pentschev! I'll submit a test for this in Dask-CUDA (maybe something like @quasiben's perf report snippet), but it would be nice to have an equivalent test in Distributed if/when we are able to test NVML stuff there.

@jakirkham
Copy link
Member

Maybe it make sense to add the test to Distributed as well with a pytest.importorskip?

@charlesbluca
Copy link
Member

charlesbluca commented Jun 1, 2021

Sure, I can do that! Would it be skipped by the Distributed CI right now?

EDIT:
Actually it looks like these Distributed tests already handle the general issue here (checking for GPU info in the WorkerState/worker monitors), so we should be good there.

@jakirkham
Copy link
Member

Yes ideally we would want Dask to even run these tests on their own CI. Though this is admittedly not possible currently. Simply having them there will make it easier to do once that is an option

@pentschev
Copy link
Member

It seems that those are already in

py.test --cache-clear -vs `python -c "import distributed.diagnostics.tests.test_nvml as m;print(m.__file__)"`
but Dask-CUDA's CI hasn't failed on them. @charlesbluca could you double-check whether those tests indeed check everything that's relevant for us?

@charlesbluca
Copy link
Member

Yeah it looks like those tests succeed despite the issues we've been discussing - the tests are usingSystemMonitor.recent() to poll for results instead of SystemMonitor.range_query() and for some reason, that call actually succeeds and gives some GPU info.

To get a check for this error, I think a simple test that verifies that we can take a range_query() from the worker and scheduler monitors should be good.

@charlesbluca
Copy link
Member

charlesbluca commented Jun 1, 2021

Actually, it looks like we do have a test that calls range_query():

I wonder why this test passed Dask-CUDA's CI? I tried it locally and got the expected failure.

EDIT:

Just realized the dask-cuda checks are only running test_nvml; would it be better to add a specific test for range_query() to the NVML tests, or add the relevant scheduler test(s) to the gpuCI script?

@pentschev
Copy link
Member

Just realized the dask-cuda checks are only running test_nvml; would it be better to add a specific test for range_query() to the NVML tests, or add the relevant scheduler test(s) to the gpuCI script?

Yes, I think so. We don't want to double Distributed's CI, but specific tests for CUDA behavior we can do here.

@quasiben
Copy link
Member Author

quasiben commented Jun 8, 2021

resolved by dask/distributed#4866

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants