[BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le #145

ksangeek · 2019-09-30T13:22:50Z

Describe the bug
Executing dask-cuda 0.9.1 test_spill.py test on a IBM AC922 (linux_ppc64le) fails for test_cudf_device_spill() -

.. o/p truncated ..
>   assert device_host_file_size_matches(
        dhf, total_bytes, device_chunk_overhead, serialized_chunk_overhead
    )
E   assert False
E    +  where False = device_host_file_size_matches(<dask_cuda.device_host_file.DeviceHostFile object at 0x7fff4a495860>, 2073600960, 32, 2080)

test_spill.py:42: AssertionError

Pls note -

Other tests in test_spill.py succeed!
In all my attempts for test_cudf_device_spill() I have seen param0 (with "spills_to_disk": False,) always fail but param1 (with "spills_to_disk": True) suceeds sometimes.

Steps/Code to reproduce bug

Install the dask-cuda 0.9.1 and cudf 0.9.0 conda package which we have built for linux_ppc64le
Clone the v0.9.1 code of https://github.com/rapidsai/dask-cuda.git
cd dask_cuda/tests
Execute pytest test_spill.py::test_cudf_device_spill

Expected behavior
The testcase should succeed!

Environment details

Environment location: Bare-metal (IBM AC922 machine with NVIDIA GPUs.)
Method of dask-cuda install: conda [Built for linux_ppc64le]
cudatoolkit 10.1.243

Additional context

I added some print statements in the test and see that this condition fails -

byte_sum <= total_bytes + device_overhead + host_overhead + disk_overhead

These were the values for param0 and param1 respectively when it failed -

byte_sum=2142728928, total_bytes=2073600960
total_bytes=2073600960, device_overhead=416, host_overhead=29120, disk_overhead=8320

byte_sum=2142723212, total_bytes=2073600960
total_bytes=2073600960, device_overhead=416, host_overhead=37440, disk_overhead=0

For the odd case when param1 succeeded, these were the values -

byte_sum=2073608872, total_bytes=2073600960
total_bytes=2073600960, device_overhead=384, host_overhead=29120, disk_overhead=8320

Patch of the testcase which I used to add the print -

@@ -30,6 +30,9 @@
     host_overhead = len(dhf.host) * serialized_chunk_overhead
     disk_overhead = len(dhf.disk) * serialized_chunk_overhead

+    print("byte_sum={}, total_bytes={}".format(byte_sum, total_bytes))
+    print("total_bytes={}, device_overhead={}, host_overhead={}, disk_overhead={}".format(total_bytes, device_overhead, host_overhead, disk_overhead))
+
     return (
         byte_sum >= total_bytes
         and byte_sum <= total_bytes + device_overhead + host_overhead + disk_overhead

$ pytest test_dgx.py
============================= test session starts ==============================
platform linux -- Python 3.6.9, pytest-5.1.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/sangeek/sandbox/dask-cuda-test
plugins: asyncio-0.10.0
collected 8 items

test_spill.py ....F...                                                   [100%]

=================================== FAILURES ===================================
_______________________ test_cudf_device_spill[params0] ________________________

params = {'device_memory_limit': 1000000000.0, 'host_pause': None, 'host_spill': 0.0, 'host_target': 0.0, ...}

    @pytest.mark.parametrize(
        "params",
        [
            {
                "device_memory_limit": 1e9,
                "memory_limit": 4e9,
                "host_target": 0.0,
                "host_spill": 0.0,
                "host_pause": None,
                "spills_to_disk": False,
            },
            {
                "device_memory_limit": 1e9,
                "memory_limit": 1e9,
                "host_target": 0.0,
                "host_spill": 0.0,
                "host_pause": None,
                "spills_to_disk": True,
            },
        ],
    )
    def test_cudf_device_spill(params):
        @gen_cluster(
            client=True,
            nthreads=[("127.0.0.1", 1)],
            Worker=Worker,
            timeout=60,
            worker_kwargs={
                "memory_limit": params["memory_limit"],
                "data": DeviceHostFile(
                    device_memory_limit=params["device_memory_limit"],
                    memory_limit=params["memory_limit"],
                ),
            },
            config={
                "distributed.comm.timeouts.connect": "20s",
                "distributed.worker.memory.target": params["host_target"],
                "distributed.worker.memory.spill": params["host_spill"],
                "distributed.worker.memory.pause": params["host_pause"],
            },
        )
        def test_device_spill(client, scheduler, worker):

            # There's a known issue with datetime64:
            # https://github.com/numpy/numpy/issues/4983#issuecomment-441332940
            # The same error above happens when spilling datetime64 to disk
            cdf = (
                dask.datasets.timeseries(dtypes={"x": int, "y": float}, freq="20ms")
                .reset_index(drop=True)
                .map_partitions(cudf.from_pandas)
            )

            sizes = yield client.compute(cdf.map_partitions(lambda df: df.__sizeof__()))
            sizes = sizes.tolist()
            nbytes = sum(sizes)
            part_index_nbytes = (yield client.compute(cdf.partitions[0].index)).__sizeof__()

            cdf2 = cdf.persist()
            yield wait(cdf2)

            del cdf

            yield client.run(worker_assert, nbytes, 32, 2048 + part_index_nbytes)

            host_chunks = yield client.run(lambda: len(get_worker().data.host))
            disk_chunks = yield client.run(lambda: len(get_worker().data.disk))
            for hc, dc in zip(host_chunks.values(), disk_chunks.values()):
                if params["spills_to_disk"]:
                    assert dc > 0
                else:
                    assert hc > 0
                    assert dc == 0

            del cdf2

            yield client.run(delayed_worker_assert, 0, 0, 0)

>       test_device_spill()

test_spill.py:279:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils_test.py:947: in test_func
    coro, timeout=timeout * 2 if timeout else timeout
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/ioloop.py:532: in run_sync
    return future_cell[0].result()
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils_test.py:915: in coro
    result = await future
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/gen.py:742: in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
test_spill.py:264: in test_device_spill
    yield client.run(worker_assert, nbytes, 32, 2048 + part_index_nbytes)
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/gen.py:735: in run
    value = future.result()
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/client.py:2248: in _run
    six.reraise(*clean_exception(**resp))
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/six.py:692: in reraise
    raise value.with_traceback(tb)
test_spill.py:49: in worker_assert
    get_worker().data, total_size, device_chunk_overhead, serialized_chunk_overhead
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   assert device_host_file_size_matches(
        dhf, total_bytes, device_chunk_overhead, serialized_chunk_overhead
    )
E   assert False
E    +  where False = device_host_file_size_matches(<dask_cuda.device_host_file.DeviceHostFile object at 0x7fff4a495860>, 2073600960, 32, 2080)

test_spill.py:42: AssertionError
----------------------------- Captured stderr call -----------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:40981
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://127.0.0.1:54686 remote=tcp://127.0.0.1:32867>
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:39221
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:39221
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:40981
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    4.00 GB
distributed.worker - INFO -       Local Directory: /home/sangeek/sandbox/dask-cuda-test/dask_cuda/tests/dask-worker-space/worker-gw0cuptr
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register tcp://127.0.0.1:39221
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:39221
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:40981
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Run out-of-band function 'worker_assert'
distributed.worker - WARNING - Run Failed
Function: worker_assert
args:     (2073600960, 32, 2080)
kwargs:   {}
Traceback (most recent call last):
  File "/home/sangeek/aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/worker.py", line 3380, in run
    result = function(*args, **kwargs)
  File "/home/sangeek/sandbox/dask-cuda-test/dask_cuda/tests/test_spill.py", line 49, in worker_assert
    get_worker().data, total_size, device_chunk_overhead, serialized_chunk_overhead
  File "/home/sangeek/sandbox/dask-cuda-test/dask_cuda/tests/test_spill.py", line 42, in assert_device_host_file_size
    assert device_host_file_size_matches(
AssertionError: assert False
 +  where False = device_host_file_size_matches(<dask_cuda.device_host_file.DeviceHostFile object at 0x7fff4a495860>, 2073600960, 32, 2080)
distributed.scheduler - INFO - Remove client Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.scheduler - INFO - Remove client Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.scheduler - INFO - Close client connection: Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:39221
distributed.scheduler - INFO - Remove worker tcp://127.0.0.1:39221
distributed.core - INFO - Removing comms to tcp://127.0.0.1:39221
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
=================== 1 failed, 7 passed in 216.85s (0:03:36) ====================

Please let me know if some other information is required to get to the root cause of this issue. Thank you!

The text was updated successfully, but these errors were encountered:

pentschev · 2019-10-01T15:36:05Z

Thanks for the report @ksangeek. This particular test has been pretty difficult to ensure limits are correct, and they seem to manifest differently for different systems.

We should eventually try to improve that test, but I'm not really sure how to do it today. Also I'd say that the issue is with the test rather than the actual functionality, and the passing CuPy tests help to ensure that (but doesn't eliminate the possibility of a bug).

pentschev · 2020-01-09T21:36:55Z

This is also a duplicate of #79, I'm therefore closing it, and hopefully once it gets fixed it should work properly on all platforms.

ksangeek changed the title ~~test_spill.py::test_cudf_device_spill fails in linux_ppc64le~~ [BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le Sep 30, 2019

pentschev closed this as completed Jan 9, 2020

pentschev mentioned this issue Jan 9, 2020

Fix spilling test assertion that fails on rare occasions #79

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le #145

[BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le #145

ksangeek commented Sep 30, 2019 •

edited

Loading

pentschev commented Oct 1, 2019

pentschev commented Jan 9, 2020

[BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le #145

[BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le #145

Comments

ksangeek commented Sep 30, 2019 • edited Loading

pentschev commented Oct 1, 2019

pentschev commented Jan 9, 2020

ksangeek commented Sep 30, 2019 •

edited

Loading