Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le #145

Closed
ksangeek opened this issue Sep 30, 2019 · 2 comments
Closed

[BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le #145

ksangeek opened this issue Sep 30, 2019 · 2 comments

Comments

@ksangeek
Copy link
Contributor

ksangeek commented Sep 30, 2019

Describe the bug
Executing dask-cuda 0.9.1 test_spill.py test on a IBM AC922 (linux_ppc64le) fails for test_cudf_device_spill() -

.. o/p truncated ..
>   assert device_host_file_size_matches(
        dhf, total_bytes, device_chunk_overhead, serialized_chunk_overhead
    )
E   assert False
E    +  where False = device_host_file_size_matches(<dask_cuda.device_host_file.DeviceHostFile object at 0x7fff4a495860>, 2073600960, 32, 2080)

test_spill.py:42: AssertionError

Pls note -

  1. Other tests in test_spill.py succeed!
  2. In all my attempts for test_cudf_device_spill() I have seen param0 (with "spills_to_disk": False,) always fail but param1 (with "spills_to_disk": True) suceeds sometimes.

Steps/Code to reproduce bug

  1. Install the dask-cuda 0.9.1 and cudf 0.9.0 conda package which we have built for linux_ppc64le
  2. Clone the v0.9.1 code of https://github.com/rapidsai/dask-cuda.git
  3. cd dask_cuda/tests
  4. Execute pytest test_spill.py::test_cudf_device_spill

Expected behavior
The testcase should succeed!

Environment details

  • Environment location: Bare-metal (IBM AC922 machine with NVIDIA GPUs.)
  • Method of dask-cuda install: conda [Built for linux_ppc64le]
  • cudatoolkit 10.1.243

Additional context

I added some print statements in the test and see that this condition fails -

byte_sum <= total_bytes + device_overhead + host_overhead + disk_overhead

These were the values for param0 and param1 respectively when it failed -

byte_sum=2142728928, total_bytes=2073600960
total_bytes=2073600960, device_overhead=416, host_overhead=29120, disk_overhead=8320

byte_sum=2142723212, total_bytes=2073600960
total_bytes=2073600960, device_overhead=416, host_overhead=37440, disk_overhead=0

For the odd case when param1 succeeded, these were the values -

byte_sum=2073608872, total_bytes=2073600960
total_bytes=2073600960, device_overhead=384, host_overhead=29120, disk_overhead=8320

Patch of the testcase which I used to add the print -

@@ -30,6 +30,9 @@
     host_overhead = len(dhf.host) * serialized_chunk_overhead
     disk_overhead = len(dhf.disk) * serialized_chunk_overhead

+    print("byte_sum={}, total_bytes={}".format(byte_sum, total_bytes))
+    print("total_bytes={}, device_overhead={}, host_overhead={}, disk_overhead={}".format(total_bytes, device_overhead, host_overhead, disk_overhead))
+
     return (
         byte_sum >= total_bytes
         and byte_sum <= total_bytes + device_overhead + host_overhead + disk_overhead
$ pytest test_dgx.py
============================= test session starts ==============================
platform linux -- Python 3.6.9, pytest-5.1.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/sangeek/sandbox/dask-cuda-test
plugins: asyncio-0.10.0
collected 8 items

test_spill.py ....F...                                                   [100%]

=================================== FAILURES ===================================
_______________________ test_cudf_device_spill[params0] ________________________

params = {'device_memory_limit': 1000000000.0, 'host_pause': None, 'host_spill': 0.0, 'host_target': 0.0, ...}

    @pytest.mark.parametrize(
        "params",
        [
            {
                "device_memory_limit": 1e9,
                "memory_limit": 4e9,
                "host_target": 0.0,
                "host_spill": 0.0,
                "host_pause": None,
                "spills_to_disk": False,
            },
            {
                "device_memory_limit": 1e9,
                "memory_limit": 1e9,
                "host_target": 0.0,
                "host_spill": 0.0,
                "host_pause": None,
                "spills_to_disk": True,
            },
        ],
    )
    def test_cudf_device_spill(params):
        @gen_cluster(
            client=True,
            nthreads=[("127.0.0.1", 1)],
            Worker=Worker,
            timeout=60,
            worker_kwargs={
                "memory_limit": params["memory_limit"],
                "data": DeviceHostFile(
                    device_memory_limit=params["device_memory_limit"],
                    memory_limit=params["memory_limit"],
                ),
            },
            config={
                "distributed.comm.timeouts.connect": "20s",
                "distributed.worker.memory.target": params["host_target"],
                "distributed.worker.memory.spill": params["host_spill"],
                "distributed.worker.memory.pause": params["host_pause"],
            },
        )
        def test_device_spill(client, scheduler, worker):

            # There's a known issue with datetime64:
            # https://github.com/numpy/numpy/issues/4983#issuecomment-441332940
            # The same error above happens when spilling datetime64 to disk
            cdf = (
                dask.datasets.timeseries(dtypes={"x": int, "y": float}, freq="20ms")
                .reset_index(drop=True)
                .map_partitions(cudf.from_pandas)
            )

            sizes = yield client.compute(cdf.map_partitions(lambda df: df.__sizeof__()))
            sizes = sizes.tolist()
            nbytes = sum(sizes)
            part_index_nbytes = (yield client.compute(cdf.partitions[0].index)).__sizeof__()

            cdf2 = cdf.persist()
            yield wait(cdf2)

            del cdf

            yield client.run(worker_assert, nbytes, 32, 2048 + part_index_nbytes)

            host_chunks = yield client.run(lambda: len(get_worker().data.host))
            disk_chunks = yield client.run(lambda: len(get_worker().data.disk))
            for hc, dc in zip(host_chunks.values(), disk_chunks.values()):
                if params["spills_to_disk"]:
                    assert dc > 0
                else:
                    assert hc > 0
                    assert dc == 0

            del cdf2

            yield client.run(delayed_worker_assert, 0, 0, 0)

>       test_device_spill()

test_spill.py:279:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils_test.py:947: in test_func
    coro, timeout=timeout * 2 if timeout else timeout
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/ioloop.py:532: in run_sync
    return future_cell[0].result()
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils_test.py:915: in coro
    result = await future
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/gen.py:742: in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
test_spill.py:264: in test_device_spill
    yield client.run(worker_assert, nbytes, 32, 2048 + part_index_nbytes)
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/gen.py:735: in run
    value = future.result()
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/client.py:2248: in _run
    six.reraise(*clean_exception(**resp))
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/six.py:692: in reraise
    raise value.with_traceback(tb)
test_spill.py:49: in worker_assert
    get_worker().data, total_size, device_chunk_overhead, serialized_chunk_overhead
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   assert device_host_file_size_matches(
        dhf, total_bytes, device_chunk_overhead, serialized_chunk_overhead
    )
E   assert False
E    +  where False = device_host_file_size_matches(<dask_cuda.device_host_file.DeviceHostFile object at 0x7fff4a495860>, 2073600960, 32, 2080)

test_spill.py:42: AssertionError
----------------------------- Captured stderr call -----------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:40981
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://127.0.0.1:54686 remote=tcp://127.0.0.1:32867>
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:39221
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:39221
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:40981
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    4.00 GB
distributed.worker - INFO -       Local Directory: /home/sangeek/sandbox/dask-cuda-test/dask_cuda/tests/dask-worker-space/worker-gw0cuptr
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register tcp://127.0.0.1:39221
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:39221
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:40981
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Run out-of-band function 'worker_assert'
distributed.worker - WARNING - Run Failed
Function: worker_assert
args:     (2073600960, 32, 2080)
kwargs:   {}
Traceback (most recent call last):
  File "/home/sangeek/aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/worker.py", line 3380, in run
    result = function(*args, **kwargs)
  File "/home/sangeek/sandbox/dask-cuda-test/dask_cuda/tests/test_spill.py", line 49, in worker_assert
    get_worker().data, total_size, device_chunk_overhead, serialized_chunk_overhead
  File "/home/sangeek/sandbox/dask-cuda-test/dask_cuda/tests/test_spill.py", line 42, in assert_device_host_file_size
    assert device_host_file_size_matches(
AssertionError: assert False
 +  where False = device_host_file_size_matches(<dask_cuda.device_host_file.DeviceHostFile object at 0x7fff4a495860>, 2073600960, 32, 2080)
distributed.scheduler - INFO - Remove client Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.scheduler - INFO - Remove client Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.scheduler - INFO - Close client connection: Client-cf578786-e36a-11e9-ac7e-70e2841429a9
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:39221
distributed.scheduler - INFO - Remove worker tcp://127.0.0.1:39221
distributed.core - INFO - Removing comms to tcp://127.0.0.1:39221
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
=================== 1 failed, 7 passed in 216.85s (0:03:36) ====================

Please let me know if some other information is required to get to the root cause of this issue. Thank you!

@ksangeek ksangeek changed the title test_spill.py::test_cudf_device_spill fails in linux_ppc64le [BUG] test_spill.py::test_cudf_device_spill fails in linux_ppc64le Sep 30, 2019
@pentschev
Copy link
Member

Thanks for the report @ksangeek. This particular test has been pretty difficult to ensure limits are correct, and they seem to manifest differently for different systems.

We should eventually try to improve that test, but I'm not really sure how to do it today. Also I'd say that the issue is with the test rather than the actual functionality, and the passing CuPy tests help to ensure that (but doesn't eliminate the possibility of a bug).

@pentschev
Copy link
Member

This is also a duplicate of #79, I'm therefore closing it, and hopefully once it gets fixed it should work properly on all platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants