Pin `fsspec==2022.5.0` #113

karlhigley · 2022-08-09T14:08:15Z

fsspec 2022.7.0 breaks Dataset.to_parquet, so pinning to an earlier version to avoid that

nvidia-merlin-bot · 2022-08-09T14:09:25Z

Click to view CI Results

GitHub pull request #113 of commit ff4ffa5adf534b68ea66dbe772df60074794151d, no merge conflicts.
Running as SYSTEM
Setting status of ff4ffa5adf534b68ea66dbe772df60074794151d to PENDING with url https://10.20.13.93:8080/job/merlin_core/91/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse ff4ffa5adf534b68ea66dbe772df60074794151d^{commit} # timeout=10
Checking out Revision ff4ffa5adf534b68ea66dbe772df60074794151d (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f ff4ffa5adf534b68ea66dbe772df60074794151d # timeout=10
Commit message: "Pin `fsspec==2022.5.0`"
 > git rev-list --no-walk 78799c0a50e1a66e69731df00af7d6b70a2bf18f # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins11976271538902687337.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ..................................FFFF......... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=================================== FAILURES ===================================

_________________ test_dask_dataset_from_dataframe[True-cudf] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra4')

origin = 'cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra4/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra4/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________ test_dask_dataset_from_dataframe[True-dask_cudf] _______________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra5')

origin = 'dask_cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra5/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra5/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-pd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra6')

origin = 'pd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra6/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra6/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-dd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra7')

origin = 'dd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra7/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_dask_dataset_from_datafra7/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 44217 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 33567 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 42615 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 40433 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 35109 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 41933 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dask_cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-pd] - ...

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dd] - ...

============ 4 failed, 339 passed, 1 skipped, 82 warnings in 52.02s ============

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins4920047722190126785.sh

nvidia-merlin-bot · 2022-08-09T14:10:25Z

Click to view CI Results

GitHub pull request #113 of commit 9408224520d731c51b7952a43def675b76e81756, no merge conflicts.
Running as SYSTEM
Setting status of 9408224520d731c51b7952a43def675b76e81756 to PENDING with url https://10.20.13.93:8080/job/merlin_core/92/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse 9408224520d731c51b7952a43def675b76e81756^{commit} # timeout=10
Checking out Revision 9408224520d731c51b7952a43def675b76e81756 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9408224520d731c51b7952a43def675b76e81756 # timeout=10
Commit message: "Merge branch 'main' into fix/fsspec-version"
 > git rev-list --no-walk ff4ffa5adf534b68ea66dbe772df60074794151d # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins257079640869010468.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ..................................FFFF......... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=================================== FAILURES ===================================

_________________ test_dask_dataset_from_dataframe[True-cudf] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra4')

origin = 'cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra4/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra4/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________ test_dask_dataset_from_dataframe[True-dask_cudf] _______________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra5')

origin = 'dask_cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra5/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra5/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-pd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra6')

origin = 'pd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra6/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra6/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-dd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra7')

origin = 'dd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra7/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-13/test_dask_dataset_from_datafra7/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 35785 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 46777 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 45355 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 35769 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 42373 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 43187 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dask_cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-pd] - ...

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dd] - ...

============ 4 failed, 339 passed, 1 skipped, 82 warnings in 52.55s ============

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins1173334063800393754.sh

github-actions · 2022-08-09T14:18:58Z

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-113

karlhigley · 2022-08-09T14:55:48Z

rerun tests

nvidia-merlin-bot · 2022-08-09T14:56:59Z

Click to view CI Results

GitHub pull request #113 of commit 9408224520d731c51b7952a43def675b76e81756, no merge conflicts.
Running as SYSTEM
Setting status of 9408224520d731c51b7952a43def675b76e81756 to PENDING with url https://10.20.13.93:8080/job/merlin_core/93/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse 9408224520d731c51b7952a43def675b76e81756^{commit} # timeout=10
Checking out Revision 9408224520d731c51b7952a43def675b76e81756 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9408224520d731c51b7952a43def675b76e81756 # timeout=10
Commit message: "Merge branch 'main' into fix/fsspec-version"
 > git rev-list --no-walk 9408224520d731c51b7952a43def675b76e81756 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins4997533829381272966.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: dask>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 1)) (2022.1.1)
Requirement already satisfied: distributed>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 2)) (2022.3.0)
Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 3)) (1.3.5)
Requirement already satisfied: numba>=0.54 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 4)) (0.55.2)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 5)) (6.0.0)
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 6)) (3.19.4)
Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 7)) (4.64.0)
Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 8)) (1.9.0)
Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 9)) (1.2.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 10)) (21.3)
Collecting fsspec==2022.5.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (1.2.0)
Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (5.4.1)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (0.11.2)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (8.0.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (3.0.3)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.4)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (5.9.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.7.0)
Requirement already satisfied: tornado>=6.0.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (6.1)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.2.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2022.1)
Requirement already satisfied: numpy>=1.17.3; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10" in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.21.5)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->-r requirements.txt (line 4)) (0.38.1)
Requirement already satisfied: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (from numba>=0.54->-r requirements.txt (line 4)) (62.4.0)
Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.1.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.52.0)
Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (0.4.2)
Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (1.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->-r requirements.txt (line 10)) (3.0.9)
Requirement already satisfied: locket in /usr/local/lib/python3.8/dist-packages (from partd>=0.3.10->dask>=2021.11.2->-r requirements.txt (line 1)) (1.0.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2021.11.2->-r requirements.txt (line 2)) (2.0.1)
Requirement already satisfied: heapdict in /usr/local/lib/python3.8/dist-packages (from zict>=0.1.3->distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.1)
Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.15.0)
Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.1.0)
Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.2)
Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.1)
Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.0.0)
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
ERROR: dask-cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: dask-cuda 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: dask-cudf 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement aiobotocore~=2.1.0, but you'll have aiobotocore 2.3.4 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement fsspec==2022.02.0, but you'll have fsspec 2022.5.0 which is incompatible.
Installing collected packages: fsspec
Successfully installed fsspec-2022.5.0
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ..................................FFFF......... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=================================== FAILURES ===================================

_________________ test_dask_dataset_from_dataframe[True-cudf] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra4')

origin = 'cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra4/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra4/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________ test_dask_dataset_from_dataframe[True-dask_cudf] _______________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra5')

origin = 'dask_cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra5/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra5/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-pd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra6')

origin = 'pd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra6/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra6/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-dd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra7')

origin = 'dd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra7/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-20/test_dask_dataset_from_datafra7/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 36505 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 40771 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 36405 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 44155 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 44187 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 35813 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dask_cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-pd] - ...

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dd] - ...

============ 4 failed, 339 passed, 1 skipped, 82 warnings in 51.95s ============

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins2970326891801939736.sh

karlhigley · 2022-08-09T15:04:30Z

rerun tests

nvidia-merlin-bot · 2022-08-09T15:05:45Z

Click to view CI Results

GitHub pull request #113 of commit 9408224520d731c51b7952a43def675b76e81756, no merge conflicts.
Running as SYSTEM
Setting status of 9408224520d731c51b7952a43def675b76e81756 to PENDING with url https://10.20.13.93:8080/job/merlin_core/94/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse 9408224520d731c51b7952a43def675b76e81756^{commit} # timeout=10
Checking out Revision 9408224520d731c51b7952a43def675b76e81756 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9408224520d731c51b7952a43def675b76e81756 # timeout=10
Commit message: "Merge branch 'main' into fix/fsspec-version"
 > git rev-list --no-walk 9408224520d731c51b7952a43def675b76e81756 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins17517435640007878806.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: dask>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 1)) (2022.1.1)
Requirement already satisfied: distributed>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 2)) (2022.3.0)
Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 3)) (1.3.5)
Requirement already satisfied: numba>=0.54 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 4)) (0.55.2)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 5)) (6.0.0)
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 6)) (3.19.4)
Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 7)) (4.64.0)
Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 8)) (1.9.0)
Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 9)) (1.2.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 10)) (21.3)
Collecting fsspec==2022.5.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (1.2.0)
Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (5.4.1)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (0.11.2)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (8.0.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (3.0.3)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.4)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (5.9.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.7.0)
Requirement already satisfied: tornado>=6.0.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (6.1)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.2.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2022.1)
Requirement already satisfied: numpy>=1.17.3; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10" in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.21.5)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->-r requirements.txt (line 4)) (0.38.1)
Requirement already satisfied: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (from numba>=0.54->-r requirements.txt (line 4)) (62.4.0)
Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.1.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.52.0)
Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (0.4.2)
Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (1.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->-r requirements.txt (line 10)) (3.0.9)
Requirement already satisfied: locket in /usr/local/lib/python3.8/dist-packages (from partd>=0.3.10->dask>=2021.11.2->-r requirements.txt (line 1)) (1.0.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2021.11.2->-r requirements.txt (line 2)) (2.0.1)
Requirement already satisfied: heapdict in /usr/local/lib/python3.8/dist-packages (from zict>=0.1.3->distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.1)
Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.15.0)
Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.1.0)
Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.2)
Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.1)
Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.0.0)
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
ERROR: dask-cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: dask-cuda 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: dask-cudf 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement aiobotocore~=2.1.0, but you'll have aiobotocore 2.3.4 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement fsspec==2022.02.0, but you'll have fsspec 2022.5.0 which is incompatible.
Installing collected packages: fsspec
Successfully installed fsspec-2022.5.0
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ..................................FFFF......... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=================================== FAILURES ===================================

_________________ test_dask_dataset_from_dataframe[True-cudf] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra4')

origin = 'cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra4/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra4/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________ test_dask_dataset_from_dataframe[True-dask_cudf] _______________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra5')

origin = 'dask_cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra5/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra5/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-pd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra6')

origin = 'pd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra6/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra6/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-dd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra7')

origin = 'dd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra7/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-22/test_dask_dataset_from_datafra7/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 34623 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 41811 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 41807 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 39721 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 41603 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 38093 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dask_cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-pd] - ...

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dd] - ...

============ 4 failed, 339 passed, 1 skipped, 82 warnings in 52.41s ============

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins154260125794305389.sh

karlhigley · 2022-08-12T16:46:13Z

rerun tests

nvidia-merlin-bot · 2022-08-12T16:47:59Z

Click to view CI Results

GitHub pull request #113 of commit 9408224520d731c51b7952a43def675b76e81756, no merge conflicts.
Running as SYSTEM
Setting status of 9408224520d731c51b7952a43def675b76e81756 to PENDING with url https://10.20.13.93:8080/job/merlin_core/96/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse 9408224520d731c51b7952a43def675b76e81756^{commit} # timeout=10
Checking out Revision 9408224520d731c51b7952a43def675b76e81756 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9408224520d731c51b7952a43def675b76e81756 # timeout=10
Commit message: "Merge branch 'main' into fix/fsspec-version"
 > git rev-list --no-walk d000560b0578ef8dcdcc1dc9c6463d5a91164d0d # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins8786413089919146306.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: dask>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 1)) (2022.1.1)
Requirement already satisfied: distributed>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 2)) (2022.3.0)
Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 3)) (1.3.5)
Requirement already satisfied: numba>=0.54 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 4)) (0.56.0)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 5)) (6.0.0)
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 6)) (3.19.4)
Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 7)) (4.64.0)
Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 8)) (1.9.0)
Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 9)) (1.2.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 10)) (21.3)
Collecting fsspec==2022.5.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (1.3.0)
Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (5.4.1)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (0.12.0)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (8.0.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (3.0.3)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.4)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (5.9.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.7.0)
Requirement already satisfied: tornado>=6.0.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (6.2)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.2.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2022.2)
Requirement already satisfied: numpy>=1.17.3; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10" in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.21.5)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->-r requirements.txt (line 4)) (0.39.0)
Requirement already satisfied: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (from numba>=0.54->-r requirements.txt (line 4)) (62.4.0)
Requirement already satisfied: importlib-metadata; python_version < "3.9" in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->-r requirements.txt (line 4)) (4.12.0)
Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.2.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.52.0)
Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (0.4.2)
Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (1.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->-r requirements.txt (line 10)) (3.0.9)
Requirement already satisfied: locket in /usr/local/lib/python3.8/dist-packages (from partd>=0.3.10->dask>=2021.11.2->-r requirements.txt (line 1)) (1.0.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2021.11.2->-r requirements.txt (line 2)) (2.0.1)
Requirement already satisfied: heapdict in /usr/local/lib/python3.8/dist-packages (from zict>=0.1.3->distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.1)
Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.15.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata; python_version < "3.9"->numba>=0.54->-r requirements.txt (line 4)) (3.8.1)
Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.1.0)
Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.2)
Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.1)
Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.0.0)
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
ERROR: dask-cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: dask-cuda 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: dask-cudf 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement aiobotocore~=2.1.0, but you'll have aiobotocore 2.3.4 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement fsspec==2022.02.0, but you'll have fsspec 2022.5.0 which is incompatible.
Installing collected packages: fsspec
Successfully installed fsspec-2022.5.0
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ..................................FFFF......... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=================================== FAILURES ===================================

_________________ test_dask_dataset_from_dataframe[True-cudf] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra4')

origin = 'cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra4/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra4/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________ test_dask_dataset_from_dataframe[True-dask_cudf] _______________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra5')

origin = 'dask_cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra5/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra5/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-pd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra6')

origin = 'pd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra6/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra6/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-dd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra7')

origin = 'dd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra7/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-0/test_dask_dataset_from_datafra7/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_serial_context[True]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 46259 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 36213 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 46457 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 46141 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 36833 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 33323 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dask_cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-pd] - ...

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dd] - ...

======= 4 failed, 339 passed, 1 skipped, 83 warnings in 80.27s (0:01:20) =======

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins18024306074523492797.sh

karlhigley · 2022-08-12T16:59:53Z

rerun tests

nvidia-merlin-bot · 2022-08-12T17:01:25Z

Click to view CI Results

GitHub pull request #113 of commit 9408224520d731c51b7952a43def675b76e81756, no merge conflicts.
Running as SYSTEM
Setting status of 9408224520d731c51b7952a43def675b76e81756 to PENDING with url https://10.20.13.93:8080/job/merlin_core/97/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse 9408224520d731c51b7952a43def675b76e81756^{commit} # timeout=10
Checking out Revision 9408224520d731c51b7952a43def675b76e81756 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9408224520d731c51b7952a43def675b76e81756 # timeout=10
Commit message: "Merge branch 'main' into fix/fsspec-version"
 > git rev-list --no-walk 9408224520d731c51b7952a43def675b76e81756 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins13934548096379464390.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: dask>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 1)) (2022.1.1)
Requirement already satisfied: distributed>=2021.11.2 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 2)) (2022.3.0)
Requirement already satisfied: pandas<1.4.0dev0,>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 3)) (1.3.5)
Requirement already satisfied: numba>=0.54 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 4)) (0.56.0)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 5)) (6.0.0)
Requirement already satisfied: protobuf>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 6)) (3.19.4)
Requirement already satisfied: tqdm>=4.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 7)) (4.64.0)
Requirement already satisfied: tensorflow-metadata>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 8)) (1.9.0)
Requirement already satisfied: betterproto<2.0.0 in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 9)) (1.2.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from -r requirements.txt (line 10)) (21.3)
Collecting fsspec==2022.5.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (1.3.0)
Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (5.4.1)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.11.2->-r requirements.txt (line 1)) (0.12.0)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (8.0.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (3.0.3)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.4)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (5.9.1)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (1.7.0)
Requirement already satisfied: tornado>=6.0.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (6.2)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2021.11.2->-r requirements.txt (line 2)) (2.2.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (2022.2)
Requirement already satisfied: numpy>=1.17.3; platform_machine != "aarch64" and platform_machine != "arm64" and python_version < "3.10" in /usr/local/lib/python3.8/dist-packages (from pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.21.5)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->-r requirements.txt (line 4)) (0.39.0)
Requirement already satisfied: setuptools in /var/jenkins_home/.local/lib/python3.8/site-packages (from numba>=0.54->-r requirements.txt (line 4)) (62.4.0)
Requirement already satisfied: importlib-metadata; python_version < "3.9" in /usr/local/lib/python3.8/dist-packages (from numba>=0.54->-r requirements.txt (line 4)) (4.12.0)
Requirement already satisfied: absl-py<2.0.0,>=0.9 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.2.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /usr/local/lib/python3.8/dist-packages (from tensorflow-metadata>=1.2.0->-r requirements.txt (line 8)) (1.52.0)
Requirement already satisfied: grpclib in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (0.4.2)
Requirement already satisfied: stringcase in /usr/local/lib/python3.8/dist-packages (from betterproto<2.0.0->-r requirements.txt (line 9)) (1.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->-r requirements.txt (line 10)) (3.0.9)
Requirement already satisfied: locket in /usr/local/lib/python3.8/dist-packages (from partd>=0.3.10->dask>=2021.11.2->-r requirements.txt (line 1)) (1.0.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2021.11.2->-r requirements.txt (line 2)) (2.0.1)
Requirement already satisfied: heapdict in /usr/local/lib/python3.8/dist-packages (from zict>=0.1.3->distributed>=2021.11.2->-r requirements.txt (line 2)) (1.0.1)
Requirement already satisfied: six>=1.5 in /var/jenkins_home/.local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas<1.4.0dev0,>=1.2.0->-r requirements.txt (line 3)) (1.15.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata; python_version < "3.9"->numba>=0.54->-r requirements.txt (line 4)) (3.8.1)
Requirement already satisfied: h2<5,>=3.1.0 in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.1.0)
Requirement already satisfied: multidict in /usr/local/lib/python3.8/dist-packages (from grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.2)
Requirement already satisfied: hyperframe<7,>=6.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (6.0.1)
Requirement already satisfied: hpack<5,>=4.0 in /usr/local/lib/python3.8/dist-packages (from h2<5,>=3.1.0->grpclib->betterproto<2.0.0->-r requirements.txt (line 9)) (4.0.0)
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
ERROR: dask-cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: cudf 22.4.0 requires cupy-cuda117, which is not installed.
ERROR: dask-cuda 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: dask-cudf 22.4.0 has requirement dask==2022.03.0, but you'll have dask 2022.1.1 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement aiobotocore~=2.1.0, but you'll have aiobotocore 2.3.4 which is incompatible.
ERROR: s3fs 2022.2.0 has requirement fsspec==2022.02.0, but you'll have fsspec 2022.5.0 which is incompatible.
Installing collected packages: fsspec
Successfully installed fsspec-2022.5.0
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ..................................FFFF......... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=================================== FAILURES ===================================

_________________ test_dask_dataset_from_dataframe[True-cudf] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra4')

origin = 'cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra4/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra4/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________ test_dask_dataset_from_dataframe[True-dask_cudf] _______________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra5')

origin = 'dask_cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra5/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra5/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-pd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra6')

origin = 'pd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra6/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra6/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-dd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra7')

origin = 'dd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra7/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-1/test_dask_dataset_from_datafra7/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_serial_context[True]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 40679 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 43271 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 32929 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 38741 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 37921 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 43445 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dask_cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-pd] - ...

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dd] - ...

============ 4 failed, 339 passed, 1 skipped, 83 warnings in 54.15s ============

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins6311434849162267378.sh

karlhigley · 2022-08-12T17:07:57Z

rerun tests

nvidia-merlin-bot · 2022-08-12T17:09:05Z

Click to view CI Results

GitHub pull request #113 of commit 9408224520d731c51b7952a43def675b76e81756, no merge conflicts.
Running as SYSTEM
Setting status of 9408224520d731c51b7952a43def675b76e81756 to PENDING with url https://10.20.13.93:8080/job/merlin_core/98/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse 9408224520d731c51b7952a43def675b76e81756^{commit} # timeout=10
Checking out Revision 9408224520d731c51b7952a43def675b76e81756 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9408224520d731c51b7952a43def675b76e81756 # timeout=10
Commit message: "Merge branch 'main' into fix/fsspec-version"
 > git rev-list --no-walk 9408224520d731c51b7952a43def675b76e81756 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins13216399660058998797.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ..................................FFFF......... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=================================== FAILURES ===================================

_________________ test_dask_dataset_from_dataframe[True-cudf] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra4')

origin = 'cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra4/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra4/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________ test_dask_dataset_from_dataframe[True-dask_cudf] _______________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra5')

origin = 'dask_cudf', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra5/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra5/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-pd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra6')

origin = 'pd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra6/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra6/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_dask_dataset_from_dataframe[True-dd] ___________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra7')

origin = 'dd', cpu = True
@pytest.mark.parametrize("origin", ["cudf", "dask_cudf", "pd", "dd"])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_dataset_from_dataframe(tmpdir, origin, cpu):

    # Generate a DataFrame-based input
    if origin in ("pd", "dd"):
        df = pd.DataFrame({"a": range(100)})
        if origin == "dd":
            df = dask.dataframe.from_pandas(df, npartitions=4)
    elif origin in ("cudf", "dask_cudf"):
        df = cudf.DataFrame({"a": range(100)})
        if origin == "dask_cudf":
            df = dask_cudf.from_cudf(df, npartitions=4)

    # Convert to an NVTabular Dataset and back to a ddf
    dataset = merlin.io.Dataset(df, cpu=cpu)
    result = dataset.to_ddf()

    # Check resulting data
    assert_eq(df, result)

    # Check that the cpu kwarg is working correctly
    if cpu:
        assert isinstance(result.compute(), pd.DataFrame)

        # Should still work if we move to the GPU
        # (test behavior after repetitive conversion)
        dataset.to_gpu()
        dataset.to_cpu()
        dataset.to_cpu()
        dataset.to_gpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), cudf.DataFrame)
        dataset.to_cpu()
    else:
        assert isinstance(result.compute(), cudf.DataFrame)

        # Should still work if we move to the CPU
        # (test behavior after repetitive conversion)
        dataset.to_cpu()
        dataset.to_gpu()
        dataset.to_gpu()
        dataset.to_cpu()
        result = dataset.to_ddf()
        assert isinstance(result.compute(), pd.DataFrame)
        dataset.to_gpu()

    # Write to disk and read back
    path = str(tmpdir)
    dataset.to_parquet(path, out_files_per_proc=1, shuffle=None)


  ddf_check = dask_cudf.read_parquet(path).compute()


tests/unit/io/test_io.py:290:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/local.py:553: in get_sync

return get_async(

/usr/local/lib/python3.8/dist-packages/dask/local.py:496: in get_async

for key, res_info, failed in queue_get(queue).result():

/usr/lib/python3.8/concurrent/futures/_base.py:437: in result

return self.__get_result()

/usr/lib/python3.8/concurrent/futures/_base.py:389: in __get_result

raise self._exception

/usr/local/lib/python3.8/dist-packages/dask/local.py:538: in submit

fut.set_result(fn(*args, **kwargs))

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in batch_execute_tasks

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:234: in 

return [execute_task(a) for a in it]

/usr/local/lib/python3.8/dist-packages/dask/local.py:225: in execute_task

result = pack_exception(e, dumps)

/usr/local/lib/python3.8/dist-packages/dask/local.py:220: in execute_task

result = _execute_task(task, data)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func((_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:216: in read_partition

cls._read_paths(

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/parquet.py:92: in _read_paths

df = cudf.read_parquet(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:379: in read_parquet

) = _process_dataset(

/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner

result = func(*args, **kwargs)

/usr/local/lib/python3.8/dist-packages/cudf/io/parquet.py:205: in _process_dataset

dataset = ds.dataset(

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra7/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-2/test_dask_dataset_from_datafra7/part_0.parquet': Parquet file size is 0 bytes. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_serial_context[True]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 39773 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 34871 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 44475 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 34411 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 37897 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 34135 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dask_cudf]

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-pd] - ...

FAILED tests/unit/io/test_io.py::test_dask_dataset_from_dataframe[True-dd] - ...

============ 4 failed, 339 passed, 1 skipped, 83 warnings in 52.25s ============

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins16730871028489489208.sh

karlhigley · 2022-08-15T13:33:58Z

rerun tests

nvidia-merlin-bot · 2022-08-15T13:35:08Z

Click to view CI Results

GitHub pull request #113 of commit 9408224520d731c51b7952a43def675b76e81756, no merge conflicts.
Running as SYSTEM
Setting status of 9408224520d731c51b7952a43def675b76e81756 to PENDING with url https://10.20.13.93:8080/job/merlin_core/100/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/113/*:refs/remotes/origin/pr/113/* # timeout=10
 > git rev-parse 9408224520d731c51b7952a43def675b76e81756^{commit} # timeout=10
Checking out Revision 9408224520d731c51b7952a43def675b76e81756 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9408224520d731c51b7952a43def675b76e81756 # timeout=10
Commit message: "Merge branch 'main' into fix/fsspec-version"
 > git rev-list --no-walk d000560b0578ef8dcdcc1dc9c6463d5a91164d0d # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins12002685091393774014.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped
tests/unit/core/test_dispatch.py ..                                      [  0%]

tests/unit/dag/test_base_operator.py ....                                [  1%]

tests/unit/dag/test_column_selector.py ..........................        [  9%]

tests/unit/dag/test_graph.py .                                           [  9%]

tests/unit/dag/test_tags.py ......                                       [ 11%]

tests/unit/dag/ops/test_selection.py ...                                 [ 12%]

tests/unit/io/test_io.py ............................................... [ 25%]

................................................................         [ 44%]

tests/unit/schema/test_column_schemas.py ............................... [ 53%]

........................................................................ [ 74%]

.......................................................................  [ 95%]

tests/unit/schema/test_schema.py ......                                  [ 97%]

tests/unit/schema/test_schema_io.py ..                                   [ 97%]

tests/unit/utils/test_utils.py ........                                  [100%]
=============================== warnings summary ===============================

tests/unit/dag/test_base_operator.py: 4 warnings

tests/unit/io/test_io.py: 71 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/io/test_io.py::test_validate_and_regenerate_dataset

/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.

paths = [p.path for p in pa_dataset.pieces]
tests/unit/utils/test_utils.py::test_serial_context[True]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 42429 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 36285 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 32861 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 45823 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 37803 instead

warnings.warn(
tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]

/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.

Perhaps you already have a cluster running?

Hosting the HTTP server on port 40063 instead

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

================= 343 passed, 1 skipped, 83 warnings in 54.12s =================

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[merlin_core] $ /bin/bash /tmp/jenkins3214915028103014584.sh

Pin fsspec==2022.5.0

ff4ffa5

karlhigley added ci chore Maintenance for the repository labels Aug 9, 2022

karlhigley added this to the Merlin 22.08 milestone Aug 9, 2022

karlhigley requested a review from nv-alaiacano August 9, 2022 14:08

karlhigley self-assigned this Aug 9, 2022

Merge branch 'main' into fix/fsspec-version

9408224

nv-alaiacano approved these changes Aug 9, 2022

View reviewed changes

karlhigley linked an issue Aug 9, 2022 that may be closed by this pull request

fsspec v22.7.1 breaks Dataset.to_parquet #110

Closed

karlhigley merged commit f6bb01c into NVIDIA-Merlin:main Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pin `fsspec==2022.5.0` #113

Pin `fsspec==2022.5.0` #113

karlhigley commented Aug 9, 2022 •

edited

Loading

nvidia-merlin-bot commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

github-actions bot commented Aug 9, 2022

karlhigley commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

karlhigley commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

karlhigley commented Aug 12, 2022

nvidia-merlin-bot commented Aug 12, 2022

karlhigley commented Aug 12, 2022

nvidia-merlin-bot commented Aug 12, 2022

karlhigley commented Aug 12, 2022

nvidia-merlin-bot commented Aug 12, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

Pin fsspec==2022.5.0 #113

Pin fsspec==2022.5.0 #113

Conversation

karlhigley commented Aug 9, 2022 • edited Loading

nvidia-merlin-bot commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

github-actions bot commented Aug 9, 2022

Documentation preview

karlhigley commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

karlhigley commented Aug 9, 2022

nvidia-merlin-bot commented Aug 9, 2022

karlhigley commented Aug 12, 2022

nvidia-merlin-bot commented Aug 12, 2022

karlhigley commented Aug 12, 2022

nvidia-merlin-bot commented Aug 12, 2022

karlhigley commented Aug 12, 2022

nvidia-merlin-bot commented Aug 12, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

Pin `fsspec==2022.5.0` #113

Pin `fsspec==2022.5.0` #113

karlhigley commented Aug 9, 2022 •

edited

Loading