Individual CUDA object spilling #451

madsbk · 2020-12-01T15:20:18Z

This PR introduces a new device host file that uses ProxyObejct to implement spilling of individual CUDA objects as opposed to the current host file, which spills entire keys.

Implement spilling of individual objects
Handle task level aliasing
Handle shared device buffers
Write docs

To use, set DASK_JIT_UNSPILL=True

Motivation

Aliases at the task level

Consider the following two tasks:

def task1():  # Create list of dataframes
    df1 = cudf.DataFrame({"a": range(10)})
    df2 = cudf.DataFrame({"a": range(10)})
    return [df1, df2]

def task2(dfs):  # Get the second item
    return dfs[1]

Running the two task on a worker we get something like:

>>> data["k1"] = task1()
>>> data["k2"] = task2(data["k1"])
>>> data
{
    "k1": [df1, df2],
    "k2": df2,
}

Since the current implementation of spilling works on keys and handles each keys separately, it overestimate the device memory used: sizeof(df)*3. But even worse, if it decides to spill k2 no device memory is freed since k1 still holds a reference to df2!

The new spilling implementation fixes this issue by wrapping identical CUDA objects in a shared ProxyObejct thus in this case df2 in both k1 and k2 will refer to the same ProxyObejct.

Sharing device buffers

Consider the following code snippet:

>>> data["df"] = cudf.DataFrame({"a": range(10)})
>>> data["grouped"] = shuffle_group(data["df"], "a", 0, 2, 2, False, 2)
>>> data["v1"] = data["grouped"][0]
>>> data["v2"] = data["grouped"][1]

In this case v1 and v2 are separate objects and are handled separately both in the current and the new spilling implementation. However, the shuffle_group() in cudf actually returns a single device memory buffer such that v1 and v2 points to the same underlying memory buffer. Thus the current implement will again overestimate the memory use and spill one of the dataframes without any effect.
The new implementation takes this into account when estimating memory usage and make sure that either both dataframes are spilled or none of them are.

cc. @beckernick, @VibhuJawa
xref: dask/distributed#3756

beckernick · 2020-12-01T16:41:08Z

From our initial tests, this works exactly as we had all hoped. Memory is well behaved, and the performance impact is significant. Query results also pass the correctness checks.

Setup:

8 GPUs of a DGX-2 (GPUs 0-7)
DEVICE_MEMORY_LIMIT="15GB"
POOL_SIZE="30GB"
TCP communication
SF1K data
Reading parquet files comprising 2GB in-memory data chunks from the local /raid of the DGX-2

Q02 Standard: 300 seconds
Q02 Object Spilling: 85 seconds

Q03 Standard: 295 seconds
Q03 Object Spilling: 100 seconds

Q04 Standard: 305 seconds
Q04 Object Spilling: 90 seconds

cc @quasiben @kkraus14

EDIT: Will be doing a full sweep of the queries

quasiben · 2020-12-01T16:55:46Z

Those are significant improvements!

beckernick · 2020-12-01T17:21:51Z

Some queries are failing during equality comparisons (ProxyObjects don't support equality ops). Looks like there's several other errors, but will need to distinguish between them and other issues.

For queries with which this succeeds, it's fantastic.

beckernick · 2020-12-01T18:21:15Z

There are no failures in the standard environment. Will be using this comment to track object spilling failures with @madsbk . These are not meant to be the full tracebacks -- just a log.

Q01, Q07
TypeError: '<' not supported between instances of 'ProxyObject' and 'ProxyObject'
TypeError: '>' not supported between instances of 'ProxyObject' and 'ProxyObject'
Likely coming from the less than operation in the custom task https://github.com/rapidsai/gpu-bdb/blob/1dc201bdb4542213df265c87b21e8e989291cacc/tpcx_bb/queries/q01/tpcx_bb_query_01.py#L78

Q05
TypeError: Implicit conversion to a host NumPy array via array is not allowed, To explicitly construct a GPU array, consider using cupy.asarray(...)
To explicitly construct a host array, consider using .to_array()
Likely somewhere in build_and_predict_model: https://github.com/rapidsai/gpu-bdb/blob/main/tpcx_bb/queries/q05/tpcx_bb_query_05.py#L75

Q16
distributed.protocol.pickle - INFO - Failed to serialize ([['i_item_id', <dask_cuda.proxy_object.ProxyObject at 0x7f1cc6e1afa0 of cudf.core.series.Series at 0x7f1cc6e13e60>]],). Exception: args[0] from newobj args has the wrong class
Encountered Exception while running query
distributed.protocol.pickle - INFO - Failed to serialize (subgraph_callable, "('read-parquet-5a6657fa560358b3b3932b8954051ef0', 0)", 'w_state_code', (subgraph_callable, (subgraph_callable, (subgraph_callable, "('read-parquet-5a6657fa560358b3b3932b8954051ef0', 0)", ['w_state']), {'w_state': <dask_cuda.proxy_object.ProxyObject at 0x7f1cc6e69730 of cudf.core.series.Series at 0x7f1cc6e697d0>}, None, 'getitem-6cdeec5813df85c43710a7d82d0884b6'), 'w_state')). Exception: args[0] from newobj args has the wrong class
Traceback (most recent call last):
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201201-spill/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 49, in dumps
result = pickle.dumps(x, **dump_kwargs)
_pickle.PicklingError: args[0] from newobj args has the wrong class

Q15, Q17
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201201-spill/lib/python3.7/site-packages/dask_cuda/proxy_object.py", line 256, in getattr
return getattr(self._obj_pxy_deserialize(), name)
AttributeError: 'tuple' object has no attribute 'copy'

Q21
This query appears to finish but then just before the final ops it goes back and repeats the second half of the shuffle phase and all associated work. Repeatedly (forever)

Q30
f"{obj.class.name} object is not iterable. "
TypeError: MultiIndex object is not iterable. Consider using .to_arrow(), .to_pandas() or .values_host if you wish to iterate over the values.

Q12
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201201-spill/lib/python3.7/site-packages/cudf/core/column_accessor.py", line 267, in _select_by_label_grouped
result = self._grouped_data[key]
KeyError: False

Q23, Q24
df.index = df[index_cols].copy(deep=False)
AttributeError: 'ProxyObject' object has no attribute 'index'

Q29
File "tpcx_bb_query_29.py", line 139, in main
grouped_df.columns = ["category_id_1", "category_id_2", "cnt"]
AttributeError: 'ProxyObject' object has no attribute 'columns'

Q10, Q18, Q19, Q27
File "cudf/_lib/merge.pyx", line 36, in cudf._lib.merge.merge_sorted
TypeError: Cannot convert ProxyObject to cudf._lib.table.Table

Q28
File "<array_function internals>", line 6, in concatenate
ValueError: object array method not producing an array

Q20, Q25, Q26 (Fixed by dask/dask#6927)
TypeError: Expected meta to specify scalar, got cudf.core.series.Series

jakirkham · 2020-12-01T18:47:38Z

My guess is ProxyObjects will need to convert themselves back to real objects when performing those comparisons.

jakirkham · 2020-12-02T20:42:02Z

My guess is ProxyObjects will need to convert themselves back to real objects when performing those comparisons.

Done in PR ( #458 )

codecov-io · 2020-12-02T21:01:38Z

Codecov Report

Merging #451 (81668b1) into branch-0.18 (b170b29) will increase coverage by 0.73%.
The diff coverage is 94.26%.

@@               Coverage Diff               @@
##           branch-0.18     #451      +/-   ##
===============================================
+ Coverage        90.40%   91.14%   +0.73%     
===============================================
  Files               15       18       +3     
  Lines             1126     1446     +320     
===============================================
+ Hits              1018     1318     +300     
- Misses             108      128      +20

Impacted Files	Coverage Δ
dask_cuda/cli/dask_cuda_worker.py	`96.92% <ø> (+0.09%)`	⬆️
dask_cuda/device_host_file.py	`90.69% <66.66%> (-8.17%)`	⬇️
dask_cuda/cuda_worker.py	`76.66% <75.00%> (-0.35%)`	⬇️
dask_cuda/proxify_device_objects.py	`85.71% <85.71%> (ø)`
dask_cuda/get_device_memory_objects.py	`89.04% <89.04%> (ø)`
dask_cuda/proxy_object.py	`90.62% <95.91%> (+2.82%)`	⬆️
dask_cuda/local_cuda_cluster.py	`81.17% <100.00%> (+0.68%)`	⬆️
dask_cuda/proxify_host_file.py	`100.00% <100.00%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b170b29...81668b1. Read the comment docs.

madsbk · 2020-12-03T14:05:33Z

@beckernick, I have updated your error log. Q01, q05, q07 should now work. Please re-add them if they still fails for you

beckernick · 2020-12-03T14:56:04Z

👍 I see this also. Just for tracking sake, I'm going to re-do your comment edits as ~~strikethrough~~

EDIT: Updated the tracking comment to reflect new successes as well and changed errors

Fixed in rapidsai#451

madsbk · 2020-12-16T13:18:58Z

@beckernick, all queries seems to work now. Can you confirm that they also works for you?
Notice, you need dask-master from today, which include dask/dask#6927 or dask/dask#6981 to support q28

madsbk · 2020-12-21T14:18:52Z

Test and evaluation of NVTabular's Criteo/DLRM Preprocessing Benchmark

Running a 10-days data set on a on DXG-2 everything works and achieve a 1.48 times speedup. In order to force the use of device-to-host memory spilling, I had to set the device limit to 6.5GB.

	Memory Peak	Runtime	Speedup
Old style spilling	20 GB	408 sec	1
JIT unspill	17 GB	276 sec	1.48

The commands, which depend on Dask master:

# Old style spilling
DASK_JIT_UNSPILL=False python NVTabular/examples/dask-nvtabular-criteo-benchmark.py --data-path /datasets/criteo/crit_orig_pq_10days --out-path output -d "8,9,10,11,12,13,14,15" --cats-on-device --cat-cache-high device --device-limit-frac 0.2

# JIT unspill
DASK_JIT_UNSPILL=True  python NVTabular/examples/dask-nvtabular-criteo-benchmark.py --data-path /datasets/criteo/crit_orig_pq_10days --out-path output -d "8,9,10,11,12,13,14,15" --cats-on-device --cat-cache-high device --device-limit-frac 0.2

Based on the results from the NVTabular and TPCx-BB workflows, I think this PR is ready for reviews.

pentschev

From a high-level, I think this PR looks good, I have a few proposed changes and questions below.

Overall the PR is very complex and difficult to keep track of all the important details when reviewing, so in all honesty I couldn't mentally validate if all the connections between different classes and various data types registrations are indeed correct, I think the best approach here is to indeed have this as an experimental feature (as currently proposed) for some type to allow people to validate the system functionality.

Thanks @madsbk for the huge effort you put into this problem!

pentschev · 2021-01-04T11:29:00Z

dask_cuda/cuda_worker.py

+                {
+                    "device_memory_limit": parse_device_memory_limit(
+                        device_memory_limit, device_index=i
+                    ),


The fact that there's no memory_limit here seems to point that there's no capability for host<->disk spilling, is that intended? We also see to be missing local_directory, which will prevent users from running storing things anywhere but the current directory, that seems problematic for certain use cases.

Ok, I now see you added a comment about memory_limit in LocalCUDACluster docs. Can you do the same in

dask-cuda/dask_cuda/cli/dask_cuda_worker.py

Lines 196 to 199 in b170b29

@click.option(

"--enable-jit-unspill/--disable-jit-unspill",

default=None, # If not specified, use Dask config

help="Enable just-in-time unspilling",

?

The local_directory question is still valid though.

Yes, in this PR we will not support any spilling to disk. I have removed memory_limit and local_directory to make this clear. I don't think local_directory is used for anything else than disk spilling?

I will be up to a future PR to implement disk spilling, which shouldn't be too difficult.

My understanding was that local_directory was used for more than just spilling, so I went on a hunt. To be fair, it's still unclear to me even where in the code local_directory is really used, I found one other case where it's used though, when you upload files, see for example https://github.com/dask/distributed/blob/607cfd2ce00edd44c99da3273de0763a426dda7d/distributed/tests/test_worker.py#L176-L202 . That isn't to say this is the only other case, but I couldn't confirm nor deny whether there are other cases besides spilling and file uploading. Maybe @quasiben or @jakirkham would know.

pentschev · 2021-01-04T11:55:17Z

dask_cuda/tests/test_proxy.py

+    a = proxy_object.asproxy(org.copy())
+    b = proxy_object.asproxy(org.copy())
+    res1 = tensordot_lookup(a, b).flatten()
+    res2 = tensordot_lookup(org.copy(), org.copy()).flatten()


I see you're using the tensordot_lookup and einsum_lookup below. Could we test instead np.tensordot and np.einsum, as users would call it?

Changed to dask.array.tensordot() and dask.array.einsum()

pentschev · 2021-01-04T13:06:00Z

dask_cuda/tests/test_proxify_host_file.py

+    assert k2._obj_pxy_serialized()
+    assert not dhf["k4"]._obj_pxy_serialized()
+
+    # Deleting k2 does change anything since k3 still holds a


Did you mean to write "Deleting k2 does NOT change anything" ?

I have change the test to use dhf.proxies_tally for the check.

dask_cuda/tests/test_proxify_host_file.py

dask_cuda/proxy_object.py

dask_cuda/proxify_device_objects.py

dask_cuda/proxify_host_file.py

dask_cuda/get_device_memory_objects.py

dask_cuda/proxify_host_file.py

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

madsbk · 2021-01-05T12:37:13Z

@pentschev thanks for the review, much appreciated. I have addressed all of your suggestions, I think :)

pentschev · 2021-01-05T13:20:05Z

Thanks @madsbk for addressing those. It seems that failing tests are legit, all three seem to fail with similar errors:

dask_cuda/tests/test_proxy.py::test_proxy_object_parquet *** stack smashing detected ***: <unknown> terminated
Fatal Python error: Aborted

madsbk · 2021-01-05T14:23:32Z

Thanks @madsbk for addressing those. It seems that failing tests are legit, all three seem to fail with similar errors:

I think it is a cuDF bug: rapidsai/cudf#7074
Changed test_proxy_object_parquet() to use the pyarrow engine

pentschev

Thanks for checking the failing tests @madsbk . To the extent I was capable of verifying this PR, everything looks good to me now. I'm approving but will wait until tomorrow for merging to give others a chance to review it as well.

madsbk changed the title ~~Individual CUDA object spilling~~ [WIP] Individual CUDA object spilling Dec 1, 2020

madsbk added the 2 - In Progress Currently a work in progress label Dec 1, 2020

madsbk force-pushed the object_spilling branch from 6e817d2 to f2fffb1 Compare December 2, 2020 10:26

madsbk added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 2, 2020

madsbk force-pushed the object_spilling branch 3 times, most recently from e18c3ef to 159fa02 Compare December 2, 2020 20:35

madsbk mentioned this pull request Dec 3, 2020

Simplify has_parallel_type() dask/dask#6927

Merged

2 tasks

madsbk force-pushed the object_spilling branch 9 times, most recently from 93bf3e1 to 4c16d5e Compare December 14, 2020 08:26

madsbk changed the base branch from branch-0.17 to branch-0.18 December 14, 2020 08:26

madsbk added a commit to madsbk/dask-cuda that referenced this pull request Dec 16, 2020

xfail test_spilling_local_cuda_cluster[True]

02c5f49

Fixed in rapidsai#451

madsbk added the 3 - Ready for Review Ready for review by team label Dec 17, 2020

madsbk changed the title ~~[WIP] Individual CUDA object spilling~~ Individual CUDA object spilling Dec 17, 2020

madsbk added 4 commits December 18, 2020 13:37

proxify_device_objects(): clean up

e21204f

get_device_memory_objects(): clean up

0609916

Added some typing

bcf56e4

typo

5b73daf

pentschev requested changes Jan 4, 2021

View reviewed changes

madsbk and others added 7 commits January 5, 2021 12:44

Added CLI help doc for --enable-jit-unspill

fde45bf

test of tensordot and einsum now uses dask.array.

018db71

test_one_item_limit(): testing proxies_tally

bdaeaa6

Fixed removal of an "optional"

85d962f

Fixing typos

4093550

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

rename: _obj_pxy_serialized() => _is_obj_pxy_serialized()

544417a

removed old TODO list

d1933d5

test_proxy_object_parquet(): use pyarrow engine

81668b1

pentschev approved these changes Jan 5, 2021

View reviewed changes

madsbk added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Jan 6, 2021

rapids-bot bot merged commit f6ec9ef into rapidsai:branch-0.18 Jan 6, 2021

madsbk deleted the object_spilling branch January 11, 2021 15:42

madsbk mentioned this pull request Jan 27, 2021

Double Counting and Issues w/Spilling dask/distributed#4186

Open

madsbk mentioned this pull request Mar 8, 2021

[Question] A new approach to memory spilling dask/distributed#4568

Open

madsbk mentioned this pull request May 3, 2021

Asynchronous Disk Access in Workers dask/distributed#4424

Open

quasiben mentioned this pull request May 3, 2021

Document DASK_JIT_UNSPILL #597

Closed

madsbk mentioned this pull request May 20, 2021

lazy_isinstance(): use .__class__ for type checking dmlc/xgboost#6974

Merged

madsbk mentioned this pull request Aug 17, 2021

Use __class__ when creating DataFrames dask/dask#8053

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Individual CUDA object spilling #451

Individual CUDA object spilling #451

madsbk commented Dec 1, 2020 •

edited

Loading

beckernick commented Dec 1, 2020 •

edited

Loading

quasiben commented Dec 1, 2020

beckernick commented Dec 1, 2020

beckernick commented Dec 1, 2020 •

edited by madsbk

Loading

jakirkham commented Dec 1, 2020

jakirkham commented Dec 2, 2020

codecov-io commented Dec 2, 2020 •

edited

Loading

madsbk commented Dec 3, 2020

beckernick commented Dec 3, 2020 •

edited

Loading

madsbk commented Dec 16, 2020 •

edited

Loading

madsbk commented Dec 21, 2020 •

edited

Loading

pentschev left a comment

pentschev Jan 4, 2021

pentschev Jan 4, 2021

madsbk Jan 5, 2021

pentschev Jan 5, 2021

pentschev Jan 4, 2021

madsbk Jan 5, 2021

pentschev Jan 4, 2021

madsbk Jan 5, 2021

madsbk commented Jan 5, 2021

pentschev commented Jan 5, 2021

madsbk commented Jan 5, 2021

pentschev left a comment

	@click.option(
	"--enable-jit-unspill/--disable-jit-unspill",
	default=None, # If not specified, use Dask config
	help="Enable just-in-time unspilling",

Individual CUDA object spilling #451

Individual CUDA object spilling #451

Conversation

madsbk commented Dec 1, 2020 • edited Loading

Motivation

Aliases at the task level

Sharing device buffers

beckernick commented Dec 1, 2020 • edited Loading

quasiben commented Dec 1, 2020

beckernick commented Dec 1, 2020

beckernick commented Dec 1, 2020 • edited by madsbk Loading

jakirkham commented Dec 1, 2020

jakirkham commented Dec 2, 2020

codecov-io commented Dec 2, 2020 • edited Loading

Codecov Report

madsbk commented Dec 3, 2020

beckernick commented Dec 3, 2020 • edited Loading

madsbk commented Dec 16, 2020 • edited Loading

madsbk commented Dec 21, 2020 • edited Loading

Test and evaluation of NVTabular's Criteo/DLRM Preprocessing Benchmark

pentschev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk commented Jan 5, 2021

pentschev commented Jan 5, 2021

madsbk commented Jan 5, 2021

pentschev left a comment

Choose a reason for hiding this comment

madsbk commented Dec 1, 2020 •

edited

Loading

beckernick commented Dec 1, 2020 •

edited

Loading

beckernick commented Dec 1, 2020 •

edited by madsbk

Loading

codecov-io commented Dec 2, 2020 •

edited

Loading

beckernick commented Dec 3, 2020 •

edited

Loading

madsbk commented Dec 16, 2020 •

edited

Loading

madsbk commented Dec 21, 2020 •

edited

Loading