-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Individual CUDA object spilling #451
Conversation
From our initial tests, this works exactly as we had all hoped. Memory is well behaved, and the performance impact is significant. Query results also pass the correctness checks. Setup:
Q02 Standard: 300 seconds Q03 Standard: 295 seconds Q04 Standard: 305 seconds EDIT: Will be doing a full sweep of the queries |
Those are significant improvements! |
Some queries are failing during equality comparisons (ProxyObjects don't support equality ops). Looks like there's several other errors, but will need to distinguish between them and other issues. For queries with which this succeeds, it's fantastic. |
There are no failures in the standard environment. Will be using this comment to track object spilling failures with @madsbk . These are not meant to be the full tracebacks -- just a log.
|
My guess is |
6e817d2
to
f2fffb1
Compare
e18c3ef
to
159fa02
Compare
Done in PR ( #458 ) |
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #451 +/- ##
===============================================
+ Coverage 90.40% 91.14% +0.73%
===============================================
Files 15 18 +3
Lines 1126 1446 +320
===============================================
+ Hits 1018 1318 +300
- Misses 108 128 +20
Continue to review full report at Codecov.
|
@beckernick, I have updated your error log. Q01, q05, q07 should now work. Please re-add them if they still fails for you |
👍 I see this also. Just for tracking sake, I'm going to re-do your comment edits as EDIT: Updated the tracking comment to reflect new successes as well and changed errors |
93bf3e1
to
4c16d5e
Compare
@beckernick, all queries seems to work now. Can you confirm that they also works for you? |
Test and evaluation of NVTabular's Criteo/DLRM Preprocessing BenchmarkRunning a 10-days data set on a on DXG-2 everything works and achieve a 1.48 times speedup. In order to force the use of device-to-host memory spilling, I had to set the device limit to 6.5GB.
The commands, which depend on Dask
Based on the results from the NVTabular and TPCx-BB workflows, I think this PR is ready for reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a high-level, I think this PR looks good, I have a few proposed changes and questions below.
Overall the PR is very complex and difficult to keep track of all the important details when reviewing, so in all honesty I couldn't mentally validate if all the connections between different classes and various data types registrations are indeed correct, I think the best approach here is to indeed have this as an experimental feature (as currently proposed) for some type to allow people to validate the system functionality.
Thanks @madsbk for the huge effort you put into this problem!
{ | ||
"device_memory_limit": parse_device_memory_limit( | ||
device_memory_limit, device_index=i | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that there's no memory_limit
here seems to point that there's no capability for host<->disk spilling, is that intended? We also see to be missing local_directory
, which will prevent users from running storing things anywhere but the current directory, that seems problematic for certain use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I now see you added a comment about memory_limit
in LocalCUDACluster
docs. Can you do the same in
dask-cuda/dask_cuda/cli/dask_cuda_worker.py
Lines 196 to 199 in b170b29
@click.option( | |
"--enable-jit-unspill/--disable-jit-unspill", | |
default=None, # If not specified, use Dask config | |
help="Enable just-in-time unspilling", |
The local_directory
question is still valid though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this PR we will not support any spilling to disk. I have removed memory_limit
and local_directory
to make this clear. I don't think local_directory
is used for anything else than disk spilling?
I will be up to a future PR to implement disk spilling, which shouldn't be too difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that local_directory
was used for more than just spilling, so I went on a hunt. To be fair, it's still unclear to me even where in the code local_directory
is really used, I found one other case where it's used though, when you upload files, see for example https://github.com/dask/distributed/blob/607cfd2ce00edd44c99da3273de0763a426dda7d/distributed/tests/test_worker.py#L176-L202 . That isn't to say this is the only other case, but I couldn't confirm nor deny whether there are other cases besides spilling and file uploading. Maybe @quasiben or @jakirkham would know.
dask_cuda/tests/test_proxy.py
Outdated
a = proxy_object.asproxy(org.copy()) | ||
b = proxy_object.asproxy(org.copy()) | ||
res1 = tensordot_lookup(a, b).flatten() | ||
res2 = tensordot_lookup(org.copy(), org.copy()).flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you're using the tensordot_lookup
and einsum_lookup
below. Could we test instead np.tensordot
and np.einsum
, as users would call it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to dask.array.tensordot()
and dask.array.einsum()
assert k2._obj_pxy_serialized() | ||
assert not dhf["k4"]._obj_pxy_serialized() | ||
|
||
# Deleting k2 does change anything since k3 still holds a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to write "Deleting k2 does NOT change anything" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have change the test to use dhf.proxies_tally
for the check.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
@pentschev thanks for the review, much appreciated. I have addressed all of your suggestions, I think :) |
Thanks @madsbk for addressing those. It seems that failing tests are legit, all three seem to fail with similar errors: dask_cuda/tests/test_proxy.py::test_proxy_object_parquet *** stack smashing detected ***: <unknown> terminated
Fatal Python error: Aborted |
I think it is a cuDF bug: rapidsai/cudf#7074 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking the failing tests @madsbk . To the extent I was capable of verifying this PR, everything looks good to me now. I'm approving but will wait until tomorrow for merging to give others a chance to review it as well.
This PR introduces a new device host file that uses
ProxyObejct
to implement spilling of individual CUDA objects as opposed to the current host file, which spills entire keys.To use, set
DASK_JIT_UNSPILL=True
Motivation
Aliases at the task level
Consider the following two tasks:
Running the two task on a worker we get something like:
Since the current implementation of spilling works on keys and handles each keys separately, it overestimate the device memory used:
sizeof(df)*3
. But even worse, if it decides to spillk2
no device memory is freed sincek1
still holds a reference todf2
!The new spilling implementation fixes this issue by wrapping identical CUDA objects in a shared
ProxyObejct
thus in this casedf2
in bothk1
andk2
will refer to the sameProxyObejct
.Sharing device buffers
Consider the following code snippet:
In this case
v1
andv2
are separate objects and are handled separately both in the current and the new spilling implementation. However, theshuffle_group()
in cudf actually returns a single device memory buffer such thatv1
andv2
points to the same underlying memory buffer. Thus the current implement will again overestimate the memory use and spill one of the dataframes without any effect.The new implementation takes this into account when estimating memory usage and make sure that either both dataframes are spilled or none of them are.
cc. @beckernick, @VibhuJawa
xref: dask/distributed#3756