Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to have output dask_cudf.DataFrame not necessarily all in memory #1128

Open
wmalpica opened this issue Nov 9, 2020 · 0 comments
Open

Comments

@wmalpica
Copy link
Contributor

wmalpica commented Nov 9, 2020

For ideas on how to implement this, see this conversation:

felipe Nov 7th at 9:30 AM
When a dask_cudf caches its memory off of gpu onto disk where does it actually do this transformation? Like how can we see from lower level how to build a dask_cudf dataframe where some of the partitions can be off gpu to have ouputs larger than available gpu memory

Benjamin Zaitlen 1 day ago
I think reading through this issue would be a good starting point (you can safely ignore the larger issue of double counting for now): dask/distributed#4186
GitHubGitHub
Double Counting and Issues w/Spilling · Issue #4186 · dask/distributed
cuDF recently made a change (they are now reverting that change) which impacted the way memory usage is reported. I'm bringing this up to see if my understanding about memory reporting within i...

Benjamin Zaitlen 1 day ago
The stored data/zict setup for dask-cuda is here:
https://github.com/rapidsai/dask-cuda/blob/302d1b8d422dbb981a0e36b1c7f14941cfd80ef7/dask_cuda/device_host_file.py#L103-L125
That will make more sense after reading through the issue
GitHubGitHub
rapidsai/dask-cuda
Utilities for Dask and CUDA interactions. Contribute to rapidsai/dask-cuda development by creating an account on GitHub.

Benjamin Zaitlen 1 day ago
After that, you will want to look at the underlying zict buffer object:
https://github.com/dask/zict/blob/master/zict/buffer.py
GitHubGitHub
dask/zict
Useful Mutable Mappings. Contribute to dask/zict development by creating an account on GitHub.

Benjamin Zaitlen 1 day ago
Lastly, you may be interested in this issue:
rapidsai/dask-cuda#438
GitHubGitHub
Log when spilling from GPU memory to host memory and from host memory to disk happens. · Issue #438 · rapidsai/dask-cuda
Add logs that can be turned on/off with dask.config.set that log when spilling occurs. Spilling has such an adverse effect on perf that it's important to highlight to our customers when it happ...

Benjamin Zaitlen 1 day ago
Happy to chat on Monday about this as well.
cc @peter Entschev in case he has thought on Monday

felipe 1 day ago
You are my hero Benjamin (edited)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant