-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multi GPU KMeans memory usage is 2x larger than expected. #5936
Comments
@tfeher i believe we have an issue open for this already (and have for some time). the problem is that kmeans 1) expects data in row major layout, and 2) can’t handle multiple partitions on each worker so it instead concatenates the partitions on each worker, thus duplicating the data footprint. we can fix this, but it’s going to take each worker being able to support multiple partitions as input, and processing them in place. |
Here is the issue I opened last year: #5212 (I’m looking for one that might have been opened earlier too. Maybe it was closed at some point). |
I think it is not exactly the same issue. In the example above, I carefully create the input data so that we have a single partition per worker. There is no copy needed. Additionally, in a separate test, I have tried to skip all the dask input utils and directly call the KMeansMG wrapper (here is the script). This still results in double the memory usage. |
I was initially going to say that we should verify whether or not your random input array is row major or Col major (because as you know the input utils will copy the data to transpose it) but if you were also encountering this problem directly in the c++ layer then that is indeed a problem. Wow- that makes me wonder if there’s actually been multiple copies this whole time. That would explain a lot. |
Yes, it is row major, as expected by the c++ layer
|
The memory duplication seems to happen just around the time when the clustering is finishes. Here is a time trace of the memory usage
|
Ooh, interesting. That makes me wonder if the output is somehow being duplicated on its way back to the user (in Dask, or when cuML is providing the properly formatted output). I admit I’m not sure how the cuml output types work in Dask at this point because I understand there was some refactoring of those internal APIs recently. |
On a quick audit of the code, I can't find where a copy is being done in this case quite yet. Shouldn't impact to this extent, but we also have a PR that fixes having all the Interestingly, I see a different behavior in a t4:
|
And changing initialization (
|
Managed to track the issue to a part of our python input utilities. Still don't know why it's behaving this way specifically for dask scenarios, but should be able to triage and create soon. |
Triaged it to an issue on concatenation, submitted a PR working on the fix. |
Thansk @dantegd for the fix! After this I believe there is still a duplication happening as I describe above. We have a suspicion that this is related to how we pass the input data to predict after |
@tfeher @achirkin I thought that could've been the case as well, but I was still seeing the duplication even when removing that call entirely (and not calculating the labels). After the fix of 5937 I cannot currently see any copies in the example in two machines:
do you still see the double memory when testing the fix of that PR? |
I've added a bunch of
From this output it looks like there's no unnecessary allocation on cuml side anymore. However, the script still fails with OOM if the allocation size is more that half of available memory.
|
Haven't had time to try it, but given this line
I wonder if the error would still happen similarly if say we use cuml.dask make blobs? Will make some time to try tomorrow if no one gets around to it. |
I've tried to replace the generation code as you suggested # dataset = da.random.random((n_rows, n_cols), chunks=(n_rows/n_gpus, n_cols)).astype(np.float32)
dataset = cuml.dask.datasets.make_blobs(n_samples=n_rows, n_features=n_cols, n_parts=n_gpus, order='C', dtype=np.float32)[0] This doesn't make the difference w.r.t. the memory usage. I believe this The error you showed occurs in zict and is the expected behavior of the zict.Buffer LRU eviction logic. In my tests, this logic triggers exactly when the amount of free GPU memory is less than needed for one more allocation of the data (i.e. with 12GB setting; when I set the data size to 4GB, eviction does not happen). |
@achirkin, above you write:
This is because KMeans takes already GPU data. Could you confirm what happens with numpy input data? (E.g just remove the (This is independent of the zict error) |
Indeed, in this case it allocates another copy of the dataset on GPU for a short time during call to |
I missed that. Thanks Artem for confirming, and thanks Dante for the fix! |
…#5937) Partial solution for #5936 Issue was that concatenating when having a single array per worker was causing a memory copy (not sure if always, but often enough). This PR avoids the concatenation when a worker has a single partition of data. This is coming from a behavior from CuPy, where some testing reveals that sometimes it creates an extra allocation when concatenating lists that are comprised of a single array: ```python >>> import cupy as cp >>> a = cp.random.rand(2000000, 250).astype(cp.float32) # Memory occupied: 5936MB >>> b = [a] >>> c = cp.concatenate(b) # Memory occupied: 5936 MB <- no memory copy ``` ```python >>> import cupy as cp >>> a = cp.random.rand(1000000, 250) # Memory occupied: 2120 MB >>> b = [a] >>> c = cp.concatenate(b) # Memory occupied: 4028 MB <- memory copy was performed! ``` I'm not sure what are the exact rules that CuPy follows here, we could check, but in general avoiding the concatenate when we have a single partition is an easy fix that will not depend on the behavior outside of cuML's code. cc @tfeher @cjnolet Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Artem M. Chirkin (https://github.com/achirkin) - Tamas Bela Feher (https://github.com/tfeher) - Divye Gala (https://github.com/divyegala) URL: #5937
Fixed by #5937 |
Describe the bug
The expected GPU memory usage of cuML's kmeans algorithm is
(n_rows * n_cols + n_cluster*n_cols) * sizeof(MathT)
, where(n_rows, n_cols)
is the shape of the input matrix. In parcticen_clusters
is much less thann_rows,
therefore the memory usage is expected to be only slightly larger than the input data size.Here is a code to demonstrate memory usage of single GPU k-means
Expected output
In contrast when using
cuml.dask.cluster.KMeans
the memory usage is twice as large.Steps/Code to reproduce bug
We will use nvidia-smi to monitor memory usage
Output
We can see from the output that after allocating 4.0 GB input array we had 4801 MiB memory usage, which went up to 8883 MiB after we have started clustering. This is not expected.
Expected behavior
It is expected that the multi-GPU kmeans implementation has similar memory usage as the single GPU: it shall only need slightly large space than the local chunk of input data.
Environment details (please complete the following information):
Checked with rapids 24.04 and 24.06 on various GPUs (V100, A100, A30).
The text was updated successfully, but these errors were encountered: