-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Large memory requirements for SimpleImputer strategy median #4794
Comments
Thanks for filing an issue. I observe the example above (~2.16GB array) spiking memory to ~18GB on my machine in 22.06. As a note, you can edit your issues if they are accidentally filed without a title or for any other reason. I've updated this issue to better reflect the behavior, which helps us evaluate. |
This issue has been labeled |
… (#4817) I have implemented a fix for [BUG] Large memory requirements for SimpleImputer strategy median #4794. I narrowed down the issue to _masked_column_median. As expected, the extra memory results from the unnecessary copy of the array (in the case where NaN is the masked value). However, in the other case (where NaN isn't the masked value) this copy is necessary. To fix this, I used in-place sorting. However, in both cases the memory usage goes from 3000 MiB (size of original array) to 13000. From my understanding, sorting should only take up an additional 3000 MiB. Is it possible to reduce memory usage further? Still, this fix still reduces the memory used by over 5000 MiB. Authors: - https://github.com/erikrene Approvers: - William Hicks (https://github.com/wphicks) URL: #4817
…idsai#4794 (rapidsai#4817) I have implemented a fix for [BUG] Large memory requirements for SimpleImputer strategy median rapidsai#4794. I narrowed down the issue to _masked_column_median. As expected, the extra memory results from the unnecessary copy of the array (in the case where NaN is the masked value). However, in the other case (where NaN isn't the masked value) this copy is necessary. To fix this, I used in-place sorting. However, in both cases the memory usage goes from 3000 MiB (size of original array) to 13000. From my understanding, sorting should only take up an additional 3000 MiB. Is it possible to reduce memory usage further? Still, this fix still reduces the memory used by over 5000 MiB. Authors: - https://github.com/erikrene Approvers: - William Hicks (https://github.com/wphicks) URL: rapidsai#4817
This was resolved by #4817 . We now generally require less memory than the CPU scikit-learn version. Closing. %load_ext gpu_memory_profiler
%load_ext memory_profiler
from sklearn.impute import SimpleImputer as skl_SimpleImputer
from cuml.preprocessing import SimpleImputer as cu_SimpleImputer
from sklearn.datasets import make_classification
import numpy as np
import cupy as cp
import gc
NROWS = [
64e6,
128e6,
256e6,
]
NROWS = [int(x) for x in NROWS]
NULL_PCT = [
0.1,
]
for N in NROWS:
for NP in NULL_PCT:
# Create some data and randomly set some elements as null
X = np.random.normal(0, 10, size=(N, 1))
mask = np.random.choice([True, False], size=X.shape, p=[NP, 1-NP])
X[mask] = None
# Compare peak memory usage on GPU and CPU
print(f"{N:,} rows, {NP} null percent, X size: {X.nbytes/1e9} GB")
imputer = cu_SimpleImputer(strategy='median')
%gpu_memit imputer.fit(X)
imputer = skl_SimpleImputer(strategy='median')
%memit imputer.fit(X)
print()
del X
gc.collect()
64,000,000 rows, 0.1 null percent, X size: 0.512 GB
Peak GPU memory: 4225.00 MiB
peak memory: 6078.40 MiB, increment: 2136.23 MiB
128,000,000 rows, 0.1 null percent, X size: 1.024 GB
Peak GPU memory: 7219.00 MiB
peak memory: 9787.75 MiB, increment: 4272.26 MiB
256,000,000 rows, 0.1 null percent, X size: 2.048 GB
Peak GPU memory: 13203.00 MiB
peak memory: 17206.98 MiB, increment: 8544.85 MiB |
Describe the bug
Running fit_transform with SimpleImputer results in running out of memory. This only occurs when the imputation strategy is median.
Steps/Code to reproduce bug
Output:
Expected behavior
This operation should not take as much memory as it does. Using a smaller array does not result in the error. After running nvidia-smi, the memory usage increases about 5x after running the code above.
Environment details (please complete the following information):
The text was updated successfully, but these errors were encountered: