Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling learner.feature_importance on larger than memory dataset causes OOM #310

Closed
scottcha opened this issue Dec 11, 2021 · 8 comments
Closed
Labels
bug Something isn't working

Comments

@scottcha
Copy link

Repro steps:

  1. Train a model with larger than memory data
  2. Call learn.feature_importance()

Expected result: show feature importance of features
Actual result: OOM--full repro and notebook here: https://github.com/scottcha/TsaiOOMRepro/blob/main/TsaiOOMRepro.ipynb

os : Linux-5.4.0-91-generic-x86_64-with-glibc2.17
python : 3.8.11
tsai : 0.2.24
fastai : 2.5.3
fastcore : 1.3.26
zarr : 2.10.0
torch : 1.9.1+cu102
n_cpus : 24
device : cuda (GeForce GTX 1080 Ti)

Stack Trace:

MemoryError Traceback (most recent call last)
/tmp/ipykernel_3968/3713785271.py in
----> 1 learn.feature_importance()

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/learner.py in feature_importance(self, feature_names, key_metric_idx, show_chart, save_df_path, random_state)
337 value = self.get_X_preds(X_valid, y_valid, with_loss=True)[-1].mean().item()
338 else:
--> 339 output = self.get_X_preds(X_valid, y_valid)
340 value = metric(output[0], output[1]).item()
341 print(f"{k:3} feature: {COLS[k]:20} {metric_name}: {value:8.6f}")

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/inference.py in get_X_preds(self, X, y, bs, with_input, with_decoded, with_loss)
16 print("cannot find loss as y=None")
17 with_loss = False
---> 18 dl = self.dls.valid.new_dl(X, y=y)
19 if bs: setattr(dl, "bs", bs)
20 else: assert dl.bs, "you need to pass a bs != 0"

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in new_dl(self, X, y)
486 assert X.ndim == 3, "You must pass an X with 3 dimensions [batch_size x n_vars x seq_len]"
487 if y is not None and not is_array(y) and not is_listy(y): y = [y]
--> 488 new_dloader = self.new(self.dataset.add_dataset(X, y=y))
489 return new_dloader
490

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in add_dataset(self, X, y, inplace)
422 @patch
423 def add_dataset(self:NumpyDatasets, X, y=None, inplace=True):
--> 424 return add_ds(self, X, y=y, inplace=inplace)
425
426 @patch

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in add_ds(dsets, X, y, inplace)
413 tls = dsets.tls if with_labels else dsets.tls[:dsets.n_inp]
414 new_tls = L([tl._new(item, split_idx=1) for tl,item in zip(tls, items)])
--> 415 return type(dsets)(tls=new_tls)
416 elif isinstance(dsets, TfmdLists):
417 new_tl = dsets._new(items, split_idx=1)

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in init(self, X, y, items, sel_vars, sel_steps, tfms, tls, n_inp, dl_type, inplace, **kwargs)
378 if len(self.tls) > 0 and len(self.tls[0]) > 0:
379 self.typs = [type(tl[0]) if isinstance(tl[0], torch.Tensor) else self.typs[i] for i,tl in enumerate(self.tls)]
--> 380 self.ptls = L([typ(stack(tl[:]))[...,self.sel_vars, self.sel_steps] if i==0 else typ(stack(tl[:]))
381 for i,(tl,typ) in enumerate(zip(self.tls,self.typs))]) if inplace and len(tls[0]) != 0 else tls
382

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in (.0)
378 if len(self.tls) > 0 and len(self.tls[0]) > 0:
379 self.typs = [type(tl[0]) if isinstance(tl[0], torch.Tensor) else self.typs[i] for i,tl in enumerate(self.tls)]
--> 380 self.ptls = L([typ(stack(tl[:]))[...,self.sel_vars, self.sel_steps] if i==0 else typ(stack(tl[:]))
381 for i,(tl,typ) in enumerate(zip(self.tls,self.typs))]) if inplace and len(tls[0]) != 0 else tls
382

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in getitem(self, it)
243 def subset(self, i, **kwargs): return type(self)(self.items, splits=self.splits[i], split_idx=i, do_setup=False, types=self.types, **kwargs)
244 def getitem(self, it):
--> 245 if hasattr(self.items, 'oindex'): return self.items.oindex[self._splits[it]]
246 else: return self.items[self._splits[it]]
247 def len(self): return len(self._splits)

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/indexing.py in getitem(self, selection)
602 selection = ensure_tuple(selection)
603 selection = replace_lists(selection)
--> 604 return self.array.get_orthogonal_selection(selection, fields=fields)
605
606 def setitem(self, selection, value):

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/core.py in get_orthogonal_selection(self, selection, out, fields)
939 indexer = OrthogonalIndexer(selection, self)
940
--> 941 return self._get_selection(indexer=indexer, out=out, fields=fields)
942
943 def get_coordinate_selection(self, selection, out=None, fields=None):

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/core.py in _get_selection(self, indexer, out, fields)
1107 # setup output array
1108 if out is None:
-> 1109 out = np.empty(out_shape, dtype=out_dtype, order=self._order)
1110 else:
1111 check_array_shape('out', out, out_shape)

MemoryError: Unable to allocate 315. GiB for an array with shape (60000, 978, 1441) and data type float32

@oguiza oguiza added the bug Something isn't working label Dec 12, 2021
@oguiza
Copy link
Contributor

oguiza commented Dec 12, 2021

Hi @scottcha,
Thanks for taking the time to report his bug.
This issue is created when each feature is shuffled. To shuffle data it needs to be loaded in memory.
By default, feature_importance uses all data in the validation split. This makes it only usable with in-memory datasets.
There are a few alternatives to fix this issue:

  1. Add X and y as optional arguments. Then feature importance will be measured on the X and y you pass instead of the entire dataset.
  2. Add partial_n as an optional argument (int or float, like in the dataloaders). In this way, you could indicate either a fixed number of samples with an int (1000 samples) or a percent of the validation set.
  3. Add X, y, and partial_n, so that you can use X & y or partial_n.

I think option 3 would probably cover most scenarios as it's the most flexible.
What do you think Scott?

@scottcha
Copy link
Author

scottcha commented Dec 12, 2021

@oguiza I agree 3 is the most flexible.
I tried out #1 as a workaround but I run in to a separate memory issue in the loop doing the feature importance calcs:

My entire chrome session running jupyter crashes with this error:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Each of my samples is about .5mb to 1mb on disk and I encounter this error even when only computing with 100 samples--since I have ~900 features it goes through the calculation that many times but seems to hit this around iteration 50.

Even though monitoring my system ram (which does grow aggressively during this at approximately 1gb per iteration in the feature importance calculation and my gpu ram seems constant but obviously something seems to be leaking or growing out of control). My guess its related to some of the gpu allocated objects not getting freed but I wasn't sure how to debug that.

Also, FWIW I ran this outside of jupyter in VS Code python debugger and get the same error with the one additional piece of information that it indicates "Dataloader Worker (PID(s) 1618) Exited Unexpectedly".

Thanks

oguiza pushed a commit that referenced this issue Dec 14, 2021
@oguiza
Copy link
Contributor

oguiza commented Dec 14, 2021

Hi @scottcha,
Thanks for providing more details on your issue.
I've updated feature_importance now and get_X_preds to ensure as much non-required data is removed (using gc.collect). Please, try it again if you can, and let me know if you still have issues.

@scottcha
Copy link
Author

I tried out the new implementation. Here are a couple of notes:

  1. When I provide my own smaller X, y parameters I still get the crash at about the 50th iteration of calculating the feature importance as well as high system memory usage.
  2. The current logic to slice X doesn't seem to work with native zarr arrays. I believe in the case where X is a zarr array this would be the right way to slice it based on a set of random indices:
    X = X.get_orthogonal_selection((rand_idxs, slice(None), slice(None)))

@oguiza
Copy link
Contributor

oguiza commented Dec 23, 2021

Hi @scottcha,
I need to adapt feature_importance to work with zarr arrays as you mention. I'll fix it within the next few days.
But I'm not exactly sure what's causing the issue in your bullet point #1.
Could you please try to export the loader once it's trained and reload it using load_learner. If you do that it will contain no data. You can then pass a smaller array and see if the issue persists. That'll give us a hint at what the root cause might be.

@scottcha
Copy link
Author

Sorry it took me a bit to get back to this.
I refreshed my env with the latest and reran my use case (large zarr file, sliced before calling feature_importance) and I was able to complete the run without encountering the OOM or the shared memory errors.
So I would say at this point the issues I called out are resolved or not reproducible with the exception that it may not natively handle zarr arrays though that's pretty easy to work around.

Thanks!

@oguiza
Copy link
Contributor

oguiza commented Jan 17, 2022

Ok, I'm glad to hear that Scott.
I forgot to fix the indexing for zarr arrays. I've added it now in the GitHub repo. It works when you pass a partial_n (int or float) since the data doesn't fit in memory.
If you pass an X it needs to be a numpy array.
If you have a chance, it'd be good if you can test it (use pip install -Uqq git+https://github.com/timeseriesAI/tsai.git).

@oguiza
Copy link
Contributor

oguiza commented Jan 25, 2022

I'll close this issue since the requested fix has already been implemented. Please, reopen it if necessary.

@oguiza oguiza closed this as completed Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants