Calling learner.feature_importance on larger than memory dataset causes OOM #310

scottcha · 2021-12-11T22:13:06Z

Repro steps:

Train a model with larger than memory data
Call learn.feature_importance()

Expected result: show feature importance of features
Actual result: OOM--full repro and notebook here: https://github.com/scottcha/TsaiOOMRepro/blob/main/TsaiOOMRepro.ipynb

os : Linux-5.4.0-91-generic-x86_64-with-glibc2.17
python : 3.8.11
tsai : 0.2.24
fastai : 2.5.3
fastcore : 1.3.26
zarr : 2.10.0
torch : 1.9.1+cu102
n_cpus : 24
device : cuda (GeForce GTX 1080 Ti)

Stack Trace:

MemoryError Traceback (most recent call last)
/tmp/ipykernel_3968/3713785271.py in
----> 1 learn.feature_importance()

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/learner.py in feature_importance(self, feature_names, key_metric_idx, show_chart, save_df_path, random_state)
337 value = self.get_X_preds(X_valid, y_valid, with_loss=True)[-1].mean().item()
338 else:
--> 339 output = self.get_X_preds(X_valid, y_valid)
340 value = metric(output[0], output[1]).item()
341 print(f"{k:3} feature: {COLS[k]:20} {metric_name}: {value:8.6f}")

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/inference.py in get_X_preds(self, X, y, bs, with_input, with_decoded, with_loss)
16 print("cannot find loss as y=None")
17 with_loss = False
---> 18 dl = self.dls.valid.new_dl(X, y=y)
19 if bs: setattr(dl, "bs", bs)
20 else: assert dl.bs, "you need to pass a bs != 0"

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in new_dl(self, X, y)
486 assert X.ndim == 3, "You must pass an X with 3 dimensions [batch_size x n_vars x seq_len]"
487 if y is not None and not is_array(y) and not is_listy(y): y = [y]
--> 488 new_dloader = self.new(self.dataset.add_dataset(X, y=y))
489 return new_dloader
490

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in add_dataset(self, X, y, inplace)
422 @patch
423 def add_dataset(self:NumpyDatasets, X, y=None, inplace=True):
--> 424 return add_ds(self, X, y=y, inplace=inplace)
425
426 @patch

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in add_ds(dsets, X, y, inplace)
413 tls = dsets.tls if with_labels else dsets.tls[:dsets.n_inp]
414 new_tls = L([tl._new(item, split_idx=1) for tl,item in zip(tls, items)])
--> 415 return type(dsets)(tls=new_tls)
416 elif isinstance(dsets, TfmdLists):
417 new_tl = dsets._new(items, split_idx=1)

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in init(self, X, y, items, sel_vars, sel_steps, tfms, tls, n_inp, dl_type, inplace, **kwargs)
378 if len(self.tls) > 0 and len(self.tls[0]) > 0:
379 self.typs = [type(tl[0]) if isinstance(tl[0], torch.Tensor) else self.typs[i] for i,tl in enumerate(self.tls)]
--> 380 self.ptls = L([typ(stack(tl[:]))[...,self.sel_vars, self.sel_steps] if i==0 else typ(stack(tl[:]))
381 for i,(tl,typ) in enumerate(zip(self.tls,self.typs))]) if inplace and len(tls[0]) != 0 else tls
382

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in (.0)
378 if len(self.tls) > 0 and len(self.tls[0]) > 0:
379 self.typs = [type(tl[0]) if isinstance(tl[0], torch.Tensor) else self.typs[i] for i,tl in enumerate(self.tls)]
--> 380 self.ptls = L([typ(stack(tl[:]))[...,self.sel_vars, self.sel_steps] if i==0 else typ(stack(tl[:]))
381 for i,(tl,typ) in enumerate(zip(self.tls,self.typs))]) if inplace and len(tls[0]) != 0 else tls
382

~/miniconda3/envs/tsai/lib/python3.8/site-packages/tsai/data/core.py in getitem(self, it)
243 def subset(self, i, **kwargs): return type(self)(self.items, splits=self.splits[i], split_idx=i, do_setup=False, types=self.types, **kwargs)
244 def getitem(self, it):
--> 245 if hasattr(self.items, 'oindex'): return self.items.oindex[self._splits[it]]
246 else: return self.items[self._splits[it]]
247 def len(self): return len(self._splits)

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/indexing.py in getitem(self, selection)
602 selection = ensure_tuple(selection)
603 selection = replace_lists(selection)
--> 604 return self.array.get_orthogonal_selection(selection, fields=fields)
605
606 def setitem(self, selection, value):

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/core.py in get_orthogonal_selection(self, selection, out, fields)
939 indexer = OrthogonalIndexer(selection, self)
940
--> 941 return self._get_selection(indexer=indexer, out=out, fields=fields)
942
943 def get_coordinate_selection(self, selection, out=None, fields=None):

~/miniconda3/envs/tsai/lib/python3.8/site-packages/zarr/core.py in _get_selection(self, indexer, out, fields)
1107 # setup output array
1108 if out is None:
-> 1109 out = np.empty(out_shape, dtype=out_dtype, order=self._order)
1110 else:
1111 check_array_shape('out', out, out_shape)

MemoryError: Unable to allocate 315. GiB for an array with shape (60000, 978, 1441) and data type float32

oguiza · 2021-12-12T09:55:45Z

Hi @scottcha,
Thanks for taking the time to report his bug.
This issue is created when each feature is shuffled. To shuffle data it needs to be loaded in memory.
By default, feature_importance uses all data in the validation split. This makes it only usable with in-memory datasets.
There are a few alternatives to fix this issue:

Add X and y as optional arguments. Then feature importance will be measured on the X and y you pass instead of the entire dataset.
Add partial_n as an optional argument (int or float, like in the dataloaders). In this way, you could indicate either a fixed number of samples with an int (1000 samples) or a percent of the validation set.
Add X, y, and partial_n, so that you can use X & y or partial_n.

I think option 3 would probably cover most scenarios as it's the most flexible.
What do you think Scott?

scottcha · 2021-12-12T18:17:44Z

@oguiza I agree 3 is the most flexible.
I tried out #1 as a workaround but I run in to a separate memory issue in the loop doing the feature importance calcs:

My entire chrome session running jupyter crashes with this error:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Each of my samples is about .5mb to 1mb on disk and I encounter this error even when only computing with 100 samples--since I have ~900 features it goes through the calculation that many times but seems to hit this around iteration 50.

Even though monitoring my system ram (which does grow aggressively during this at approximately 1gb per iteration in the feature importance calculation and my gpu ram seems constant but obviously something seems to be leaking or growing out of control). My guess its related to some of the gpu allocated objects not getting freed but I wasn't sure how to debug that.

Also, FWIW I ran this outside of jupyter in VS Code python debugger and get the same error with the one additional piece of information that it indicates "Dataloader Worker (PID(s) 1618) Exited Unexpectedly".

Thanks

oguiza · 2021-12-14T16:58:19Z

Hi @scottcha,
Thanks for providing more details on your issue.
I've updated feature_importance now and get_X_preds to ensure as much non-required data is removed (using gc.collect). Please, try it again if you can, and let me know if you still have issues.

scottcha · 2021-12-15T04:59:44Z

I tried out the new implementation. Here are a couple of notes:

When I provide my own smaller X, y parameters I still get the crash at about the 50th iteration of calculating the feature importance as well as high system memory usage.
The current logic to slice X doesn't seem to work with native zarr arrays. I believe in the case where X is a zarr array this would be the right way to slice it based on a set of random indices:
X = X.get_orthogonal_selection((rand_idxs, slice(None), slice(None)))

oguiza · 2021-12-23T17:53:46Z

Hi @scottcha,
I need to adapt feature_importance to work with zarr arrays as you mention. I'll fix it within the next few days.
But I'm not exactly sure what's causing the issue in your bullet point #1.
Could you please try to export the loader once it's trained and reload it using load_learner. If you do that it will contain no data. You can then pass a smaller array and see if the issue persists. That'll give us a hint at what the root cause might be.

scottcha · 2022-01-16T16:06:43Z

Sorry it took me a bit to get back to this.
I refreshed my env with the latest and reran my use case (large zarr file, sliced before calling feature_importance) and I was able to complete the run without encountering the OOM or the shared memory errors.
So I would say at this point the issues I called out are resolved or not reproducible with the exception that it may not natively handle zarr arrays though that's pretty easy to work around.

Thanks!

oguiza · 2022-01-17T20:17:47Z

Ok, I'm glad to hear that Scott.
I forgot to fix the indexing for zarr arrays. I've added it now in the GitHub repo. It works when you pass a partial_n (int or float) since the data doesn't fit in memory.
If you pass an X it needs to be a numpy array.
If you have a chance, it'd be good if you can test it (use pip install -Uqq git+https://github.com/timeseriesAI/tsai.git).

oguiza · 2022-01-25T08:55:51Z

I'll close this issue since the requested fix has already been implemented. Please, reopen it if necessary.

oguiza added the bug Something isn't working label Dec 12, 2021

oguiza pushed a commit that referenced this issue Dec 14, 2021

fixes issue #310

60cc6b4

oguiza closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling learner.feature_importance on larger than memory dataset causes OOM #310

Calling learner.feature_importance on larger than memory dataset causes OOM #310

scottcha commented Dec 11, 2021

oguiza commented Dec 12, 2021

scottcha commented Dec 12, 2021 •

edited

Loading

oguiza commented Dec 14, 2021

scottcha commented Dec 15, 2021

oguiza commented Dec 23, 2021

scottcha commented Jan 16, 2022

oguiza commented Jan 17, 2022

oguiza commented Jan 25, 2022

Calling learner.feature_importance on larger than memory dataset causes OOM #310

Calling learner.feature_importance on larger than memory dataset causes OOM #310

Comments

scottcha commented Dec 11, 2021

Stack Trace:

oguiza commented Dec 12, 2021

scottcha commented Dec 12, 2021 • edited Loading

oguiza commented Dec 14, 2021

scottcha commented Dec 15, 2021

oguiza commented Dec 23, 2021

scottcha commented Jan 16, 2022

oguiza commented Jan 17, 2022

oguiza commented Jan 25, 2022

scottcha commented Dec 12, 2021 •

edited

Loading