pickle for SVHN bigger than arff file #891

amueller · 2019-12-02T18:36:45Z

For dataset 41081, I get:

1.1G	dataset.arff
2.3G	dataset.pkl.py3

Maybe we should be using joblib instead? My hard drive is getting full because I'm trying to run CC-18 stuff, which is a bit annoying.

The text was updated successfully, but these errors were encountered:

amueller · 2019-12-02T21:07:13Z

potentially related: loading the dataset takes like 10GB of ram which makes running benchmarks on CC-18 more tricky.

joaquinvanschoren · 2019-12-03T01:33:17Z

+1 for using joblib. Although actually Arrow/Feather seems even better, and will definitely load large datasets a lot faster once they are cached.

We also experienced cases where the file won't unpickle correctly (see #780).
The 10GB RAM usage is because of the ARFF parser, right? Or are you also using so much when you load the pickle?

amueller · 2019-12-03T15:59:45Z

Could be the arff parser

amueller · 2019-12-03T16:43:41Z

Ok so here's some numbers for CIFAR10:

arff: 0.63 G

so the first problem is that the class (which is an integer in the original) is given a dtype of category, which is then stored as an object dtype, making the dataframe harder to store.
If we don't do that, we can get substantial gains from joblib compression and go to half the size of the arff. But the real deal is storing it as uint8 as it was original, which renders compression useless and makes it really fast and small.

amueller · 2019-12-03T16:48:07Z

so looks like this was broken in #548

amueller · 2019-12-03T17:10:47Z

actually the dtype object for the classes is not the issue, we can probably get all the gain by just storing the data as uint8.

amueller · 2019-12-03T19:20:38Z

Using joblib compress=3 takes about 3x as long for storing than the standard dump, not sure if that's worth it. For MNIST (which is very redundant), file size goes from 53M (uint8) to 30M (compressed uint8). Current develop branch is 140M.
Loading is only a little slower with joblib than with pickle.

mfeurer · 2019-12-04T14:28:49Z

potentially related: loading the dataset takes like 10GB of ram

This could be fixed by using a generator to load the data as done for scikit-learn.

Using joblib compress=3 takes about 3x as long for storing than the standard dump, not sure if that's worth it.

I think so, the number of datasets on OpenML will only grow.

mfeurer · 2020-10-29T18:06:43Z

Hey @amueller we just merged storing image data as uint8. Do you think it's worth pursuing compressing the files? In my opinion it is more important to load fast than to save space, but others might have a different opinion on that.

amueller mentioned this issue Dec 3, 2019

Asynchronous API calls #886

Closed

amueller mentioned this issue Dec 3, 2019

store uint8 data in uint8 not float #892

Closed

ArlindKadra mentioned this issue Oct 29, 2020

Updated the way 'image features' are stored #983

Merged

mfeurer closed this as completed in #983 Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pickle for SVHN bigger than arff file #891

pickle for SVHN bigger than arff file #891

amueller commented Dec 2, 2019

amueller commented Dec 2, 2019

joaquinvanschoren commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

mfeurer commented Dec 4, 2019

mfeurer commented Oct 29, 2020

pickle for SVHN bigger than arff file #891

pickle for SVHN bigger than arff file #891

Comments

amueller commented Dec 2, 2019

amueller commented Dec 2, 2019

joaquinvanschoren commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

amueller commented Dec 3, 2019

mfeurer commented Dec 4, 2019

mfeurer commented Oct 29, 2020