Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickle for SVHN bigger than arff file #891

Closed
amueller opened this issue Dec 2, 2019 · 9 comments · Fixed by #983
Closed

pickle for SVHN bigger than arff file #891

amueller opened this issue Dec 2, 2019 · 9 comments · Fixed by #983

Comments

@amueller
Copy link
Contributor

amueller commented Dec 2, 2019

For dataset 41081, I get:

1.1G	dataset.arff
2.3G	dataset.pkl.py3

Maybe we should be using joblib instead? My hard drive is getting full because I'm trying to run CC-18 stuff, which is a bit annoying.

@amueller
Copy link
Contributor Author

amueller commented Dec 2, 2019

potentially related: loading the dataset takes like 10GB of ram which makes running benchmarks on CC-18 more tricky.

@joaquinvanschoren
Copy link
Contributor

+1 for using joblib. Although actually Arrow/Feather seems even better, and will definitely load large datasets a lot faster once they are cached.

We also experienced cases where the file won't unpickle correctly (see #780).
The 10GB RAM usage is because of the ARFF parser, right? Or are you also using so much when you load the pickle?

@amueller
Copy link
Contributor Author

amueller commented Dec 3, 2019

Could be the arff parser

@amueller
Copy link
Contributor Author

amueller commented Dec 3, 2019

Ok so here's some numbers for CIFAR10:

arff: 0.63 G

image

so the first problem is that the class (which is an integer in the original) is given a dtype of category, which is then stored as an object dtype, making the dataframe harder to store.
If we don't do that, we can get substantial gains from joblib compression and go to half the size of the arff. But the real deal is storing it as uint8 as it was original, which renders compression useless and makes it really fast and small.

@amueller
Copy link
Contributor Author

amueller commented Dec 3, 2019

so looks like this was broken in #548

@amueller
Copy link
Contributor Author

amueller commented Dec 3, 2019

actually the dtype object for the classes is not the issue, we can probably get all the gain by just storing the data as uint8.

@amueller
Copy link
Contributor Author

amueller commented Dec 3, 2019

Using joblib compress=3 takes about 3x as long for storing than the standard dump, not sure if that's worth it. For MNIST (which is very redundant), file size goes from 53M (uint8) to 30M (compressed uint8). Current develop branch is 140M.
Loading is only a little slower with joblib than with pickle.

@mfeurer
Copy link
Collaborator

mfeurer commented Dec 4, 2019

potentially related: loading the dataset takes like 10GB of ram

This could be fixed by using a generator to load the data as done for scikit-learn.

Using joblib compress=3 takes about 3x as long for storing than the standard dump, not sure if that's worth it.

I think so, the number of datasets on OpenML will only grow.

@mfeurer
Copy link
Collaborator

mfeurer commented Oct 29, 2020

Hey @amueller we just merged storing image data as uint8. Do you think it's worth pursuing compressing the files? In my opinion it is more important to load fast than to save space, but others might have a different opinion on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants