-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pickle for SVHN bigger than arff file #891
Comments
potentially related: loading the dataset takes like 10GB of ram which makes running benchmarks on CC-18 more tricky. |
+1 for using joblib. Although actually Arrow/Feather seems even better, and will definitely load large datasets a lot faster once they are cached. We also experienced cases where the file won't unpickle correctly (see #780). |
Could be the arff parser |
Ok so here's some numbers for CIFAR10: arff: 0.63 G so the first problem is that the class (which is an integer in the original) is given a dtype of category, which is then stored as an object dtype, making the dataframe harder to store. |
so looks like this was broken in #548 |
actually the dtype object for the classes is not the issue, we can probably get all the gain by just storing the data as uint8. |
Using joblib compress=3 takes about 3x as long for storing than the standard dump, not sure if that's worth it. For MNIST (which is very redundant), file size goes from 53M (uint8) to 30M (compressed uint8). Current develop branch is 140M. |
This could be fixed by using a generator to load the data as done for scikit-learn.
I think so, the number of datasets on OpenML will only grow. |
Hey @amueller we just merged storing image data as uint8. Do you think it's worth pursuing compressing the files? In my opinion it is more important to load fast than to save space, but others might have a different opinion on that. |
For dataset 41081, I get:
Maybe we should be using joblib instead? My hard drive is getting full because I'm trying to run CC-18 stuff, which is a bit annoying.
The text was updated successfully, but these errors were encountered: