Should the pickle speed up be supported after Parquet integration? #1027

PGijsbers · 2021-02-19T11:05:58Z

I am currently working on adding Parquet support to openml-python.
Parquet files will load a lot faster than arff files, so I was wondering if we should still aim to also provide the speed improvement through keeping the pickle files. I did a small speed test to record speed loading the cc-18 from disk, the suite has 72 datasets varying from a few kb to 190 mb in size (parquet/pickle size).
Loading all datasets in the suite takes ~6 seconds for parquet, ~700ms for pickle and ~7min for arff.

We see that loading the difference from pickle and parquet is still big relatively speaking, but in absolute numbers parquet still loads the entire suite of 72 datasets in a ~6 seconds. I think it's worth evaluating whether or not we want to keep pickle files.
The obvious the drawback is slower loads, though the difference might not be noticeable in most cases.
Getting rid of the pickle files would have the following benefits:

faster first-time load (no need to pickle the loaded dataframe)
save disk space (the pickle is roughly as big as the parquet file, doubling space the cache requires)
pickles are more brittle. Different Python/pickle versions may lead to errors (get_dataset pickle protocol problem #898), the package versions for storing and loading need to be compatible (Pickle error: No module named 'pandas.core.categorical' #918), and overall it's not that easy (Process killed when pickling dataset #780, all the code surrounding pickle loads).
less code is less maintenance

@mfeurer

The text was updated successfully, but these errors were encountered:

mfeurer · 2021-02-19T11:40:28Z

Hi, thanks for bringing this up. From the top of my head I think we don't to have pickle files that are also available Parquet (I'm not sure if we'll drop the arff immediately). However, as these are 72 vastly different datasets I was wondering whether there's are any assumptions in Parquet about the data structure that benefit certain datasets (at least for the format added by @sahithyaravi1493 there was quite some overhead for wide datasets). So I think before making a decision it would be good the per-dataset difference. What do you think about that?

PGijsbers · 2021-02-19T12:37:37Z

I do think a full switch over to parquet is actually reasonable in the near future (as long as the server has the parquet files), though I don't want to do that in a single release cycle either. A bit more profiling based on dataset characteristics seems reasonable.

mfeurer · 2021-02-19T13:01:26Z

Thanks, that makes sense. Will you extend your notebook to contain such stats?

mfeurer · 2021-02-19T14:05:53Z

Another question: will we in the near future then drop liac-arff as a dependency? This would allow us to also stop kinda maintaining that and we should let the scikit-learn folks know about this.

PGijsbers · 2021-02-19T15:41:15Z

Will you extend your notebook to contain such stats?

Yes, but I'll do that after the next release (first we have side-by-side support and we'll keep pickling).

will we in the near future then drop liac-arff as a dependency?

As soon as all data is available in parquet format I would be in favor of a major code cleaning removing all arff logic (and thus also no longer require liac-arff).

mfeurer · 2021-02-19T15:45:25Z

As soon as all data is available in parquet format I would be in favor of a major code cleaning removing all arff logic (and thus also no longer require liac-arff).

I fully agree on that and am looking forward to that!

PGijsbers changed the title ~~Should be maintain dataset pickles after Parquet support?~~ Should the pickle speed up be supported after Parquet integration? Feb 19, 2021

mfeurer mentioned this issue Feb 24, 2021

Future of liac-arff renatopp/liac-arff#120

Open

PGijsbers mentioned this issue Mar 2, 2021

Moving from ARFF to Parquet #1032

Open

mfeurer added the Data OpenML concept label Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the pickle speed up be supported after Parquet integration? #1027

Should the pickle speed up be supported after Parquet integration? #1027

PGijsbers commented Feb 19, 2021

mfeurer commented Feb 19, 2021

PGijsbers commented Feb 19, 2021

mfeurer commented Feb 19, 2021

mfeurer commented Feb 19, 2021

PGijsbers commented Feb 19, 2021

mfeurer commented Feb 19, 2021

Should the pickle speed up be supported after Parquet integration? #1027

Should the pickle speed up be supported after Parquet integration? #1027

Comments

PGijsbers commented Feb 19, 2021

mfeurer commented Feb 19, 2021

PGijsbers commented Feb 19, 2021

mfeurer commented Feb 19, 2021

mfeurer commented Feb 19, 2021

PGijsbers commented Feb 19, 2021

mfeurer commented Feb 19, 2021