Refactor data loading/storing #1018

PGijsbers · 2021-01-15T18:27:40Z

I stumbled on the dataset loading/caching code and found the flow very difficult to parse. Additionally there was a lot of duplicate code around. The main goal of this PR is to make the flow of loading and storing compressed data easier to read, and reduce the amount of duplicate code. In the process I streamlined the flow a little, which should lead to some performance gains (less excessive data loading).

Differences in behavior:

When the OpenMLDataset is constructed, outdated pickle data is no longer pro-actively updated. Instead the numpy pickle files are updated in _load_data.
When _load_data is called on a dataset which is not yet compressed (or needs to be updated), it is loaded only once instead of twice.
When the OpenMLDataset is constructed, the arff file is not immediately converted to a compressed format. Instead this happens the first time the data is required in _load_data.
OpenMLDataset's data_pickle_file, data_feather_file and feather_attribute_file members more accurately reflect the presence of the file. Previously the cache format files for the format that was not used would also be set, while they are never generated.

All unit tests passed without modification, except for the test_get_dataset_cache_format_feather which relied on the assumption that the data was compressed and stored on disk before the first load. However I also updated the pickle test which "tests" if the file can be loaded from the compressed format, otherwise it would now test if they are loaded from arff (the first load is from arff but stores to pickle, subsequent loads are from pickle).

There was a lot of code duplication, and the general flow of loading/storing the data in compressed format was hard to navigate.

Otherwise the data would actually be loaded from arff (first load).

My editor incorrectly renamed too many instances of 'data_file' to 'arff_file'.

mfeurer

Looks good to me, I only have one minor change request.

openml/datasets/dataset.py

codecov-io · 2021-01-19T12:00:57Z

Codecov Report

Merging #1018 (77ab46d) into develop (fba6aab) will increase coverage by 0.45%.
The diff coverage is 79.16%.

@@             Coverage Diff             @@
##           develop    #1018      +/-   ##
===========================================
+ Coverage    87.62%   88.07%   +0.45%     
===========================================
  Files           36       36              
  Lines         4574     4563      -11     
===========================================
+ Hits          4008     4019      +11     
+ Misses         566      544      -22

Impacted Files	Coverage Δ
openml/datasets/dataset.py	`87.92% <78.26%> (+3.48%)`	⬆️
openml/utils.py	`91.33% <100.00%> (+0.66%)`	⬆️
openml/_api_calls.py	`89.23% <0.00%> (-3.08%)`	⬇️
openml/runs/functions.py	`83.16% <0.00%> (+0.25%)`	⬆️
openml/testing.py	`84.52% <0.00%> (+0.59%)`	⬆️
openml/datasets/functions.py	`94.42% <0.00%> (+0.98%)`	⬆️
openml/exceptions.py	`96.77% <0.00%> (+9.67%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fba6aab...77ab46d. Read the comment docs.

PGijsbers added 5 commits January 15, 2021 19:02

Refactor flow of loading/compressing data

d1d0f91

There was a lot of code duplication, and the general flow of loading/storing the data in compressed format was hard to navigate.

Only set data file members for files that exist

60d602b

Call get_data to create compressed pickle

1bcb7d3

Otherwise the data would actually be loaded from arff (first load).

Add data load refactor

0cffe57

Revert aggressive text replacement from PyCharm

9453fee

My editor incorrectly renamed too many instances of 'data_file' to 'arff_file'.

PGijsbers requested a review from mfeurer January 15, 2021 18:36

mfeurer reviewed Jan 18, 2021

View reviewed changes

openml/datasets/dataset.py Show resolved Hide resolved

Avoid duplicate exists/isdir

77ab46d

mfeurer approved these changes Jan 19, 2021

View reviewed changes

mfeurer merged commit e074c14 into develop Jan 19, 2021

mfeurer deleted the refactor branch January 19, 2021 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor data loading/storing #1018

Refactor data loading/storing #1018

PGijsbers commented Jan 15, 2021 •

edited

Loading

mfeurer left a comment

codecov-io commented Jan 19, 2021

Refactor data loading/storing #1018

Refactor data loading/storing #1018

Conversation

PGijsbers commented Jan 15, 2021 • edited Loading

mfeurer left a comment

Choose a reason for hiding this comment

codecov-io commented Jan 19, 2021

Codecov Report

PGijsbers commented Jan 15, 2021 •

edited

Loading