Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data loading/storing #1018

Merged
merged 6 commits into from
Jan 19, 2021
Merged

Refactor data loading/storing #1018

merged 6 commits into from
Jan 19, 2021

Conversation

PGijsbers
Copy link
Collaborator

@PGijsbers PGijsbers commented Jan 15, 2021

I stumbled on the dataset loading/caching code and found the flow very difficult to parse. Additionally there was a lot of duplicate code around. The main goal of this PR is to make the flow of loading and storing compressed data easier to read, and reduce the amount of duplicate code. In the process I streamlined the flow a little, which should lead to some performance gains (less excessive data loading).

Differences in behavior:

  • When the OpenMLDataset is constructed, outdated pickle data is no longer pro-actively updated. Instead the numpy pickle files are updated in _load_data.
  • When _load_data is called on a dataset which is not yet compressed (or needs to be updated), it is loaded only once instead of twice.
  • When the OpenMLDataset is constructed, the arff file is not immediately converted to a compressed format. Instead this happens the first time the data is required in _load_data.
  • OpenMLDataset's data_pickle_file, data_feather_file and feather_attribute_file members more accurately reflect the presence of the file. Previously the cache format files for the format that was not used would also be set, while they are never generated.

All unit tests passed without modification, except for the test_get_dataset_cache_format_feather which relied on the assumption that the data was compressed and stored on disk before the first load. However I also updated the pickle test which "tests" if the file can be loaded from the compressed format, otherwise it would now test if they are loaded from arff (the first load is from arff but stores to pickle, subsequent loads are from pickle).

There was a lot of code duplication, and the general flow of
loading/storing the data in compressed format was hard to navigate.
Otherwise the data would actually be loaded from arff (first load).
My editor incorrectly renamed too many instances of 'data_file' to
'arff_file'.
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I only have one minor change request.

openml/datasets/dataset.py Show resolved Hide resolved
@codecov-io
Copy link

Codecov Report

Merging #1018 (77ab46d) into develop (fba6aab) will increase coverage by 0.45%.
The diff coverage is 79.16%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1018      +/-   ##
===========================================
+ Coverage    87.62%   88.07%   +0.45%     
===========================================
  Files           36       36              
  Lines         4574     4563      -11     
===========================================
+ Hits          4008     4019      +11     
+ Misses         566      544      -22     
Impacted Files Coverage Δ
openml/datasets/dataset.py 87.92% <78.26%> (+3.48%) ⬆️
openml/utils.py 91.33% <100.00%> (+0.66%) ⬆️
openml/_api_calls.py 89.23% <0.00%> (-3.08%) ⬇️
openml/runs/functions.py 83.16% <0.00%> (+0.25%) ⬆️
openml/testing.py 84.52% <0.00%> (+0.59%) ⬆️
openml/datasets/functions.py 94.42% <0.00%> (+0.98%) ⬆️
openml/exceptions.py 96.77% <0.00%> (+9.67%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fba6aab...77ab46d. Read the comment docs.

@mfeurer mfeurer merged commit e074c14 into develop Jan 19, 2021
@mfeurer mfeurer deleted the refactor branch January 19, 2021 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants