-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor data loading/storing #1018
Conversation
There was a lot of code duplication, and the general flow of loading/storing the data in compressed format was hard to navigate.
Otherwise the data would actually be loaded from arff (first load).
My editor incorrectly renamed too many instances of 'data_file' to 'arff_file'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I only have one minor change request.
Codecov Report
@@ Coverage Diff @@
## develop #1018 +/- ##
===========================================
+ Coverage 87.62% 88.07% +0.45%
===========================================
Files 36 36
Lines 4574 4563 -11
===========================================
+ Hits 4008 4019 +11
+ Misses 566 544 -22
Continue to review full report at Codecov.
|
I stumbled on the dataset loading/caching code and found the flow very difficult to parse. Additionally there was a lot of duplicate code around. The main goal of this PR is to make the flow of loading and storing compressed data easier to read, and reduce the amount of duplicate code. In the process I streamlined the flow a little, which should lead to some performance gains (less excessive data loading).
Differences in behavior:
_load_data
._load_data
is called on a dataset which is not yet compressed (or needs to be updated), it is loaded only once instead of twice._load_data
.data_pickle_file
,data_feather_file
andfeather_attribute_file
members more accurately reflect the presence of the file. Previously the cache format files for the format that was not used would also be set, while they are never generated.All unit tests passed without modification, except for the
test_get_dataset_cache_format_feather
which relied on the assumption that the data was compressed and stored on disk before the first load. However I also updated the pickle test which "tests" if the file can be loaded from the compressed format, otherwise it would now test if they are loaded from arff (the first load is from arff but stores to pickle, subsequent loads are from pickle).