Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified logic for excel extraction #566

Merged
merged 15 commits into from
Mar 18, 2020

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Mar 16, 2020

This introduces new cleaned up excel extractor (in extract/excel.py) and refactors extract/eia860.py and extract/eia923.py to use this common code instead of consisting largely of slightly-modified boilerplate code peppered with custom business logic.

This refactoring mostly keeps the external API as is. There are few points where I think the interaction with the other code is sub-optimal and might benefit from further refactoring/cleanup, specifically:

  1. helpers.py:verify_input_files() - this is a monolith method for checking existence of files, aggregates business logic across datasets and basically just runs datastore method and see if they throw an exception. eia860 and eia923 were pulled out and verify_files() method is now called for these at call-sites. I do not quite like this solution but I'm also not yet sure how to solve this elegantly.

  2. Pulling files from datastore is a place where API is also somewhat obscure. Right now, both datastore and extractor provide partial information about where the data is stored and this has to be combined. On top of this, Extractor needs to pass data_dir to the datastore. Ideally, we would want single module/component to be in charge of figuring out where the data lives (different modules may need different identifiers, e.g. page, year for spreadsheets but possibly something else for other inputs), which would clean up these interactions. I think this problem may be addressed with the planned refactoring of datastores.

Note that eia861 was not converted to use this new code and that the code still relies on the old-style non-transposed column_maps.

Jan Rous and others added 8 commits February 4, 2020 13:46
1. moved the logic into ExcelMetadata class
2. unified naming scheme, moved metadata under xlsx_maps/${dataset}
3. refactored eia860 and eia923 to use this metadata wrapper
1. Minor file/class naming cleanups.
2. Introduced excel.GenericExtractor with common logic
3. Implemented subclasses for EIA860 and EIA923
4. Moved drop_columns logic into process_final_page method of EIA923.
5. Removed old-style code from EIA860 and EIA923 extractors.

Outstanding issues:
1. File/year validation has been suppressed for EIA860 and EIA923.
2. Data consistency between old/new code style has not been verified yet
1. make the API a bit cleaner.
2. add more documentation.
3. fix cases that invoke validate_input_files from tests
@zaneselvans
Copy link
Member

Hey @rousik thanks so much for this contribution! We're working on the two issues you bring up, in the context of switching the datastore over to using data packages containing the raw data that are pulled down from Zenodo. @ptvirgo has been the one primarily working on this. Sorry for all the auto-comments on the PR -- they're from the Codacy integration.

It looks like the Travis CI builds failed in the linting phase -- do you have Tox set up in your environment? To run the suite of Travis tests locally, without downloading new data or re-building the FERC 1 DB unnecessarily, you can do:

tox -v -e travis -- --fast --pudl_in=AUTO --live_ferc1_db=AUTO

and there's some documentation on the other (linting, docs, etc.) tests that are run and how to get all that stuff set up locally over here on Read The Docs.

@codecov
Copy link

codecov bot commented Mar 17, 2020

Codecov Report

Merging #566 into master will decrease coverage by 0.64%.
The diff coverage is 74.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #566      +/-   ##
==========================================
- Coverage   78.55%   77.91%   -0.64%     
==========================================
  Files          31       33       +2     
  Lines        3720     3743      +23     
==========================================
- Hits         2922     2916       -6     
- Misses        798      827      +29     
Impacted Files Coverage Δ
src/pudl/constants.py 100.00% <ø> (ø)
src/pudl/convert/ferc1_to_sqlite.py 24.44% <ø> (ø)
src/pudl/extract/excel_test.py 0.00% <0.00%> (ø)
src/pudl/workspace/datastore.py 66.53% <33.33%> (+0.40%) ⬆️
src/pudl/extract/excel.py 91.07% <91.07%> (ø)
src/pudl/etl.py 83.22% <100.00%> (+0.06%) ⬆️
src/pudl/extract/eia860.py 100.00% <100.00%> (+3.17%) ⬆️
src/pudl/extract/eia923.py 100.00% <100.00%> (+7.46%) ⬆️
src/pudl/helpers.py 77.01% <100.00%> (+1.57%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72ef949...6970704. Read the comment docs.

@zaneselvans
Copy link
Member

Good progress toward standardizing the Excel extraction process for use across all the various EIA data. Thank you!

@zaneselvans zaneselvans merged commit eb05172 into catalyst-cooperative:master Mar 18, 2020
@zaneselvans zaneselvans added eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 labels Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants