Unified logic for excel extraction #566

rousik · 2020-03-16T23:00:14Z

This introduces new cleaned up excel extractor (in extract/excel.py) and refactors extract/eia860.py and extract/eia923.py to use this common code instead of consisting largely of slightly-modified boilerplate code peppered with custom business logic.

This refactoring mostly keeps the external API as is. There are few points where I think the interaction with the other code is sub-optimal and might benefit from further refactoring/cleanup, specifically:

helpers.py:verify_input_files() - this is a monolith method for checking existence of files, aggregates business logic across datasets and basically just runs datastore method and see if they throw an exception. eia860 and eia923 were pulled out and verify_files() method is now called for these at call-sites. I do not quite like this solution but I'm also not yet sure how to solve this elegantly.
Pulling files from datastore is a place where API is also somewhat obscure. Right now, both datastore and extractor provide partial information about where the data is stored and this has to be combined. On top of this, Extractor needs to pass data_dir to the datastore. Ideally, we would want single module/component to be in charge of figuring out where the data lives (different modules may need different identifiers, e.g. page, year for spreadsheets but possibly something else for other inputs), which would clean up these interactions. I think this problem may be addressed with the planned refactoring of datastores.

Note that eia861 was not converted to use this new code and that the code still relies on the old-style non-transposed column_maps.

1. moved the logic into ExcelMetadata class 2. unified naming scheme, moved metadata under xlsx_maps/${dataset} 3. refactored eia860 and eia923 to use this metadata wrapper

1. Minor file/class naming cleanups. 2. Introduced excel.GenericExtractor with common logic 3. Implemented subclasses for EIA860 and EIA923 4. Moved drop_columns logic into process_final_page method of EIA923. 5. Removed old-style code from EIA860 and EIA923 extractors. Outstanding issues: 1. File/year validation has been suppressed for EIA860 and EIA923. 2. Data consistency between old/new code style has not been verified yet

1. make the API a bit cleaner. 2. add more documentation. 3. fix cases that invoke validate_input_files from tests

zaneselvans · 2020-03-16T23:12:27Z

Hey @rousik thanks so much for this contribution! We're working on the two issues you bring up, in the context of switching the datastore over to using data packages containing the raw data that are pulled down from Zenodo. @ptvirgo has been the one primarily working on this. Sorry for all the auto-comments on the PR -- they're from the Codacy integration.

It looks like the Travis CI builds failed in the linting phase -- do you have Tox set up in your environment? To run the suite of Travis tests locally, without downloading new data or re-building the FERC 1 DB unnecessarily, you can do:

tox -v -e travis -- --fast --pudl_in=AUTO --live_ferc1_db=AUTO

and there's some documentation on the other (linting, docs, etc.) tests that are run and how to get all that stuff set up locally over here on Read The Docs.

codecov · 2020-03-17T01:59:29Z

Codecov Report

Merging #566 into master will decrease coverage by 0.64%.
The diff coverage is 74.87%.

@@            Coverage Diff             @@
##           master     #566      +/-   ##
==========================================
- Coverage   78.55%   77.91%   -0.64%     
==========================================
  Files          31       33       +2     
  Lines        3720     3743      +23     
==========================================
- Hits         2922     2916       -6     
- Misses        798      827      +29

Impacted Files	Coverage Δ
src/pudl/constants.py	`100.00% <ø> (ø)`
src/pudl/convert/ferc1_to_sqlite.py	`24.44% <ø> (ø)`
src/pudl/extract/excel_test.py	`0.00% <0.00%> (ø)`
src/pudl/workspace/datastore.py	`66.53% <33.33%> (+0.40%)`	⬆️
src/pudl/extract/excel.py	`91.07% <91.07%> (ø)`
src/pudl/etl.py	`83.22% <100.00%> (+0.06%)`	⬆️
src/pudl/extract/eia860.py	`100.00% <100.00%> (+3.17%)`	⬆️
src/pudl/extract/eia923.py	`100.00% <100.00%> (+7.46%)`	⬆️
src/pudl/helpers.py	`77.01% <100.00%> (+1.57%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72ef949...6970704. Read the comment docs.

zaneselvans · 2020-03-18T20:27:09Z

Good progress toward standardizing the Excel extraction process for use across all the various EIA data. Thank you!

Jan Rous and others added 8 commits February 4, 2020 13:46

Minor code cleanup. Use more pythonic constructs.

9a4662e

Merge remote-tracking branch 'upstream/master'

011a060

Refactored excel metadata loading logic.

36689e0

1. moved the logic into ExcelMetadata class 2. unified naming scheme, moved metadata under xlsx_maps/${dataset} 3. refactored eia860 and eia923 to use this metadata wrapper

Merge remote-tracking branch 'upstream/master'

276f58c

Cleanup unused eia860 and eia923 constants.

71a7d1c

Expose file validation method in excel.Extract

4dcf49d

Few final cleanups for the excel extraction logic.

a25833a

1. make the API a bit cleaner. 2. add more documentation. 3. fix cases that invoke validate_input_files from tests

rousik added 2 commits March 16, 2020 19:00

Fix some issues highlighted by tox and codacy.

0d82461

Fix some more errors.

2056baf

rousik added 5 commits March 17, 2020 17:44

Add unit tests for excel.py

7135f4e

Fix tox errors.

c68ec9c

Fix parameter order for eia923/eia860 processors.

2456244

Mark process methods of GenericExtractor as static.

d2af9e6

Mark test_read_excel_calls as static.

6970704

zaneselvans merged commit eb05172 into catalyst-cooperative:master Mar 18, 2020

zaneselvans added eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 labels Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified logic for excel extraction #566

Unified logic for excel extraction #566

rousik commented Mar 16, 2020

zaneselvans commented Mar 16, 2020

codecov bot commented Mar 17, 2020 •

edited

Loading

zaneselvans commented Mar 18, 2020

Unified logic for excel extraction #566

Unified logic for excel extraction #566

Conversation

rousik commented Mar 16, 2020

zaneselvans commented Mar 16, 2020

codecov bot commented Mar 17, 2020 • edited Loading

Codecov Report

zaneselvans commented Mar 18, 2020

codecov bot commented Mar 17, 2020 •

edited

Loading