-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified logic for excel extraction #566
Conversation
1. moved the logic into ExcelMetadata class 2. unified naming scheme, moved metadata under xlsx_maps/${dataset} 3. refactored eia860 and eia923 to use this metadata wrapper
1. Minor file/class naming cleanups. 2. Introduced excel.GenericExtractor with common logic 3. Implemented subclasses for EIA860 and EIA923 4. Moved drop_columns logic into process_final_page method of EIA923. 5. Removed old-style code from EIA860 and EIA923 extractors. Outstanding issues: 1. File/year validation has been suppressed for EIA860 and EIA923. 2. Data consistency between old/new code style has not been verified yet
1. make the API a bit cleaner. 2. add more documentation. 3. fix cases that invoke validate_input_files from tests
Hey @rousik thanks so much for this contribution! We're working on the two issues you bring up, in the context of switching the datastore over to using data packages containing the raw data that are pulled down from Zenodo. @ptvirgo has been the one primarily working on this. Sorry for all the auto-comments on the PR -- they're from the Codacy integration. It looks like the Travis CI builds failed in the linting phase -- do you have Tox set up in your environment? To run the suite of Travis tests locally, without downloading new data or re-building the FERC 1 DB unnecessarily, you can do:
and there's some documentation on the other (linting, docs, etc.) tests that are run and how to get all that stuff set up locally over here on Read The Docs. |
Codecov Report
@@ Coverage Diff @@
## master #566 +/- ##
==========================================
- Coverage 78.55% 77.91% -0.64%
==========================================
Files 31 33 +2
Lines 3720 3743 +23
==========================================
- Hits 2922 2916 -6
- Misses 798 827 +29
Continue to review full report at Codecov.
|
Good progress toward standardizing the Excel extraction process for use across all the various EIA data. Thank you! |
This introduces new cleaned up excel extractor (in extract/excel.py) and refactors extract/eia860.py and extract/eia923.py to use this common code instead of consisting largely of slightly-modified boilerplate code peppered with custom business logic.
This refactoring mostly keeps the external API as is. There are few points where I think the interaction with the other code is sub-optimal and might benefit from further refactoring/cleanup, specifically:
helpers.py:verify_input_files() - this is a monolith method for checking existence of files, aggregates business logic across datasets and basically just runs datastore method and see if they throw an exception. eia860 and eia923 were pulled out and verify_files() method is now called for these at call-sites. I do not quite like this solution but I'm also not yet sure how to solve this elegantly.
Pulling files from datastore is a place where API is also somewhat obscure. Right now, both datastore and extractor provide partial information about where the data is stored and this has to be combined. On top of this, Extractor needs to pass data_dir to the datastore. Ideally, we would want single module/component to be in charge of figuring out where the data lives (different modules may need different identifiers, e.g. page, year for spreadsheets but possibly something else for other inputs), which would clean up these interactions. I think this problem may be addressed with the planned refactoring of datastores.
Note that eia861 was not converted to use this new code and that the code still relies on the old-style non-transposed column_maps.