You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During discussions about integrating the SEC data into PUDL, we decided to focus on a minimal integration for the time being, but also to scope/plan future infrastructure development to make integrating projects like mozilla easier in the future. I think any major infrastructure development should also consider other pain points in our current system, as well as desires for future capabilities of PUDL. Keeping all of this in mind will help us to combine projects and funding sources when we actually do the work. I'm sure POSE outreach will influence our desires and priorities, but I want to wanted to start collecting our own thoughts on the state of PUDL and where we'd like it to be.
My thoughts
A lot of my thoughts are influenced by recent work on the FERC 2023 integration and planning mozilla integration.
Pain points
Archiver
Zenodo dependence makes rapid development on new archivers really difficult. If you're working on a large dataset you'll quickly run into zenodo size restrictions, the sandbox is often unusable for testing, and outside developers don't have credentials to create production archives
Dependency issues also slow down archiver development. To add a new archiver, you first have to make a PR to PUDL to add metadata, get that merged, then you can actually create the archiver. If you get through this and start extracting the data, and realize you want to modify the structure of the archives, you then have to go back to the archiver and make another PR and wait until you can get a production archive published before you can continue working on the extraction
FERC extraction
FERC extraction deals with similar dependency issues as the archiver. When updating the extraction, we have to get through a PR review, and publish a new version of the extractor before we can fully test it in PUDL
Extraction is compute intensive, but also changes infrequently, and we only get new FERC data annually, so we waste a ton of time and resources on frequently running extraction that produces the exact same outputs
ML projects
We currently have no clean way to cache pre-trained models and/or outputs from compute intensive models, which is really limiting to the types of projects we can do in PUDL
Code organization
Our current code organization doesn't do the best job of separating library code from specific implementations, which I think contributes to a large number of circular imports that have to be resolved anytime I do major work on PUDL
dagsterIssues related to our use of the Dagster orchestratormozilla_sec_to_eiaMozilla AI for EJ grant to link SEC utility ownership data to EIA operational data
1 participant
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Background
During discussions about integrating the SEC data into PUDL, we decided to focus on a minimal integration for the time being, but also to scope/plan future infrastructure development to make integrating projects like mozilla easier in the future. I think any major infrastructure development should also consider other pain points in our current system, as well as desires for future capabilities of PUDL. Keeping all of this in mind will help us to combine projects and funding sources when we actually do the work. I'm sure POSE outreach will influence our desires and priorities, but I want to wanted to start collecting our own thoughts on the state of PUDL and where we'd like it to be.
My thoughts
A lot of my thoughts are influenced by recent work on the FERC 2023 integration and planning mozilla integration.
Pain points
Archiver
FERC extraction
ML projects
Code organization
Beta Was this translation helpful? Give feedback.
All reactions