Collecting ideas for PUDL infrastructure roadmap #3728

zschira · 2024-07-22T17:00:46Z

zschira
Jul 22, 2024
Maintainer

Background

During discussions about integrating the SEC data into PUDL, we decided to focus on a minimal integration for the time being, but also to scope/plan future infrastructure development to make integrating projects like mozilla easier in the future. I think any major infrastructure development should also consider other pain points in our current system, as well as desires for future capabilities of PUDL. Keeping all of this in mind will help us to combine projects and funding sources when we actually do the work. I'm sure POSE outreach will influence our desires and priorities, but I want to wanted to start collecting our own thoughts on the state of PUDL and where we'd like it to be.

My thoughts

A lot of my thoughts are influenced by recent work on the FERC 2023 integration and planning mozilla integration.

Pain points

Archiver

Zenodo dependence makes rapid development on new archivers really difficult. If you're working on a large dataset you'll quickly run into zenodo size restrictions, the sandbox is often unusable for testing, and outside developers don't have credentials to create production archives
Dependency issues also slow down archiver development. To add a new archiver, you first have to make a PR to PUDL to add metadata, get that merged, then you can actually create the archiver. If you get through this and start extracting the data, and realize you want to modify the structure of the archives, you then have to go back to the archiver and make another PR and wait until you can get a production archive published before you can continue working on the extraction

FERC extraction

FERC extraction deals with similar dependency issues as the archiver. When updating the extraction, we have to get through a PR review, and publish a new version of the extractor before we can fully test it in PUDL
Extraction is compute intensive, but also changes infrequently, and we only get new FERC data annually, so we waste a ton of time and resources on frequently running extraction that produces the exact same outputs

ML projects

We currently have no clean way to cache pre-trained models and/or outputs from compute intensive models, which is really limiting to the types of projects we can do in PUDL

Code organization

Our current code organization doesn't do the best job of separating library code from specific implementations, which I think contributes to a large number of circular imports that have to be resolved anytime I do major work on PUDL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Collecting ideas for PUDL infrastructure roadmap #3728

{{title}}

Replies: 0 comments

Select a reply

Catalyst Cooperative

Collecting ideas for PUDL infrastructure roadmap #3728

zschira Jul 22, 2024 Maintainer

Background

My thoughts

Pain points

Archiver

FERC extraction

ML projects

Code organization

Replies: 0 comments

zschira
Jul 22, 2024
Maintainer