Skip to content

rmi-electricity/pudl

 
 

Repository files navigation

The Public Utility Data Liberation Project (PUDL)

Project Status: Active Tox-PyTest Status Codecov Test Coverage Read the Docs Build Status PyPI Latest Version conda-forge Version Supported Python Versions Any color you want, so long as it's black. pre-commit CI Zenodo DOI Schedule a 1-on-1 chat with us about PUDL.

What is PUDL?

The PUDL Project is an open source data processing pipeline that makes US energy data easier to access and use programmatically.

Hundreds of gigabytes of valuable data are published by US government agencies, but it's often difficult to work with. PUDL takes the original spreadsheets, CSV files, and databases and turns them into a unified resource. This allows users to spend more time on novel analysis and less time on data preparation.

What data is available?

PUDL currently integrates data from:

Thanks to support from the Alfred P. Sloan Foundation Energy & Environment Program, from 2021 to 2023 we will be integrating the following data as well:

Who is PUDL for?

The project is focused on serving researchers, activists, journalists, policy makers, and small businesses that might not otherwise be able to afford access to this data from commercial sources and who may not have the time or expertise to do all the data processing themselves from scratch.

We want to make this data accessible and easy to work with for as wide an audience as possible: anyone from a grassroots youth climate organizers working with Google sheets to university researchers with access to scalable cloud computing resources and everyone in between!

How do I access the data?

There are four main ways to access PUDL outputs. For more details you'll want to check out the complete documentation, but here's a quick overview:

Datasette

We publish a lot of the data on https://data.catalyst.coop using a tool called Datasette that lets us wrap our databases in a relatively friendly web interface. You can browse and query the data, make simple charts and maps, and download portions of the data as CSV files or JSON so you can work with it locally. For a quick introduction to what you can do with the Datasette interface, check out this 17 minute video.

This access mode is good for casual data explorers or anyone who just wants to grab a small subset of the data. It also lets you share links to a particular subset of the data and provides a REST API for querying the data from other applications.

Docker + Jupyter

Want access to all the published data in bulk? If you're familiar with Python and Jupyter Notebooks and are willing to install Docker you can:

If you'd rather work with the PUDL SQLite Databases and Apache Parquet files directly, they are accessible within the same Zenodo archive.

The PUDL Examples repository has more detailed instructions on how to work with the Zenodo data archive and Docker image.

JupyterHub

Do you want to use Python and Jupyter Notebooks to access the data but aren't comfortable setting up Docker? We are working with 2i2c to host a JupyterHub that has the same software and data as the Docker container and Zenodo archive mentioned above, but running in the cloud.

Note: you'll only have 4-6GB of RAM and 1 CPU to work with on the JupyterHub, so if you need more computing power, you may need to set PUDL up on your own computer. Eventually we hope to offer scalable computing resources on the JupyterHub as well.

The PUDL Development Environment

If you're more familiar with the Python data science stack and are comfortable working with git, conda environments, and the Unix command line, then you can set up the whole PUDL Development Environment on your own computer. This will allow you to run the full data processing pipeline yourself, tweak the underlying source code, and (we hope!) make contributions back to the project.

This is by far the most involved way to access the data and isn't recommended for most users. You should check out the Development section of the main PUDL documentation for more details.

Nightly Data Builds

If you are less concerned with reproducibility and want the freshest possible data we also upload the outputs of our nightly builds to public S3 storage buckets. This data is produced by the dev branch, of PUDL, and is updated most weekday mornings. It is also the data used to populate Datasette:

Contributing to PUDL

Find PUDL useful? Want to help make it better? There are lots of ways to help!

Licensing

In general, our code, data, and other work are permissively licensed for use by anybody, for any purpose, so long as you give us credit for the work we've done.

Contact Us

  • For bug reports, feature requests, and other software or data issues please make a GitHub Issue.
  • For more general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
  • If you'd like to get occasional updates about the project sign up for our email list.
  • Want to schedule a time to chat with us one-on-one about your PUDL use case, ideas for improvement, or get some personalized support? Join us for Office Hours
  • Follow us on Twitter: @CatalystCoop
  • More info on our website: https://catalyst.coop
  • For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: pudl@catalyst.coop

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

About

The Public Utility Data Liberation Project

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 98.5%
  • Other 1.5%