feat: example of calculating weighted quantiles #14

zaneselvans · 2024-01-30T23:31:32Z

One query we (@catalyst-cooperative) have attempted to write a couple of times for use in data validation tests is the calculation of a weighted quantile. For example, if we have a table describing fuel deliveries to power plants, with one column that contains the average heat content per unit of fuel, and another column that contains the total number of units of fuel delivered, and the size of individual deliveries can vary by orders of magnitude, to make sure that nothing has gone haywire in our data processing we'd like to be able to quickly calculate, say, "How many MMBTUs does a ton of coal contain, at the 95% percentile across all the coal deliveries recorded in the table?"

In python we do it like:

import numpy as np
import pandas as pd

def weighted_quantile(data: pd.Series, weights: pd.Series, quantile: float) -> float:
    """Calculate the weighted quantile of a Series or DataFrame column.

    This function allows us to take two columns from a :class:`pandas.DataFrame` one of
    which contains an observed value (data) like heat content per unit of fuel, and the
    other of which (weights) contains a quantity like quantity of fuel delivered which
    should be used to scale the importance of the observed value in an overall
    distribution, and calculate the values that the scaled distribution will have at
    various quantiles.

    Args:
        data: A series containing numeric data.
        weights: Weights to use in scaling the data. Must have the same length as data.
        quantile: A number between 0 and 1, representing the quantile at which we want
            to find the value of the weighted data.

    Returns:
        The value in the weighted data corresponding to the given quantile. If there are
        no values in the data, return :mod:`numpy.nan`.
    """
    if (quantile < 0) or (quantile > 1):
        raise ValueError("quantile must have a value between 0 and 1.")
    if len(data) != len(weights):
        raise ValueError("data and weights must have the same length")
    df = (
        pd.DataFrame({"data": data, "weights": weights})
        .replace([np.inf, -np.inf], np.nan)
        .dropna()
        .sort_values(by="data")
    )
    Sn = df.weights.cumsum()  # noqa: N806
    # This conditional is necessary because sometimes new columns get
    # added to the EIA data, and so they won't show up in prior years.
    if len(Sn) > 0:
        Pn = (Sn - 0.5 * df.weights) / Sn.iloc[-1]  # noqa: N806
        return np.interp(quantile, Pn, df.data)

    return np.nan

This is useful because it lets us look at a weighted histogram of the variable and pick some guard-rails that are specific to the distribution, and save them to check at the end of our data processing. These typically look something like:

the weighted median value of the distribution should lie between 4.5 and 6.2
the weighted 95th percentile of the distribution should be less than 10.1
the weighted 10th percentile of the distribution should be greater than 1.5

It would also be useful to be able to apply this calculation to records grouped by fuel type or year, or for a number of different quantiles simultaneously (to avoid needing to do all the sorting etc. multiple times)

gvwilson self-assigned this Jan 31, 2024

gvwilson added in-content Things in lessons request-addition Please add this labels Jan 31, 2024

gvwilson added this to the v1 milestone Jan 31, 2024

gvwilson modified the milestones: v1, someday Feb 16, 2024

gvwilson removed their assignment Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: example of calculating weighted quantiles #14

feat: example of calculating weighted quantiles #14

zaneselvans commented Jan 30, 2024 •

edited

Loading

feat: example of calculating weighted quantiles #14

feat: example of calculating weighted quantiles #14

Comments

zaneselvans commented Jan 30, 2024 • edited Loading

zaneselvans commented Jan 30, 2024 •

edited

Loading