You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 2, 2024. It is now read-only.
One query we (@catalyst-cooperative) have attempted to write a couple of times for use in data validation tests is the calculation of a weighted quantile. For example, if we have a table describing fuel deliveries to power plants, with one column that contains the average heat content per unit of fuel, and another column that contains the total number of units of fuel delivered, and the size of individual deliveries can vary by orders of magnitude, to make sure that nothing has gone haywire in our data processing we'd like to be able to quickly calculate, say, "How many MMBTUs does a ton of coal contain, at the 95% percentile across all the coal deliveries recorded in the table?"
In python we do it like:
importnumpyasnpimportpandasaspddefweighted_quantile(data: pd.Series, weights: pd.Series, quantile: float) ->float:
"""Calculate the weighted quantile of a Series or DataFrame column. This function allows us to take two columns from a :class:`pandas.DataFrame` one of which contains an observed value (data) like heat content per unit of fuel, and the other of which (weights) contains a quantity like quantity of fuel delivered which should be used to scale the importance of the observed value in an overall distribution, and calculate the values that the scaled distribution will have at various quantiles. Args: data: A series containing numeric data. weights: Weights to use in scaling the data. Must have the same length as data. quantile: A number between 0 and 1, representing the quantile at which we want to find the value of the weighted data. Returns: The value in the weighted data corresponding to the given quantile. If there are no values in the data, return :mod:`numpy.nan`. """if (quantile<0) or (quantile>1):
raiseValueError("quantile must have a value between 0 and 1.")
iflen(data) !=len(weights):
raiseValueError("data and weights must have the same length")
df= (
pd.DataFrame({"data": data, "weights": weights})
.replace([np.inf, -np.inf], np.nan)
.dropna()
.sort_values(by="data")
)
Sn=df.weights.cumsum() # noqa: N806# This conditional is necessary because sometimes new columns get# added to the EIA data, and so they won't show up in prior years.iflen(Sn) >0:
Pn= (Sn-0.5*df.weights) /Sn.iloc[-1] # noqa: N806returnnp.interp(quantile, Pn, df.data)
returnnp.nan
This is useful because it lets us look at a weighted histogram of the variable and pick some guard-rails that are specific to the distribution, and save them to check at the end of our data processing. These typically look something like:
the weighted median value of the distribution should lie between 4.5 and 6.2
the weighted 95th percentile of the distribution should be less than 10.1
the weighted 10th percentile of the distribution should be greater than 1.5
It would also be useful to be able to apply this calculation to records grouped by fuel type or year, or for a number of different quantiles simultaneously (to avoid needing to do all the sorting etc. multiple times)
The text was updated successfully, but these errors were encountered:
One query we (@catalyst-cooperative) have attempted to write a couple of times for use in data validation tests is the calculation of a weighted quantile. For example, if we have a table describing fuel deliveries to power plants, with one column that contains the average heat content per unit of fuel, and another column that contains the total number of units of fuel delivered, and the size of individual deliveries can vary by orders of magnitude, to make sure that nothing has gone haywire in our data processing we'd like to be able to quickly calculate, say, "How many MMBTUs does a ton of coal contain, at the 95% percentile across all the coal deliveries recorded in the table?"
In python we do it like:
This is useful because it lets us look at a weighted histogram of the variable and pick some guard-rails that are specific to the distribution, and save them to check at the end of our data processing. These typically look something like:
It would also be useful to be able to apply this calculation to records grouped by fuel type or year, or for a number of different quantiles simultaneously (to avoid needing to do all the sorting etc. multiple times)
The text was updated successfully, but these errors were encountered: