Measures supported by OxonFair

OxonFair uses a wide range of measures to enforce and measures fairness and performance.

These measures can be passed to a FairPredictor by calling FairPredictor.fit(objective, constraint, value). This will optimize the measure objective subject to the requirement that the other measure constraint is greater or less than value, as required.

These measures can also be evaluated by passing to the evaluation functions evaluate, evaluate_groups, and evaluate_fairness as a dict of measures, where the keys of the dict are short-form names using when verbose=False and the values are measures.

This document lists the standard measures provided by the group_metrics library, which is imported as:

from oxonfair.utils import group_metrics as gm

Basic Structure

The majority of measures are defined as GroupMetrics or sub-objects of GroupMetrics.

A group measure is specified by a function that takes the number of True Positives, False Positives, False Negatives, and True Negatives and returns a score; A string specifying the name of the of the measure; and optionally a bool indicating if greater values are better than smaller ones. For example, accuracy is defined as:

accuracy = gm.GroupMetric(lambda TP, FP, FN, TN: (TP + TN) / (TP + FP + FN + TN), 'Accuracy')

For efficiency, our approach relies on broadcast semantics and all operations in the function must be applicable to numpy arrays.

Having defined a GroupMetric it can be called in two ways. Either:

accuracy(target_labels, predictions, groups)

Here target_labels and predictions are binary vectors corresponding to either the target ground-truth values, or the predictions made by a classifier, with 1 representing the positive label and 0 otherwise. Groups is simply a vector of values where each unique value is assumed to correspond to a distinct group.

The other way it can be called is by passing it a single 3D array of dimension 4 by number of groups by k, where k is the number of candidate classifiers that the measure should be computed over.

As a convenience, GroupMetrics automatically implements a range of functionality as sub-objects.

Having defined a metric as above, we have a range of different objects:

metric.diff reports the average absolute difference of the method between pairs of groups.
metric.average reports the average of the method taken over all groups.
metric.max_diff reports the maximum difference of the method between any pair of groups.
metric.max reports the maximum value for any group.
metric.min reports the minimum value for any group.
metric.overall reports the overall value for all groups combined, and is the same as calling metric directly
metric.ratio reports the average ratio over pairs of distinct groups, where the smallest value is divided by the largest
metric.per_group reports the value for every group.

These can be passed directly to fit, or to the evaluation functions we provide.

The vast majority of fairness metrics are implemented as a .diff of a standard performance measure, and by placing a .min after any measure such as recall or precision it is possible to add constraints that enforce that the precision or recall is above a particular value for every group. gm.

Dataset Measures

Name	Definition
`gm.count`	Total number of points in a dataset or group
`gm.pos_data_count`	Total number of positively labeled points in a dataset or group
`gm.neg_data_count`	Total number of negatively labeled points in a dataset or group
`gm.pos_data_rate`	Ratio of positively labeled points to size of the group
`gm.neg_data_rate`	Ratio of negatively labeled points to size of the group

Standard Prediction Measures

Name	Definition
`gm.pos_pred_rate`	Positive Prediction Rate: Ratio of the number of positively predicted points to the size of the group
`gm.neg_pred_rate`	Negative Prediction Rate: Ratio of the number of negatively predicted points to the size of the group
`gm.true_pos_rate`	True Positive Rate: Ratio of true positives divided by total positive predictions
`gm.true_neg_rate`	True Negative Rate: Ratio of true negatives divided by total negative predictions
`gm.false_pos_rate`	False Positive Rate: Ratio of False Positives divided by total negative prediction
`gm.false_neg_rate`	False Negative Rate: Ratio of False Negatives divided by total positive predictions
`gm.pos_pred_val`	Positive Predicted Value': Ratio of True Positives divided by the total number of points with positive label
`gm.neg_pred_val`	Negative Predicted Value': Ratio of True Negatives divided by the total number of points with a negative label

Core Performance Measures

Name	Definition
`gm.accuracy`	Proportion of points correctly identified
`gm.balanced_accuracy`	The average of the proportion of points with a positive label correctly identified and the proportion of points with a negative label correctly identified
`gm.min_accuracy`	The minimum of the proportion of points with a positive label correctly identified and the proportion of points with a negative label correctly identified (common in min-max fairness)
`gm.f1`	F1 Score. Defined as: (2 * TP) / (2 * TP + FP + FN)
`gm.precision`	AKA Positive Prediction Rate
`gm.recall`	AKA True Positive Prediction Rate
`gm.mcc`	Matthews Correlation Coefficient. See https://en.wikipedia.org/wiki/Phi_coefficient

Additional Performance Measures

Name	Definition
`gm.acceptance_rate`	AKA precision AKA Positive Prediction Rate
`gm.cond_accept`	Conditional Acceptance Rate. The ratio of positive predictions to positive labels
`gm.cond_reject`	Conditional Rejectance Rate. The ratio of negative predictions to negative labels
`gm.specificity`	AKA True Negative Rate
`gm.rejection_rate`	AKA Negative Predicted Value
`gm.error_ratio`	The ratio of False Positives to False Negatives

Fairness Measures Supported

Sagemaker Clarify Measures

Name	Definition
`gm.class_imbalance`	Average difference between groups in Positive Data Rate
`gm.demographic_parity`	AKA Statistical Parity. Average difference between groups in Positive Prediction Rate
`gm.disparate_impact`	The smallest Positive Prediction Rate of any group divided by the largest
`gm.accuracy.diff`	Average difference between groups in Accuracy
`gm.recall.diff`	AKA Equal Opportunity. Average difference between groups in Recall
`gm.cond_accept.diff`	Average difference between groups in Conditional Acceptance Rate
`gm.acceptance_rate.diff`	Average difference between groups in Acceptance Rate
`gm.specificity.diff`	Average difference between groups in Specificity (or True Negative Rate)
`gm.cond_reject.diff`	Average difference between groups in Conditonal Rejectance Rate
`gm.rejection_rate.diff`	Average difference between groups in Rejection Rate (or Negative Predicted Value)
`gm.treatment_equality`	Average difference between groups in Error Ratio
`gm.gen_entropy`	This is the expected square of a particular utility function divided by its expected value, minus 1 and then divided by 2. The function takes the form: `TP1+FP2+FN*1`, where TP, FP, NP, and TN are the true positives, false positives, false negatives and true negatives respectively.

Measures from Verma and Rubin.

All the measures in Verma and Rubin are defined as strict equalities for two groups. We relax them into a continuous measure that reports the Average difference over any pair of groups between the left and right sides of the equality. These relaxations take value 0 only if the equalities are satisfied for all pairs of groups.

Name	Definition
`gm.statistical_parity`	AKA Demographic Parity. Average difference between groups in Positive Prediction Rate
`gm.predictive_parity`	AKA Rejection Rate Difference. Average difference between groups in Precision
`gm.false_pos_rate.diff`	AKA Specificity Difference. Average difference between groups in False Positive rate.
`gm.false_neg_rate.diff`	AKA Equal Opportunity or Recall difference. Average difference between groups in False Negative Rate
`gm.equalized_odds`	The average of `true_pos_rate.diff` and `false_neg_rate.diff`
`gm.cond_use_accuracy`	The average of `pos_pred_val.diff` and `neg_pred_val.diff`
`gm.predictive_equality`	Average difference in False Negative Rate
`gm.accuracy._parity`	Average difference in Accuracy
`gm.treatment_equality`	Average difference between groups in Error Ratio

Conditional Metrics

OxonFair also supports conditional metrics. These are used to compensate for acceptable biases present in the data. For example, in one famous case, Berkley showed a strong gender bias in admissions despite the fact that each department had minimal admissions bias with respect to gender. The cause underlying this was that women were disproportionately applying to departments with higher rejection rates.

To measure this correct for this bias we follow the method set out in chapter 2 of: Statistics by Freedman et al., which Wachter et al. applied to algorithmic fairness.

This measure compensates for the fact that different selection rates across groups may be driven by an acceptable factor that is correlated with the protected attributes. For example, in the Berkley case, it is acceptable that different departments should have different admissions rates, but the choice of department is correlated with gender.

This is also measured by Amazon Clarify and IBM360

However, no other fairness toolkit optimizes it. All of these measures are subtly different, but weight data in the same way. Freedman et al. considers the weighted proportion of people in a particular group receiving positive decisions vs. the total number of people in the group.

Wachter et al. examines the weighted proportion of [members of a protected group] within the set of all people receiving a positive decision; and the same weighted proportion of [members of the protected group] within the set of all people receiving a negative decision. If this proportion is larger for the positive set, than the negative set, the group is doing disproportionately well, and if it is smaller, the group is doing disproportionately badly.

Clarify and IBM360 measures the difference of the two measures in Wachter et al.

All methods are broadly equivalent in the sense that the difference between every pair of groups using Freedman's measure is zero if and only if the difference between positives and negatives measures of Wachter et al., for every group is zero.

For simplicity, we implement Freedman's measure. This give natural extensions to difference in conditional selection rate, corresponding to conditional demographic parity, and average ratio in conditional selection rate, corresponding to disparate impact. Moreover, the levelling-up measures such as minimal conditional selection rate will also work, which is not the case for the measure of Wachter et al.

We assign a weight $w_i$ to an individual $i$ belonging to a particular protected group, and conditioning factor e.g., school as:

$$ w_i = \frac{#\text{individuals with the same conditioning factor}}{#\text{individuals belonging to the same group and conditioning factor}} $$

The conditional positive decision rate is given by: $$ \frac {\text{wTP+ wFP}{wTP +wFP +wFN +wTN}$$ where wTP, wFP, wFN, wTN are the weighted sum of True Positives, False Positives using the weights $w_i$.

This can be used for levelling up, by enforcing minimum conditional selection rates, and enforcing conditional demographic parity.

The use of conditional metrics is somewhat more involved, as it requires the specification of a conditioning factor, alongside groups. Here is a quick example using a conditional minimal selection rate of 0.3.

import oxonfair
import xgboost
from oxonfair import group_metrics as gm
from oxonfair import conditional_group_metrics as cgm
train,val,test = oxonfair.dataset_loader.adult()
classifier = xgboost.XGBClassifier().fit(y=train['target'], X=train['data'])
fpred = oxonfair.FairPredictor(classifier, val, conditioning_factor='education-num')
fpred.fit(gm.accuracy, cgm.pos_pred_rate.min,0.3)
fpred.evaluate_groups(metrics=cgm.cond_measures)

We support conditioning on range of linear measures.

cgm.accuaracy conditional accuracy which is weighted in the same way;
cgm.positive_decision_rate conditional positive decision rate
cgm.positive_data_rate conditional positive data rate
cgm.false_neg_rate conditional false negative rate
cgm.false_pos_rate conditional false positive rate

For false negative and false positive rate, we normalize by the total number of negatively or positively labelled points rather than the total number of points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

measures.md

measures.md

Measures supported by OxonFair

Basic Structure

Dataset Measures

Standard Prediction Measures

Core Performance Measures

Additional Performance Measures

Fairness Measures Supported

Conditional Metrics

Files

measures.md

Latest commit

History

measures.md

File metadata and controls

Measures supported by OxonFair

Basic Structure

Dataset Measures

Standard Prediction Measures

Core Performance Measures

Additional Performance Measures

Fairness Measures Supported

Conditional Metrics