Skip to content

Latest commit

 

History

History
29 lines (21 loc) · 5.94 KB

implementation_suggestions.md

File metadata and controls

29 lines (21 loc) · 5.94 KB

audit-AI: How we use it and what it does

pymetrics inc.

16 January 2020

The goal of the below document is to explain one way in which audit-AI can be used to de-bias a machine learning model. While this is far from the only way in which the code can be used, we hope it gives you a sense of this tool’s potential.

Background: pymetrics models

Pymetrics models are built for specific roles within specific companies. To achieve this customization, we collect data from top-performing incumbents in the target role. We then compare incumbents to a baseline sample of the over 1 million candidates who have applied to jobs through pymetrics. We also establish a special data set, which we call the debias set, which is sampled from a pool of 150,000 individuals who have voluntarily provided basic demographic information such as sex, ethnicity or age. From there, a wide variety of algorithms might be tested to create an initial machine learning model from the training data. The process itself is model agnostic. Multiple algorithms are fit in this process, and we are continuously testing new methods that might improve performance. The goal of the algorithm is to find the features that will most accurately and reliably separate the incumbent set from the baseline set. We create hundreds of possible models with slightly different model parameters to test and compare their performance before selecting the final model.

Pre-deployment auditing

Audit-ai first comes into play when we check the initial model for bias, primarily focusing on disparities across racial and gender groups. Definitions of “fairness” vary across contexts, but in the realm of employment, a tool used to evaluate job candidates must consistently recommend individuals across legally-protected groups, known as pass rates. Specifically, the Equal Employment Opportunity Commission’s (EEOC) Uniform Guidelines on Employee Selection Procedures mandates that pass rates for any one group must be within 80% of the highest-passing group, 80% (also known as “the 4/5th rule”). For example: if 200 people apply for a job,100 men and 100 women, an assessment tool is used that deems 50 of the men qualified for the role (a 50% pass rate) must also deem at least 40 women qualified (an 80% pass rate). The goal of this standard is to ensure that employment selection practices that appear neutral on the surface are not discreetly resulting in adverse impact.

Pre-deployment de-biasing

In short, by using audit-AI at this point in the process, we are able to streamline the process of testing models for compliance with the EEOC’s 4/5th rule. More importantly, audit-AI provides visibility into how we can improve the fairness of our models, without sacrificing predictive power. The package allows us to identify and adjust any traits that exhibit score discrepancies across demographic groups, perhaps due to Simpson’s Paradox or a sampling anomaly. From there, we can employ a feature selection process, such as recursive feature elimination or feature regularization on a criterion of fairness, to reduce the weighting of problematic features in the local population. This continues until we can estimate no significant differences between legally-protected groups. Prior to deployment, the overall efficacy of the newly de-biased model is approximated using five-fold stratified cross-validation, with 80% of the training data being used to train the model and 20% being held out for testing, repeated and averaged over five trials so that all data both contributes to prediction and serves as a test set.

Other standards for fairness

Depending on the precise context, it is worth noting that models may be tested for compliance with standards for fairness beyond the EEOC’s 4/5ths rule. For example, with the advent of employment selection tools that rely on automated data analysis, some U.S. courts have begun evaluating hiring assessments through the lens of statistical significance. In other words, if there are disparities in the pass rates of demographic groups, what is the likelihood that this is merely due to chance (and therefore not due to embedded systematic discrimination)? Audit-AIi is also able to streamline the process of iteratively testing models for such probabilities, reporting statistical significance calculated from z-tests, analysis of variance (ANOVA) tests, Chi-squared tests, and Fisher’s exact tests. The selection of the appropriate method to evaluate the presence of bias is typically a function of the available samples. A novel contribution of the package is the first-ever implementation of the Cochran-Mantel-Haenszel test, which the EEOC uses when testing for statistical adverse impact over time.

Post-deployment validation

Once we deploy a model to predict the hireability of a group of new job applicants, audit-AI can again be used to validate whether the fairness of our approach holds up in the “real world.” Like in the pre-deployment phase, the goal in the post-deployment phase is to test for consistency of pass rates across legally-protected demographic groups, whether defined by measures of practical or statistical significance.

Some key takeaways

The above process of iteratively testing for bias is informed by a few important principles about AI in hiring:

  1. Bias checks should happen both in the initial process of building a model and after it has been deployed.
  2. Bias can exist against any sub-group within a population, but the specific groups that must be tested will depend on how the algorithm is being used.
  3. It is extremely important to have a large and diverse sample to robustly test for bias.
  4. The law provides standards for selection procedures. If an algorithm leads to a disparity between protected groups (gender, race, age 40+), then it may not be used as part of the selection procedure.
  5. Employers should not use an algorithm that is not rigorously tested for bias across these groups.