-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Racism and the load_boston
dataset
#91
Comments
Thank you for holding us accountable and also bring our attention to this, so that we can rectify this mistake. We suggest the following actions:
The latter action highlighting that we believe that we can draw better insights from (flawed) data when looking at it from a causal perspective. This way I hope we can contribute to the fight for racial equity. Thank you for sharing your talk on this and do let us know if we can engage you for further feedback, or hear your thoughts on our proposed actions. |
I appreciate the immediate response but I feel obliged to respond in detail to this comment.
Getting advice from experts is all fair and good. But in this situation, is a tutorial on I might urge one to be very careful here. We would not want to suggest that a library can replace the actual human-intelligence involved in interpreting a dataset. The problem in the current tutorial isn't that the causal techniques can't be applied in a meaningful way, rather, that the technique seems to have been distracting from the actual problem in the dataset. I appreciate work in fairness/ML tooling but my genuine fear is that folks might use it as an excuse. It'd be a real shame if people consider the use of a technique as a superior checkbox so that they no longer need to think about their data. |
The first instance of this was mentioned last year. Why did it take over a year to comment back on this? This is the 3/5 rule put into clear data sets that have been flagged as racist. McKinsey has a pretty bad reputation, carrying this over into data science is a poor choice. |
Apologies for the late reply, I wanted to dedicate enough time to give you a proper response.
For transparency: we also use the boston housing data set in the Distribution schema tutorial tutorial, albeit without the Hello @kcoleman80 , |
Thanks so much, I have some ideas - comment about the factor of B = 1000(Bk – 0.63)2 (Bk is proportion of Blacks/African Americans per town according to the census in the 70's), and how the information has been widely used on Kaggle and on sklearn, yet due to how outdated it is, especially in this time of reckoning, it's time to retire in favour of updated information from HUD, and discuss the redlining that occurred in Boston in order to provide historical context, and provide a section about ethics in AI, data collection, and the real impacts we must face (ie: issues with facial recognition for non-white faces, COMPAS, Clearview AI, etc.). There's some pretty good open government data sets here: |
A buddy of mine came up with a quote a while ago;
Every model suffers from this, even causal ones. If there's something not quite right about the labels then the model is going to pick this up. Even with fairness techniques in the pipeline. This is even more true when you put a model into production. Suppose we have an application for resume scoring. How are we going to guarantee that the predictions are not going to reinforce the bias that is in the data? You typically only hear back from the candidates that you actually hire. But what about the candidates that you didn't consider? There might be some good hires in there. Your data feed might suffer from very poor recall. So how might we prevent bad lessons from being learned? By changing the model? Or ... by changing the feedback mechanism? You might achieve some good by fiddling around the model space, but to me, it seems far more likely that you're going to understand the actual problem if you take all of that effort and assign it to the data instead. In the recruitment case, maybe the best causal thing you can do is to hire some employees you normally might not. That way, there's at least something of a signal coming in that might challenge a model's learned bias. A clear example of this "take the effect of data serious"-phenomenon, someplace early in the docs, would be grand. I don't have a nice example of a dataset that would be well suited for your project though. I do believe that a lesson along these lines might be both appropriate and interesting to folks reading about your tool. You might even experiment with how much you can steer the model in the right direction, as long as it's not dismissed that the learning depends on the data. |
I feel like the data science field is getting exceedingly over saturated with so many courses available and a lack of professional oversight (versus say, accounting, law, engineering where requirements for training and experience need to be met). If big impactful decisions are going to be made on data, how do we keep it accountable? Do we get social scientists involved to review work done? Does the profession get certified and regulated? Context at the very least needs to be included in datasets. |
Thanks everyone for your input! I will give it a read tomorrow when my mind is fresh. Hopefully I find time and can start working on the notebook update as well. |
Hello all, Thank you again for creating awareness on the Boston housing dataset. The updated tutorial notebook got merged into What we have done:
For example, what we added for the Diabetes dataset is:Dataset bias evaluation:As we're dealing with the data of the individuals and the predictions of the model may have a profound impact on their lives, we should evaluate the dataset on the potential presence of bias.
When deploying the model in the context of healthcare, make sure it is equally performant in the subgroups with respect to sensitive attributes and their intersection. We recommend always assessing the bias and fairness risks at each step of the process (from problem understanding, data collection, processing, modelling and deployment), when working on models to be deployed, to minimize undesired outcomes By adopting the 3 steps of fairness evaluation, we can hopefully identify problematic datasets before they are used. In the future, we could think of adding something like the data nutrition labels: https://datanutrition.org |
Thank you so much Philip! This is an amazing example of learning and fixing! We need people like this in AI/Data Science/Software. I'm so happy, you've helped restore my faith in people today! 😊😊😊😊😊😊 |
Description
Your documentation lists a demo that is using the load_boston dataset to explain how to use the tool here. It also lists the variables in the dataset and you can confirm the contents.
One of the variables used to predict a house price is skin color. That's incredibly problematic.
Given that this project is backed by a global consultancy firm I'm especially worried. People will look at this example and copy it. Since documentation pages are typically seen as an authoritative source of truth it's really dubious to use this dataset without even a mention of the controversy in the variables. The use of skin color to predict a house price is legitimately a bad practice and the documentation currently makes no effort to acknowledge it. For a package that's all about causal inference, you'd expect it to acknowledge that you still need to understand the dataset that you feed in.
Note that this dataset is up for removal from scikit-learn because of the obvious controversy and it's also something that's been pointed out at many conferences. Here is one talk from me if you're interested.
The text was updated successfully, but these errors were encountered: