-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Names Bias] Code for creating sets of names #4836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a few minor things to clarify.
import pandas as pd | ||
|
||
|
||
RACES_ETHNICITIES = ['hispanic', 'white', 'black', 'api', 'aian', '2prace'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the spellings of the last two items here intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's how Tzioumis spelled them - adding a comment for this
names_to_new_lists = {} | ||
for name_list, names in orig_name_lists.items(): | ||
for name in names: | ||
proc_name = name.replace('-', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this will concatenate hyphenated names. Why is that required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to match the formatting of the baby-name lists - adding a comment for this
projects/dialogue_bias/util.py
Outdated
.sort_values('obs_of_this_ethnicity', ascending=False) | ||
) | ||
tzioumis_plurality_names = percent_plurality_names_df.iloc[ | ||
:200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, why 200? Can you please add a comment to clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semi-arbitrary, to prevent the lists of names from being too big - adding a comment
] | ||
if mapped_ethnicity == 'aa': | ||
# Avoid the same name in two lists by removing it from this one | ||
female_race_gender_name_list.remove('Yolanda') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a programmatic way to detect duplicates? That way it would just work for updated versions of the source datasets (assuming the schema stays the same).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call - haha this was obviously a hack to account for the current duplicates. Adding a TODO here
Release the code used to process and produce the two sets of names used in https://arxiv.org/pdf/2109.03300.pdf , given the lists of names in the original papers that we used to create these sets.