Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[Names Bias] Code for creating sets of names #4836

Merged
merged 5 commits into from
Oct 31, 2022

Conversation

EricMichaelSmith
Copy link
Contributor

Release the code used to process and produce the two sets of names used in https://arxiv.org/pdf/2109.03300.pdf , given the lists of names in the original papers that we used to create these sets.

Copy link

@davides davides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a few minor things to clarify.

import pandas as pd


RACES_ETHNICITIES = ['hispanic', 'white', 'black', 'api', 'aian', '2prace']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the spellings of the last two items here intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's how Tzioumis spelled them - adding a comment for this

names_to_new_lists = {}
for name_list, names in orig_name_lists.items():
for name in names:
proc_name = name.replace('-', '')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will concatenate hyphenated names. Why is that required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to match the formatting of the baby-name lists - adding a comment for this

.sort_values('obs_of_this_ethnicity', ascending=False)
)
tzioumis_plurality_names = percent_plurality_names_df.iloc[
:200
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, why 200? Can you please add a comment to clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi-arbitrary, to prevent the lists of names from being too big - adding a comment

]
if mapped_ethnicity == 'aa':
# Avoid the same name in two lists by removing it from this one
female_race_gender_name_list.remove('Yolanda')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a programmatic way to detect duplicates? That way it would just work for updated versions of the source datasets (assuming the schema stays the same).

Copy link
Contributor Author

@EricMichaelSmith EricMichaelSmith Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - haha this was obviously a hack to account for the current duplicates. Adding a TODO here

@EricMichaelSmith EricMichaelSmith merged commit b9317bd into main Oct 31, 2022
@EricMichaelSmith EricMichaelSmith deleted the names-bias-name-set-code branch October 31, 2022 17:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants