Skip to content

Commit

Permalink
Merge alpha version of 0.1.0 into master
Browse files Browse the repository at this point in the history
  • Loading branch information
lmullen committed Feb 22, 2014
2 parents e091786 + 24dae93 commit 9657456
Show file tree
Hide file tree
Showing 26 changed files with 705 additions and 17 deletions.
15 changes: 12 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,13 +1,22 @@
Package: gender
Type: Package
Title: Gender: encode gender based on names and dates of birth
Version: 1.0
Title: Gender: find gender by name and date
Version: 0.1
Author: Lincoln Mullen <lincoln@lincolnmullen.com>
Maintainer: Lincoln Mullen <lincoln@lincolnmullen.com>
Description: Encodes gender based on names and dates of birth, using the Social
Security Administration's data set of first names by year and state. By
using the SSA data instead of lists of male and female names, this package
is able to more accurately guess the gender of a name, and it is able to
report the probability that a name was male or female. Based on an
algorithm devised by Cameron Blevins and Bridget Baird.
algorithm devised by Cameron Blevins.
URL: https://github.com/lmullen/gender
Depends:
R (>= 3.0.0)
Imports:
dplyr (>= 0.1.1)
Suggests:
testthat
LazyData: yes
License: MIT
Roxygen: list(wrap = FALSE)
9 changes: 9 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
gender 0.1
==========

* function `gender` implements gender lookup for names and data frames

* implemented finding gender by using the Kantrowitz names corpus

* implemented finding gender by using the national Social Security
Administration data for names and dates of birth
74 changes: 74 additions & 0 deletions R/gender-package.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#' Gender: find gender by name and date
#'
#' Encodes gender based on names and dates of birth, using the Social
#' Security Administration's data set of first names by year and state. By
#' using the SSA data instead of lists of male and female names, this package
#' is able to more accurately guess the gender of a name, and it is able to
#' report the probability that a name was male or female. Based on an technique
#' devised by Cameron Blevins.
#'
#' @name gender
#' @docType package
#' @title Gender: find gender by name and date
#' @author \email{lincoln@@lincolnmullen.com}
#' @keywords gender
NULL

#' Social Security Administration national names dataset
#'
#' A data set containing the number of instances of male and female names born
#' in the years 1880 to 2012 for people who have received Social Security
#' Numbers. The SSA includes only names were used more than five times in a
#' given year. The data set contains 91,320 unique names in total.
#'
#' @docType data
#' @keywords datasets
#' @name ssa_national
#' @source Social Security Administration,
#' \url{http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data}
#' @format A data frame with 1,603,026 observations and 4 variables
NULL

#' Social Security Administration state names dataset
#'
#' A data set containing the number of instances of male and female names born
#' in each state for the years 1910 to 2012 for people who have received Social
#' Security Numbers. The SSA includes only names were used more than five times
#' in a given state in a given year.
#'
#' @docType data
#' @keywords datasets
#' @name ssa_state
#' @source Social Security Administration,
#' \url{http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-data-by-state-and-district-of-}
#' @format A data frame with 5,267,234 observations and 5 variables
NULL

#' Social Security Administration national names dataset
#'
#' A data set containing the number of instances of male and female names born
#' in the years 1880 to 2012 for people who have received Social Security
#' Numbers. The SSA includes only names were used more than five times in a
#' given year. The data set contains 91,320 unique names in total.
#'
#' @docType data
#' @keywords datasets
#' @name ssa_national
#' @source Social Security Administration,
#' \url{http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data}
#' @format A data frame with 1,603,026 observations and 4 variables
NULL

#' Katrowitz names corpus
#'
#' A data set containing 7,579 unique names compiled into two lists of male and
#' female names by Mark Kantrowitz and Bill Ross in 1991, also used in Python's
#' Natural Language Toolkit.
#'
#' @docType data
#' @keywords datasets
#' @name kantrowitz
#' @source Mark Kantrowitz and Bill Ross,
#' \url{http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/0.html}
#' @format A data frame with 7,579 observations and 2 variables
NULL
70 changes: 70 additions & 0 deletions R/gender.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#' Find the gender of frst names
#'
#' This function looks up the gender of either a single first name or of a
#' column of first names in a data frame. Optionally it can take a year, a
#' range of years, or a column of years in the data frame to take into account
#' variation in the use of names over time.
#'
#' @param data A character string of a first name or a data frame with a column
#' named \code{name} with a character vector containing first names. The names
#' must all be lowercase.
#' @param years This argument can be either a single year, a range of years in
#' the form \code{c(1880, 1900)}, or the value \code{TRUE}. If no value is
#' specified, then the names will be looked up for the period 1932 to 2012. If
#' a year or range of years is specified, then the names will be looked up for
#' that period. If the value is \code{TRUE}, then the function will look for
#' a column in the data frame named \code{year} containing an integer vector
#' of the year of birth associated with each name. This permits you to do a
#' precise lookup for each person in your data set. Dates may range from 1880
#' to 2012; if earlier or later dates are included in a column in the data
#' frame, they will not be matched.
#' @param method This value can be either \code{"ssa"}, in which case the
#' function will look up names based on Social Security Administration name
#' data, or \code{"kantrowitz"}, in which case the function will use the
#' Kantrowitz corpus of male and female names.
#' @param certainty A boolean value, which determines whether or not to return
#' the proportion of male and female uses of names in addition to determining
#' the gender of names.
#' @keywords gender
#' @export
#' @examples
#' library(dplyr)
#' gender("madison")
#' gender("madison", years = c(1900, 1985))
#' gender("madison", years = 1985)
#' gender(sample_names_data)
#' gender(sample_names_data, years = TRUE)
#' gender(sample_names_data, certainty = FALSE)
#' gender(sample_names_data, method = "kantrowitz")
gender <- function(data, years = c(1932, 2012), method = "ssa",
certainty = TRUE) {

# If data is a character vector, then convert it to a data frame.
# If the data is not a character vector or a data frame, throw an error.
if (class(data) == "character") {
data <- as.data.frame(data, optional = T, stringsAsFactors = FALSE)
colnames(data) <- "name"
} else if (class(data) != "data.frame") {
stop("Data must be a character vector or a data frame.")
}

# Hand off the arguments to functions based on method, and do error checking
if (method == "ssa") {
# Check for errors in the year argument
if (length(years) == 1) years <- c(years, years)
if (length(years) > 2) {
stop("Year should be a numeric vector with no more than two values.")
} else if (years[1] > years[2]) {
stop("The first value for years should be smaller than the second value.")
} else {
gender_ssa(data = data, years = years, certainty = certainty)
}
} else if (method == "kantrowitz") {
if (!missing(years)) {
warning("The year is not taken into account with the Kantrowitz method.")
}
gender_kantrowitz(data = data)
} else {
stop("Method ", method, " is not recognized. Try ?gender for help.")
}
}
13 changes: 13 additions & 0 deletions R/gender_kantrowitz.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#' Find the gender of frst names using Kantrowitz names corpus
#'
#' This internal function implements the \code{method = "kantrowitz"} option of
#' \code{\link{gender}}. See that function for documentation.
#'
#' @param data A character string of a first name or a data frame with a column
#' named \code{name} with a character vector containing first names. The names
#' must all be lowercase.
gender_kantrowitz <- function(data) {

left_join(data, gender::kantrowitz, by = "name")

}
62 changes: 62 additions & 0 deletions R/gender_ssa.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#' Find the gender of frst names using Social Security data
#'
#' This internal function implements the \code{method = "ssa"} option of
#' \code{\link{gender}}. See that function for documentation.
#'
#' @param data A character string of a first name or a data frame with a column
#' named \code{name} with a character vector containing first names. The names
#' must all be lowercase.
#' @param years This argument can be either a single year, a range of years in
#' the form \code{c(1880, 1900)}, or the value \code{TRUE}. If no value is
#' specified, then the names will be looked up for the period 1932 to 2012. If
#' a year or range of years is specified, then the names will be looked up for
#' that period. If the value is \code{TRUE}, then the function will look for
#' a column in the data frame named \code{year} containing an integer vector
#' of the year of birth associated with each name. This permits you to do a
#' precise lookup for each person in your data set. Dates may range from 1880
#' to 2012; if earlier or later dates are included in a column in the data
#' frame, they will not be matched.
#' @param certainty A boolean value, which determines whether or not to return
#' the proportion of male and female uses of names in addition to determining
#' the gender of names.
gender_ssa <- function(data, years, certainty) {

if (class(years) == "numeric") {

# Calculate the male and female proportions for the given range of years
ssa_select <- gender::ssa_national %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5, "either",
ifelse(proportion_female > 0.5, "female", "male")))

results <- left_join(data, ssa_select, by = "name")

} else if (class(years) == "logical") {

# Join the data to SSA data by name and year, then calculate proportions
results <-
left_join(data, gender::ssa_national, by = c("name", "year")) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5, "either",
ifelse(proportion_female > 0.5, "female", "male")))
}

# Delete the male and female columns since we won't report them to the user
results$male <- NULL
results$female <- NULL

# Delete the certainty columns unless the user wants them
if(!certainty) {
results$proportion_male <- NULL
results$proportion_female <- NULL
}

return(results)

}
119 changes: 105 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,121 @@
# Gender: an R package to encode gender based on names and dates of birth
# Gender

Lincoln A. Mullen | lincoln@lincolnmullen.com | http://lincolnmullen.com

This package encodes gender based on names and dates of birth, using the
Social Security Administration's data set of first names by year and
state. By using the SSA data instead of lists of male and female names,
this package is able to more accurately guess the gender of a name, and
it is able to report the probability that a name was male or female.
Data sets, historical or otherwise, often contain a list of first names
but seldom identify those names by gender. Most techniques for finding
gender programmatically, such as the [Natural Language Toolkit][] rely
on lists of male and female names. However, the gender[\*][] of names
can vary over time. Any data set that covers the normal span of a human
life will require a fundamentally historical method to find gender from
names.

This package is based on a [Python script][] by [Cameron Blevins][] and
[Bridget Baird][], who came up with the original idea and found the data
set to make it possible.
This package, based on collaborative work with [Cameron Blevins][]
encodes gender based on names and dates of birth, using the Social
Security Administration's data set of first names by year since 1880. By
using the SSA data instead of lists of male and female names, this
package is able to more accurately guess the gender of a name;
furthermore it is able to report the proportion of times that a name was
male or female for any given range of years.

See also Cameron's implementation of the same concept in a [Python
script][].

# Installation

To install this package, first install
[devtools](https://github.com/hadley/devtools).
To install this package, first install [devtools][].

Then run the following command:

devtools::install_github("lmullen/gender")

# Using the package

The simplest way to use this package is to pass a single name to the
`gender()` function. You can optionally specify a year or range of years
to the function. If you specify the years option, the function will
calculate the proportion of male and female uses of a name for that time
period; otherwise it will use the time period 1932--2012.

gender("madison")
# returns
# name proportion_female gender proportion_male
# 1 madison 0.9828 female 0.0172

gender("madison", years = c(1900, 1985))
# returns
# name proportion_female gender proportion_male
# 1 madison 0.0972 male 0.9028

gender("madison", years = 1985)
# name proportion_female gender proportion_male
# 1 madison 0.7863 female 0.2137

In most cases, you probable have a data set with many names. For now
this package assumes that you have a data frame with a column `name`
which is a character vector (not a factor) containing all lowercase
names. If this does not match your data set, see [dplyr][] and
[stringr][] for help. You can pass that data frame to the `gender()`
function, which will add columns for gender and the certainty of that
guess to your data frame.

gender(sample_names_data)

Using a data frame you can specify a single year or range of years as in
the example above. But you can also specify a column in your data set
which contains year of birth associated with the name. For now, this
column must be an integer vector (not a numeric vector) name `year`.

gender(sample_names_data, years = TRUE)

If you prefer to use Kantrowitz corpus of male and female names, you can
use the `method` option.

gender(sample_names_data, method = "kantrowitz")

If you prefer a more minimal output, use the option `certainty = FALSE`
to remove the `proportion_male` and `proportion_female` output.

# Data

This package includes cleaned-up versions of several data sets. To see
the available data sets run the following command:

data(package = "gender")
data(ssa_national) # returns a data set with 1.6 million rows

The raw data sets used in this package are available here:

- [Mark Kantrowitz's name corpus][]
- [Social Security Administration's baby names by year and state][]
- [Social Security Administration's baby names by year][]

# License

MIT License <http://lmullen.mit-license.org/>
MIT License, <http://lmullen.mit-license.org/>

[Python script]: https://github.com/cblevins/Gender-ID-By-Time
# Citation

Eventually Cameron and I will publish an article about this method. In
the meantime, you can cite and link to either his [Python
implementation][Python script] or my implementation in this R package.

# Note

<a name="gender-vs-sex"></a>\* Of course in most cases the Social
Security Administration data more approximately records the biological
category sex rather than the social category gender, since it mostly
records names given at birth. But since in most cases researchers will
be interested in gender, I've named this package gender, leaving it up
to researchers to interpret exactly what the encoded values mean.

[Natural Language Toolkit]: http://www.nltk.org/
[\*]: #gender-vs-sex
[Cameron Blevins]: http://www.cameronblevins.org/
[Bridget Baird]: http://oak.conncoll.edu/bbbai/
[Python script]: https://github.com/cblevins/Gender-ID-By-Time
[devtools]: https://github.com/hadley/devtools
[dplyr]: https://github.com/hadley/dplyr
[stringr]: https://github.com/hadley/stringr
[Mark Kantrowitz's name corpus]: http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/0.html
[Social Security Administration's baby names by year and state]: http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-data-by-state-and-district-of-
[Social Security Administration's baby names by year]: http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data
Binary file added data/kantrowitz.rda
Binary file not shown.
Binary file added data/sample_names_data.rda
Binary file not shown.
Binary file added data/ssa_national.rda
Binary file not shown.
Binary file added data/ssa_state.rda
Binary file not shown.
Loading

0 comments on commit 9657456

Please sign in to comment.