Merge alpha version of 0.1.0 into master

lmullen · Feb 22, 2014 · 9657456 · 9657456
2 parents e091786 + 24dae93
commit 9657456
Show file tree

Hide file tree

Showing 26 changed files with 705 additions and 17 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,13 +1,22 @@
 Package: gender
 Type: Package
-Title: Gender: encode gender based on names and dates of birth
-Version: 1.0
+Title: Gender: find gender by name and date
+Version: 0.1
 Author: Lincoln Mullen <lincoln@lincolnmullen.com>
 Maintainer: Lincoln Mullen <lincoln@lincolnmullen.com>
 Description: Encodes gender based on names and dates of birth, using the Social
     Security Administration's data set of first names by year and state. By
     using the SSA data instead of lists of male and female names, this package
     is able to more accurately guess the gender of a name, and it is able to
     report the probability that a name was male or female. Based on an
-    algorithm devised by Cameron Blevins and Bridget Baird.
+    algorithm devised by Cameron Blevins.
+URL: https://github.com/lmullen/gender
+Depends:
+    R (>= 3.0.0)
+Imports:
+    dplyr (>= 0.1.1)
+Suggests:
+    testthat
+LazyData: yes
 License: MIT
+Roxygen: list(wrap = FALSE)
diff --git a/NEWS b/NEWS
@@ -0,0 +1,9 @@
+gender 0.1
+==========
+
+* function `gender` implements gender lookup for names and data frames
+
+* implemented finding gender by using the Kantrowitz names corpus
+
+* implemented finding gender by using the national Social Security 
+  Administration data for names and dates of birth
diff --git a/R/gender-package.r b/R/gender-package.r
@@ -0,0 +1,74 @@
+#' Gender: find gender by name and date
+#' 
+#' Encodes gender based on names and dates of birth, using the Social
+#' Security Administration's data set of first names by year and state. By
+#' using the SSA data instead of lists of male and female names, this package
+#' is able to more accurately guess the gender of a name, and it is able to
+#' report the probability that a name was male or female. Based on an technique
+#' devised by Cameron Blevins.
+#' 
+#' @name gender
+#' @docType package
+#' @title Gender: find gender by name and date
+#' @author \email{lincoln@@lincolnmullen.com}
+#' @keywords gender
+NULL
+
+#' Social Security Administration national names dataset
+#' 
+#' A data set containing the number of instances of male and female names born
+#' in the years 1880 to 2012 for people who have received Social Security 
+#' Numbers. The SSA includes only names were used more than five times in a 
+#' given year. The data set contains 91,320 unique names in total.
+#' 
+#' @docType data
+#' @keywords datasets
+#' @name ssa_national
+#' @source Social Security Administration, 
+#'   \url{http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data}
+#' @format A data frame with 1,603,026 observations and 4 variables
+NULL
+
+#' Social Security Administration state names dataset
+#' 
+#' A data set containing the number of instances of male and female names born
+#' in each state for the years 1910 to 2012 for people who have received Social
+#' Security Numbers. The SSA includes only names were used more than five times
+#' in a given state in a given year. 
+#' 
+#' @docType data
+#' @keywords datasets
+#' @name ssa_state
+#' @source Social Security Administration, 
+#'   \url{http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-data-by-state-and-district-of-}
+#' @format A data frame with 5,267,234 observations and 5 variables
+NULL
+
+#' Social Security Administration national names dataset
+#' 
+#' A data set containing the number of instances of male and female names born
+#' in the years 1880 to 2012 for people who have received Social Security 
+#' Numbers. The SSA includes only names were used more than five times in a 
+#' given year. The data set contains 91,320 unique names in total.
+#' 
+#' @docType data
+#' @keywords datasets
+#' @name ssa_national
+#' @source Social Security Administration, 
+#'   \url{http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data}
+#' @format A data frame with 1,603,026 observations and 4 variables
+NULL
+
+#' Katrowitz names corpus
+#' 
+#' A data set containing 7,579 unique names compiled into two lists of male and
+#' female names by Mark Kantrowitz and Bill Ross in 1991, also used in Python's
+#' Natural Language Toolkit.
+#' 
+#' @docType data
+#' @keywords datasets
+#' @name kantrowitz
+#' @source Mark Kantrowitz and Bill Ross,
+#'   \url{http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/0.html}
+#' @format A data frame with 7,579 observations and 2 variables
+NULL
diff --git a/R/gender.R b/R/gender.R
@@ -0,0 +1,70 @@
+#' Find the gender of frst names
+#' 
+#' This function looks up the gender of either a single first name or of a 
+#' column of first names in a data frame. Optionally it can take a year, a 
+#' range of years, or a column of years in the data frame to take into account
+#' variation in the use of names over time.
+#' 
+#' @param data A character string of a first name or a data frame with a column 
+#'   named \code{name} with a character vector containing first names. The names 
+#'   must all be lowercase. 
+#' @param years This argument can be either a single year, a range of years in 
+#'   the form \code{c(1880, 1900)}, or the value \code{TRUE}. If no value is 
+#'   specified, then the names will be looked up for the period 1932 to 2012. If
+#'   a year or range of years is specified, then the names will be looked up for
+#'   that period. If the value is \code{TRUE}, then the function will look for 
+#'   a column in the data frame named \code{year} containing an integer vector
+#'   of the year of birth associated with each name. This permits you to do a 
+#'   precise lookup for each person in your data set. Dates may range from 1880 
+#'   to 2012; if earlier or later dates are included in a column in the data 
+#'   frame, they will not be matched.
+#' @param method This value can be either \code{"ssa"}, in which case the 
+#'   function will look up names based on Social Security Administration name 
+#'   data, or \code{"kantrowitz"}, in which case the function will use the 
+#'   Kantrowitz corpus of male and female names.
+#' @param certainty A boolean value, which determines whether or not to return
+#'   the proportion of male and female uses of names in addition to determining
+#'   the gender of names.
+#' @keywords gender
+#' @export
+#' @examples
+#' library(dplyr)
+#' gender("madison")
+#' gender("madison", years = c(1900, 1985))
+#' gender("madison", years = 1985)
+#' gender(sample_names_data)
+#' gender(sample_names_data, years = TRUE)
+#' gender(sample_names_data, certainty = FALSE)
+#' gender(sample_names_data, method = "kantrowitz")
+gender <- function(data, years = c(1932, 2012), method = "ssa",
+                   certainty = TRUE) {
+
+  # If data is a character vector, then convert it to a data frame. 
+  # If the data is not a character vector or a data frame, throw an error.
+  if (class(data) == "character") {
+    data <- as.data.frame(data, optional = T, stringsAsFactors = FALSE)
+    colnames(data) <- "name"
+  } else if (class(data) != "data.frame") {
+    stop("Data must be a character vector or a data frame.")
+  }
+
+  # Hand off the arguments to functions based on method, and do error checking
+  if (method == "ssa") {
+    # Check for errors in the year argument
+    if (length(years) == 1) years <- c(years, years)
+    if (length(years) > 2) {
+      stop("Year should be a numeric vector with no more than two values.")
+    } else if (years[1] > years[2]) {
+      stop("The first value for years should be smaller than the second value.")
+    } else {
+      gender_ssa(data = data, years = years, certainty = certainty)
+    }
+  } else if (method == "kantrowitz") {
+    if (!missing(years)) {
+      warning("The year is not taken into account with the Kantrowitz method.") 
+    }
+    gender_kantrowitz(data = data)
+  } else {
+    stop("Method ", method, " is not recognized. Try ?gender for help.")
+  }
+}
diff --git a/R/gender_kantrowitz.R b/R/gender_kantrowitz.R
@@ -0,0 +1,13 @@
+#' Find the gender of frst names using Kantrowitz names corpus
+#' 
+#' This internal function implements the \code{method = "kantrowitz"} option of 
+#' \code{\link{gender}}. See that function for documentation.
+#' 
+#' @param data A character string of a first name or a data frame with a column 
+#'   named \code{name} with a character vector containing first names. The names 
+#'   must all be lowercase. 
+gender_kantrowitz <- function(data) {
+
+  left_join(data, gender::kantrowitz, by = "name")
+
+}
diff --git a/R/gender_ssa.R b/R/gender_ssa.R
@@ -0,0 +1,62 @@
+#' Find the gender of frst names using Social Security data
+#' 
+#' This internal function implements the \code{method = "ssa"} option of 
+#' \code{\link{gender}}. See that function for documentation.
+#' 
+#' @param data A character string of a first name or a data frame with a column 
+#'   named \code{name} with a character vector containing first names. The names 
+#'   must all be lowercase. 
+#' @param years This argument can be either a single year, a range of years in 
+#'   the form \code{c(1880, 1900)}, or the value \code{TRUE}. If no value is 
+#'   specified, then the names will be looked up for the period 1932 to 2012. If
+#'   a year or range of years is specified, then the names will be looked up for
+#'   that period. If the value is \code{TRUE}, then the function will look for 
+#'   a column in the data frame named \code{year} containing an integer vector
+#'   of the year of birth associated with each name. This permits you to do a 
+#'   precise lookup for each person in your data set. Dates may range from 1880 
+#'   to 2012; if earlier or later dates are included in a column in the data 
+#'   frame, they will not be matched.
+#' @param certainty A boolean value, which determines whether or not to return
+#'   the proportion of male and female uses of names in addition to determining
+#'   the gender of names.
+gender_ssa <- function(data, years, certainty) {
+
+  if (class(years) == "numeric") {
+
+    # Calculate the male and female proportions for the given range of years
+    ssa_select <- gender::ssa_national %.%
+      filter(year >= years[1], year <= years[2]) %.%
+      group_by(name) %.%
+      summarise(female = sum(female),
+                male = sum(male)) %.%
+      mutate(proportion_male = round((male / (male + female)), digits = 4),
+             proportion_female = round((female / (male + female)), digits = 4)) %.%
+      mutate(gender = ifelse(proportion_female == 0.5, "either",
+                             ifelse(proportion_female > 0.5, "female", "male")))      
+
+    results <- left_join(data, ssa_select, by = "name")
+
+  } else if (class(years) == "logical") {
+
+    # Join the data to SSA data by name and year, then calculate proportions
+    results <- 
+    left_join(data, gender::ssa_national, by = c("name", "year")) %.%
+      mutate(proportion_male = round((male / (male + female)), digits = 4),
+             proportion_female = round((female / (male + female)), digits = 4)) %.%
+      mutate(gender = ifelse(proportion_female == 0.5, "either",
+                             ifelse(proportion_female > 0.5, "female", "male")))  
+  }
+
+  # Delete the male and female columns since we won't report them to the user
+  results$male <- NULL
+  results$female <- NULL
+
+  # Delete the certainty columns unless the user wants them
+  if(!certainty) {
+    results$proportion_male <- NULL
+    results$proportion_female <- NULL
+  }  
+
+  return(results)
+
+}
diff --git a/README.md b/README.md
@@ -1,30 +1,121 @@
-# Gender: an R package to encode gender based on names and dates of birth
+# Gender
 
 Lincoln A. Mullen | lincoln@lincolnmullen.com | http://lincolnmullen.com
 
-This package encodes gender based on names and dates of birth, using the
-Social Security Administration's data set of first names by year and
-state. By using the SSA data instead of lists of male and female names,
-this package is able to more accurately guess the gender of a name, and
-it is able to report the probability that a name was male or female.
+Data sets, historical or otherwise, often contain a list of first names
+but seldom identify those names by gender. Most techniques for finding
+gender programmatically, such as the [Natural Language Toolkit][] rely
+on lists of male and female names. However, the gender[\*][] of names
+can vary over time. Any data set that covers the normal span of a human
+life will require a fundamentally historical method to find gender from
+names.
 
-This package is based on a [Python script][] by [Cameron Blevins][] and
-[Bridget Baird][], who came up with the original idea and found the data
-set to make it possible.
+This package, based on collaborative work with [Cameron Blevins][]
+encodes gender based on names and dates of birth, using the Social
+Security Administration's data set of first names by year since 1880. By
+using the SSA data instead of lists of male and female names, this
+package is able to more accurately guess the gender of a name;
+furthermore it is able to report the proportion of times that a name was
+male or female for any given range of years.
+
+See also Cameron's implementation of the same concept in a [Python
+script][].
 
 # Installation
 
-To install this package, first install 
-[devtools](https://github.com/hadley/devtools). 
+To install this package, first install [devtools][].
 
 Then run the following command:
 
     devtools::install_github("lmullen/gender")
 
+# Using the package
+
+The simplest way to use this package is to pass a single name to the
+`gender()` function. You can optionally specify a year or range of years
+to the function. If you specify the years option, the function will
+calculate the proportion of male and female uses of a name for that time
+period; otherwise it will use the time period 1932--2012.
+
+    gender("madison")
+    # returns
+    #      name proportion_female gender proportion_male
+    # 1 madison            0.9828 female          0.0172
+
+    gender("madison", years = c(1900, 1985))
+    # returns
+    #      name proportion_female gender proportion_male
+    # 1 madison            0.0972   male          0.9028
+
+    gender("madison", years = 1985)
+    #      name proportion_female gender proportion_male
+    # 1 madison            0.7863 female          0.2137
+
+In most cases, you probable have a data set with many names. For now
+this package assumes that you have a data frame with a column `name`
+which is a character vector (not a factor) containing all lowercase
+names. If this does not match your data set, see [dplyr][] and
+[stringr][] for help. You can pass that data frame to the `gender()`
+function, which will add columns for gender and the certainty of that
+guess to your data frame.
+
+    gender(sample_names_data)
+
+Using a data frame you can specify a single year or range of years as in
+the example above. But you can also specify a column in your data set
+which contains year of birth associated with the name. For now, this
+column must be an integer vector (not a numeric vector) name `year`.
+
+    gender(sample_names_data, years = TRUE)
+
+If you prefer to use Kantrowitz corpus of male and female names, you can
+use the `method` option.
+
+    gender(sample_names_data, method = "kantrowitz")
+
+If you prefer a more minimal output, use the option `certainty = FALSE`
+to remove the `proportion_male` and `proportion_female` output.
+
+# Data
+
+This package includes cleaned-up versions of several data sets. To see
+the available data sets run the following command:
+
+    data(package = "gender")
+    data(ssa_national)        # returns a data set with 1.6 million rows
+
+The raw data sets used in this package are available here:
+
+-   [Mark Kantrowitz's name corpus][]
+-   [Social Security Administration's baby names by year and state][]
+-   [Social Security Administration's baby names by year][]
+
 # License
 
-MIT License <http://lmullen.mit-license.org/>
+MIT License, <http://lmullen.mit-license.org/>
 
-  [Python script]: https://github.com/cblevins/Gender-ID-By-Time
+# Citation
+
+Eventually Cameron and I will publish an article about this method. In
+the meantime, you can cite and link to either his [Python
+implementation][Python script] or my implementation in this R package.
+
+# Note
+
+<a name="gender-vs-sex"></a>\* Of course in most cases the Social
+Security Administration data more approximately records the biological
+category sex rather than the social category gender, since it mostly
+records names given at birth. But since in most cases researchers will
+be interested in gender, I've named this package gender, leaving it up
+to researchers to interpret exactly what the encoded values mean.
+
+  [Natural Language Toolkit]: http://www.nltk.org/
+  [\*]: #gender-vs-sex
   [Cameron Blevins]: http://www.cameronblevins.org/
-  [Bridget Baird]: http://oak.conncoll.edu/bbbai/
+  [Python script]: https://github.com/cblevins/Gender-ID-By-Time
+  [devtools]: https://github.com/hadley/devtools
+  [dplyr]: https://github.com/hadley/dplyr
+  [stringr]: https://github.com/hadley/stringr
+  [Mark Kantrowitz's name corpus]: http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/0.html
+  [Social Security Administration's baby names by year and state]: http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-data-by-state-and-district-of-
+  [Social Security Administration's baby names by year]: http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data
diff --git a/data/kantrowitz.rda b/data/kantrowitz.rda
diff --git a/data/sample_names_data.rda b/data/sample_names_data.rda
diff --git a/data/ssa_national.rda b/data/ssa_national.rda
diff --git a/data/ssa_state.rda b/data/ssa_state.rda