Coding Gender of Nonprofit Leaders

library( pander )
library( gender )
library( dplyr )
library( ggplot2 )


# when first installing, you will be asked to build a local database of names
# > gen <- gender( fn, method="ssa" )
# Install the genderdata package? 
# 
# 1: Yes
# 2: No
# Selection:  <- TYPE "Yes"

Load Compensation Data

This data was generated from Part II of Schedule J on the IRS 990 2014 E-Files.

data.url <- "https://github.com/lecy/coding-gender-of-nonprofit-leaders/raw/master/DATA/CompDat-2014.rds"

dat <- readRDS( gzcon( url( data.url )))

nm <- toupper( dat$PersonNm )

head( nm, 10 )
##  [1] "BRIAN PARKHILL"             "SANDY SZABAT"              
##  [3] "BUD WATSON"                 "CLINT SINGLEY"             
##  [5] "PAT TRUJILLO"               "LCDR RONALD MCCAMPBELL"    
##  [7] "EN1 NATHAN ARMENDARIZ"      "SH2 KIM MORRISON"          
##  [9] "SH1 SAMUEL HERNANDEZMORENO" "VINCENT BENIGNI"

Cleaning Names

There are no clear IRS formatting guidelines for submitting unstructured data like names and organizational role. As a result, this data can be fairly messy.

Here are some examples of using string processing functions in R to clean up some common problems with the data.

nm <- gsub( "SR ", "", nm, ignore.case=FALSE )
nm <- gsub( "DR ", "", nm, ignore.case=FALSE )
nm <- gsub( "REV ", "", nm, ignore.case=FALSE )
nm <- gsub( "MR ", "", nm, ignore.case=FALSE )
nm <- gsub( "SISTER ", "", nm, ignore.case=FALSE )
nm <- gsub( "REVEREND ", "", nm, ignore.case=FALSE )
nm <- gsub( "PROF ", "", nm, ignore.case=FALSE )
nm <- gsub( "RABBI ", "", nm, ignore.case=FALSE )
nm <- gsub( "^.{1} ", "", nm ) # remove numbers at the beginning of names
nm <- gsub( "^.{1} ", "", nm )
nm <- gsub( "[0-9]", "", nm ) # remove all numbers from names
nm <- gsub( "^ ", "", nm ) # remove spaces at the beginning of names
nm <- gsub( "^ ", "", nm )

nm[ nm == "" ] <- NA  # remove empty name elements

Split Full Names Into Parts

In order to use the gender package in R, we need to isolate first names. We do this by splitting the full name into individual components, then retaining the first name in the list.

In some cases this approach will fail. For example, if the names are listed in reverse order:

Smith, John

Or perhaps a person uses a title or an abbreviated first name:

Senator Smith

JW Smith

x <- strsplit( nm, " " )

first.names <- unlist( lapply( x, `[[`, 1 ) )

head( first.names, 25 )
##  [1] "BRIAN"    "SANDY"    "BUD"      "CLINT"    "PAT"      "LCRONALD"
##  [7] "EN"       "SH"       "SH"       "VINCENT"  "BETSY"    "JAMES"   
## [13] "PAUL"     "JD"       "JOHN"     "BENJAMIN" "JAMES"    "LOUIS"   
## [19] "MANUEL"   "MICHELLE" "JILL"     "HEATHER"  "ELAINE"   "MERCEDES"
## [25] "DONNA"
dat$FirstName <- tolower( first.names )

fn <- unique( tolower( first.names ) )

There are 78287 unique names in this dataset.

The Gender Package in R

Usage

gender( names, years = c(1932, 2012), method = c("ssa", "ipums", "napp",
  "kantrowitz", "genderize", "demo"), countries = c("United States", "Canada",
  "United Kingdom", "Germany", "Iceland", "Norway", "Sweden"))

Description

This function predicts the gender of a first name given a year or range of years in which the person was born. The prediction can use one of several data sets suitable for different time periods or geographical regions. See the package vignette for suggestions on using this function with multiple names and for a discussion of which data set is most suitable for your research question. When using certains methods, the genderdata data package is required; you will be prompted to install it if it is not already available.

Arguments

names

First names as a character vector. Names are case insensitive.

years

The birth year of the name whose gender is to be predicted. This argument can be either a single year, a range of years in the form c(1880, 1900). If no value is specified, then for the “ssa” method it will use the period 1932 to 2012; acceptable years for the SSA method range from 1880 to 2012, but for years before 1930 the IPUMS method is probably more accurate. For the “ipums” method the default range is the period 1789 to 1930, which is also the range of acceptable years. For the “napp” method the default range is the period 1758 to 1910, which is also the range of acceptable years. If a year or range of years is specified, then the names will be looked up for that period.

method

This value determines the data set that is used to predict the gender of the name. The “ssa” method looks up names based from the U.S. Social Security Administration baby name data. (This method is based on an implementation by Cameron Blevins.) The “ipums” method looks up names from the U.S. Census data in the Integrated Public Use Microdata Series. (This method was contributed by Ben Schmidt.) The “kantrowitz” method uses the Kantrowitz corpus of male and female names. The “genderize” method uses the Genderize.io http://genderize.io/ API, which is based on “user profiles across major social networks.” The “demo” method is uses the top 100 names in the SSA method; it is provided only for demonstration purposes when the genderdata package is not installed and it is not suitable for research purposes.

countries

The countries for which datasets are being used. For the “ssa” and “ipums” methods, the only valid option is “United States” which will be assumed if no argument is specified. For the “napp” method, you may specify a character vector with any of the following countries: “Canada”, “United Kingdom”, “Germany”, “Iceland”, “Norway”, “Sweden”. For the “kantrowitz” and “genderize” methods, no country should be specified.

Example

Load Some Example Data

first.names <- c("dave", "glen", "beverly", "jennifer", "stacy", "lynn", "betty", 
                 "linda", "laurie", "marilyn", "michelle", "cara", "allison", 
                 "alan", "jerry", "bo", "paul", "jim", "jeff", "chuck", "henry", 
                 NA, "steve", "saddiq", "kim")

Assigne Gender to Names

# library( gender )

example.results <- gender(  first.names )

print( example.results, n=10 )
## # A tibble: 24 x 6
##       name proportion_male proportion_female gender year_min year_max
##      <chr>           <dbl>             <dbl>  <chr>    <dbl>    <dbl>
##  1    alan          0.9968            0.0032   male     1932     2012
##  2 allison          0.0082            0.9918 female     1932     2012
##  3   betty          0.0039            0.9961 female     1932     2012
##  4 beverly          0.0070            0.9930 female     1932     2012
##  5      bo          0.9348            0.0652   male     1932     2012
##  6    cara          0.0012            0.9988 female     1932     2012
##  7   chuck          0.9997            0.0003   male     1932     2012
##  8    dave          0.9985            0.0015   male     1932     2012
##  9    glen          0.9918            0.0082   male     1932     2012
## 10   henry          0.9935            0.0065   male     1932     2012
## # ... with 14 more rows

Code Names from Compensation Data

We will use method=“ssa” here, which matches names in our data to the Social Security Administration birth certificate database, and will return the proportion of individuals with that first name that belong to each gender.

gen <- gender( fn, method="ssa" )

# Additional Available Methods
# gen <- gender( fn, method="ipums" )
# gen <- gender( fn, method="kantrowitz" ) 


print( gen, n=10 )
## # A tibble: 30,274 x 6
##       name proportion_male proportion_female gender year_min year_max
##      <chr>           <dbl>             <dbl>  <chr>    <dbl>    <dbl>
##  1  aakash          1.0000            0.0000   male     1932     2012
##  2   aalap          1.0000            0.0000   male     1932     2012
##  3  aaleah          0.0000            1.0000 female     1932     2012
##  4 aaliyah          0.0011            0.9989 female     1932     2012
##  5   aamer          1.0000            0.0000   male     1932     2012
##  6   aamir          1.0000            0.0000   male     1932     2012
##  7   aamna          0.0000            1.0000 female     1932     2012
##  8    aana          0.0000            1.0000 female     1932     2012
##  9    aara          0.0000            1.0000 female     1932     2012
## 10   aaren          0.6915            0.3085   male     1932     2012
## # ... with 3.026e+04 more rows

You can add gender to the original dataset by merging results:

gen <- gen[ , 1:4 ]

dat <- merge( dat, gen, by.x="FirstName", by.y="name", all.x=T )

table( dat$gender, useNA="ifany" )
## 
##  either  female    male    <NA> 
##      15  967627 1580613  129031

Failed Matches

Since the R package assigns gender based upon matches to the Social Security database of birth certificates, names that do not appear in the database (or appear too few times and thus are not reported by the SSA for privacy reasons) cannot be properly coded.

# NAMES THAT DO NOT RECEIVE A CLEAR GENDER CODE

ambiguous.cases <- dat$gender == "either"
ambiguous.cases[ is.na(ambiguous.cases)] <- FALSE

dat$PersonNm[ ambiguous.cases ]
##  [1] "LUGENE POWELL"    "LUGENE INZANA"    "Lugene Powell"   
##  [4] "LUGENE LOGAN"     "Lugene Inzana"    "LUGENE INZANA"   
##  [7] "LUGENE INZANA"    "LUGENE INZANA"    "LUGENE GARRETT"  
## [10] "LUGENE INZANA"    "LUGENE POWELL"    "Lugene Inzana"   
## [13] "LUGENE VINCENT"   "LUGENE INZANA"    "LUGENE CALDERONE"
# ONLY ONE NAME HAS EXACTLY 50-50 SPLIT:  LUGENE!



# NAMES NOT FOUND

no.gender <- dat$PersonNm[ is.na(dat$gender) ]

head( no.gender, 10 )
##  [1] "'"                       "'ANAPESI KA'ILI"        
##  [3] "J 'BEN' WARREN"          "C W 'BILL' ENGLUND JR"  
##  [5] "'D'Juana Miller"         "'IOKEPA DESANTOS"       
##  [7] "L 'JACK' WONG"           "'LAINE HEATHCOTE"       
##  [9] "W R 'ROSS' BRIGDEN"      "'VARIOUS' INDIVIDUALS 8"

Comparing Compensation by Gender

dat$gender[ dat$gender == "either" ] <- NA

ggplot( dat[!is.na(dat$gender),], aes( x=log(RptCmpOrg), fill=gender )) + 
        geom_density(alpha = 0.5) + xlim(10,15) +
        xlab( "Compensation (logged)" )

ggplot( dat, aes( x=log(RptCmpOrg), fill=gender )) + 
        geom_density(alpha = 0.5) + xlim(10,15) +
        xlab( "Compensation (logged)" )

Coding Titles

The IRS forms (Schedule J) contain information about all leaders, board members, and highly-compensated individuals in the nonprofit organizations. We often want to isolate one of these groups. For example, perhaps we want to look at only CEOs or CFOs.

Unfortunately the titles are also not standardized. But we can apply some similar string processing techinques to identify sets of individuals.

dat$TitleTxt <- toupper( dat$TitleTxt )

# d2 <- dat

title <- dat$TitleTxt

head( title, 10 )
##  [1] "PRESIDENT"       "DIRECTOR"        "DIRECTOR"       
##  [4] "PAST PRESIDENT"  "FINANCE MANAGER" "DIRECTOR"       
##  [7] "DIRECTOR"        "BOARD MEMBER"    "DIRECTOR"       
## [10] "DIRECTORS"
length( unique( title ))  # 161,162 different titles used!!!
## [1] 165659
title <- gsub( "\\/", " ", title )
title <- gsub( "\\.", "", title )


sort( table( title ) , T )[ 1:50 ] %>% names 
##  [1] "DIRECTOR"                 "BOARD MEMBER"            
##  [3] "TRUSTEE"                  "TREASURER"               
##  [5] "PRESIDENT"                "SECRETARY"               
##  [7] "MEMBER"                   "VICE PRESIDENT"          
##  [9] "EXECUTIVE DIRECTOR"       "CHAIRMAN"                
## [11] "VICE CHAIR"               "VICE PRESIDE"            
## [13] "CHAIR"                    "VICE CHAIRMAN"           
## [15] "CFO"                      "EXECUTIVE DI"            
## [17] "CEO"                      "SECRETARY TREASURER"     
## [19] "PHYSICIAN"                "PAST PRESIDENT"          
## [21] "CHAIRPERSON"              "VICE-PRESIDENT"          
## [23] "EXECUTIVE DIR"            "EXECUTIVE BOARD"         
## [25] "BOARD OF DIRECTORS"       "PRESIDENT & CEO"         
## [27] "BOARD CHAIR"              "GOVERNOR"                
## [29] "MEMBER AT LARGE"          "CHIEF FINANCIAL OFFICER" 
## [31] "BOARD"                    "PRESIDENT CEO"           
## [33] "BOARD DIRECTOR"           "OFFICER"                 
## [35] "VICE-CHAIR"               "SECRETARY TR"            
## [37] "EXECUTIVE DIREC"          "CHIEF EXECUTIVE OFFICER" 
## [39] "COO"                      "PRESIDENT ELECT"         
## [41] "VP"                       "PAST PRESIDE"            
## [43] "CLERK"                    "ASSISTANT SECRETARY"     
## [45] "RECORDING SECRETARY"      "PAST CHAIR"              
## [47] "EX-OFFICIO"               "IMMEDIATE PAST PRESIDENT"
## [49] "VICE-PRESIDE"             "ADMINISTRATOR"
# PRESIDENT / CEO

director <- 
c("PRESIDENT","EXECUTIVE DIRECTOR","CEO","EXECUTIVE DI","PRESIDENT & CEO",              
  "EXECUTIVE DIREC","PRESIDENT CEO","CHIEF EXECUTIVE OFFICER","PRESIDENT ELECT",        
  "EXEC DIRECTOR","PRESIDENT-ELECT","DIR","EXEC DIR","PRESIDENT DIRECTOR",
  "NATIONAL DIRECTOR","PRES","CHIEF","MANAGING DIRECTOR","EXEC DIRECTO","EXEC DIRECT",
  "PRESIDENT CE","DIRECTOR PRESIDENT","PRESIDENT &","EX DIRECTOR","PRESIDENT, CEO",
  "PRESIDENT DI","PRESIDENT   CEO","PRES CEO","PRESIDENT, DIRECTOR","EXEC DIRECTOR CEO",
  "PRESIDENT CHAIRMAN" )


# select all with these titles

d2 <- dat[ title %in% director , ]


# select by additional criteria such as org type and minimum comp / hours

d3 <- d2[ d2$AvgHrs > 1 & d2$RptCmpOrg > 1 & d2$Org501c3 == 1 , ]