R Packages

install.packages( "MatchIt" )
install.packages( "rgenoud" )
install.packages( "ggplot2" )
library( MatchIt )    # popular matching package in R
library( rgenoud )    # genetic search used in matching 
library( dplyr )      # data wrangling 
library( ggplot2 )    # nice graphs 
library( stargazer )  # reports descriptive statistics 
library( pander )     # formats tables for printing in RMD files  

Load Data

This file was generated from the data steps.

dat <- read.csv( "https://raw.githubusercontent.com/lecy/political-ideology-of-nonprofits/master/DATA/02-processed-data/CensusPlusVotingAll.csv", stringsAsFactors=F )

dat <- 
  dat %>% 
  mutate( pop.density = 10000*(pop10 / Shape_area),
          vote.dem = Pres_D_08 / (Pres_D_08+Pres_R_08) ) %>% 
  select( vote.dem, white, black, poverty, pop.density, 
          medianage, income, totalpop,  hispanic, 
          vtd.key1, tract.key, pop10, Shape_area,
          Pres_D_08, Pres_R_08 ) 

head( dat )

Restrict Data to Supermajority Voting Districts

Restrict the sample to only districts that voted over 70% in favor of Obama in 2008, or over 70% in favor of McCain.

dat$dem.super <- dat$vote.dem >= 0.70 

dat$repub.super <- dat$vote.dem <= 0.30 

dat.super <- dat[ dat$vote.dem <= 0.30 | dat$vote.dem >= 0.70 , ]

dat.super <- na.omit( dat.super )

dat.super$party <- ifelse( dat.super$dem.super, "DEM", "REP" )
dat.super$party <- factor( dat.super$party )

Match Voting Districts

library( MatchIt )
library( rgenoud )

For details on the matching procedures see:

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1-24.

Ho, D., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software,42(i08).

Sekhon, Jasjeet S. 2011. “Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.” Journal of Statistical Software. 42(7): 1-52.

match.dat <- dat.super[ c("vtd.key1", "tract.key", "party", "dem.super",
                       "pop.density","income","white",
                       "poverty","black","hispanic","medianage") ]

table( match.dat$party ) 
## 
## DEM REP 
## 729 900

All supermajority districts in the sample:

match.dat %>%
  select( dem.super, poverty, white, black, pop.density, 
          income, medianage, hispanic ) %>% 
          stargazer( type = "html", digits=2 )
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
dem.super 1,629 0.45 0.50 0 0 1 1
poverty 1,629 18.30 13.85 0.00 7.40 26.83 82.30
white 1,629 70.42 25.00 0.95 56.35 89.89 100.00
black 1,629 16.01 23.29 0.00 0.81 21.55 98.08
pop.density 1,629 9.41 13.45 0.0004 0.36 14.12 302.27
income 1,629 52,812.08 30,786.88 10,000 32,123 61,481 224,097
medianage 1,629 36.83 8.40 14.00 30.40 42.90 66.20
hispanic 1,629 33.53 29.70 0.00 10.48 50.20 100.00
head( match.dat )

The supermajority districts have very different demographic profiles.

ddat <- match.dat

ggplot( ddat, aes( x=poverty, fill=party )) + 
        geom_density(alpha = 0.5) + # xlim(10,15) +
        xlab( "Poverty Rate" )

ggplot( ddat, aes( x=black, fill=party )) + 
        geom_density(alpha = 0.5) + # xlim(10,15) +
        xlab( "Percent Black" )

ggplot( ddat, aes( x=pop.density, fill=party )) + 
        geom_density(alpha = 0.5) + xlim(0,25) +
        xlab( "Population Density" )

Matching Algorithm

We explored several of the matching procedures, but many failed to produce balanced samples.

The approach presented here generates a weighted matched set using a genetic search algorithm available in the rgenoud package.

Depending upon the underlying data, you can refined the matched sample by selecting pairs of voting districts from each set based upon matching weights.

# genetic search is a stochastic algorithm 
# which uses random start points. 
# Use a set seed if you want results to 
# be reproducible. 

set.seed( 1234 )

# reset row names to make subsets easier below

row.names( match.dat ) <- 1:nrow(match.dat) 

m.out <- matchit( dem.super ~ poverty + black + pop.density, 
                  data = match.dat, 
                  method = "genetic", 
                  discard="both", reestimate=TRUE, 
                  replace=FALSE, caliper=.25 )

# discard="both" disregards all cases outside of the common support region 
#   prior to the matching process 
#
# reestimate=TRUE recalculates distance metrics after discarding cases
#   outside of the common support region 
#
# replace=FALSE disallows one "control group" case being matched with multiple 
#   treatment group cases
#
# caliper=0.25 prunes all matches that have a fit worse than a 0.25 sd distance apart

If the matching process is successful you will get the following message (suppressed here for brevity because a bunch of output is printed at each iteration):

No significant improvement in 4 generations.

Solution Lexical Fitness Value:
9.678960e-03  2.358318e-02  4.812874e-02  6.144835e-02  6.619846e-02  6.355658e-01  9.781473e-01  9.998425e-01  

Parameters at the Solution:

 X[ 1] :    9.991076e+02
 X[ 2] :    4.674203e+02
 X[ 3] :    1.078070e+02
 X[ 4] :    2.048367e+01

Solution Found Generation 9
Number of Generations Run 14

Thu Jul 23 20:45:46 2020
Total run time : 0 hours 0 minutes and 21 seconds

The final matched sample is as follows:

m.out$nn %>% pander()
  Control Treated
All 900 729
Matched 109 109
Unmatched 555 610
Discarded 236 10

The treatment group is democratic supermajority districts, and the control group is republican supermajority districts.

There were 236 republican districts discarded because they fell outside of the general region of common support, but only 10 democratic districts. This suggests that republican districts tended to be more extreme (very white, more suburban or rural, and much wealthier.

Of the remaining cases after the discard stage we were able to find “doppenganger” districts for 109 cases or approximately 15% of the remaining candidates: 109 / (109+610) for democrats, and 109 / (109+555) for republicans.

These are the cases that we would interpret as being demographically “identical” except for political ideology (on a small set of covariates, at least).

Examine Matches

We can check to see whether we have achieved balance:

summary( m.out )
## 
## Call:
## matchit(formula = dem.super ~ poverty + black + pop.density, 
##     data = match.dat, method = "genetic", discard = "both", reestimate = TRUE, 
##     replace = FALSE, caliper = 0.25)
## 
## Summary of balance for all data:
##             Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance           0.8500        0.1249     0.1890    0.7251  0.8441   0.7255
## poverty           28.3131       10.1851     7.5186   18.1280 18.6200  18.1547
## black             29.2939        5.2478     7.4964   24.0461 20.0000  24.0784
## pop.density       16.2054        3.9038     5.7350   12.3016 12.5110  12.3180
##              eQQ Max
## distance      0.9299
## poverty      32.0100
## black        68.0600
## pop.density 259.7431
## 
## 
## Summary of balance for matched data:
##             Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance           0.4181        0.4119     0.2811    0.0062   0.009   0.0109
## poverty           19.5198       19.2818     8.0723    0.2380   0.410   0.5565
## black              8.4061        8.8911     9.7407   -0.4850   0.900   1.0183
## pop.density        7.0231        6.7103     6.6653    0.3128   0.362   0.4183
##             eQQ Max
## distance     0.0470
## poverty      2.3300
## black        4.9500
## pop.density  1.3693
## 
## Percent Balance Improvement:
##             Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance       99.1387 98.9323  98.4955 94.9419
## poverty        98.6872 97.7981  96.9346 92.7210
## black          97.9828 95.5000  95.7711 92.7270
## pop.density    97.4574 97.1063  96.6043 99.4728
## 
## Sample sizes:
##           Control Treated
## All           900     729
## Matched       109     109
## Unmatched     555     610
## Discarded     236      10

Graphically:

par( mfrow=c(1,2) )
plot( m.out, type="jitter" )
plot( m.out )

For some more graphical representations of balance see the cobalt package in R.

We can also print the list of matches. Each number represents the row ID of the observation, and an NA in the control group means no match was found for that particular democratic district.

matches <- data.frame( treatment.group.id = row.names( m.out$match.matrix), 
                       control.group.id = m.out$match.matrix )

head( matches, 25 )

When can then examine matched pairs to see how closely they align on the covariates:

# compare quality of matches
# select id (row) of treated and id (row) of comparison,
# compare on model covariates:

# note the nice balance: 

covariates <- c("dem.super","poverty","black","pop.density")

matched.rows <- c(63,889)
match.dat[ matched.rows , covariates ] %>% pander()
  dem.super poverty black pop.density
63 TRUE 19.43 0 10.86
889 FALSE 19.39 4.71 9.181
matched.rows <- c(64,612)
match.dat[ matched.rows , covariates ] %>% pander()
  dem.super poverty black pop.density
64 TRUE 27.69 0 13.84
612 FALSE 27.37 1.37 15.47

Save the Balanced Dataset

To extract our new balanced data frame we can use the match.data() function from MatchIt.

balanced.data <- match.data( m.out )
dim( balanced.data )
## [1] 218  13

We now have 109 rows for democrats, and 109 rows of their republican dopelganger districts.

Save the new data frame:

write.csv( balanced.data, "../DATA/02-processed-data/matched-pairs.csv", row.names=F )

Although not necessary, we might also save the list of which cases are matched:

matches <- na.omit( matches )
write.csv( matches, "../DATA/02-processed-data/matched-pairs-index.csv", row.names=F )

It may come in handy if you want to visualize matches, or do some post-hoc sensitivity analysis.

One other good reason to set the random seed before creating the matched samples.

Check Balance

We no longer have any demographic differences between our “treatment” and “control” group. They differ primarily on political ideology.

table( balanced.data$party )
## 
## DEM REP 
## 109 109
t.test( as.numeric(balanced.data$poverty) ~ balanced.data$party )
## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(balanced.data$poverty) by balanced.data$party
## t = 0.21366, df = 215.72, p-value = 0.831
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.957431  2.433394
## sample estimates:
## mean in group DEM mean in group REP 
##          19.51982          19.28183
t.test( as.numeric(balanced.data$black) ~ balanced.data$party )
## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(balanced.data$black) by balanced.data$party
## t = -0.35699, df = 215.3, p-value = 0.7214
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.163094  2.193002
## sample estimates:
## mean in group DEM mean in group REP 
##          8.406055          8.891101
t.test( balanced.data$pop.density ~ balanced.data$party )
## 
##  Welch Two Sample t-test
## 
## data:  balanced.data$pop.density by balanced.data$party
## t = 0.347, df = 216, p-value = 0.7289
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.463864  2.089434
## sample estimates:
## mean in group DEM mean in group REP 
##          7.023111          6.710326
ddat <- balanced.data

ggplot( ddat, aes( x=poverty, fill=party )) + 
        geom_density(alpha = 0.5) + # xlim(10,15) +
        xlab( "Poverty Rate" )

ggplot( ddat, aes( x=black, fill=party )) + 
        geom_density(alpha = 0.5) + # xlim(10,15) +
        xlab( "Percent Black" )

ggplot( ddat, aes( x=pop.density, fill=party )) + 
        geom_density(alpha = 0.5) + xlim(0,50) +
        xlab( "Population Density" )




Citation

Lecy, J. D., Ashley, S. R., & Santamarina, F. J. (2019). Do nonprofit missions vary by the political ideology of supporting communities? Some preliminary results. Public Performance & Management Review, 42(1), 115-141. DOWNLOAD

Matching Packages:

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1-24.

Ho, D., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software,42(i08).

Sekhon, Jasjeet S. 2011. “Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.” Journal of Statistical Software. 42(7): 1-52.