This file was generated from the data steps.
dat <- read.csv( "https://raw.githubusercontent.com/lecy/political-ideology-of-nonprofits/master/DATA/02-processed-data/CensusPlusVotingAll.csv", stringsAsFactors=F )
dat <-
dat %>%
mutate( pop.density = 10000*(pop10 / Shape_area),
vote.dem = Pres_D_08 / (Pres_D_08+Pres_R_08) ) %>%
select( vote.dem, white, black, poverty, pop.density,
medianage, income, totalpop, hispanic,
vtd.key1, tract.key, pop10, Shape_area,
Pres_D_08, Pres_R_08 )
head( dat )
Restrict the sample to only districts that voted over 70% in favor of Obama in 2008, or over 70% in favor of McCain.
For details on the matching procedures see:
Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1-24.
Ho, D., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software,42(i08).
Sekhon, Jasjeet S. 2011. “Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.” Journal of Statistical Software. 42(7): 1-52.
match.dat <- dat.super[ c("vtd.key1", "tract.key", "party", "dem.super",
"pop.density","income","white",
"poverty","black","hispanic","medianage") ]
table( match.dat$party )
##
## DEM REP
## 729 900
All supermajority districts in the sample:
match.dat %>%
select( dem.super, poverty, white, black, pop.density,
income, medianage, hispanic ) %>%
stargazer( type = "html", digits=2 )
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
dem.super | 1,629 | 0.45 | 0.50 | 0 | 0 | 1 | 1 |
poverty | 1,629 | 18.30 | 13.85 | 0.00 | 7.40 | 26.83 | 82.30 |
white | 1,629 | 70.42 | 25.00 | 0.95 | 56.35 | 89.89 | 100.00 |
black | 1,629 | 16.01 | 23.29 | 0.00 | 0.81 | 21.55 | 98.08 |
pop.density | 1,629 | 9.41 | 13.45 | 0.0004 | 0.36 | 14.12 | 302.27 |
income | 1,629 | 52,812.08 | 30,786.88 | 10,000 | 32,123 | 61,481 | 224,097 |
medianage | 1,629 | 36.83 | 8.40 | 14.00 | 30.40 | 42.90 | 66.20 |
hispanic | 1,629 | 33.53 | 29.70 | 0.00 | 10.48 | 50.20 | 100.00 |
The supermajority districts have very different demographic profiles.
ddat <- match.dat
ggplot( ddat, aes( x=poverty, fill=party )) +
geom_density(alpha = 0.5) + # xlim(10,15) +
xlab( "Poverty Rate" )
ggplot( ddat, aes( x=black, fill=party )) +
geom_density(alpha = 0.5) + # xlim(10,15) +
xlab( "Percent Black" )
ggplot( ddat, aes( x=pop.density, fill=party )) +
geom_density(alpha = 0.5) + xlim(0,25) +
xlab( "Population Density" )
We explored several of the matching procedures, but many failed to produce balanced samples.
The approach presented here generates a weighted matched set using a genetic search algorithm available in the rgenoud package.
Depending upon the underlying data, you can refined the matched sample by selecting pairs of voting districts from each set based upon matching weights.
# genetic search is a stochastic algorithm
# which uses random start points.
# Use a set seed if you want results to
# be reproducible.
set.seed( 1234 )
# reset row names to make subsets easier below
row.names( match.dat ) <- 1:nrow(match.dat)
m.out <- matchit( dem.super ~ poverty + black + pop.density,
data = match.dat,
method = "genetic",
discard="both", reestimate=TRUE,
replace=FALSE, caliper=.25 )
# discard="both" disregards all cases outside of the common support region
# prior to the matching process
#
# reestimate=TRUE recalculates distance metrics after discarding cases
# outside of the common support region
#
# replace=FALSE disallows one "control group" case being matched with multiple
# treatment group cases
#
# caliper=0.25 prunes all matches that have a fit worse than a 0.25 sd distance apart
If the matching process is successful you will get the following message (suppressed here for brevity because a bunch of output is printed at each iteration):
No significant improvement in 4 generations.
Solution Lexical Fitness Value:
9.678960e-03 2.358318e-02 4.812874e-02 6.144835e-02 6.619846e-02 6.355658e-01 9.781473e-01 9.998425e-01
Parameters at the Solution:
X[ 1] : 9.991076e+02
X[ 2] : 4.674203e+02
X[ 3] : 1.078070e+02
X[ 4] : 2.048367e+01
Solution Found Generation 9
Number of Generations Run 14
Thu Jul 23 20:45:46 2020
Total run time : 0 hours 0 minutes and 21 seconds
The final matched sample is as follows:
Control | Treated | |
---|---|---|
All | 900 | 729 |
Matched | 109 | 109 |
Unmatched | 555 | 610 |
Discarded | 236 | 10 |
The treatment group is democratic supermajority districts, and the control group is republican supermajority districts.
There were 236 republican districts discarded because they fell outside of the general region of common support, but only 10 democratic districts. This suggests that republican districts tended to be more extreme (very white, more suburban or rural, and much wealthier.
Of the remaining cases after the discard stage we were able to find “doppenganger” districts for 109 cases or approximately 15% of the remaining candidates: 109 / (109+610) for democrats, and 109 / (109+555) for republicans.
These are the cases that we would interpret as being demographically “identical” except for political ideology (on a small set of covariates, at least).
We can check to see whether we have achieved balance:
##
## Call:
## matchit(formula = dem.super ~ poverty + black + pop.density,
## data = match.dat, method = "genetic", discard = "both", reestimate = TRUE,
## replace = FALSE, caliper = 0.25)
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance 0.8500 0.1249 0.1890 0.7251 0.8441 0.7255
## poverty 28.3131 10.1851 7.5186 18.1280 18.6200 18.1547
## black 29.2939 5.2478 7.4964 24.0461 20.0000 24.0784
## pop.density 16.2054 3.9038 5.7350 12.3016 12.5110 12.3180
## eQQ Max
## distance 0.9299
## poverty 32.0100
## black 68.0600
## pop.density 259.7431
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance 0.4181 0.4119 0.2811 0.0062 0.009 0.0109
## poverty 19.5198 19.2818 8.0723 0.2380 0.410 0.5565
## black 8.4061 8.8911 9.7407 -0.4850 0.900 1.0183
## pop.density 7.0231 6.7103 6.6653 0.3128 0.362 0.4183
## eQQ Max
## distance 0.0470
## poverty 2.3300
## black 4.9500
## pop.density 1.3693
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 99.1387 98.9323 98.4955 94.9419
## poverty 98.6872 97.7981 96.9346 92.7210
## black 97.9828 95.5000 95.7711 92.7270
## pop.density 97.4574 97.1063 96.6043 99.4728
##
## Sample sizes:
## Control Treated
## All 900 729
## Matched 109 109
## Unmatched 555 610
## Discarded 236 10
Graphically:
For some more graphical representations of balance see the cobalt package in R.
We can also print the list of matches. Each number represents the row ID of the observation, and an NA in the control group means no match was found for that particular democratic district.
matches <- data.frame( treatment.group.id = row.names( m.out$match.matrix),
control.group.id = m.out$match.matrix )
head( matches, 25 )
When can then examine matched pairs to see how closely they align on the covariates:
# compare quality of matches
# select id (row) of treated and id (row) of comparison,
# compare on model covariates:
# note the nice balance:
covariates <- c("dem.super","poverty","black","pop.density")
matched.rows <- c(63,889)
match.dat[ matched.rows , covariates ] %>% pander()
dem.super | poverty | black | pop.density | |
---|---|---|---|---|
63 | TRUE | 19.43 | 0 | 10.86 |
889 | FALSE | 19.39 | 4.71 | 9.181 |
dem.super | poverty | black | pop.density | |
---|---|---|---|---|
64 | TRUE | 27.69 | 0 | 13.84 |
612 | FALSE | 27.37 | 1.37 | 15.47 |
To extract our new balanced data frame we can use the match.data() function from MatchIt.
## [1] 218 13
We now have 109 rows for democrats, and 109 rows of their republican dopelganger districts.
Save the new data frame:
Although not necessary, we might also save the list of which cases are matched:
matches <- na.omit( matches )
write.csv( matches, "../DATA/02-processed-data/matched-pairs-index.csv", row.names=F )
It may come in handy if you want to visualize matches, or do some post-hoc sensitivity analysis.
One other good reason to set the random seed before creating the matched samples.
We no longer have any demographic differences between our “treatment” and “control” group. They differ primarily on political ideology.
##
## DEM REP
## 109 109
##
## Welch Two Sample t-test
##
## data: as.numeric(balanced.data$poverty) by balanced.data$party
## t = 0.21366, df = 215.72, p-value = 0.831
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.957431 2.433394
## sample estimates:
## mean in group DEM mean in group REP
## 19.51982 19.28183
##
## Welch Two Sample t-test
##
## data: as.numeric(balanced.data$black) by balanced.data$party
## t = -0.35699, df = 215.3, p-value = 0.7214
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.163094 2.193002
## sample estimates:
## mean in group DEM mean in group REP
## 8.406055 8.891101
##
## Welch Two Sample t-test
##
## data: balanced.data$pop.density by balanced.data$party
## t = 0.347, df = 216, p-value = 0.7289
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.463864 2.089434
## sample estimates:
## mean in group DEM mean in group REP
## 7.023111 6.710326
ddat <- balanced.data
ggplot( ddat, aes( x=poverty, fill=party )) +
geom_density(alpha = 0.5) + # xlim(10,15) +
xlab( "Poverty Rate" )
ggplot( ddat, aes( x=black, fill=party )) +
geom_density(alpha = 0.5) + # xlim(10,15) +
xlab( "Percent Black" )
ggplot( ddat, aes( x=pop.density, fill=party )) +
geom_density(alpha = 0.5) + xlim(0,50) +
xlab( "Population Density" )
Lecy, J. D., Ashley, S. R., & Santamarina, F. J. (2019). Do nonprofit missions vary by the political ideology of supporting communities? Some preliminary results. Public Performance & Management Review, 42(1), 115-141. DOWNLOAD
Matching Packages:
Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1-24.
Ho, D., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software,42(i08).
Sekhon, Jasjeet S. 2011. “Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.” Journal of Statistical Software. 42(7): 1-52.