Lab 6 引入协变量的RDD效应分析(儿童死亡率案例)

6.1 实验说明

6.1.1 实验内容

实验目标:引入协变量情况下,骤变RDD的局部线性回归估计及分析,复现儿童死亡率案例(包括Table21.1、Figure 21.2a的全部过程及结果(参看:HANSEN B. Econometrics[M].2021(作者手稿). 第21章 Regression Discontinuity. )

主要内容包括:

  • 在引入两个额外协变量情形下,基于骤变RDD(SRDD)模型,进行局部线性回归方法(LLR)分析,估计其断点处置效应ATE(\(\widehat{\theta}\))及其估计标准误。

  • 对比无协变量SRDD模型(上一次实验)和有协变量SRDD,对这两个模型进行综合比较,并得出相关结论

6.1.2 材料准备

  • 软件环境:编译器为Rstudio(要求R 4.1及以上版本)

  • R代码文件:工作项目(R project)根目录下Rscript/文件夹内

# don't run here ! 
# only show you the code file path!
source("../Rscript/hansen21-fig21-2a.R", encoding = "UTF-8")

注意:上述代码运行时间较长(约3-5分钟),为提高渲染效率代码文件仅需运行一次。

6.2 案例描述

6.2.1 死亡率案例(续):背景说明

援助项目与儿童死亡率

我们继续使用前面(Ludwig J 等, 2007)的研究案例,来评估美国联邦政府脱贫援助项目(Head Start)对儿童死亡率的骤变RDD政策效应。现在我们考虑使用两个协变量(covariates):

  • 县级黑人人口占比black pop percentage\(Z_a\)

  • 县级城镇人口占比urban pop percentage\(Z_a\)

]

  • 上述两个协变量,本质上可以视作为收入变量(income)的代理变量(proxy)。

  • 下面我们将使用(Robinson P M, 1988)半参数效率估计方法来评估项目援助的断点处置效应(RDD ATE)。

6.2.2 读取案例数据

##  This file generates Figure 21.2a and Table 21.1 Covariates
##  Head Start RDD with Covariates
##  This file uses the package haven
##  This file uses the data file LM2007.dta

####  The data file LM2007.dta is used ####
library(haven)

dt_path <- here::here("data/LM2007.dta")
dat <- read_dta(dt_path)
y <- dat$mort_age59_related_postHS
x <- dat$povrate60
Za <- dat$census1960_pctblack
Zb <- dat$census1960_pcturban

c <- 59.1984

dt_head <- tibble(X= x,
                  Y = y,
                  Za = Za,
                  Zb = Zb) %>%
  mutate(D= ifelse(X>=c,1,0))

6.2.3 数据呈现

dt_head %>%
  add_column(obs = 1:nrow(.), .before = "X") %>%
  DT::datatable(
    caption = paste0("增加协变量的援助项目数据集(n=2783)"),
    options = list(dom = "tip", pageLength =8))%>%
  formatRound(c(2:5), digits = c(4,4,1,1))
  • 样本数据的描述性统计如下:
summary(dt_head)
       X            Y             Za           Zb     
 Min.   :15   Min.   :  0   Min.   : 0   Min.   :  0  
 1st Qu.:24   1st Qu.:  0   1st Qu.: 0   1st Qu.:  0  
 Median :34   Median :  0   Median : 2   Median : 28  
 Mean   :37   Mean   :  2   Mean   :11   Mean   : 29  
 3rd Qu.:47   3rd Qu.:  3   3rd Qu.:15   3rd Qu.: 48  
 Max.   :82   Max.   :136   Max.   :83   Max.   :100  
       D       
 Min.   :0.00  
 1st Qu.:0.00  
 Median :0.00  
 Mean   :0.11  
 3rd Qu.:0.00  
 Max.   :1.00  

6.2.4 数据分组描述性统计

我们根据处置变量D进行数据分组,并进行描述性统计,统计分析代码如下:

smry_grouped <- dt_head %>%
  group_by(D) %>%
  dplyr::summarize(n = n(),
            x_mean = mean(X, na.rm =T),
            x_min = min(X,na.rm = T),
            x_max = max(X, na.rm = T),
            x_sd = sd(X),
            y_mean = mean(Y, na.rm =T),
            y_min = min(Y,na.rm = T),
            y_max = max(Y, na.rm = T),
            y_sd = sd(Y),
            za_mean = mean(Za, na.rm =T),
            za_min = min(Za,na.rm = T),
            za_max = max(Za, na.rm = T),
            za_sd = sd(Za),
            zb_mean = mean(Zb, na.rm =T),
            zb_min = min(Zb,na.rm = T),
            zb_max = max(Zb, na.rm = T),
            zb_sd = sd(Zb)
            ) %>%
  pivot_longer(names_to = "stat", values_to = "value", -D) %>%
  pivot_wider(names_from = D, values_from = value) %>%
  rename_all(., ~c("stats","D0","D1"))

处置组和控制组描述性统计表如下:

smry_grouped %>% 
  arrange(desc(stats)) %>%
  DT::datatable(
    caption = paste0("处置组和控制组描述性统计(g=",n,")"),
    options = list(dom = "tip", 
                   pageLength =10,
                   scrollX = TRUE))%>%
  formatRound(c(2:3), digits = 2)

6.2.5 数据散点图

#### basic plot ####

lwd <- 0.6
fsize <- 16

p0 <- ggplot() +
  geom_point(aes(X, Y),data = dt_head, pch=21,alpha=0.6) +
  geom_vline(aes(xintercept=c),
             color = "orange", lty = "dotted", lwd=lwd) +
  labs(x= "X贫困线", y ="Y儿童死亡率") +
  scale_y_continuous(breaks = seq(0,80,20), limits = c(0,80)) +
  scale_x_continuous(breaks = seq(10,90,10), limits = c(10,90)) +
  theme_bw() +
  theme(text = element_text(size = fsize))
# show the scatter
p0
样本数据散点图n=`r n`

图 6.1: 样本数据散点图n=r n

6.3 过程0:设定相关参数和辅助计算工具

6.3.1 R运算代码

考虑到对断点两侧的数据,我们要多次进行局部线性估计(LLR)、计算预测误差、计算标准误(从而计算置信区间),因此我们可以事先设计并封装好一些有用的辅助函数(辅助工具)。

  • 为了简化计算,我们这里不再进行LLR的最优谱宽选择过程,而是直接先验设定好一个初始值(\(h=8\))。

  • 我们会多次分别对断点两侧运用LLR估计流程(我们已经在非参数局部回归估计中学习过了!)

#### RDD Helper function ####

# specify parameters and basic calculation
h <- 8
T <- as.numeric(x >= c)
y1 <- y[T==0]
y2 <- y[T==1]
x1 <- x[T==0]
x2 <- x[T==1]
n1 <- length(y1)
n2 <- length(y2)
n <- n1+n2

# specify bins
g1 <- seq(15,59.2,.2)
g2 <- seq(59.2,82,.2)
G1 <- length(g1)
G2 <- length(g2)

# Helper function for  triangle kernel function
TriKernel <- function(x) {
  s6 <- sqrt(6)
  ax <- abs(x)
  f <- (1-ax/s6)*(ax < s6)/s6
  return(f)
}

# Helper function for LLR estimation
LL_EST <- function(y,x,g,h) {
  G <- length(g)
  m <- matrix(0,G,1)
  z <- matrix(1,length(y),1)
  for (j in 1:G){
    xj <- x-g[j]
    K <- TriKernel(xj/h)
    zj <- cbind(z,xj)
    zz <- t(cbind(K,xj*K))
    beta <- solve(zz%*%zj,zz%*%y)
    m[j] <- beta[1]
  }
  return(m)
}

# Helper function for forecast residuals
LL_Residual <- function(y,x,h) {
  n <- length(y)
  e <- matrix(0,n,1)
  z <- matrix(1,n,1)
  for (j in 1:n){
    xj <- x-x[j]
    K <- TriKernel(xj/h)
    K[j] <- 0
    zj <- cbind(z,xj)
    zz <- t(cbind(K,xj*K))
    beta <- solve(zz%*%zj,zz%*%y)
    e[j] <- y[j] - beta[1]
  }
  return(e)
}

# Helper function for standard error
LL_SE <- function(y,x,g,h) {
  G <- length(g)
  s <- matrix(0,G,1)
  z <- matrix(1,length(y),1)
  e <- LL_Residual(y,x,h)
  for (j in 1:G){
    xj <- x-g[j]
    K <- TriKernel(xj/h)
    zj <- cbind(z,xj)
    zz <- cbind(K,xj*K)
    ZKZ <- solve(t(zz)%*%zj)
    Ke <- K*e
    ze <- cbind(Ke,xj*Ke)
    V <- ZKZ%*%crossprod(ze)%*%ZKZ
    s[j] = sqrt(V[1,1])
  }
  return(s)
}

6.3.2 过程步骤解读:协变量RDD分析的规则策略

在进行协变量RDD分析之前,我们设定如下的规则策略:

  • 规则1:我们设定先验谱宽\(h=8\),断点值设定为 \(c =59.1984\%\)

  • 规则2:分别设定断点两边箱组中心点序列值(center of bins)。我们将采用非对称箱组设置方法:

  • 控制组(断点左边)的评估范围为 \([15, 59.2]\),序列间隔为0.2。评估总箱组数为 \(g1=222\),待评估序列值为 \(15.0, 15.2, 15.4, 15.6, 15.8, \cdots,58.6, 58.8, 59.0, 59.2\)
  • 处置组(断点右边)的评估范围为 \([59.2, 82]\),序列间隔为0.2。评估总箱组数为 \(g2=115\),待评估序列值为 \(59.2, 59.4, 59.6, 59.8, 60.0, \cdots,81.4, 81.6, 81.8, 82.0\)
  • 规则3:如果使用局部线性估计法(LL),则采用三角核函数(triangle kenerl)。

  • 规则4:我们将使用(Robinson P M, 1988)半参数效率估计方法来评估断点处置效应(RDD ATE)。

6.4 过程2:协变量SRDD估计

协变量SRDD过程(R代码)如下:

#### RDD Estimation ####

#### step 1: LL estimate Y on X ####
## obtain residual of stage 1 
e1 <- LL_Residual(y1,x1,h)
e2 <- LL_Residual(y2,x2,h)
e <- rbind(e1,e2)

#### step 2: LL estimate Z on X ####
## obtain residual of stage 2 
Ra <- LL_Residual(Za,x,h)
Rb <- LL_Residual(Zb,x,h)
Z <- cbind(Za,Zb)
R <- cbind(Ra,Rb)

# step 3: OLS  ####
## regress estimate residual of stage 1 on residual of stage 2
## obtain the coefficient and standard errors
## no intercept !
XXR <- solve(crossprod(R))     # inverse matrix
beta <- XXR%*%(t(R)%*%e)
u <- e - R%*%beta
Ru <- R*(u%*%matrix(1,1,ncol(R)))
V <- XXR%*%crossprod(Ru)%*%XXR    # robust se
sbeta <- sqrt(diag(V))

tstat <- beta/sbeta
pvalue <- 2*(1-pnorm(abs(tstat)))
betas <- cbind(beta,sbeta,tstat,pvalue)

#### step 4: Construct the residual ####
## obtain constructed residual

zm <- colMeans(Z)%*%beta

Z1 <- Z[T==0,]
Z2 <- Z[T==1,]
yZ <- y - Z%*%beta
yZ1 <- y1 - Z1%*%beta
yZ2 <- y2 - Z2%*%beta

#### step 5: LL ATE estimate ####
## LL estimate of constructed residual on X
## obtain  ATE and standard errors

m1 <- LL_EST(yZ1,x1,g1,h) + matrix(zm,G1,1)
m2 <- LL_EST(yZ2,x2,g2,h) + matrix(zm,G2,1)
s1 <- LL_SE(yZ1,x1,g1,h)
s2 <- LL_SE(yZ2,x2,g2,h)
L1 <- m1-1.96*s1
U1 <- m1+1.96*s1
L2 <- m2-1.96*s2
U2 <- m2+1.96*s2

# rdd effect
theta_lft <- m1[G1]
theta_rgt <- m2[1]
theta <- m2[1]-m1[G1]
setheta <- sqrt(s1[G1]^2 + s2[1]^2)
tstat <- theta/setheta
pvalue <- 2*(1-pnorm(abs(tstat)))

6.5 过程3:将RDD估计结果整理成相关表格

6.5.1 将断点两侧的m(x)估计结果整理成表格(R代码)

#### tibble of mx estimate related ####

tbl_mx1 <- tibble(group = "control",
                  xg = g1,
                  mx=as.vector(m1),
                  s = as.vector(s1),
                  lwr = as.vector(L1), 
                  upr = as.vector(U1))

tbl_mx2 <- tibble(group = "treat",
                  xg = g2,
                  mx=as.vector(m2),
                  s = as.vector(s2),
                  lwr = as.vector(L2), 
                  upr = as.vector(U2))

tb_mxh <- bind_rows(tbl_mx1, tbl_mx2)

6.5.2 将RDD预测误差整理成表格(R代码)

#### tibble of sample result ####
tbl_residual <- dt_head %>%
  mutate(e = as.vector(e),   # residual of LL 1
         Ra = as.vector(Ra),   # residual of LL 2
         Rb = as.vector(Rb),
         RZ = as.vector(yZ) # residual of construct
         )

6.5.3 将最终RDD断点效应整理成表格(R代码)

#### tibble of estimate result ####
tbl_theta02 <- tibble(
  model = "covariate",
  pars = c("theta", "black","urban"),
  est = c(theta, as.vector(beta)),
  se  = c(setheta, as.vector(sbeta))
)

6.6 过程4:绘制RDD分析图(mx估计值)

6.6.1 对绘制底图(R代码)

lwd <- 0.6
lwadd <- 0.2

p00 <- ggplot() +
  geom_point(aes(X, Y),data = dt_head, pch=21,alpha=0.3) +
  geom_vline(aes(xintercept=c),
             color = "red", lty = "dotted", lwd=lwd) +
  labs(x= "X贫困线", y ="Y儿童死亡率") +
  scale_y_continuous(breaks = seq(0,5,1), limits = c(0,5)) +
  scale_x_continuous(breaks = seq(10,85,10), limits = c(10,85)) +
  theme_bw() +
  theme(text = element_text(size = fsize))

6.6.2 对断点左边绘制mx估计图(R代码)

# mx plot for left part
p_mxh1 <- p00 +
  geom_line(aes(x = xg, y = mx, 
                color="m1", lty="m1"),
            lwd = lwd+lwadd,
            data = base::subset(tb_mxh,group=="control")) +
  scale_color_manual(
    name="",
    breaks = c("m1"),
    labels = c(expression(m(x):control)),
    values=c("purple"))+
  scale_linetype_manual(
    name="",
    breaks = c("m1"),
    labels = c(expression(m(x):control)),
    values=c("solid"))+
  theme(legend.position = "right",
        text = element_text(size = fsize))

6.6.3 对断点右边绘制mx估计图(R代码)

# mx plot for right part
p_mxh2 <- p00 +
  geom_line(aes(x = xg, y = mx, 
                color="m2", lty="m2"),
            lwd = lwd+lwadd,
            data = base::subset(tb_mxh,group=="treat")) +
  scale_color_manual(
    name="",
    breaks = c("m2"),
    labels = c(expression(m(x):treat)),
    values=c("blue"))+
  scale_linetype_manual(
    name="",
    breaks = c("m2"),
    labels = c(expression(m(x):treat)),
    values=c("solid"))+
  theme(legend.position = "right",
        text = element_text(size = fsize))

6.6.4 同时对断点两边绘制mx估计图(R代码)

# mx plot for both side
p_mxh <- p00 +
  geom_line(aes(x = xg, y = mx, 
                color="m1", lty="m1"),
            lwd = lwd+lwadd,
            data = base::subset(tb_mxh,group=="control")) +
  geom_line(aes(x = xg, y = mx, 
                color="m2", lty="m2"), 
            lwd = lwd+lwadd,
            data = base::subset(tb_mxh,group=="treat")) +
  scale_color_manual(
    name="",
    breaks = c("m1", "m2"),
    labels = c(expression(m(x):control),expression(m(x):treat)),
    values=c("purple", "blue"))+
  scale_linetype_manual(
    name="",
    breaks = c("m1", "m2"),
    labels = c(expression(m(x):control),expression(m(x):treat)),
    values=c("solid", "solid"))+
  theme(legend.position = "right",
        text = element_text(size = fsize))

6.7 过程5:绘制RDD置信带图

6.7.1 对断点左边绘制置信带(R代码)

#### Plot Confidence Bands ####

# band plot for left side
p_band_left <- p_mxh1 +
  geom_ribbon(aes(x=xg,ymin = lwr, ymax = upr),
              data = base::subset(tb_mxh,group=="control"),
              alpha = 0.2) 

6.7.2 对断点右边绘制置信带(R代码)

# band plot for right side
p_band_right <- p_mxh2 +
  geom_ribbon(aes(x=xg,ymin = lwr, ymax = upr),
              data = base::subset(tb_mxh,group=="treat"),
              alpha = 0.2) 

6.7.3 对断点两边同时绘制置信带(R代码)

# band plot for right side
# band plot for both side
p_band_cov <- p_mxh +
  geom_ribbon(aes(x=xg,ymin = lwr, ymax = upr),
              data = base::subset(tb_mxh,group=="control"),
              alpha = 0.2) +
  geom_ribbon(aes(x=xg,ymin = lwr, ymax = upr),
              data = base::subset(tb_mxh,group=="treat"),
              alpha = 0.2) +
  geom_text(aes(x = c-3, y = theta_lft, 
                label=number(theta_lft, 0.0001)),
            color = "purple")+
  geom_text(aes(x = c+3, y = theta_rgt, 
                label=number(theta_rgt, 0.0001)),
            color = "blue") +
  geom_segment(aes(x=c, xend = c,
                   y= theta_lft, yend = theta_rgt),
               arrow = arrow(length = unit(0.1,"cm"),
                             type = "closed"),
               lty = "solid", color= "orange", lwd=lwd)  +
  geom_text(aes(x = c+5, y=theta_lft+0.5*theta), 
            label = TeX(paste0("\\hat{\\theta}=",number(theta,0.0001))),
            color= "orange", size=4)

6.8 协变量RDD分析的过程步骤解读

6.8.1 第1阶段LLR估计残差

  • 步骤1:直接采用前面的局部线性回归方法(LLR),用 \(Y_i\)\(X_i\)进行LL回归,得到第1阶段的结果变量的拟合值 \(\widehat{m}_i = \widehat{m}_i(X_i)\),并进一步构造留一法残差a \(Y_i - \widehat{m}_i\)
tbl_residual %>%
  add_column(obs = 1:nrow(.), .before = "X") %>%
  select(obs,D,X,Y,Za,Zb,e) %>%
  DT::datatable(
    caption = paste0("RDD LL估计得到的残差(n=",n,")"),
    options = list(dom = "tip", pageLength =6))%>%
  formatRound(c(3:7), digits = c(4,4,1,1,4))

a 这个阶段的残差序列用e命名。

6.8.2 第2阶段LLR估计残差

  • 步骤2:同上,依次做 \(Z_a\)\(X_i\)\(Z_b\)\(X_i\)局部线性回归(LLR),并分别得到协变量的拟合值 \(\widehat{g}_{1i},\widehat{g}_{2i}\),及其对应残差a \((Z_a-\widehat{g}_{1i}),(Z_b-\widehat{g}_{2i})\)
tbl_residual %>%
  add_column(obs = 1:nrow(.), .before = "X") %>%
  select(obs,D,X,Y,Za,Zb,e,Ra,Rb) %>%
  DT::datatable(
    caption = paste0("RDD LL估计得到的残差(n=",n,")"),
    options = list(dom = "tip", pageLength =6))%>%
  formatRound(c(3:9), digits = c(4,4,1,1,4,4,4))

a 这个阶段的两个残差序列分别用RaRb命名。

6.8.3 第3阶段OLS估计(模型)

  • 步骤3:利用前面两个阶段的残差,做 \(Y_i -m_{i}\)\(Z_{i1}-\widehat{g}_{1i},Z_{i2}-\widehat{g}_{2i},\ldots,Z_{ik}-\widehat{g}_{ki}\)无截距的普通最小二乘回归(OLS),并得到估计系数 \(\hat{\beta}\)及其标准误

\[\begin{align} (Y_i -m_{i}) &= \hat{\beta}_1(Z_{ia}-\widehat{g}_{1i})+\hat{\beta}_2(Z_{ib}-\widehat{g}_{2i})\\ e&=\hat{\beta}_{1}R_a + \hat{\beta}_{2}R_a \end{align}\]

6.8.4 第3阶段OLS估计(结果)

  • 上述模型,未矫正标准误下OLS估计的结果如下a
mod_res <- formula(e~-1+Ra+Rb)
lx_est <- xmerit::lx.est(lm.mod = mod_res , 
                         lm.dt = tbl_residual, 
                         lm.n = 5,
                         opt = c("s","t","p"), 
                         inf = c("over"),
                         digits = c(4,4,2,4))

\[\begin{equation} \begin{alignedat}{999} &\widehat{e}=&&+0.0265Ra_i&&-0.0094Rb_i\\ &(s)&&(0.0083)&&(0.0045)\\ &(t)&&(+3.19)&&(-2.08)\\ &(p)&&(0.0014)&&(0.0377)\\ &(over)&&n=2783&&\hat{\sigma}=5.7091 \end{alignedat} \end{equation}\]

  • 上述模型,进行稳健标准误矫正OLS估计的结果如下b
library(robustbase)
library(sandwich)
library(lmtest)
library(modelr)
library(broom)
# regression with robust standard errors using R
## see [url](https://www.brodrigues.co/blog/2018-07-08-rob_stderr/)

lm.out <- lm(formula = mod_res, data =tbl_residual )

broom::tidy(
  lmtest::coeftest(lm.out, 
                   vcov = sandwich::vcovHC(lm.out))
  ) %>%
  DT::datatable(
    caption = paste0("稳健标准误OLS估计(n=",n,")"),
    options = list(dom = "t", pageLength =6))%>%
  formatRound(c(2:5), digits = c(4,4,2,4))

a b 两种OLS估计程序下,回归系数都相同,只是系数对应的标准误不一样。这里我们仅需要用到回归系数,因此不影响后续步骤。

6.8.5 第4阶段构造残差

  • 步骤4:利用前面的OLS估计系数,我们就可以构造得到残差 \(\hat{e}_i=Y_i - Z^{\prime}_i\hat{\beta}\)
tbl_residual %>%
  add_column(obs = 1:nrow(.), .before = "X") %>%
  select(obs,D,X,Y,Za,Zb,e,Ra,Rb,RZ) %>%
  DT::datatable(
    caption = paste0("RDD LL估计得到的残差(n=",n,")"),
    options = list(dom = "tip", pageLength =6))%>%
  formatRound(c(3:10), digits = c(4,4,1,1,4,4,4,4))