Introduction

Education is a crucial factor in the development of a thriving and prosperous society. From increasing economic growth to promoting civic engagement, education continues to be the key determinant of success in various aspects of life, especially career prospects and high-status income levels[3]. Therefore, We intend to estimate the Conditional Average Treatment Effect (CATE) of education level on earning using the “honest” estimator proposed in the paper Comprehension and Reproduction of Recursive Partitioning for Heterogeneous Causal Effects, written by Susan Athey and Guido Imbens[1]. Cate is designed to capture heterogeneity of a treatment effect across subpopulations when the unconfoundedness assumption applies. The authors created and benchmarked the unbiased estimator of CATE across subsets of the population with different treatments, proposing an “honest” approach for estimation.

Research Problems

1. The CATE of college education on yearly income for different years, being 2010, 2000, and 1990.

2. The CATE of college education on yearly income for males and females in 2010.

3. The CATE of college education on yearly income for different age groups in 2010, being people in their 30s, 40s, and 50s.

Learn More about our project

[1] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353-7360, jul 2016.
[2] Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379-423, 1948.
[3] Ulrich Teichler. Higher Education and the World of Work: Conceptual Frameworks, Comparative Perspectives, Empirical Findings. Sense Publishers, 2009.

Methodology

Casual Tree Data EDA

Casual Tree & CATE Estimation

The "honest" approach used for building and validating the Causal Tree is an extension and divergence from the classification and regression trees(CART) algorithm. One of the most important concerns is to find criteria to evaluate and compare estimators for treatment effects. Due to the fact that we cannot observe both $Y_i(1)$ and $Y_i(0)$ for an individual, the true treatment effect $\tau$ is also not observable since we are missing half of the $Y^{obs}$. Thus, the "honest" version of $EMSE_{\tau}(\Pi)$: \[EMSE_{\tau}(\Pi)\equiv\mathbb{E}_{S^{te},S^{est}}[MSE_{\tau}(S^{te},S^{est},\Pi)]\] is not feasible as we have no knowledge about $\tau_i$. Therefore, the paper proposed an estimator for $EMSE_{\tau}(\Pi)$ by modifying the $MSE_{\mu}$ in CART to get an unbiased estimator $\widehat{MSE}_{\tau}$ for the treatment effect and the $EMSE_{\tau}(\Pi)$ in "honest" algorithm to get an unbiased estimator $\widehat{EMSE}_{\tau}(S^{tr},N^{est},\Pi)$ for $EMSE_{\tau}(\Pi)$. Let $p$ denote the proportion of the treated individuals in a leaf, $S_{control}^{tr}$ denote the subsample of the control group in the training sample, and $S_{treat}^{tr}$$ denote the subsample of the treatment group in the training sample, the unbiased estimator for $EMSE_{\tau}(\Pi)$ for splitting is defined as \[-\widehat{EMSE}_{\tau}(S^{tr},N^{est},\Pi)\equiv\frac{1}{N^{tr}}\sum_{i \in S^{tr}}\hat{\tau}^2(X_i;S^{tr},\Pi)-(\frac{1}{N^{tr}}+\frac{1}{N^{est}})\cdot \sum_{l \in \Pi}(\frac{S_{S_{treat}^{tr}}^2(l)}{p}+\frac{S_{S_{control}^{tr}}^2(l)}{1-p}).\] Using the same equation as the splitting criterion with cross-validation sample, the unbiased estimator of $EMSE_{\tau}(\Pi)$ for cross-validation is $-\widehat{EMSE}_{\tau}(S^{tr,cv},N^{est},\Pi)$.

Data Source & Dataset Description

We queried the IPUMS database to attain U.S. census microdata from around 1980 to 2020. To start with our investigation, we processed the data for them to fit in our model:
- Treatment Variable(W): EDUC, education level transformed to binary scale, indicating whether the individual finished college or not.
- Outcome Variable(Y ): INCWAGE, natural log of the yearly wage of an individual.
- Covariates(X): other features extracted such as AGE, SEX, RACE, etc.

Explanatory Data Analysis

We plotted the distribution of the variables relevant to our research question to get a more general sense. From the EDAs, we can observe a clear difference in wages across decades and education levels, as well as a difference in education level across ages, which motivate our incentive to investigate the heterogenous CATE of education.

Wages by decades

Educ-Age-Wage

Complete rate

< >

Timeline

Sept. - Oct.
● Conduct literature review by reading the target paper
Nov. - Dec.
● Reproduce experiments and write report
● Begin planning and drafting project proposal
● Dataset candidates selected
Jan. - Feb.
● Decide on dataset and perform exploratory data analysis (EDA)
● Conduct model tuning
● Create project poster and launch website
● Begin drafting the final report
Mar.
● Implement and fine-tune the random causal forest model
● Finalize models, project report, poster, and website
● Mar.15 Final Showcase

Results

Conclusion

We have found positive CATE of education on earnings among all of the three research ques- tions. The CATE is larger in recent decades and for younger people. Unexpectedly, there was no significant difference found between the CATE of males and females.

Trees and discovery under different conditions

Each leave of a tree represent the CATE of that group, and average CATE is the average of that values across all leaves

CATE across Decades

The corresponding average CATE estimation across leaves for the 1990s, 2000s, and 2010s are 0.481, 0.547, and 0.771. Therefore, we conclude that later decades would produce a deeper tree and a larger CATE of education on income.

Tree by 1990

Tree by 2000

Tree by 2010

CATE across Ages

The corresponding average CATE estimation across leaves for people in their 30s, 40s, and 50s are 0.74, 0.679, and 0.645. Hense, similar to the former problem, the tree produced is deeper and the CATE is larger for the younger population, which is in our expectation. However, the difference in the tree depth and CATE is not as large as that over decades.

Tree from age 30s

Tree from age 40s

Tree from age 50s

CATE across Genders

The corresponding average CATE estimation across leaves for males and females in 2010 is 0.556 and 0.546. Different from the two above questions, there is no obvious difference in the tree depth and CATE across genders.