Econ 265: Introduction to Econometrics

Topic 6: Variants on MLR

Moshi Alam

Introduction

Data scaling
Standardized coefficients
Nonlinearities in X
Interaction terms
Average partial effects
Adjusted R-squared
Over fitting

Data Scaling

library(wooldridge); library(stargazer);
model1 <- lm(bwght ~ cigs + faminc, bwght) # bwght in grams
model2 <- lm(I(bwght/16) ~ cigs + faminc, bwght) # bwght in pounds
bwght$packs <- bwght$cigs/20 # 1 pack =  20 cigs
model3 <- lm(bwght ~ packs + faminc, bwght) # use packs instead of cigs
stargazer(model1, model2, model3, type = "text")


=================================================================
                                       Dependent variable:       
                                ---------------------------------
                                  bwght    I(bwght/16)   bwght   
                                   (1)         (2)        (3)    
-----------------------------------------------------------------
cigs                            -0.463***   -0.029***            
                                 (0.092)     (0.006)             
                                                                 
packs                                                  -9.268*** 
                                                        (1.832)  
                                                                 
faminc                           0.093***   0.006***    0.093*** 
                                 (0.029)     (0.002)    (0.029)  
                                                                 
Constant                        116.974***  7.311***   116.974***
                                 (1.049)     (0.066)    (1.049)  
                                                                 
-----------------------------------------------------------------
Observations                      1,388       1,388      1,388   
R2                                0.030       0.030      0.030   
Adjusted R2                       0.028       0.028      0.028   
Residual Std. Error (df = 1385)   20.063      1.254      20.063  
F Statistic (df = 2; 1385)      21.274***   21.274***  21.274*** 
=================================================================
Note:                                 *p<0.1; **p<0.05; ***p<0.01

Data Scaling

Coefficients in a regression model are sensitive to the units of measurement

The population model: \(bwght_i = \beta_0 + \beta_1 cigs_i + \beta_2 faminc + u_i\) follows MLR 1-5
\(bwght_i/16 = \beta_0/16 + \beta_1/16 cigs_i + \beta_2/16 faminc + u_i\)
- Coefficients scaled down by factor with which dependent variable is scaled
\(bwhgt_i = \beta_0 + 20\beta_1 cigs_i/20 + \beta_2 faminc + u_i\)
- which simplifies to \(bwhgt_i = \beta_0 + 20\beta_1 packs_i + \beta_2 faminc + u_i\)
- Coefficients scaled up by factor with which independent variable is scaled
What about SE? R^2?

Standardized Coefficients

Instead of asking what happens to \(y\) when \(x\) changes by 1 unit, we ask what happens to \(y\) when \(x\) changes by 1 standard deviation

This is useful because:
- \(x\) may be measured in different units in different datasets
- \(x\) may be a composite index of several variables
- \(x\) could be scaled up or down differently in different datasets. E.g. GPA
But s.d. effects are only useful when interpreted relative to the mean of \(x\)
- So the standardized \(x_i\) is \(x^* = (x_i - \bar{x})/s_x\)
Thus from the population model: \(y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u\)
- The standardized model is: \(y = \beta_0^* + \beta_1^* x_1^* + \ldots + \beta_k^* x_k^* + u^*\)
  - where \(\beta_j^* = \frac{s_{x_j}}{s_y}\beta_j\) for \(j = 1, \ldots, k\)
How would you derive this?
This is easy to implement in R using scale() function

Standardized Coefficients

Scaled estimation:

model4 <- lm(scale(bwght) ~ scale(cigs) + scale(faminc), bwght)
stargazer(model1, model4, type = "text", omit.stat = c("f", "ser"))


==========================================
                  Dependent variable:     
              ----------------------------
                  bwght      scale(bwght) 
                   (1)            (2)     
------------------------------------------
cigs            -0.463***                 
                 (0.092)                  
                                          
faminc           0.093***                 
                 (0.029)                  
                                          
scale(cigs)                    -0.136***  
                                (0.027)   
                                          
scale(faminc)                  0.085***   
                                (0.027)   
                                          
Constant        116.974***      -0.000    
                 (1.049)        (0.026)   
                                          
------------------------------------------
Observations      1,388          1,388    
R2                0.030          0.030    
Adjusted R2       0.028          0.028    
==========================================
Note:          *p<0.1; **p<0.05; ***p<0.01

Compute manually:

model1$coef["cigs"]* sd(bwght$cigs)/sd(bwght$bwght)

      cigs 
-0.1359828

model1$coef["faminc"]* sd(bwght$faminc)/sd(bwght$bwght)

    faminc 
0.08540571

More examples

price \(=\beta_0+\beta_1\) nox \(+\beta_2\) crime \(+\beta_3\) rooms \(+\beta_4\) dist \(+\beta_5\) stratio \(+u\)

library(wooldridge)
summary(lm(price ~ nox + crime + rooms + dist + stratio, hprice2))


Call:
lm(formula = price ~ nox + crime + rooms + dist + stratio, data = hprice2)

Residuals:
   Min     1Q Median     3Q    Max 
-13914  -3201   -662   2110  38064 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20871.13    5054.60   4.129 4.27e-05 ***
nox         -2706.43     354.09  -7.643 1.09e-13 ***
crime        -153.60      32.93  -4.665 3.97e-06 ***
rooms        6735.50     393.60  17.112  < 2e-16 ***
dist        -1026.81     188.11  -5.459 7.57e-08 ***
stratio     -1149.20     127.43  -9.018  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5586 on 500 degrees of freedom
Multiple R-squared:  0.6357,    Adjusted R-squared:  0.632 
F-statistic: 174.5 on 5 and 500 DF,  p-value: < 2.2e-16

summary(lm(scale(price) ~ scale(nox) + scale(crime) + scale(rooms) + scale(dist) + scale(stratio), hprice2))


Call:
lm(formula = scale(price) ~ scale(nox) + scale(crime) + scale(rooms) + 
    scale(dist) + scale(stratio), data = hprice2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5110 -0.3476 -0.0719  0.2291  4.1334 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.327e-16  2.697e-02   0.000        1    
scale(nox)     -3.404e-01  4.454e-02  -7.643 1.09e-13 ***
scale(crime)   -1.433e-01  3.072e-02  -4.665 3.97e-06 ***
scale(rooms)    5.139e-01  3.003e-02  17.112  < 2e-16 ***
scale(dist)    -2.348e-01  4.302e-02  -5.459 7.57e-08 ***
scale(stratio) -2.703e-01  2.997e-02  -9.018  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6066 on 500 degrees of freedom
Multiple R-squared:  0.6357,    Adjusted R-squared:  0.632 
F-statistic: 174.5 on 5 and 500 DF,  p-value: < 2.2e-16

Functional forms

Recall from lecture 1

Specification	Change in x	Effect on y
Level-level	+1 unit	+\(b_1\) units
Level-log	+1%	+\(\frac{b_1}{100}\) units
Log-level	+1 unit	+\((100 \times b_1)\%\)
Log-log	+1%	+\(b_1\%\)

Nonlinearities in X

\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{1i}^2 + \beta_3 x_{2i} + u_i\]

\[\frac{\partial y}{\partial x_{1i}} = \beta_1 + 2\beta_2 x_{1i}\]

Discuss the nature of price change of houses with respect to number of rooms:

lm(log(price) ~ log(nox) + log(dist) + rooms + I(rooms^2), hprice2)


Call:
lm(formula = log(price) ~ log(nox) + log(dist) + rooms + I(rooms^2), 
    data = hprice2)

Coefficients:
(Intercept)     log(nox)    log(dist)        rooms   I(rooms^2)  
   12.87147     -0.88558     -0.04421     -0.72860      0.08004

Based on estimates how does price change as # of rooms go up?
Increasing or decreasing rates?
Are there tipping points?
- \(|\beta_1/2\beta_2|\) is the tipping point for \(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{1i}^2 + \beta_3 x_{3i} + u_i\) when \(x_{1i}\) changes

Interaction terms

\(p = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + \beta_3 sqrft \times bdrms + \beta_4 bthrms +u\)
We can formalize it with conditional expecations:
\(E(p|sqrft, bdrms, bthrms) = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + \beta_3 sqrft \times bdrms + \beta_4 bthrms\)
\[\frac{\partial E(p|sqrft, bdrms, bthrms)}{\partial bdrms} = \beta_2 + \beta_3 sqrft\]
So at sqrft = \(\bar{sqrft}\), the effect of bdrms on price is \(\beta_2 + \beta_3 \bar{sqrft}\)
Easy to obtain given \(\hat{\beta}_2\) and \(\hat{\beta}_3\) and \(\bar{sqrft}\)
Similarly in the previous slide \(\frac{\partial E(y | x_1 = \bar{x_1}, x_2)}{\partial x_1} = \beta_1 + 2\beta_2 \bar{x_1}\)

model <-  lm(price ~ sqrft + bdrms + I(sqrft * bdrms), data = hprice1)
 model$coefficients["bdrms"] + model$coefficients["I(sqrft * bdrms)"] * mean(hprice1$sqrft)

   bdrms 
11.26181

Adjusted R-squared

Recall Econ 160
Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in the model.
Compare goodness of fit of models with different numbers of X’s. \[\text{Adjusted-}R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - k - 1} \right)\] \(R^2\) is the R-squared value, \(n\) is the # observations and \(k\) is the number of X’s
- Penalizes the addition of unnecessary predictors
The F-test could be related to the adjusted R-squared.
- Think abour \(UR\) and \(R\) models in the context of the F-test
- F-test is useful when models are nested

Overfitting/overcontrolling

Including too many controls can distort causal interpretation.
Example: Estimating the effect of beer tax on traffic fatalities ins tates:

\[ \text{fatalities}_s = \beta_0 + \beta_1 \text{tax}_s + \beta_2 \text{miles}_s + \beta_3 \text{percmale}_s + \beta_4 \text{perc16_21}_s + \dots \]

Should we control for beer consumption (beercons)?
- No, it mediates the effect of tax on fatalities.
- Controlling for it absorbs the policy impact.

The Problem of Overcontrolling

Overcontrolling can remove meaningful variation in key variables.
Example: Estimating the effect of pesticides on health costs:
- Controlling for doctor visits blocks part of the effect.
Another case: School quality → earnings:
- If quality raises education, controlling for education understates its impact.
Takeaway: Control variables should reflect causal logic, not just maximize \(R^2\).

Qualitative data

Binary variables

Binary variables are {0,1} variables. E.g. female, married, smoker
Example: Estimating the effect of gender on wages:
\(wage_i = \beta_0 + \delta_0 female_i + \beta_1 educ_i + u_i\)
\(\delta_0=\mathrm{E}( wage_i \mid female_i =1, educ_i )-\mathrm{E}( wage_i \mid female_i =0, educ_i )\)
Difference in wages between females and males, holding education constant.
If \(\delta_0 < 0\), women earn less than men on average, given education.
- Males are the base group (\(\beta_0\) is their intercept)[draw]
Including both male and female leads to perfect collinearity.
Alternatively: \(wage_i = \alpha_0 + \gamma_0 male_i + \beta_1 educ_i + u_i\)
- Here, females are the base group.

tests for wage discrimination

wage \(=\beta_0+\delta_0\) female \(+\beta_1\) educ \(+\beta_2\) exper \(+\beta_3\) tenure \(+u\)

library(wooldridge)
summary( lm(wage ~ female + educ + exper + tenure, data = wage1))


Call:
lm(formula = wage ~ female + educ + exper + tenure, data = wage1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7675 -1.8080 -0.4229  1.0467 14.0075 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.56794    0.72455  -2.164   0.0309 *  
female      -1.81085    0.26483  -6.838 2.26e-11 ***
educ         0.57150    0.04934  11.584  < 2e-16 ***
exper        0.02540    0.01157   2.195   0.0286 *  
tenure       0.14101    0.02116   6.663 6.83e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.958 on 521 degrees of freedom
Multiple R-squared:  0.3635,    Adjusted R-squared:  0.3587 
F-statistic:  74.4 on 4 and 521 DF,  p-value: < 2.2e-16

Multiple categories

Marital status and gender: \[ log(wage_i) = \beta_0 + \beta_1 marriedmale_i + \beta_2 marriedfemale_i + \beta_3 unmarriedfemale_i \\ + \beta_4 educ_i + \beta_5 exper_i + \beta_6 tenure_i + \beta_7 exper^2_i + \beta_8 tenure^2_i + u_i\] What is the base group?

summary(lm(log(wage) ~ I(married*(1-female)) + I(married*female) + I((1-married)*female) + educ + exper + tenure + I(exper^2) + I(tenure^2), data = wage1) )


Call:
lm(formula = log(wage) ~ I(married * (1 - female)) + I(married * 
    female) + I((1 - married) * female) + educ + exper + tenure + 
    I(exper^2) + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89697 -0.24060 -0.02689  0.23144  1.09197 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                0.3213781  0.1000090   3.213 0.001393 ** 
I(married * (1 - female))  0.2126757  0.0553572   3.842 0.000137 ***
I(married * female)       -0.1982676  0.0578355  -3.428 0.000656 ***
I((1 - married) * female) -0.1103502  0.0557421  -1.980 0.048272 *  
educ                       0.0789103  0.0066945  11.787  < 2e-16 ***
exper                      0.0268006  0.0052428   5.112 4.50e-07 ***
tenure                     0.0290875  0.0067620   4.302 2.03e-05 ***
I(exper^2)                -0.0005352  0.0001104  -4.847 1.66e-06 ***
I(tenure^2)               -0.0005331  0.0002312  -2.306 0.021531 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3933 on 517 degrees of freedom
Multiple R-squared:  0.4609,    Adjusted R-squared:  0.4525 
F-statistic: 55.25 on 8 and 517 DF,  p-value: < 2.2e-16

What is the wage gap between married females and unmarried females?

Interactions with dummy variables

What is the point of interacting: Marital status X gender?

\[ log(wage_i) = \beta_0 + \beta_1 female_i + \beta_2 married_i + \beta_3 married_i * female_i \\ + \beta_4 educ_i + \beta_5 exper_i + \beta_6 tenure_i + \beta_7 exper^2_i + \beta_8 tenure^2_i + u_i\]

summary(lm(log(wage) ~ female + married + I((married)*female) + educ + exper + tenure + I(exper^2) + I(tenure^2), data = wage1))


Call:
lm(formula = log(wage) ~ female + married + I((married) * female) + 
    educ + exper + tenure + I(exper^2) + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89697 -0.24060 -0.02689  0.23144  1.09197 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.3213781  0.1000090   3.213 0.001393 ** 
female                -0.1103502  0.0557421  -1.980 0.048272 *  
married                0.2126757  0.0553572   3.842 0.000137 ***
I((married) * female) -0.3005931  0.0717669  -4.188 3.30e-05 ***
educ                   0.0789103  0.0066945  11.787  < 2e-16 ***
exper                  0.0268006  0.0052428   5.112 4.50e-07 ***
tenure                 0.0290875  0.0067620   4.302 2.03e-05 ***
I(exper^2)            -0.0005352  0.0001104  -4.847 1.66e-06 ***
I(tenure^2)           -0.0005331  0.0002312  -2.306 0.021531 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3933 on 517 degrees of freedom
Multiple R-squared:  0.4609,    Adjusted R-squared:  0.4525 
F-statistic: 55.25 on 8 and 517 DF,  p-value: < 2.2e-16

Different slopes

Do men and women have different returns to education?
\(wage_i = \beta_0 + \beta_1 educ_i + \beta_2 female_i + \beta_3 educ_i * female_i + u_i\)
\(E(wage_i | educ_i, female_i = 1) - E(wage_i | educ_i, female_i = 0) = \beta_3 educ_i\)
\(\uparrow\) educ by 1 unit \(\uparrow\) wage by \(\beta_1 + \beta_3\) for females and \(\beta_1\) for males.

summary(lm(log(wage) ~ educ + female + I(educ*female) + exper + expersq + tenure + tenursq, data = wage1))


Call:
lm(formula = log(wage) ~ educ + female + I(educ * female) + exper + 
    expersq + tenure + tenursq, data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.83265 -0.25261 -0.02374  0.25396  1.13584 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.3888060  0.1186871   3.276  0.00112 ** 
educ              0.0823692  0.0084699   9.725  < 2e-16 ***
female           -0.2267886  0.1675394  -1.354  0.17644    
I(educ * female) -0.0055645  0.0130618  -0.426  0.67028    
exper             0.0293366  0.0049842   5.886 7.11e-09 ***
expersq          -0.0005804  0.0001075  -5.398 1.03e-07 ***
tenure            0.0318967  0.0068640   4.647 4.28e-06 ***
tenursq          -0.0005900  0.0002352  -2.509  0.01242 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4001 on 518 degrees of freedom
Multiple R-squared:  0.441, Adjusted R-squared:  0.4334 
F-statistic: 58.37 on 7 and 518 DF,  p-value: < 2.2e-16

What is the gender gap at the average education level?

Different models for diff groups

\[ GPA_i =\beta_0+\beta_1 SAT_i +\beta_2 hsrankperc_i +\beta_3 tothrs_i +u_i \]

For any of the slopes to depend on gender, we simply interact it with \(female_i\), and include it
To test if the model is different between men and women, then we need a model where the intercept and all slopes can be different across the two groups

\[ \begin{aligned} GPA_i = & \beta_0+\beta_1 sat_i +\beta_2 hsperc_i +\beta_3 tothrs_i \\ & +\delta_0 female_i +\delta_1 female_i * sat_i +\delta_2 female_i * hsperc_i +\delta_3 female_i * tothrs_i \\ & +u_i \end{aligned} \]

summary(lm(cumgpa ~ sat + hsperc + tothrs, data = gpa3))


Call:
lm(formula = cumgpa ~ sat + hsperc + tothrs, data = gpa3)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.01959 -0.42768  0.04212  0.45355  2.90131 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.9291105  0.2285515   4.065 5.32e-05 ***
sat          0.0009028  0.0002079   4.343 1.60e-05 ***
hsperc      -0.0063791  0.0015678  -4.069 5.24e-05 ***
tothrs       0.0119779  0.0009314  12.860  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8671 on 728 degrees of freedom
Multiple R-squared:  0.2354,    Adjusted R-squared:  0.2323 
F-statistic: 74.72 on 3 and 728 DF,  p-value: < 2.2e-16

summary(lm(cumgpa ~ sat + hsperc + tothrs + female + I(sat*female) + I(hsperc*female) + I(tothrs*female), data = gpa3))


Call:
lm(formula = cumgpa ~ sat + hsperc + tothrs + female + I(sat * 
    female) + I(hsperc * female) + I(tothrs * female), data = gpa3)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.08519 -0.39944  0.05277  0.45862  2.73325 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.214e+00  2.648e-01   4.584 5.37e-06 ***
sat                 6.113e-04  2.350e-04   2.601 0.009484 ** 
hsperc             -5.967e-03  1.776e-03  -3.359 0.000823 ***
tothrs              1.030e-02  1.093e-03   9.425  < 2e-16 ***
female             -1.114e+00  5.285e-01  -2.107 0.035460 *  
I(sat * female)     1.117e-03  5.000e-04   2.233 0.025832 *  
I(hsperc * female)  5.076e-05  4.103e-03   0.012 0.990132    
I(tothrs * female)  5.560e-03  2.070e-03   2.686 0.007386 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8591 on 724 degrees of freedom
Multiple R-squared:  0.2537,    Adjusted R-squared:  0.2464 
F-statistic: 35.15 on 7 and 724 DF,  p-value: < 2.2e-16

Econ 265: Introduction to Econometrics

Introduction

Data Scaling

Data Scaling

Standardized Coefficients

Standardized Coefficients

More examples

Functional forms

Nonlinearities in X

Interaction terms

Adjusted R-squared

Overfitting/overcontrolling

The Problem of Overcontrolling

Qualitative data

Binary variables

tests for wage discrimination

Multiple categories

Interactions with dummy variables

Different slopes

Different models for diff groups

Heteroskedasticity to be covered by Zeyi in the lab