Econ 265: Introduction to Econometrics

Topic 6: Variants on MLR

Moshi Alam

Introduction

  • Data scaling
  • Standardized coefficients
  • Nonlinearities in X
  • Interaction terms
  • Average partial effects
  • Adjusted R-squared
  • Over fitting

Data Scaling

library(wooldridge); library(stargazer);
model1 <- lm(bwght ~ cigs + faminc, bwght) # bwght in grams
model2 <- lm(I(bwght/16) ~ cigs + faminc, bwght) # bwght in pounds
bwght$packs <- bwght$cigs/20 # 1 pack =  20 cigs
model3 <- lm(bwght ~ packs + faminc, bwght) # use packs instead of cigs
stargazer(model1, model2, model3, type = "text")

=================================================================
                                       Dependent variable:       
                                ---------------------------------
                                  bwght    I(bwght/16)   bwght   
                                   (1)         (2)        (3)    
-----------------------------------------------------------------
cigs                            -0.463***   -0.029***            
                                 (0.092)     (0.006)             
                                                                 
packs                                                  -9.268*** 
                                                        (1.832)  
                                                                 
faminc                           0.093***   0.006***    0.093*** 
                                 (0.029)     (0.002)    (0.029)  
                                                                 
Constant                        116.974***  7.311***   116.974***
                                 (1.049)     (0.066)    (1.049)  
                                                                 
-----------------------------------------------------------------
Observations                      1,388       1,388      1,388   
R2                                0.030       0.030      0.030   
Adjusted R2                       0.028       0.028      0.028   
Residual Std. Error (df = 1385)   20.063      1.254      20.063  
F Statistic (df = 2; 1385)      21.274***   21.274***  21.274*** 
=================================================================
Note:                                 *p<0.1; **p<0.05; ***p<0.01

Data Scaling

  • Coefficients in a regression model are sensitive to the units of measurement
  • The population model: \(bwght_i = \beta_0 + \beta_1 cigs_i + \beta_2 faminc + u_i\) follows MLR 1-5
  • \(bwght_i/16 = \beta_0/16 + \beta_1/16 cigs_i + \beta_2/16 faminc + u_i\)
    • Coefficients scaled down by factor with which dependent variable is scaled
  • \(bwhgt_i = \beta_0 + 20\beta_1 cigs_i/20 + \beta_2 faminc + u_i\)
    • which simplifies to \(bwhgt_i = \beta_0 + 20\beta_1 packs_i + \beta_2 faminc + u_i\)
    • Coefficients scaled up by factor with which independent variable is scaled
  • What about SE? R^2?

Standardized Coefficients

Instead of asking what happens to \(y\) when \(x\) changes by 1 unit, we ask what happens to \(y\) when \(x\) changes by 1 standard deviation

  • This is useful because:
    • \(x\) may be measured in different units in different datasets
    • \(x\) may be a composite index of several variables
    • \(x\) could be scaled up or down differently in different datasets. E.g. GPA
  • But s.d. effects are only useful when interpreted relative to the mean of \(x\)
    • So the standardized \(x_i\) is \(x^* = (x_i - \bar{x})/s_x\)
  • Thus from the population model: \(y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_k + u\)
    • The standardized model is: \(y = \beta_0^* + \beta_1^* x_1^* + \ldots + \beta_k^* x_k^* + u^*\)
      • where \(\beta_j^* = \frac{s_{x_j}}{s_y}\beta_j\) for \(j = 1, \ldots, k\)
  • How would you derive this?
  • This is easy to implement in R using scale() function

Standardized Coefficients

Scaled estimation:

model4 <- lm(scale(bwght) ~ scale(cigs) + scale(faminc), bwght)
stargazer(model1, model4, type = "text", omit.stat = c("f", "ser"))

==========================================
                  Dependent variable:     
              ----------------------------
                  bwght      scale(bwght) 
                   (1)            (2)     
------------------------------------------
cigs            -0.463***                 
                 (0.092)                  
                                          
faminc           0.093***                 
                 (0.029)                  
                                          
scale(cigs)                    -0.136***  
                                (0.027)   
                                          
scale(faminc)                  0.085***   
                                (0.027)   
                                          
Constant        116.974***      -0.000    
                 (1.049)        (0.026)   
                                          
------------------------------------------
Observations      1,388          1,388    
R2                0.030          0.030    
Adjusted R2       0.028          0.028    
==========================================
Note:          *p<0.1; **p<0.05; ***p<0.01

Compute manually:

model1$coef["cigs"]* sd(bwght$cigs)/sd(bwght$bwght)
      cigs 
-0.1359828 
model1$coef["faminc"]* sd(bwght$faminc)/sd(bwght$bwght)
    faminc 
0.08540571 

More examples

price \(=\beta_0+\beta_1\) nox \(+\beta_2\) crime \(+\beta_3\) rooms \(+\beta_4\) dist \(+\beta_5\) stratio \(+u\)

library(wooldridge)
summary(lm(price ~ nox + crime + rooms + dist + stratio, hprice2))

Call:
lm(formula = price ~ nox + crime + rooms + dist + stratio, data = hprice2)

Residuals:
   Min     1Q Median     3Q    Max 
-13914  -3201   -662   2110  38064 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20871.13    5054.60   4.129 4.27e-05 ***
nox         -2706.43     354.09  -7.643 1.09e-13 ***
crime        -153.60      32.93  -4.665 3.97e-06 ***
rooms        6735.50     393.60  17.112  < 2e-16 ***
dist        -1026.81     188.11  -5.459 7.57e-08 ***
stratio     -1149.20     127.43  -9.018  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5586 on 500 degrees of freedom
Multiple R-squared:  0.6357,    Adjusted R-squared:  0.632 
F-statistic: 174.5 on 5 and 500 DF,  p-value: < 2.2e-16
summary(lm(scale(price) ~ scale(nox) + scale(crime) + scale(rooms) + scale(dist) + scale(stratio), hprice2))

Call:
lm(formula = scale(price) ~ scale(nox) + scale(crime) + scale(rooms) + 
    scale(dist) + scale(stratio), data = hprice2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5110 -0.3476 -0.0719  0.2291  4.1334 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.327e-16  2.697e-02   0.000        1    
scale(nox)     -3.404e-01  4.454e-02  -7.643 1.09e-13 ***
scale(crime)   -1.433e-01  3.072e-02  -4.665 3.97e-06 ***
scale(rooms)    5.139e-01  3.003e-02  17.112  < 2e-16 ***
scale(dist)    -2.348e-01  4.302e-02  -5.459 7.57e-08 ***
scale(stratio) -2.703e-01  2.997e-02  -9.018  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6066 on 500 degrees of freedom
Multiple R-squared:  0.6357,    Adjusted R-squared:  0.632 
F-statistic: 174.5 on 5 and 500 DF,  p-value: < 2.2e-16

Functional forms

Recall from lecture 1

Specification Change in x Effect on y
Level-level +1 unit +\(b_1\) units
Level-log +1% +\(\frac{b_1}{100}\) units
Log-level +1 unit +\((100 \times b_1)\%\)
Log-log +1% +\(b_1\%\)

Nonlinearities in X

\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{1i}^2 + \beta_3 x_{2i} + u_i\]

\[\frac{\partial y}{\partial x_{1i}} = \beta_1 + 2\beta_2 x_{1i}\]

Discuss the nature of price change of houses with respect to number of rooms:

lm(log(price) ~ log(nox) + log(dist) + rooms + I(rooms^2), hprice2)

Call:
lm(formula = log(price) ~ log(nox) + log(dist) + rooms + I(rooms^2), 
    data = hprice2)

Coefficients:
(Intercept)     log(nox)    log(dist)        rooms   I(rooms^2)  
   12.87147     -0.88558     -0.04421     -0.72860      0.08004  
  • Based on estimates how does price change as # of rooms go up?
  • Increasing or decreasing rates?
  • Are there tipping points?
    • \(|\beta_1/2\beta_2|\) is the tipping point for \(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{1i}^2 + \beta_3 x_{3i} + u_i\) when \(x_{1i}\) changes

Interaction terms

  • \(p = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + \beta_3 sqrft \times bdrms + \beta_4 bthrms +u\)

  • We can formalize it with conditional expecations:

  • \(E(p|sqrft, bdrms, bthrms) = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + \beta_3 sqrft \times bdrms + \beta_4 bthrms\)

  • \[\frac{\partial E(p|sqrft, bdrms, bthrms)}{\partial bdrms} = \beta_2 + \beta_3 sqrft\]

  • So at sqrft = \(\bar{sqrft}\), the effect of bdrms on price is \(\beta_2 + \beta_3 \bar{sqrft}\)

  • Easy to obtain given \(\hat{\beta}_2\) and \(\hat{\beta}_3\) and \(\bar{sqrft}\)

  • Similarly in the previous slide \(\frac{\partial E(y | x_1 = \bar{x_1}, x_2)}{\partial x_1} = \beta_1 + 2\beta_2 \bar{x_1}\)

model <-  lm(price ~ sqrft + bdrms + I(sqrft * bdrms), data = hprice1)
 model$coefficients["bdrms"] + model$coefficients["I(sqrft * bdrms)"] * mean(hprice1$sqrft)
   bdrms 
11.26181 

Adjusted R-squared

  • Recall Econ 160
  • Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in the model.
  • Compare goodness of fit of models with different numbers of X’s. \[\text{Adjusted-}R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - k - 1} \right)\] \(R^2\) is the R-squared value, \(n\) is the # observations and \(k\) is the number of X’s
    • Penalizes the addition of unnecessary predictors
  • The F-test could be related to the adjusted R-squared.
    • Think abour \(UR\) and \(R\) models in the context of the F-test
    • F-test is useful when models are nested

Overfitting/overcontrolling

  • Including too many controls can distort causal interpretation.
  • Example: Estimating the effect of beer tax on traffic fatalities ins tates:

\[ \text{fatalities}_s = \beta_0 + \beta_1 \text{tax}_s + \beta_2 \text{miles}_s + \beta_3 \text{percmale}_s + \beta_4 \text{perc16_21}_s + \dots \]

  • Should we control for beer consumption (beercons)?
    • No, it mediates the effect of tax on fatalities.
    • Controlling for it absorbs the policy impact.

The Problem of Overcontrolling

  • Overcontrolling can remove meaningful variation in key variables.
  • Example: Estimating the effect of pesticides on health costs:
    • Controlling for doctor visits blocks part of the effect.
  • Another case: School quality → earnings:
    • If quality raises education, controlling for education understates its impact.
  • Takeaway: Control variables should reflect causal logic, not just maximize \(R^2\).

Qualitative data

Binary variables

  • Binary variables are {0,1} variables. E.g. female, married, smoker
  • Example: Estimating the effect of gender on wages:
  • \(wage_i = \beta_0 + \delta_0 female_i + \beta_1 educ_i + u_i\)
  • \(\delta_0=\mathrm{E}( wage_i \mid female_i =1, educ_i )-\mathrm{E}( wage_i \mid female_i =0, educ_i )\)
  • Difference in wages between females and males, holding education constant.
  • If \(\delta_0 < 0\), women earn less than men on average, given education.
    • Males are the base group (\(\beta_0\) is their intercept)[draw]
  • Including both male and female leads to perfect collinearity.
  • Alternatively: \(wage_i = \alpha_0 + \gamma_0 male_i + \beta_1 educ_i + u_i\)
    • Here, females are the base group.

tests for wage discrimination

wage \(=\beta_0+\delta_0\) female \(+\beta_1\) educ \(+\beta_2\) exper \(+\beta_3\) tenure \(+u\)

library(wooldridge)
summary( lm(wage ~ female + educ + exper + tenure, data = wage1))

Call:
lm(formula = wage ~ female + educ + exper + tenure, data = wage1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7675 -1.8080 -0.4229  1.0467 14.0075 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.56794    0.72455  -2.164   0.0309 *  
female      -1.81085    0.26483  -6.838 2.26e-11 ***
educ         0.57150    0.04934  11.584  < 2e-16 ***
exper        0.02540    0.01157   2.195   0.0286 *  
tenure       0.14101    0.02116   6.663 6.83e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.958 on 521 degrees of freedom
Multiple R-squared:  0.3635,    Adjusted R-squared:  0.3587 
F-statistic:  74.4 on 4 and 521 DF,  p-value: < 2.2e-16

Multiple categories

Marital status and gender: \[ log(wage_i) = \beta_0 + \beta_1 marriedmale_i + \beta_2 marriedfemale_i + \beta_3 unmarriedfemale_i \\ + \beta_4 educ_i + \beta_5 exper_i + \beta_6 tenure_i + \beta_7 exper^2_i + \beta_8 tenure^2_i + u_i\] What is the base group?

summary(lm(log(wage) ~ I(married*(1-female)) + I(married*female) + I((1-married)*female) + educ + exper + tenure + I(exper^2) + I(tenure^2), data = wage1) )

Call:
lm(formula = log(wage) ~ I(married * (1 - female)) + I(married * 
    female) + I((1 - married) * female) + educ + exper + tenure + 
    I(exper^2) + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89697 -0.24060 -0.02689  0.23144  1.09197 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                0.3213781  0.1000090   3.213 0.001393 ** 
I(married * (1 - female))  0.2126757  0.0553572   3.842 0.000137 ***
I(married * female)       -0.1982676  0.0578355  -3.428 0.000656 ***
I((1 - married) * female) -0.1103502  0.0557421  -1.980 0.048272 *  
educ                       0.0789103  0.0066945  11.787  < 2e-16 ***
exper                      0.0268006  0.0052428   5.112 4.50e-07 ***
tenure                     0.0290875  0.0067620   4.302 2.03e-05 ***
I(exper^2)                -0.0005352  0.0001104  -4.847 1.66e-06 ***
I(tenure^2)               -0.0005331  0.0002312  -2.306 0.021531 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3933 on 517 degrees of freedom
Multiple R-squared:  0.4609,    Adjusted R-squared:  0.4525 
F-statistic: 55.25 on 8 and 517 DF,  p-value: < 2.2e-16
  • What is the wage gap between married females and unmarried females?

Interactions with dummy variables

What is the point of interacting: Marital status X gender?

\[ log(wage_i) = \beta_0 + \beta_1 female_i + \beta_2 married_i + \beta_3 married_i * female_i \\ + \beta_4 educ_i + \beta_5 exper_i + \beta_6 tenure_i + \beta_7 exper^2_i + \beta_8 tenure^2_i + u_i\]

summary(lm(log(wage) ~ female + married + I((married)*female) + educ + exper + tenure + I(exper^2) + I(tenure^2), data = wage1))

Call:
lm(formula = log(wage) ~ female + married + I((married) * female) + 
    educ + exper + tenure + I(exper^2) + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89697 -0.24060 -0.02689  0.23144  1.09197 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.3213781  0.1000090   3.213 0.001393 ** 
female                -0.1103502  0.0557421  -1.980 0.048272 *  
married                0.2126757  0.0553572   3.842 0.000137 ***
I((married) * female) -0.3005931  0.0717669  -4.188 3.30e-05 ***
educ                   0.0789103  0.0066945  11.787  < 2e-16 ***
exper                  0.0268006  0.0052428   5.112 4.50e-07 ***
tenure                 0.0290875  0.0067620   4.302 2.03e-05 ***
I(exper^2)            -0.0005352  0.0001104  -4.847 1.66e-06 ***
I(tenure^2)           -0.0005331  0.0002312  -2.306 0.021531 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3933 on 517 degrees of freedom
Multiple R-squared:  0.4609,    Adjusted R-squared:  0.4525 
F-statistic: 55.25 on 8 and 517 DF,  p-value: < 2.2e-16

Different slopes

  • Do men and women have different returns to education?

  • \(wage_i = \beta_0 + \beta_1 educ_i + \beta_2 female_i + \beta_3 educ_i * female_i + u_i\)

  • \(E(wage_i | educ_i, female_i = 1) - E(wage_i | educ_i, female_i = 0) = \beta_3 educ_i\)

  • \(\uparrow\) educ by 1 unit \(\uparrow\) wage by \(\beta_1 + \beta_3\) for females and \(\beta_1\) for males.

summary(lm(log(wage) ~ educ + female + I(educ*female) + exper + expersq + tenure + tenursq, data = wage1))

Call:
lm(formula = log(wage) ~ educ + female + I(educ * female) + exper + 
    expersq + tenure + tenursq, data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.83265 -0.25261 -0.02374  0.25396  1.13584 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.3888060  0.1186871   3.276  0.00112 ** 
educ              0.0823692  0.0084699   9.725  < 2e-16 ***
female           -0.2267886  0.1675394  -1.354  0.17644    
I(educ * female) -0.0055645  0.0130618  -0.426  0.67028    
exper             0.0293366  0.0049842   5.886 7.11e-09 ***
expersq          -0.0005804  0.0001075  -5.398 1.03e-07 ***
tenure            0.0318967  0.0068640   4.647 4.28e-06 ***
tenursq          -0.0005900  0.0002352  -2.509  0.01242 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4001 on 518 degrees of freedom
Multiple R-squared:  0.441, Adjusted R-squared:  0.4334 
F-statistic: 58.37 on 7 and 518 DF,  p-value: < 2.2e-16
  • What is the gender gap at the average education level?

Different models for diff groups

\[ GPA_i =\beta_0+\beta_1 SAT_i +\beta_2 hsrankperc_i +\beta_3 tothrs_i +u_i \]

  • For any of the slopes to depend on gender, we simply interact it with \(female_i\), and include it

  • To test if the model is different between men and women, then we need a model where the intercept and all slopes can be different across the two groups

\[ \begin{aligned} GPA_i = & \beta_0+\beta_1 sat_i +\beta_2 hsperc_i +\beta_3 tothrs_i \\ & +\delta_0 female_i +\delta_1 female_i * sat_i +\delta_2 female_i * hsperc_i +\delta_3 female_i * tothrs_i \\ & +u_i \end{aligned} \]

summary(lm(cumgpa ~ sat + hsperc + tothrs, data = gpa3))

Call:
lm(formula = cumgpa ~ sat + hsperc + tothrs, data = gpa3)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.01959 -0.42768  0.04212  0.45355  2.90131 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.9291105  0.2285515   4.065 5.32e-05 ***
sat          0.0009028  0.0002079   4.343 1.60e-05 ***
hsperc      -0.0063791  0.0015678  -4.069 5.24e-05 ***
tothrs       0.0119779  0.0009314  12.860  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8671 on 728 degrees of freedom
Multiple R-squared:  0.2354,    Adjusted R-squared:  0.2323 
F-statistic: 74.72 on 3 and 728 DF,  p-value: < 2.2e-16
summary(lm(cumgpa ~ sat + hsperc + tothrs + female + I(sat*female) + I(hsperc*female) + I(tothrs*female), data = gpa3)) 

Call:
lm(formula = cumgpa ~ sat + hsperc + tothrs + female + I(sat * 
    female) + I(hsperc * female) + I(tothrs * female), data = gpa3)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.08519 -0.39944  0.05277  0.45862  2.73325 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.214e+00  2.648e-01   4.584 5.37e-06 ***
sat                 6.113e-04  2.350e-04   2.601 0.009484 ** 
hsperc             -5.967e-03  1.776e-03  -3.359 0.000823 ***
tothrs              1.030e-02  1.093e-03   9.425  < 2e-16 ***
female             -1.114e+00  5.285e-01  -2.107 0.035460 *  
I(sat * female)     1.117e-03  5.000e-04   2.233 0.025832 *  
I(hsperc * female)  5.076e-05  4.103e-03   0.012 0.990132    
I(tothrs * female)  5.560e-03  2.070e-03   2.686 0.007386 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8591 on 724 degrees of freedom
Multiple R-squared:  0.2537,    Adjusted R-squared:  0.2464 
F-statistic: 35.15 on 7 and 724 DF,  p-value: < 2.2e-16

Heteroskedasticity to be covered by Zeyi in the lab