Econometrics PS 1

Author

Prof Alam

Note

Follow the instructions of completing your problem set detailed in the syllabus.

Question 1

Observe the following population regression models of y_1 on x and y_2 on x where y_{1i} = 2 + 3x_{i} + u_{1i} and y_{2i} = 2 + 3x_{i} + u_{2i}. Observe how the error terms u_{1i} and u_{2i} are generated for 10000 observations.

set.seed(123)
n <- 10000
x <- sin(seq(-5, 5, length.out=n))
u1 <- rnorm(n, mean=0.0, sd=0.1)
u2 <- x + rnorm(n, mean=0, sd=0.1)

y1 <- 2 + 3*x + u1
y2 <- 2 + 3*x + u2

Plot the error terms u_{1i} and u_{2i} against x_i. What do you observe?
Before running the regressions, which OLS estimates will produce unbiased estimates? Prove your answer.
Use the code above to generate the data y_1, y_2 and x. Then run the two regressions and explain the results.
Now consider a wage equation: wage_i= β_0+ β_1 education_i + u_i. Suppose unobserved ability is positively correlated with education and affects wages positively. Explain intuitively why E(u | education) \ne 0 in this case. Will OLS overestimate or ̸ underestimate the true return to education? Justify your answer.

Question 2

Consider two researchers studying the same population and estimating the population model y_i = \beta_0 + \beta_1 x_i + u_i with Var(u_i) = 1. Assume SLR.1-4 hold. The two researchers collect different samples:

Researcher A collects a random sample with 100 observations where it turns out that x_i \in [45, 55] uniformly distributed
Researcher B collects a random sample with 100 observations where it turns out that x_i \in [0, 100] uniformly distributed

Without doing exact calculations, explain which researcher will likely obtain a more precise estimate of \beta_1.

Question 3

For this problem, you will use the wage2 dataset from the wooldridge package. This dataset contains information on monthly earnings and various characteristics of workers.

Q 3.1

Load the data and create a scatter plot of monthly wages (wage) against years of education (educ). What do you observe about the relationship? Are there any concerning patterns?
Calculate and report the sample correlation between wages and education. Then manually calculate \hat{\beta}_1 using the formula: \hat{\beta}_1 = \hat{\rho}_{xy} \cdot \left(\frac{\hat{\sigma}_y}{\hat{\sigma}_x}\right) by computing each component separately. Show that this matches the OLS estimate from lm().
Estimate the simple linear regression:

wage_i = \beta_0 + \beta_1 \, educ_i + u_i

Report and interpret both coefficients. Is the intercept meaningful in this context?

Calculate the fitted values and residuals manually (without using fitted() or residuals()). Verify that:

The mean of residuals equals zero
The point (\bar{educ}, \bar{wage}) lies on the regression line
The sample covariance between education and residuals is zero

Q 3.2

Estimate:

\log(wage_i) = \beta_0 + \beta_1 \, educ_i + u_i

Using the log-level specification:

Calculate the exact percentage return to one additional year of education
Predict the percentage wage difference between workers with 12 and 16 years of education

Create histograms of wage and log(wage). Which estimation would you prefer to run: level-level or log-level? Justify your choice.

Q 3.3

Create a residual plot (residuals vs. fitted values) for your log-level specification. Do you see evidence of heteroskedasticity? Explain what pattern you would look for.
Group the data into three education categories: low (educ < 12), medium (12 \leq educ < 16), and high (educ \geq 16).

wage2$educ_group <- cut(wage2$educ, 
                        breaks = c(-Inf, 12, 16, Inf),
                        labels = c("Low (<12)", "Medium (12-16)", "High (≥16)"),
                        right = FALSE)

Calculate the variance of residuals within each group. What do these variances suggest about the homoskedasticity assumption?

If heteroskedasticity is present, what are the consequences for the unbiasedness of \hat{\beta}_1?