Problem Set 5: Income Prediction Challenge
Due: 5pm on Friday, Feb 6.
This problem set is connected to the PSID Income Prediction Challenge. Look at that page for data access instructions. Before you start, you should prepare your working directory. This folder on your computer will contain
- your R script
- the file
learning.csv(see instructions linked above) - the file
holdout.csv(see instructions linked above)
The two .csv files will already exist in the autograder working directory on Gradescope; it is not necessary to upload them. You will upload your R script.
There will be two submissions for this problem set.
- Problem Set 5: Code
- Problem Set 5: Written Answers
Code part 1: Guided
Your code submission should:
- Read
learning.csvfrom your working directory. - Create
splitted, a sample split object created by theinitial_splitfunction in thersamplepackage. You can use any proportion split that you want. - Estimate a linear regression
linear_regressionusingtraining(splitted). This model will predictg3_log_incomeas a function of any predictors you want. - Create an object
predictedthat contains the datatesting(splitted)and a new column namedyhatcontaining predictions fromlinear_regressionusing the datatesting(splitted). - Create an object
msethat summarizes the mean squared error in the testing data. This object can be atibblewith one value or a single numeric value.
Code part 2: Creative
Now use any set of predictors you want, and any prediction function you want. It can be linear regression with any set of predictors, penalized linear regression, a tree or a forest.
Estimate your model on the full learning.csv data. Then make predictions for the cases in holdout.csv. Store your predicted values in the holdout.csv column g3_log_income.
The leaderboard on Gradescope will show real-time MSE scores for the holdout set.
Written answers
This Gradescope written portion assignment asks the following questions:
How did you choose the predictor variables you used? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.
How did you choose your learning algorithm? Correct answers might be entirely conceptual, entirely data-driven, or a mixture of both.