Hi.

I'm Garfield Fisher and today I'm coming to you from Ottawa, Ontario, Canada.

Today I'm asking a question that is an extension from last week's question.

If we recall, last week you used the 2013 data set

to identify the variables that could influence or predict the market value for properties.

This week I'm asking you to take those results

and predict future market value for the same properties.

For this analysis, you'll need the 2011 and the 2013 data sets.

As usual, I look forward to the questions you have for me,

I look forward to the results that you present

and I especially look forward to seeing how you present those results

in a way that anyone can understand.

Good luck.

Thank you.

Bye-bye.

Based on what Garfield said, let me summarize the assignment.

In the last submission,

you all built a regression model

for the market value of housing units using the 2013 data.

You were asked to do some basic cleaning of the data

and then build a regression model for the value variable

using data on single-family housing;

that is, housing units where the type variable was one

and structure-type variable was also one.

The regression model you built establishes a relationship between market value

of the housing unit and the set of X or independent variables used in your analysis.

However, remember the regression you built establishes a relationship

between the concurrent Y and X variables.

That is, you establish a relationship between the market value in year 2013

and the various variables also for the year 2013.

So if you wish to predict the market value of the housing unit two years from now,

this regression may be problematic

because you would then need the corresponding X variables for the same concurrent period.

That is, two years hence.

One possibility is to use the lag of all your X or independent variables.

That is, for example, use the value variable, your Y variable from the 2013 data

and use all the X variables from 2011 data.

This way, you will be building a relationship between the current market value

and the lag values of your X variables.

This will allow you to use the regression model

to make predictions about future market value of housing units.

For example, you could predict the market value of housing units in 2013

using the data from 2011 and so on.

Let us get down to some specifics.

You need to use the dependent variable,

the value variable from the 2013 data and the X variables from the 2011 data.

For this, first you need to merge the value variable from the 2013 data file

into the 2011 data file.

You can do this using the V lookup command

and using the control variable to match up the two data files.

Remember, these are big data files

and the V lookup command may take some time to execute.

Secondly, you need to do some data cleaning steps.

Delete all rows for housing units that are not common across the two years.

Then again, as in the previous exercise, keep only single-housing family units;

that is, keep only those housing units for which the type variable is one

and structure-type variable is also one.

Delete all other housing units.

Also, delete all housing units which have a market value less than $1,000.

That is, delete all housing units for which the value variable is less than $1,000.

Next, you need to estimate a regression model using the 2013 value variable

and the X variables from the 2011 data.

Since you did a similar exercise in the previous assignment,

this should be relatively straightforward.

However, please do look at the values of various X variables to make sure

that there is no missing or so-called suspect data.

So now you have a regression model

which can be used to predict market value of single family housing units

for the year 2013 using data from 2011.

How well does the model predict future market values?

The R-square measure is one such measure

which indicates the goodness of fit of the model.

However, typically in the industry, we tend to do a holdout analysis

for testing the predictions from the model.

The notion is to holdout some data and not include it in the regression model.

Instead, when the regression model is estimated,

then use the estimated regression model to predict the held out data

and see how far or close the predictions are.

To do a holdout analysis, you need to do the following.

From the data that you have created on your Y and X variables,

you need to select 1,000 housing units at random.

This would be your holdout sample.

You will remove all data for these 1,000 housing units and keep them in a separate file.

Your regression model will be estimated on the remaining set of housing units;

that is, housing units remaining after you have removed the 1,000 selected

at random housing units.

How do you select 1,000 units at random?

In Course 1 of the specialization, we had introduced the RAND function in Excel.

This calculates a random value between 0 and 1.

So you can create a column in your data using this RAND function

and then sort the data based on this random value

and select the top 1,000 units to be held out.

You will then estimate the regression model using the remaining data

and obtain various beta coefficients.

You will then use these beta coefficients from the regression model

and using the X variables from the holdout data,

generate predictions for each housing unit in the holdout data.

You can then compare these 1,000 predictions with the actual market value

of those housing units.

You can create a mean absolute deviation measure,

which is the average of the absolute difference between the actual

and the predicted value.

The formula is as shown.

In terms of submission, please submit at the minimum your regression output

and the calculation of the mean absolute deviation between the actual

and predicted values in the held out sample.

For those of you who wish to be more adventurous,

you can extend the analysis to multiple years and in your regression,

include data on market value of housing units across multiple years.

The corresponding X variables will also be across multiple years,

except that they will be lagged by two years.