HW2 - Predicting NYC Property Values


Due Date: Friday 11/1 11:59PM

(Right click and “Save Link As…”)

nyc-property-sales.csv

nyc_properties.ipynb

In this assignment, we’ll be working with a dataset of property sales within New York City boroughs from a 12-month period, roughly 2016-2017. This assignment has two parts:

  1. First, we’ll first clean the data and then answer some questions that will provide us with insight to formulate a predictive model for property prices.
  2. We’ll then use the insights from above to create a linear regression model to predict prices.

Attached you will find the dataset of NYC property sales (nyc-property-sales.csv). Information about the columns are described in the table below. Additionally, you will find a starter Python Notebook (nyc_properties.ipynb) with cells that have questions you should determine the answers to. Submit your completed notebook on Canvas.

Column Definition
borough The borough the building is located: Manhattan, Bronx, Brooklyn, Queens, or Staten Island
neighborhood Neighborhood Name
building_class_category Categorization of buidling, i.e. walk-up, elevator building, condo, etc.
address Street Address
apartment_number Aparment number
zip_code Zip Code
residential_units Number of residential units in the building
commercial_units Number of commercial units in the building
total_units Total units (residential + commercial) in the building
land_square_feet Square feet of the land the building is built on
gross_square_feet The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property.
year_built Year building was built
sale_price Amount of sale. Zero values indicate transfer of deeds, for example parent to child.
sale_date Date of sale

Part 1 - Cleaning and Insights

Cleaning our Data

Do the following within a single cell of your notebook.

We’ll need to clean our data to ensure that the columns are formatted the way we want. First, let’s convert every column that we think should be a numeric value (hint: how did we do this for the duration column in the UFO lab?). These should be:

residential_units
commercial_units
total_units
land_square_feet
gross_square_feet
year_built
sale_price

Our dataset includes all propety sales within the date range in NYC. So there’s lot of things we probably don’t care about, e.g. 36 OUTDOOR RECREATIONAL FACILITIES. Let’s assume that we only want to learn about residential style properties from the perspective of someone looking for a home. Filter out any rows that do not fall under the following building_class_category:

01 ONE FAMILY DWELLINGS
02 TWO FAMILY DWELLINGS
03 THREE FAMILY DWELLINGS

Next, let’s drop some rows that don’t make sense for our anaylsis.

  1. Let’s make sure that the sale was indeed for a home and not a commercial business. Specifically, further filter out any rows that have residential_units equal to 0.
  2. Drop any rows that do not have a sale_price or are less than $65,000. Many sales occur with a nonsensically small dollar amount: $0 most commonly. These sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement.
  3. Drop any rows that do not have a gross_square_feet or land_square_feet.
  4. Drop any rows that a gross_square_feet or land_square_feet of 0.

As a checkpoint, you should now have a dataset with no null values and 24416 rows.

Gaining Insights from our Data

Next, we want to validate some hypothesis about the data we have. For example, it’s highly likely that the borough and neighborhood drastically change the price of a property, but we should be sure that this is the case first. Answer each question in the corresponding cell of your notebook by printing them out.

  1. Borough sensitivity: what are the average property prices per borough? What about the standard deviation of the average property prices per borough?
  2. Square footage sensitivity: what is the average price per gross square foot per borough? What about the standard deviation?
  3. Neighborhood sensitivity: what are the most expensive and least expensive neighborhoods in each borough?
  4. Neighborhood sensitivity 2: what are the most expensive and least expensive neighborhoods in each borough if we consider the per unit price? That is, we instead average the price per residential unit of a building.

Part 2 - Predicting Property Prices

Predictive Machine Learning

Now that we have some ideas about what parameters are important, let’s fit a linear regression to predict property prices. Included is some framework code for a random forest regressor. This is slightly different than the random forest classifier that we used for the Titanic lab, the only difference being that instead of something like a majority being chosen from the trees, something like the average is taken instead. This makes it more suitible for a prediction task instead of a classification one.

Your job is to provide the features that you think are important for the model. Simply uncomment the columns in the features list you wish to train the model on and run the cell. It will output the (log of the) mean squared error (MSE) and the R^2 value. In short, you can interpret them as so: the lower the MSE the better, and the closer to 1 the R^2 is the better.

For the purposes of this assignment, the training of the model is not randomized. That is, the training set is always the same and the randomness of the forest is always the same. This way, you can gain insights about what features are important. In the real world, you would likely need to train the model many times over many sets of parameters and pick the best one.

It’s recommended to start with as few features as possible, then slowly add more. Otherwise, your model may take too long to train! If your model is taking longer than a minute to train, something might be wrong.

What features do you find to be the most impactful? In particular, what do you notice about including any of the number of unit features (total_units or residential_units) versus the building_class_category feature? Why do you think this is?

Visualization

Finally, let’s visualize our predictions. Add a line to include a scatter plot in the matplotlib code provided. This scatter plot should be points where the x-value is the predicted value and the y-value is the true value. This way, we can visualize how far off our estimates are by seeing how far off they are from the line defined by y=x.

Extra Credit

Use any other model you wish to train a predictor. It does not need to perform better than the random forest regressor we used. You may find it useful to copy the code provided as a starting point.