HW2 - Predicting NYC Property Values

Due Date: Tuesday 3/18 11:59PM

(Right click and “Save Link As…”)

This dataset was taken from Kaggle.

In this assignment, we’ll be working with a dataset of property sales within New York City boroughs from a 12-month period, roughly 2016-2017. This assignment has two parts:

First, we’ll first clean the data and then answer some questions that will provide us with insight to formulate a predictive model for property prices.
We’ll then use the insights from above to create a linear regression model to predict prices.

Attached you will find the dataset of NYC property sales (nyc-property-sales.csv). Information about the columns are described in the table below. Additionally, you will find a starter Python Notebook (nyc_properties.ipynb) with cells that have questions you should determine the answers to. Submit your completed notebook on Canvas.

Column	Definition
borough	1:Manhattan, 2:Bronx, 3:Brooklyn, 4:Queens, 5:Staten Island
neighborhood	Neighborhood Name
building_class_category	Categorization of buidling, i.e. walk-up, elevator building, condo, etc.
address	Street Address
apartment_number	Aparment number
zip_code	Zip Code
residential_units	Number of residential units in the building
commercial_units	Number of commercial units in the building
total_units	Total units (residential + commercial) in the building
land_square_feet	Square feet of the land the building is built on
gross_square_feet	The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property.
year_built	Year building was built
sale_price	Amount of sale. Zero values indicate transfer of deeds, for example parent to child.
sale_date	Date of sale

Part 1 - Cleaning and Insights

Cleaning our Data

Do the following within a single cell of your notebook. You may find the UFO and Titanic labs helpful to refer back to.

We’ll need to clean our data to ensure that the columns are formatted the way we want. First, let’s convert every column that we think should be a numeric value. These should be:

residential_units
commercial_units
total_units
land_square_feet
gross_square_feet
year_built
sale_price

Our dataset includes all propety sales within the date range in NYC. So there’s lot of things we probably don’t care about, e.g. 36 OUTDOOR RECREATIONAL FACILITIES. Let’s assume that we only want to learn about residential style properties from the perspective of someone looking for a home. Filter out any rows that do not fall under the following building_class_category:

ONE FAMILY DWELLINGS
TWO FAMILY DWELLINGS
THREE FAMILY DWELLINGS

Next, let’s drop some rows that don’t make sense for our anaylsis.

Let’s make sure that the sale was indeed for a home and not a commercial business. Specifically, further filter out any rows that have residential_units equal to 0.
Drop any rows that do not have a sale_price or are less than $65,000. Many sales occur with a nonsensically small dollar amount: $0 most commonly. Most of these sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement.
Drop any rows that do not have a gross_square_feet or land_square_feet, or have a value of 0.

As a checkpoint, you should now have a dataset with no null values and 24,416 rows.

Gaining Insights from our Data

Next, we want to validate some hypothesis about the data we have. For example, it’s highly likely that the borough and neighborhood drastically change the price of a property, but we should be sure that this is the case first. Answer each question in the corresponding cell of your notebook by printing them out.

Borough sensitivity: what are the average property prices per borough? What about the standard deviation of the average property prices per borough?
Square footage sensitivity: what is the average price per gross square foot per borough? What about the standard deviation?
Neighborhood sensitivity: what are the most expensive and least expensive neighborhoods in each borough?
Neighborhood sensitivity 2: what are the most expensive and least expensive neighborhoods in each borough if we consider the per unit price? That is, we instead average the price per residential unit of a building.

Part 2 - Predicting Property Prices

Now that we have some ideas about what parameters are important, let’s fit a regression model to predict property prices. Here’s a general outline of your job:

Feature engineering & selection.
Train your model.
Explore different models and features to see what gives the best results.
Plot your results!

Feature Engineer & Feature Selection

Think about what you think would be the most informative features to train on. Additionally, you may think of some more features that may be informative to the model. Don’t simply include all features to your model, this might result in some terribly long training times. Ideally, your model shouldn’t take longer than about 30 seconds to train. For reference, the best model I found took only about 9 seconds to train.

A warning: don’t include any features that are derived from price! For example, if we include price_per_sqft and gross_square_feet, then our model will easily figure out the price and have basically zero error. Since our goal is to determine the price, we should only provide features that are disjoint from it.

Picking a Model

Once you’ve got an idea of which features are important, let’s train our model! You may find it useful to start with the code we used for the Titanic lab. You should be doing the following:

Slice your dataset to include the features you want.
Split your dataset into a training set and labels, and testing set and labels.
Fit the model!
Test the model to determine it’s accuracy. There are two lines commented out that you may find useful.

You will not be graded on how optimal the model is, simply if you’ve correctly trained one. Kudos will be given to the best tuned model :) Here’s some models that you might want to try from scikit, but feel free to use anything you want!

Some tips:

Start with as few features as possible, then slowly add more. Otherwise, your model might take too long to train! In the real world, you might actually want to do the opposite since you might have some beefy machines and more data typically means better fits.
Use a fixed random state in your models, e.g. random_state=2. This will ensure that you can compare different feature sets.
The provided lines will print out the (log of the) mean square error (MSE) and the R^2 value. In short, the lower the MSE the better, and the closer to 1 the R^2 the better. Typically, an R^2 value over 0.7 is fairly good in practice. For reference, I was able to pretty easily get an R^2 of 0.5 or so, but it took some effort to get to 0.64 or so.

Visualization

Finally, let’s visualize our predictions. Your goal is to use Matplotlib to generate a scatter plot. The x-axis should be your predicted property value and the y-axis should be the true value. Note that if our prediction was perfect, everything would simply be a point on the line y=x! In light of this, include the line defined by y=x in some other color. This way, we can visualize how far off our estimates are by seeing how far they lie from this line. Finally, add an x label, a y label, and a title to your plot.

Note, given some outliers, you probably will want to cap your graph at something like $2.5m, otherwise you probably won’t see anything meaningful. Take a look at set_xlim and set_ylim.

Extra Credit

Visualize your results in any other meaningful way using Matplotlib. Plot something other than what we already plotted!