HW2 - Predicting NYC Property Values
Due Date: Tuesday 3/18 11:59PM
(Right click and “Save Link As…”)
This dataset was taken from Kaggle.
In this assignment, we’ll be working with a dataset of property sales within New York City boroughs from a 12-month period, roughly 2016-2017. This assignment has two parts:
- First, we’ll first clean the data and then answer some questions that will provide us with insight to formulate a predictive model for property prices.
- We’ll then use the insights from above to create a linear regression model to predict prices.
Attached you will find the dataset of NYC property sales (nyc-property-sales.csv
). Information about the columns are described in the table below. Additionally, you will find a starter Python Notebook (nyc_properties.ipynb
) with cells that have questions you should determine the answers to. Submit your completed notebook on Canvas.
Column | Definition |
---|---|
borough | 1:Manhattan, 2:Bronx, 3:Brooklyn, 4:Queens, 5:Staten Island |
neighborhood | Neighborhood Name |
building_class_category | Categorization of buidling, i.e. walk-up, elevator building, condo, etc. |
address | Street Address |
apartment_number | Aparment number |
zip_code | Zip Code |
residential_units | Number of residential units in the building |
commercial_units | Number of commercial units in the building |
total_units | Total units (residential + commercial) in the building |
land_square_feet | Square feet of the land the building is built on |
gross_square_feet | The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property. |
year_built | Year building was built |
sale_price | Amount of sale. Zero values indicate transfer of deeds, for example parent to child. |
sale_date | Date of sale |
Part 1 - Cleaning and Insights
Cleaning our Data
Do the following within a single cell of your notebook. You may find the UFO and Titanic labs helpful to refer back to.
We’ll need to clean our data to ensure that the columns are formatted the way we want. First, let’s convert every column that we think should be a numeric value. These should be:
residential_units
commercial_units
total_units
land_square_feet
gross_square_feet
year_built
sale_price
Our dataset includes all propety sales
within the date range in NYC. So there’s lot of things we probably don’t care about, e.g. 36 OUTDOOR RECREATIONAL FACILITIES
.
Let’s assume that we only want to learn about residential style properties from the perspective of someone looking for a home. Filter out any rows that do not fall under the following building_class_category
:
01 ONE FAMILY DWELLINGS
02 TWO FAMILY DWELLINGS
03 THREE FAMILY DWELLINGS
Next, let’s drop some rows that don’t make sense for our anaylsis.
- Let’s make sure that the sale was indeed for a home and not a commercial business. Specifically, further filter out any rows that have
residential_units
equal to 0. - Drop any rows that do not have a
sale_price
or are less than $65,000. Many sales occur with a nonsensically small dollar amount: $0 most commonly. Most of these sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement. - Drop any rows that do not have a
gross_square_feet
orland_square_feet
, or have a value of 0.
As a checkpoint, you should now have a dataset with no null values and 24,416 rows.
Gaining Insights from our Data
Next, we want to validate some hypothesis about the data we have. For example, it’s highly likely that the borough and neighborhood drastically change the price of a property, but we should be sure that this is the case first. Answer each question in the corresponding cell of your notebook by printing them out.
- Borough sensitivity: what are the average property prices per borough? What about the standard deviation of the average property prices per borough?
- Square footage sensitivity: what is the average price per gross square foot per borough? What about the standard deviation?
- Neighborhood sensitivity: what are the most expensive and least expensive neighborhoods in each borough?
- Neighborhood sensitivity 2: what are the most expensive and least expensive neighborhoods in each borough if we consider the per unit price? That is, we instead average the price per residential unit of a building.
Part 2 - Predicting Property Prices
Now that we have some ideas about what parameters are important, let’s fit a regression model to predict property prices. Here’s a general outline of your job:
- Feature engineering & selection.
- Train your model.
- Explore different models and features to see what gives the best results.
- Plot your results!
Feature Engineer & Feature Selection
Think about what you think would be the most informative features to train on. Additionally, you may think of some more features that may be informative to the model. Don’t simply include all features to your model, this might result in some terribly long training times. Ideally, your model shouldn’t take longer than about 30 seconds to train. For reference, the best model I found took only about 9 seconds to train.
A warning: don’t include any features that are derived from price! For example, if we include price_per_sqft
and gross_square_feet
, then our model will easily figure out the price and have basically zero error. Since our goal is to determine the price, we should only provide features that are disjoint from it.
Picking a Model
Once you’ve got an idea of which features are important, let’s train our model! You may find it useful to start with the code we used for the Titanic lab. You should be doing the following:
- Slice your dataset to include the features you want.
- Split your dataset into a training set and labels, and testing set and labels.
- Fit the model!
- Test the model to determine it’s accuracy. There are two lines commented out that you may find useful.
You will not be graded on how optimal the model is, simply if you’ve correctly trained one. Kudos will be given to the best tuned model :) Here’s some models that you might want to try from scikit, but feel free to use anything you want!
Some tips:
- Start with as few features as possible, then slowly add more. Otherwise, your model might take too long to train! In the real world, you might actually want to do the opposite since you might have some beefy machines and more data typically means better fits.
- Use a fixed random state in your models, e.g.
random_state=2
. This will ensure that you can compare different feature sets. - The provided lines will print out the (log of the) mean square error (MSE) and the R^2 value. In short, the lower the MSE the better, and the closer to 1 the R^2 the better. Typically, an R^2 value over 0.7 is fairly good in practice. For reference, I was able to pretty easily get an R^2 of 0.5 or so, but it took some effort to get to 0.64 or so.
Visualization
Finally, let’s visualize our predictions. Your goal is to use Matplotlib to generate a scatter plot. The x-axis should be your predicted property value and the y-axis should be the true value. Note that if our prediction was perfect, everything would simply be a point on the line y=x! In light of this, include the line defined by y=x in some other color. This way, we can visualize how far off our estimates are by seeing how far they lie from this line. Finally, add an x label, a y label, and a title to your plot.
Note, given some outliers, you probably will want to cap your graph at something like $2.5m, otherwise you probably won’t see anything meaningful. Take a look at set_xlim
and set_ylim
.
Extra Credit
Visualize your results in any other meaningful way using Matplotlib. Plot something other than what we already plotted!