Predicting house prices with linear regression
Inspired by Kevin Markham’s post on linear regression, Chapter 3 of An Introduction to Statistical Learning and Andrew Ng’s Machine Learning course.
We’ll be working with a dataset of house prices from Kaggle.
Simple Linear Regression
Simple linear regression is a statistical approach for modelling the relationship between a predictor variable X and a response variable Y. It assumes there is a linear relationship between these two variables and we use that to predict a quantitative output.
Simple linear regression is a very simple approach for supervised learning. But, even though it’s probably the simplest and most straightforward method, it’s the fundamental starting point for all regression methods, so it’s important to fully understand what it is about. It’s also widely used and quite easy to interpret: useful to get a better understanding of the relationship between the response and predictor.
Mathematically, we can write this linear relationship as:
y = β0 + β1x+ e; where
- y is the output variable (also called response, target or dependent variable). e.g. house prices
- x is the input variable (also called feature, explanatory or independent variable) e.g. size of a house in squared meters
- β0 is the intercept (the value of y when x=0)
- β1 is the coefficient for x and the slope of the regression line (“the average increase in Y associated with a one-unit increase in X”)
- e is the error term
When implementing linear regression, the algorithm finds the line of best fit using the model coefficients β0 and β1, such that it is as close as possible to the actual data points (minimises the sum of the squared distances between each data point and the line). Once we find β0 and β1 we can use the model to predict the response.
- Black dots are the observed values of x and y (actual data)
- the blue line — line of best fit — is the line that minimises the sum of squared errors
- the red lines are the errors (or residuals) — the vertical distances between the observed values (the actual data) and the line of best fit
- the slope of the blue line is β1
- β0 is the intercept (the value of y when x=0)
An Example: Predicting house prices with linear regression using scikit-learn
Setting the environment:
import pandas as pd
import numpy as npimport seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inlinefrom sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
Read the house prices data:
houses = pd.read_csv("kc_house_data.csv")houses.head()
0. Data cleansing and exploratory analysis
#check for nulls in the data
# check for any correlations between variables
corr = houses.corr()
# sqft_living, grade, sqft_above and sqft_living15 seem to have a
# high influence in price
Building a linear regression model
- Prepare the data: define predictor and response variables
# create x and y
feature_cols = 'sqft_living'
x = houses[feature_cols] # predictor
y = houses.price # response
2. Split data into train and test
The train/test split consists in randomly making 2 subsets of the the data: the training set, used to fit our learning algorithm so it learns how to predict, and the test set which we use to get an idea of how the model would perform with new data.
# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2)
# the test set will be 20% of the whole data set
3. Fit the model on the training set
# instantiate, fit
linreg = LinearRegression()
4. Print the coefficients
[282.29917574] # for an increase of 1 square meter in house size,
# the house price will go up by ~$282, on average
The intercept (β0) is the value of y when x=0. In this case it would be the price of a house when the sqft_living is 0. (Note it does not always make sense to interpret the intercept). The coefficient of β1 is the change in y divided by change in x (i.e the derivative, the slope of the line of best fit). An increase of 1 square meter in house size is associated with a price increase of $282.3, on average. Note that association doesn’t always imply causation.
5. Predict the price of a 1000 sqft_living house using our model:
price = -46773.6549892 + 1000*282.29917574
# using the model
6. Compute the Root Mean Squared Error (RMSE), which is a commonly used metric to evaluate regression models, on the test set:
mse = mean_squared_error(y_test, linreg.predict(x_test))np.sqrt(mse)
259163.48398162922 # not greatlinreg.score(x_test,y_test)
We get a root mean squared error of $259,163.48 when predicting a price for a house, which is really high. This is kind of expected since we’re only using one feature in our model, and it could be greatly improved by adding more features such as number of bathrooms or bedrooms. We can also see that we’re omitting relevant variables by looking at the R squared coefficient: 55%. This means that our model is only able to explain 55% of the variability in house prices.
When we use multiple predictors (or features) we call it Multiple linear regression and we’ll be looking into it in another post.