Predicting house prices with linear regression

by Sara Gaspar

Adrià Luz
5 min readJan 24, 2018

Inspired by Kevin Markham’s post on linear regression, Chapter 3 of An Introduction to Statistical Learning and Andrew Ng’s Machine Learning course.

We’ll be working with a dataset of house prices from Kaggle.

Simple Linear Regression

Simple linear regression is a statistical approach for modelling the relationship between a predictor variable X and a response variable Y. It assumes there is a linear relationship between these two variables and we use that to predict a quantitative output.

Simple linear regression is a very simple approach for supervised learning. But, even though it’s probably the simplest and most straightforward method, it’s the fundamental starting point for all regression methods, so it’s important to fully understand what it is about. It’s also widely used and quite easy to interpret: useful to get a better understanding of the relationship between the response and predictor.

Mathematically, we can write this linear relationship as:

y = β0 + β1x+ e; where

  • y is the output variable (also called response, target or dependent variable). e.g. house prices
  • x is the input variable (also called feature, explanatory or independent variable) e.g. size of a house in squared meters
  • β0 is the intercept (the value of y when x=0)
  • β1 is the coefficient for x and the slope of the regression line (“the average increase in Y associated with a one-unit increase in X”)
  • e is the error term

When implementing linear regression, the algorithm finds the line of best fit using the model coefficients β0 and β1, such that it is as close as possible to the actual data points (minimises the sum of the squared distances between each data point and the line). Once we find β0 and β1 we can use the model to predict the response.

Adaptation from Kevin Markham’s post
  • Black dots are the observed values of x and y (actual data)
  • the blue line — line of best fit — is the line that minimises the sum of squared errors
  • the red lines are the errors (or residuals) — the vertical distances between the observed values (the actual data) and the line of best fit
  • the slope of the blue line is β1
  • β0 is the intercept (the value of y when x=0)

An Example: Predicting house prices with linear regression using scikit-learn

Setting the environment:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error

Read the house prices data:

houses = pd.read_csv("kc_house_data.csv")houses.head()
Top 5 rows of the houses data set

Data dictionary

0. Data cleansing and exploratory analysis

#check for nulls in the data
houses.isnull().sum()
The dataset doesn’t contain any nulls
# check for any correlations between variables
corr = houses.corr()
sns.heatmap(corr)
# sqft_living, grade, sqft_above and sqft_living15 seem to have a
# high influence in price

Building a linear regression model

  1. Prepare the data: define predictor and response variables
# create x and y
feature_cols = 'sqft_living'
x = houses[feature_cols] # predictor
y = houses.price # response

2. Split data into train and test

The train/test split consists in randomly making 2 subsets of the the data: the training set, used to fit our learning algorithm so it learns how to predict, and the test set which we use to get an idea of how the model would perform with new data.

# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2)
# the test set will be 20% of the whole data set

3. Fit the model on the training set

# instantiate, fit
linreg = LinearRegression()
linreg.fit(x_train, y_train)

4. Print the coefficients

print linreg.intercept_
print linreg.coef_
-46773.6549892
[282.29917574] # for an increase of 1 square meter in house size,
# the house price will go up by ~$282, on average

The intercept (β0) is the value of y when x=0. In this case it would be the price of a house when the sqft_living is 0. (Note it does not always make sense to interpret the intercept). The coefficient of β1 is the change in y divided by change in x (i.e the derivative, the slope of the line of best fit). An increase of 1 square meter in house size is associated with a price increase of $282.3, on average. Note that association doesn’t always imply causation.

5. Predict the price of a 1000 sqft_living house using our model:

# manually
price = -46773.6549892 + 1000*282.29917574
# using the model
linreg.predict(1000)
array([ 238175.93397914])

6. Compute the Root Mean Squared Error (RMSE), which is a commonly used metric to evaluate regression models, on the test set:

mse = mean_squared_error(y_test, linreg.predict(x_test))np.sqrt(mse)
259163.48398162922 # not great
linreg.score(x_test,y_test)
0.55433142764860421

We get a root mean squared error of $259,163.48 when predicting a price for a house, which is really high. This is kind of expected since we’re only using one feature in our model, and it could be greatly improved by adding more features such as number of bathrooms or bedrooms. We can also see that we’re omitting relevant variables by looking at the R squared coefficient: 55%. This means that our model is only able to explain 55% of the variability in house prices.

When we use multiple predictors (or features) we call it Multiple linear regression and we’ll be looking into it in another post.

--

--

Adrià Luz

Tales about data, statistics, machine learning, visualisation, and much more. By Adrià Luz (@adrialuz) and Sara Gaspar (@sargaspar).