Predicting house prices with linear regression

Inspired by Kevin Markham’s post on linear regression, Chapter 3 of An Introduction to Statistical Learning and Andrew Ng’s Machine Learning course.

We’ll be working with a dataset of house prices from Kaggle.

Simple linear regression is a statistical approach for modelling the relationship between a predictor variable X and a response variable Y. It assumes there is a linear relationship between these two variables and we use that to predict a quantitative output.

Simple linear regression is a very simple approach for supervised learning. But, even though it’s probably the simplest and most straightforward method, it’s the fundamental starting point for all regression methods, so it’s important to fully understand what it is about. It’s also widely used and quite easy to interpret: useful to get a better understanding of the relationship between the response and predictor.

Mathematically, we can write this linear relationship as:

y = β0 + β1x+ e; where

  • y is the output variable (also called response, target or dependent variable). e.g. house prices
  • x is the input variable (also called feature, explanatory or independent variable) e.g. size of a house in squared meters
  • β0 is the intercept (the value of y when x=0)
  • β1 is the coefficient for x and the slope of the regression line (“the average increase in Y associated with a one-unit increase in X”)
  • e is the error term

When implementing linear regression, the algorithm finds the line of best fit using the model coefficients β0 and β1, such that it is as close as possible to the actual data points (minimises the sum of the squared distances between each data point and the line). Once we find β0 and β1 we can use the model to predict the response.

Image for post
Image for post
Adaptation from Kevin Markham’s post
  • Black dots are the observed values of x and y (actual data)
  • the blue line — line of best fit — is the line that minimises the sum of squared errors
  • the red lines are the errors (or residuals) — the vertical distances between the observed values (the actual data) and the line of best fit
  • the slope of the blue line is β1
  • β0 is the intercept (the value of y when x=0)

Setting the environment:

Read the house prices data:

Image for post
Image for post
Top 5 rows of the houses data set

Data dictionary

0. Data cleansing and exploratory analysis

Image for post
Image for post
The dataset doesn’t contain any nulls
Image for post
Image for post
Image for post
Image for post

Building a linear regression model

  1. Prepare the data: define predictor and response variables

2. Split data into train and test

The train/test split consists in randomly making 2 subsets of the the data: the training set, used to fit our learning algorithm so it learns how to predict, and the test set which we use to get an idea of how the model would perform with new data.

3. Fit the model on the training set

4. Print the coefficients

The intercept (β0) is the value of y when x=0. In this case it would be the price of a house when the sqft_living is 0. (Note it does not always make sense to interpret the intercept). The coefficient of β1 is the change in y divided by change in x (i.e the derivative, the slope of the line of best fit). An increase of 1 square meter in house size is associated with a price increase of $282.3, on average. Note that association doesn’t always imply causation.

5. Predict the price of a 1000 sqft_living house using our model:

6. Compute the Root Mean Squared Error (RMSE), which is a commonly used metric to evaluate regression models, on the test set:

We get a root mean squared error of $259,163.48 when predicting a price for a house, which is really high. This is kind of expected since we’re only using one feature in our model, and it could be greatly improved by adding more features such as number of bathrooms or bedrooms. We can also see that we’re omitting relevant variables by looking at the R squared coefficient: 55%. This means that our model is only able to explain 55% of the variability in house prices.

When we use multiple predictors (or features) we call it Multiple linear regression and we’ll be looking into it in another post.

Written by

Tales about data, statistics, machine learning, visualisation, and much more. By Adrià Luz (@adrialuz) and Sara Gaspar (@sargaspar).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store