Predicting house prices with linear regression

Inspired by Kevin Markham’s post on linear regression, Chapter 3 of An Introduction to Statistical Learning and Andrew Ng’s Machine Learning course.

Simple Linear Regression

Simple linear regression is a statistical approach for modelling the relationship between a predictor variable X and a response variable Y. It assumes there is a linear relationship between these two variables and we use that to predict a quantitative output.

  • x is the input variable (also called feature, explanatory or independent variable) e.g. size of a house in squared meters
  • β0 is the intercept (the value of y when x=0)
  • β1 is the coefficient for x and the slope of the regression line (“the average increase in Y associated with a one-unit increase in X”)
  • e is the error term
Adaptation from Kevin Markham’s post
  • the blue line — line of best fit — is the line that minimises the sum of squared errors
  • the red lines are the errors (or residuals) — the vertical distances between the observed values (the actual data) and the line of best fit
  • the slope of the blue line is β1
  • β0 is the intercept (the value of y when x=0)

An Example: Predicting house prices with linear regression using scikit-learn

Setting the environment:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
houses = pd.read_csv("kc_house_data.csv")houses.head()
Top 5 rows of the houses data set
#check for nulls in the data
houses.isnull().sum()
The dataset doesn’t contain any nulls
# check for any correlations between variables
corr = houses.corr()
sns.heatmap(corr)
# sqft_living, grade, sqft_above and sqft_living15 seem to have a
# high influence in price
# create x and y
feature_cols = 'sqft_living'
x = houses[feature_cols] # predictor
y = houses.price # response
# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2)
# the test set will be 20% of the whole data set
# instantiate, fit
linreg = LinearRegression()
linreg.fit(x_train, y_train)
print linreg.intercept_
print linreg.coef_
-46773.6549892
[282.29917574] # for an increase of 1 square meter in house size,
# the house price will go up by ~$282, on average
# manually
price = -46773.6549892 + 1000*282.29917574
# using the model
linreg.predict(1000)
array([ 238175.93397914])
mse = mean_squared_error(y_test, linreg.predict(x_test))np.sqrt(mse)
259163.48398162922 # not great
linreg.score(x_test,y_test)
0.55433142764860421

Tales about data, statistics, machine learning, visualisation, and much more. By Adrià Luz (@adrialuz) and Sara Gaspar (@sargaspar).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store