# Predicting house prices with linear regression

Inspired by Kevin Markham’s post on linear regression, Chapter 3 of An Introduction to Statistical Learning** **and Andrew Ng’s Machine Learning course.

We’ll be working with a dataset of house prices from Kaggle.

## Simple Linear Regression

Simple linear regression is a statistical approach for modelling the relationship between a predictor variable X and a response variable Y. It assumes there is a linear relationship between these two variables and we use that to predict a quantitative output.

Simple linear regression is a very simple approach for supervised learning. But, even though it’s probably the simplest and most straightforward method, it’s the fundamental starting point for all regression methods, so it’s important to fully understand what it is about. It’s also widely used and quite easy to interpret: useful to get a better understanding of the relationship between the response and predictor.

Mathematically, we can write this linear relationship as:

y = *β0 + β1x+ e; *where

- y is the output variable (also called response, target or dependent variable). e.g. house prices
*x*is the input variable (also called feature, explanatory or independent variable) e.g. size of a house in squared meters*β0*is the intercept (the value of y when x=0)*β1*is the coefficient for*x*and the slope of the regression line (“the average increase in Y associated with a one-unit increase in X”)*e*is the error term

When implementing linear regression, the algorithm finds the **line of best fit** using the model coefficients *β0 *and* β1, *such that it is as close as possible to the actual data points (minimises the sum of the squared distances between each data point and the line). Once we find *β0 *and* β1* we can use the model to predict the response.

- Black dots are the observed values of
*x*and y (actual data) - the blue line —
**line of best fit**— is the line that minimises the sum of squared errors - the red lines are the errors (or residuals) — the vertical distances between the observed values (the actual data) and the line of best fit
- the slope of the blue line is
*β1* *β0*is the intercept (the value of y when*x*=0)

## An Example: Predicting house prices with linear regression using scikit-learn

Setting the environment:

import pandas as pd

import numpy as npimport seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inlinefrom sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.metrics import mean_squared_error

Read the house prices data:

houses = pd.read_csv("kc_house_data.csv")houses.head()

0. Data cleansing and exploratory analysis

`#check for nulls in the data`

houses.isnull().sum()

`# check for any correlations between variables`

corr = houses.corr()

sns.heatmap(corr)

# sqft_living, grade, sqft_above and sqft_living15 seem to have a

# high influence in price

**Building a linear regression model**

- Prepare the data: define predictor and response variables

`# create x and y`

feature_cols = 'sqft_living'

x = houses[feature_cols] # predictor

y = houses.price # response

2. Split data into *train* and *test*

The train/test split consists in randomly making 2 subsets of the the data: the t*raining set*, used to fit our learning algorithm so it learns how to predict, and the t*est set *which we use to get an idea of how the model would perform with new data.

`# split data into train and test`

x_train, x_test, y_train, y_test = train_test_split(

x, y, test_size=0.2)

# the test set will be 20% of the whole data set

3. Fit the model on the *training set*

`# instantiate, fit`

linreg = LinearRegression()

linreg.fit(x_train, y_train)

4. Print the coefficients

`print linreg.intercept_`

print linreg.coef_

-46773.6549892

[282.29917574] # for an increase of 1 square meter in house size,

# the house price will go up by ~$282, on average

The intercept (*β0) *is the value of y when x=0. In this case it would be the price of a house when the sqft_living is 0. (Note it does not always make sense to interpret the intercept). The coefficient of *β1* is the change in y divided by change in *x *(i.e the derivative, the slope of the line of best fit). An increase of 1 square meter in house size is **associated with** a price increase of $282.3, on average. Note that association doesn’t always imply causation.

5. Predict the price of a 1000 sqft_living house using our model:

`# manually`

price = -46773.6549892 + 1000*282.29917574

# using the model

linreg.predict(1000)

array([ 238175.93397914])

6. Compute the Root Mean Squared Error (RMSE), which is a commonly used metric to evaluate regression models, on the *test set:*

mse = mean_squared_error(y_test, linreg.predict(x_test))np.sqrt(mse)

259163.48398162922 # not greatlinreg.score(x_test,y_test)

0.55433142764860421

We get a root mean squared error of $259,163.48 when predicting a price for a house, which is really high. This is kind of expected since we’re only using one feature in our model, and it could be greatly improved by adding more features such as number of bathrooms or bedrooms. We can also see that we’re omitting relevant variables by looking at the R squared coefficient: 55%. This means that our model is only able to explain 55% of the variability in house prices.

When we use multiple predictors (or features) we call it **Multiple linear regression** and we’ll be looking into it in another post.