Visualising football players in two dimensions with PCA

As Trevor Hastie and others explain in the book Introduction to Statistical Learning, when we are faced with a large set of correlated variables, principal components allow us to summarise this set with a smaller number of representative variables that collectively explain most of the variability in the original set. Principal Components Analysis (PCA) is an unsupervised machine learning algorithm that performs dimensionality reduction and allows us to do just that.

To illustrate how PCA works, let’s look at this scatter plot of two variables (x1 and x2). PCA finds the vector u1, whose direction is the one with greater variability. The vector is placed in a way that the projection error (the thin lines between the blue and red dots) is minimised. In this case, PCA takes data (blue dots) that exists in a 2-dimensional space and projects it to a 1-dimensional space. You could imagine the vector u1 as the x axis of this new space, that represents the original dataset in just 1 dimension through the red dots.

In this post, I won’t go into any more details about the inner workings of PCA, nor important things you must keep in mind when using it (like the need to scale your features beforehand!). If you’re interested, go read the book!

FIFA 2018 dataset

This dataset contains 74 player attributes for the 17,981 football players that are in the game. You can get it from Kaggle.

Image for post
Image for post
First 5 rows of the data set. Note this only shows the first 8 columns (out of 74).

For this application, I’m only interested in looking at the 34 columns that contain information about the abilities of each player, such as acceleration, ball control, or dribbling.

Image for post
Image for post

It would be very nice if we could visualise this data, and see how players are similar to or different from each other. The problem is that it is impossible to plot data on a 34-dimensional space — the human eye is only good at interpreting charts in 2D or 3D at most. By using PCA, we can reduce the dimensionality of the data to just 2 dimensions, which will allow us to do exactly what we want to do: visualise the data.

Image for post
Image for post
First two principal components for the top 50 players.

Looking at this chart, it seems like the first component is basically separating the goalkeepers from the field players, while the second component is differentiating within field players themselves. If you know some of the players in the chart, you might have already spotted that defenders are at the top (e.g. Chiellini, Ramos, Boateng, etc.), whereas more technical attacking players are located at the bottom (e.g. Neymar, Dybala, Hazard, etc.). It’s nice to see the best player in history at the very bottom-left corner.

One problem with this chart is that, because goalkeepers are so different from field players, we can’t really see much variation of the first principal component for the field players. Excluding the goalkeepers fixes that:

Image for post
Image for post
First two principal components for the top 50 players, excluding the goalkeepers.

In this chart there’s a lot more space between players, which helps see the differences more clearly. The idea still holds: defenders are in the right-upper quadrant — putting them closer to goalkeepers. Marcelo is almost closer to the midfielders than he is to the defenders, despite being a defender himself. This makes sense because he is a very offensive left back, and his attacking efforts definitely add more value to Real Madrid than his defending actions. It’s also interesting to see how pure target men like Lewandowski and Suárez are clustered together and slightly separated from other attacking but very different players like Neymar and Hazard.

Of course, we could visualise as many players as we wanted:

Image for post
Image for post

However, with so many players, it becomes unpractical to add labels with the players’ names. One thing we could do is to colour each circle based on the player’s position in the pitch. Luckily, we have this information in our dataset (although we’ll need to do a few things first).

There’s a column called Preferred Positions that tells us in which positions the player likes to play in. It looks like this:

The problem, as you can see, is that some players have more than one preferred position. We need a way to assign one and only one position to each player.

The first approach I thought of was the simple one: use the first position listed for each player (e.g. Ronaldo would get ST, Alexis Sánchez would get RM, Bale would get RW, an so on). If you know who Alexis Sánchez is, you will see what the problem with this approach is. If we assigned RM to him, he would get labelled as a midfielder — which is clearly wrong.

The second and more complex approach I came up with was the following: manually label the top 50 players, going one by one, and write down what their position really is. Then train a logistic regression on that data and use it to predict the positions of all 17,981 players. So this is what I did.

def assign_position(df, name, pos):
"""Assigns a position (pos) to a player (name)"""
df.loc[df.Name == name, 'Preferred Position'] = pos
return df
assign_position(df, 'Cristiano Ronaldo', 'A')
assign_position(df, 'L. Messi', 'A')
assign_position(df, 'Neymar', 'A')
assign_position(df, 'L. Suárez', 'A')
assign_position(df, 'M. Neuer', 'GK')
assign_position(df, 'R. Lewandowski', 'A')
assign_position(df, 'De Gea', 'GK')
assign_position(df, 'E. Hazard', 'A')
assign_position(df, 'T. Kroos', 'DM')
assign_position(df, 'G. Higuaín', 'A')
assign_position(df, 'Sergio Ramos', 'D')
assign_position(df, 'K. De Bruyne', 'AM')
# etc.
# etc.
# etc.

These labels are what I used as the response variable for my model. As features I used the Preferred Positions column we’ve seen before. I encoded it in a way that the model could understand:

# get a list of all the unique positions
positions = df['Preferred Positions'].str.split()\
.apply(lambda x: x[0]).value_counts().index.values
# for each position...
for pos in positions:
# create an empty column with the name of the position
df[pos] = np.nan
# if the position is in the preferred positions column, put 1
df.loc[df['Preferred Positions'].str.contains(pos), pos] = 1
# if not, put 0
df[pos].fillna(0, inplace=True)

Which gave me this:

Image for post
Image for post
Preferred positions column and the new encoding columns that were used to train the model.

I performed a nested cross-validation and got ~88% accuracy. Happy with that!

Got to this point, I had my clean version of the Preferred Positions column in the form of predictions from my logistic regression model, so I could plot the data and colour it by position:

Image for post
Image for post

And, as you would expect, the defenders are still at the top, followed by the defensive midfielders, attacking midfielders, and attacking players at the very bottom. There’s certainly some overlap, especially in the middle and the bottom, but this is normal since players are not just either defensive or attacking midfielders, for example. Some players are more versatile than others and feel comfortable playing in different positions, which will be reflected in their abilities and ultimately in this plot.

You can find the notebook with all the code here.

Tales about data, statistics, machine learning, visualisation, and much more. By Adrià Luz (@adrialuz) and Sara Gaspar (@sargaspar).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store