How do we make sense of the shape and distribution of data in a 2D space? In this two-part series, we look at how kernel density plots can be used to visualize football shots data, and how a random forest predictor can help us to predict the probability of a goal based on where the shot was taken on the field.

The Problem

Association football (or soccer) is the world’s most popular sport. Two teams of 11 players compete to shoot the ball in the opposing goal, and the objective of the game is to outscore the opponent. Players may use any parts of their bodies except their arms to play the ball (goalkeepers may use their arms within the penalty area). Each game lasts 90 minutes, and each team typically takes 5 to 20 shots at the goal (“shots”), scoring a fraction of them.

Using data from Wyscout, a company that compiles match videos and tags match events, we depict all shots made by Liverpool Football Club in the 2017-18 English Premier League (38 games, 638 shots). The accompanying code in R is on our GitHub page.

Scatterplot depicting all shots made by Liverpool in the 2017-18 English Premier League (638 shots). Each shot is represented by a grey dot, which denotes the position in the field in which the shot was taken towards the goal on the right.

From the scatterplot, we can see that most shots were taken within the penalty area, where players have the highest chance of scoring. In fact, the chances of scoring is affected by angle and distance from goal—we will get to that in part two of this series. Nonetheless, there seem to be no other observable characteristics or pattern about this data.

How can we then make sense of the shape and distribution of these shots?

An Illustration

We could perhaps differentiate the shots based on which players made them; this may be useful if we wish to analyze the performance of individual players. For example, in Liverpool’s case, we may depict the shots made by their top three attackers:

Scatterplot depicting all shots made by Liverpool in the 2017-18 English Premier League (638 shots), with the three players having the most number of shots highlighted: Mohamed Salah (144 shots), Roberto Firmino (84 shots), Sadio Mane (70 shots).

While individual visual representations are helpful, how can we summarize these insights to enable comparisons with other teams?

A kernel density plot is a good way to achieve this.

Kernel density plot of all shots made by Liverpool in the 2017-18 English Premier League (638 shots). The plot is divided into ten areas, with each area containing about a tenth of all shots. The red-most area is where shots are most concentrated.

Kernel density plots reveal several insights:

First, we can identify the centers of data distribution, i.e. areas where players made the most shots. In the plot above, we see that there are 3 centers, one outside the penalty box, two inside the penalty box to the left and right sides of the goal.

Second, we can identify the concentration of data points from the centers, i.e. how close or far apart the data points are. The plot above is divided into ten areas, with each area containing about a tenth of all shots. Areas (or lines) that are closer together depict a concentrated number of shots made within those areas. We can see that shots tend to be concentrated inside the box, and more spread out outside the box.

In fact, the centers and concentration of this kernel density plot correspond directly to the shot patterns of the three Liverpool players depicted in the earlier scatterplot:

Firmino, who makes infrequent shots but tends to shoot more from outside the box as compared to other attackers, is represented by the lighter colored center outside the box.
Mane, who mostly shoots inside the box at the left of goal, is represented by the dark-colored small center; and
Salah, who mostly shoots inside the box at the right of goal and who made twice as many shots as Mane, is represented by the dark-colored large center.

See bonus graphs at the end for an analysis of the top teams in Europe.

Technical Explanation

Think of kernel density plots as smoothed histograms. Going back to our analysis of Liverpool’s soccer play, we summarize the number of shots they made across 38 games in the 2017-18 Premier League season:

*Histogram and kernel density plot of Liverpool’s shots across 38 games in the 2017-18 Premier League season.*

To visualize the underlying goal probability distribution, this histogram can be smoothed into a kernel density plot. This is done using a kernel, which is a function that transforms each data point into curves, before summing these individual curves to produce an overall probability density plot. Typically, a Gaussian kernel is used (i.e. Gaussian bell curve).

One advantage of kernel density plots over histograms is how two density plots can easily be compared. For example, the figure below compares Liverpool and Manchester City—we can see that the latter had a shot distribution with a higher average and a lower variance:

*Comparing shot frequency between Liverpool and Manchester City, across 38 games in the 2017-18 Premier League season.*

Histograms and kernel density plots also work if we have one additional dimension. Going back to the shots pattern analysis earlier, we can think of it as forming a histogram by counting the number of shots made in each part of the football field (a 2-dimension space). A kernel density plot can then be derived either in 3D or 2D.

*Renderings of a 3D histogram and kernel density plot of all shots made by Liverpool in the 2017-18 English Premier League*. The bottom two images are equivalent – the left image depicts a 3D hills-like rendering of the kernel density plot, whereas the right image depicts a 2D contour-like image of the same kernel density plot.

Limitations

Hyperparameter tuning. Histograms can be rendered differently depending on the bin width, which specifies how big each bin will be. For example, in our first histogram, we used a bin width of 5, so the first bin holds all data points valued from 7.5 to 12.5. Smaller bin widths may make it difficult to see the overall trend, while larger bin widths may wash out features of the data. In kernel density plots, we must tune a bandwidth hyperparameter that works similarly to bin widths in histograms.

Interested to know how we can use these results to predict the probability of a goal based on where the shot was taken on the field? Follow us to be notified when part 2 of this series is published:

Bonus Graphs

**Manchester City**’s title-winning campaign was powered by key offensive players S. Agüero, R. Sterling, and K. De Bruyne, and the team made a whopping 665 shots in 38 games, scoring 106 of them. This set a record for most goals scored by a team in a single season, which still stands today.

**Manchester United** ranked 6th in total number of shots made in the season. Nonetheless, their shots tended to be closer to goal (the highest proportion of shots within the six-yard box among top six teams) and tended to convert into goals more often (2nd highest shots-to-goal ratio among top six teams), which drove them to runners-up of the 2017-18 Premier League.

Of **Tottenham Hotspur**’s trifecta, H. Kane was the most influential, making 184 shots compared to H. M. Son (75), and C. Eriksen (97), as represented by the over-sized centre in their offensive plot. H. Kane will eventually win the Golden Boot for most goals at the 2018 FIFA World Cup, the most-viewed competition in the world.

On top of **Liverpool**’s powerful offensive capabilities, defender V. van Dijk’s arrival to Liverpool in Jan 2018 for a word-record £75m also marked a key point for the club. With his presence as the central-left defender, shots made against Liverpool unsurprisingly came more from the right (as depicted). In the subsequent season, Van Dijk eventually won several Player of the Year awards.

Despite managing 606 shots in the 2017-18 campaign, **Chelsea** scored only 62 of them (winners Manchester City made 665 shots and scored 106 goals). Striker A. Morata was rather ineffective, scoring only 11 goals through the season.

While **Arsenal**‘s offense was decent during the 2017-18 season, their defense was woeful – they let in 51 goals (compared to an average of 33 among the top five teams). As depicted in the right plot, opponents made a large number of shots within the penalty box, from both left and right of the goal.

Once again, **Bayern Munich**‘s prolific striker R. Lewandowski had an amazing season, scoring 29 goals to lead his team to a record sixth consecutive German title. A high concentration of his shots were within the penalty area, and he scored almost twice as much as the second-placed goalscorer in the league.

L. Messi’s and L. Suarez’s combined 318 shots and 59 goals catapulted **Barcelona** to the top of the Spanish league. They were also defensively strong, limiting many of their opponents’ shots to outside the penalty box.

Star-striker C. Ronaldo made an average of 7 shots every 90 minutes, and 1 goal every 90 minutes, the highest in the Spanish league. While **Real Madrid** did not win the Spanish league, they did win Europe’s premier competition – the Champions League, and it was their 4th title in 5 years, a feat that we may never see again.

**Atletico Madrid** is a rather defensive team. They made only 406 shots, scoring 58 goals (in comparison, Barcelona and Real Madrid scored 99 and 94 goals respectively). As depicted in the left plot, their offensive did not have a focal point, with center-forward A. Greizmann preferring to roam in different areas of the penalty box. The defensive prowess of F. Luis, Saul, S. Savic, and D. Godin meant that the team shipped only 22 goals in 38 games.

Data Source

Pappalardo et al., (2019) A public data set of spatio-temporal match events in soccer competitions, Nature Scientific Data 6:236, https://www.nature.com/articles/s41597-019-0247-7

We used data from the 2017/2018 season of five national first division football competitions in Europe: England, France, Germany, Italy, and Spain. Over 1,800 matches, 3 million events, and 4,000 players were analyzed.

Data was first collected in video format before events (e.g. shots, goals) were being tagged by operators (e.g. position of shot, player who made the shot) through a proprietary software.

Algobeans

Layman Tutorials in Machine Learning

Kernel Density Plots

The Problem

An Illustration

Technical Explanation

Limitations

Bonus Graphs

Data Source

2 thoughts on “Kernel Density Plots”

Leave a comment Cancel reply

The Problem

An Illustration

Technical Explanation

Limitations

Bonus Graphs

Data Source

Share this:

Related

2 thoughts on “Kernel Density Plots”

Leave a comment Cancel reply