Fußball Quantitativ: Mai 2020

A dynamic zone evaluation model

The embedded presentation shows the usage of zones by Liverpool FC before scoring goals in form of animated gifs. The purpose is to show how Jürgen Klopp's team move the ball into more dangerous zones of the pitch and towards the opponent's goal in order to increase their chance of scoring rapidly. Tracking data for the presentation with a total of nineteen goals (nine displayed in the slides) were provided by Friends of Tracking. Before analysing the positional data, there was however some preparatory work to be done, using event data and some football ideas.

The starting point for my analysis was to divide the football pitch into zones. There are several approaches out there and various used at different clubs. From a technical standpoint, there was as often a trade-off between precision (using as many zones as possible to make the model as accurate as possible) on the one hand and robustness and feasibility on the other (my device only has so much computing power). Plus, the choice also had to make sense from a pure football point of view. After some attempts with rather small zones (one for each square meter) which my laptop could not handle, I decided to discriminate between twenty-five different zones.

Vertically (seen from one goal to the other), the pitch is divided into five strips; the centre, the wings and the half-spaces in between. You can see a similar approach used for instance by Liverpool's title rivals on their training ground. Horizontally, I also used five layers, but unlike vertically, these are not symmetrical. The pitch is divided into four fourths, but the final fourth in the attacking direction is further halved. This is mainly due to the xG-model which I will describe later, because shots from outside the box should not be directly compared with tap-ins inside the six-yard box. All in all, this leaves us with twenty-five zones, which are displayed in the following graph.

The next step was to put a numeric value to each of these zones. These could be assigned manually with some football knowledge and some gut instincts, but a little more sophisticated model should do a better job. Additionally, I also wanted the value to be dynamic, changing as the game evolves and the ball is moved to other zones. Based on the assumption that teams in general try to keep the ball when they have it, score goals themselves and prevent the opponent from scoring, I opted for three parameters to assign the overall danger value of a zone in a given moment:

pass probability (how easy are we going to get the ball there?)
expected goal value (if we shoot from there, how likely are we to score)
closeness to opponent goal (even if the chance of scoring from there is not very high, it might be a better location to have the ball)

Pass probability

Based on historical data, we can estimate how likely a pass will end within the reach of a teammate for each start and end zone of the pitch. Data therefore are provided by StatsBomb, who offer free event data including x,y coordinates for a number of competitions. For my model, I only used data from league games (NWSL, FA Women's Super League and La Liga), which left me with 677.277 passes.

As a first step, I looked at the historical completion rates of passes from one zone to each other (25*25, so 625 possible combinations). Some of these combinations are naturally more likely to happen than others. Furthermore, shorter passes are way more often than longer (the six most frequent combinations are passes that end in the same zone they are played from) and those within the midfield zone (second and third fourth of the pitch) are most common.

Some combinations, however, did not occur at all. As one could expect from a football point of view, teams don't tend to play passes across the pitch from one wing to the other, towards their own goal. Passes from the most defensive fourth to the most offensive one are also relatively infrequent in the dataset (maybe not so surprising, given that a lot of games in it are actually FC Barcelona matches). In order to overcome the scarcity of some combinations of interzonal passes, I also calculated a very basic completion model (logistic regression), in which I only used pass length as an independent predicator. I use both historical distributions and the model to predict pass success from one zone to the other, giving the model more weight if less data are in the dataset. With this prediction we can estimate the probability of a successful pass from any point of the pitch to any other zone. A pass from the centre point (not necessarily a kick-off since game situation is not included into the model) would for instance have the following chances of being completed (lighter shade of blue indicates a higher chance of a successful pass). Please note that information concerning player dispersion is not directly included in the model, only historic events.

We can see that keeping the ball in the same zone or playing it to a neighbouring would be a good choice if we want to keep possession, a move out towards the wing might also be a possibility. Nothing really surprising, model and viz are working.

Expected goal value

Next for shots and expected goals. The basic idea was the same as with passes. I looked at the shots in my dataset (n=17.331) and determined the distribution of those going in and those that do not, based on the zone they came from. The results were again rather unsurprising, with the zone right in front of the opponent's goal being the most dangerous one (~24% chance of scoring) and also the one with the most shots. Yet, there are again difficulties only using at historic averages. In my case, there are some zones from which very few shots are fired, which leaves more space for random variation (just as in the case of the pass zone combinations). For instance, from the right wing (highest zone), there are 34 shots in the dataset, with five of them ending in the back of the net (~15%). On the equivalent zone on the left side, things are quite different; some number of shots, only two goals (~6%). This could have something set pieces (maybe free kicks from the right directly taken from left-footed players are more dangerous), but I would rather suggest this phenomenon is not systematic.

Fortunately for me, StatsBomb also includes an xG value for each of their shots, which allows me to overcome the small subsample size problem by calculating the average expected goal value for every shot zone. Doing this, we can see that shots from the right wing are actually not so different from those from the left, so the different scoring rates are probably an anomaly. Shots from the central danger zone are on the other hand scored almost exactly as expected, showing that the more shots there are from one zone, the more trustworthy are also historic scoring rates.

As in the case of the passing model, I assigned the expected goal value for the zones based on the historic rates and the results of the model (in this case, the one by StatsBomb), giving more weight to more frequent shots. Expected goal values are not assigned to zones in the own half, assuming that shooting from there is basically useless. The average expected goal value is displayed in the following graph, with brighter colour shades again indicating higher values.

Distance to goal

Given that not all zones receive an xG-value, there is still the need to capture the fact that some are more prone to create dangerous attacks than others. Therefore, the final factor to assess the danger of a zone is simply its distance to goal (measured from the midpoint of the zone) opposed to the distance of the zone in question.

With these three measures, we are able to calculate an index by simply adding the values of each of them. In order to be dynamic, the position of the ball has to be taken into account, hence the danger for each zone is calculated for each frame in the tracking data set. In the case of xG and distance, I did not use absolute values but rather the ratio of the zone in question, compared to the position of the ball. We have therefore an index which ranges (at least theoretically) from 0 (no chance of getting the ball there, no point in shooting from there, and the point is much further away from the opposition goal) to 3 (100% pass completion, a sure goal if a shot happens, and it is much closer to the opposition goal). Of course, in reality these extreme values will not occur.

Visualization

There were some minor issues to be solved in order for the visualization to work. Unlike in the videos, the attacking direction is always from left to right, given that the model itself is not symmetrical. The ball is always black and a bit smaller in size than the players, which are depicted in the colour of the bgcolor variable. The gganimate-package in R-Studio did not allow me to make the colour of zones a bit transparent, hence all the green of the pitch is lost in the gifs. The dynamical change of colour does however work for each frame, even if the ball is passed. Given that in these instances no decision can be made as to where move the ball next, a next step would be only to calculate danger in moments when there is actually a decision to be taken and referring a default value if not.

As a general result, it is striking how Liverpool almost always try to move the ball immediately to the closest more dangerous area, either by playing passes or with the ball on their feet. Only seldom to they play the ball out wide, at least in case of the videos I had data for. The pass to the wing if usually only played against set defences and most of the time the ball is immediately crossed back into the centre, with many goals stemming from almost completely free shots from somewhere between the six-yard box and the penalty spot. Although they are not necessarily known for their passing game and some technical errors are inevitable playing such a high-tempo style, their passes behind the last opponent line are often perfectly timed, giving the receiver the one moment he needs to either score at once or go around the keeper and finish the play.

For the presentation above, I had to cherry-pick some of the goals, since it was reduced to five slides only. I therefore only used nine of the nineteen goals in the dataset for the slides. These goals are divided broadly into three sub-categories, based on watching the videos and my eye test. A more robust way might be to classify them using some form of clustering (based on factors like speed of attacks, number of players involved, passes played, usage of central or outside channels, etc.) or even ML procedures. Finally, including tracking data and concepts such as pitch control might help to evaluate the danger of different zones on the pitch even better.

To close this entry, I add the one goal which I personally liked best, the1:0 against Wolves in the last game of the 2018/19 season, scored by Sadio Mané. The video can be found here.

The code for my analysis as well as the gifs for all goals can be found at Github.

Fußball Quantitativ

Sonntag, 10. Mai 2020

Liverpool F.C. Goal Analysis

A dynamic zone evaluation model

Pass probability

Expected goal value

Distance to goal

Visualization

Suchfunktion

Blog-Archiv