Fußball Quantitativ

Montag, 21. Dezember 2020

Do supporters in the grounds affect results? Some evidence from Austria.

Remember when we (myself included) declared the end of home-ground advantage after football restarted? Well, not so fast. At least in the Austrian Bundesliga, things are pretty much back to normal, although we are back to closed gates after some games with fans in September and October.

We can see that away teams were actually slightly more successful even before the pandemic halted football for almost three months. They obtained around 0.1 points per game more than home teams in the 22 games before lockdown. This difference skyrocketed after restart to almost 0.5, which is generally why home-ground advantage was supposed to be at least severely damaged. Things have however changed since the start of the current season. Home teams obtain around one point more every five games than their opponents. Roughly the same is true for the number of goals scored.

Interestingly, the pattern is almost identical for home and away teams if we look only at goals scored in the first half: a massive drop after restart and comparable levels in the 2020-21 games to the ones with full attendance. This also means that away teams completely outperformed home teams after the break in the ten matchdays between restart and the end of the season. Whether this is due to the absence of supporters and psychological effects thereof or just some random variation is unclear.

Underlying statistics show a similar, although less contrastive picture. Home teams suffered a significant drop of performance levels after football restarted, both in terms of shots and expected goals. Games were almost completely level between home and away teams; hence the absence of home-ground advantage can be confirmed for this period. Interestingly, also away teams suffered a drop in expected goals after restart yet is was much smaller than the one for home teams.

Since the start of the current season, things are however back to comparable levels to before the pandemic. Both home and away teams actually create around 0.1 xG per game more than before football was halted, hence the difference between them is basically the same. The question why all teams create more or better chances now than at the same time last year would be an interesting question for a separate thread or blog entry. In line with what we saw when looking at actual goals scored, analyzing expected goals shows that away teams massively outperformed their opponents especially after half-time.

One decisive point why away teams performed substantially better after restart was discipline. Referees seemed to punish them much less than before Covid, both in absolute terms as well as in comparison to home teams.

This pattern was somehow reversed after the start of the 2020/21 season. Not so much in terms of penalties, where away games are still treated preferably, although the difference between home and away teams is at the moment minimal. They still get around one penalty more every eight games, this difference was around three times higher in the final games of last season. Both get more penalties now than last season; the pressure on referees to award penalties is at least in this case negatively proportional to the number of fans in the stands.

Likewise, the conversion rate of penalties is much higher this season than last and also way above the long-term average of around 0.8. More than nine in ten penalties are converted this season, and the difference between home and away teams is again small. As in the case of penalties awarded, home teams suffered greatly after restart and are now at higher levels than they were before the pandemic.

Other disciplinary topics show more diverse patterns. Offside calls have been on the rise ever since, something I would not directly link to the number of spectators in stadia. These are (despite some confusing VAR cases) obviously more objective, binary decisions than fouls, cards or penalties.

Fouls and personal punishments were in my opinion the main drivers of away teams’ improved results after lockdown. In all three categories (fouls, yellow cards, send-offs), home teams were treated worse than their opponents in the ten rounds after football restarted, unlike before Covid. This pattern was reversed but for players sent off (both straight reds and second yellows), where they are still worse off. In all cases, the difference is however smaller than it was when fans where still allowed without restrictions, which would indicate that having full grounds does influence refereeing decisions.

So, do supporters influence results? To answer this question, the current scenario offers us nearly a natural experiment. Until matchday 22 of the past season, they were allowed without restrictions. After that, games where played completely behind closed doors until the end of the season. Then in summer, there was an initial limit of up to 10.000 visitors, which was in place however only for matchday 1 due to growing infection numbers. This limit was subsequently lowered to 3.000 and then 1.500, before the league had to return to closed gates. Under each scenario we had a couple of games played, although with a limited sample size, so conclusions should be taken with a grain of salt.

We can see that point average for home and away teams was relatively stable before a few games before football was stopped, with away teams overperforming their counterparts during a long time early in the season. They then started to perform worse, but immediately rose again to levels unseen before after restart and maintained this level for the rest of the season. With the return of fans, their performance levels dropped again and reached a low point a few games in the 3.000 supporters period. They then started to rise again and have reached parity with home teams lately, as one would suspect if we are to suppose that the presence of fans influences results for the teams they support. Evidence is not really robust, but we can definitely show that away teams performed better with lower attendance figures or no fans at all in the stadia.

Sonntag, 21. Juni 2020

Just a results crisis?

From almost two and a half points per game down to less than one, fewer goals scored and almost double the amount of goals conceded, former leader LASK has clearly not made a good start into Post-Covid football. Given that points were halved after the regular season and they got six points deduced from their tally due to misbehaviour during lockdown, they went from being first in the table with six points ahead of Salzburg to sitting in third position, twelve points behind. One can easily conclude that the club is in crisis, despite winning their first game after the Corona break last Wednesday, against arguably the weakest team in the upper play-off group.

The question is whether this downward trend in results is due to some unlucky factors or actually caused by worse performances. They are in any case still the second-best team in the league in terms of expected goals scored and conceded even after the league restarted, a position they held also before Corona. Furthermore, they still have the realistic chance to finish second (which means the chance to qualify for Champions League football next year). On the other hand, there are some worrying trends in the underlying numbers as well.

Table 1 shows the change in expected goals for and against as well as the difference between the two for all clubs in the champions group (a.k.a. upper play-off) in the Austrian Bundesliga before and after football's lockdown. Before this break, the twelve teams of the league played each other twice in the regular season. In order to make numbers comparable, I only included those from games against direct opponents (those which made it into the upper play-off) from games before the break.

We can see that the former league leaders are the team with the second highest drop in xG created. They score almost half an expected goal per game less, which in turn can explain why their goals tally went down by almost the same amount.

Their increased defensive vulnerability is, however, not backed up by the numbers. They remain almost unchanged in comparison. The only problem there is that relatively, they became worse, given that almost all the other teams (with the exception of Wolfsberg, obviously an interesting case in itself) improved their defensive performance much more.

	Expected Goals for	Expected Goals against	Expected Goals difference
Salzburg	-0,23	-0,77	0,54
Rapid	-0,23	-0,70	0,47
LASK	-0,41	-0,01	-0,40
Wolfsberg	-0,88	0,36	-1,24
Hartberg	-0,03	-0,67	0,64
Sturm	-0,05	-0,04	-0,01

Table 1: Difference between xG-values per game between regular season and play-off. Only games against direct opponents.

This relative decline is mirrored in the change of xG-difference, where they are again the second worst team of the upper league group. We can therefore conclude that they obviously got worse in the attacking sphere of things and did not improve enough defensively to catch up with their flaws.

The next question to look at would be to gauge whether the decline is due to structural reasons or related to more individual ones. On a structural level, it is hard to identify any significant differences. One of their most important strengths during the season so far, set pieces, works almost as well as before. They create chances from dead balls as well as ever, but concede some more. This can however not explain their drop in expected goals created. Likewise, chances created from counters and crosses have not changed notably.

This is where the individual level and one of the most popular excuses in football comes in, i.e. injuries. They lost two of their key players, Marvin Potzmann and Thomas Goiginger, both within one week at the beginning of March due to ACLs. Potzmann had accounted for a combined 0.41 of xG and xA per 90 minutes played and Goiginger for 0.67, the latter being their second most productive offensive player in this regard during the regular season.

The gap left by them could not be filled by the remaining players. Out of players with at least two xG or xA during regular season, only one improved his numbers considerably after restart, central defender Gernot Trauner. Direct replacements for Goiginger such as Dominik Frieser and Samuel Tetteh saw their numbers decline. In the case of Potzmann, younger players occupying his position such as Andrés Andrade and David Schnegg are not yet up to meeting demands and produce offensive output. The only one getting near to him, René Renner, had lower numbers even before lockdown (especially assisting his teammates less than Potzmann), plus declined slightly after restart.

The crisis that LASK are going through seems to be real, and down to losing two key players which could not be replaced neither internally nor externally, given that the transfer window is shut. Not the best prospects for the remainder of the season.

Dienstag, 2. Juni 2020

How could additional substitutions affect the title race?

Like their German counterparts, the clubs of the highest Austrian league have decided to take usage of FIFA's step to allow two additional substitutions for the remainder of the season. While it has been argued that this rule change will especially benefit the bigger teams due to their stronger squads, we can use historic data (i.e. the season before lockdown) to deduce which effects the new norm can have, especially on the most important decisions (league winner, European starters and relegation). I also propose a new measurement of squad rotation, which will be affected by the rule change.

A first look at the data shows that most of the teams in the league are quite happy to use the substitutions they are allowed to make (see Graph 1); three of the twelve clubs of Austria's highest league used all of their 66 (22 league games so far). Another seven teams missed very few possibilities to make a change during their games. Only one team (Altach) is more reluctant to make changes within games, rejecting one substitution almost every other game.

Graph 1 (click to enlarge)

There is, however, no relationship between the use of substitutions and teams' position in the table. Among the three clubs at the top of the substitution usage table, there are both highly (LASK) and lowly (Wattens, St. Pölten) classified clubs.

Pure usage figures alone don't tell us a lot about how the rule changes might influence clubs' behaviour, although we can suppose that teams which are very reluctant to even use all three substitutions will also use the additional ones more cautiously.

Likewise, the patterns of substitution timing might be affected. Right now, clubs tend to make their first substitution between minutes 50 and 60 (see Graph 2), although there are some significant differences between them. You would suppose that the first, second and third change will come earlier, so the lines on Graph 2 will be lower on the y-axis. Knowing that they have two more opportunities, coaches might even take advantage of tactical changes in the first half more often.

Graph 2 (click to enlarge)

You can also expect the number of substitutions at half-time to go up, due to the restrictions to three slots in which teams can take off players during the second half. Only seven percent of substitutions so far were done at the break, with newly promoted Wattens the most active team at this time (nine). On the end of the scale, substitution resistant Altach and Sturm have only used this opportunity twice, with all the other teams recurring three, four or five times to it.

On a club level, teams which tend to take their first changes later also tend to take their second and third later. My hypothesis would be that these patterns won't change all that much, rather we will see more double or even triple subs later on in the game by teams such as Sturm, Austria and Wolfsberg (Admira is a bit more difficult to assess given that they have their third head coach this season already).

So far for the historic data. Beyond these, we are more interested in the way the new rule might affect teams and their playing performance during the rest of the season. We need therefore a way to assess the quality of those players which might benefit from more substitutions, i.e. the players who have not been regular starters or are not regularly subbed on when not in the starting eleven.

One possibility to assess their quality is to simply look at the number of minutes they have played (league only), assuming that higher quality players tend to get more game time. We therefore rank the players of each team by the number of minutes played and compare afterwards those players with the 12th to 14th most minutes (regular subs) to those with the 15th and 16th most. The latter are those who should benefit the most from the upcoming rule change.

Graph 3 (click to enlarge)

Higher levels of minutes played by either subgroup indicate in theory higher levels of quality, hence the club should benefit more from the new rule. As we can see in Graph 3, there is a mostly linear relationship between both indicators, with the notable exceptions of probably-not title candidate LASK and Hartberg. Both of them rely on a strong core of 14 more or less regular players, but lack depth afterwards. The latter are an interesting case, as they have reached the upper play-off group, which means that they cannot be relegated anymore but might reach the Europa League, which would be a huge success given their stature.

There is no surprise that Wolfsberg were the only club decidedly against the rule change, given their location in Graph 3. They neither had nor will have strong impact coming from the bench, which is the more interesting as they also competed at the European level before Christmas and are still in contention for another year of international football.

Leaders Salzburg should definitely be able to exploit the new rules given the large resources at their disposal, giving them an edge in the title race (which also includes third-placed Rapid). At the other end of the table, Mattersburg might have a little advantage and Admira a small drawback in a relegation battle which will be fierce, with the six teams in the lower play-off separated by four points only.

Minutes given to regular substitution players are in any case not the only way to measure squad involvement and depth. One could also analyse the total number of players used and compare teams by the amount of footballers they field during the course of a season, either total or above a certain threshold of minutes. These measurements are fine yet can be improved to get a more detailed picture. I propose therefore a methodology, borrowed from Political Science, which counts not the actual but the effective number of units (in this case, players) in a given system.

Consider the case of two fictious parliaments. In the first one, Party A got 42% of the seats, Party B has 38% and Party C the remaining 20%. In the second one, there is a majority Party D with 60% of the seats, meanwhile the rest of the mandates are distributed among parties E (25%), F (10%) and G (5%). Which one of them has more parties?

The answer is that it depends. Surely, you could simply count parties in each chamber without further consideration of their numerical strength, which would show that there are more parties in the second example (four) than in the first one (three). This is simple, but also misleading. Our hypothetical Party D would in reality be able to govern quite easily without bothering much about the other parties, meanwhile in the first example, no party alone has enough seats to reach a majority. It is therefore more accurate to count the parties not equally, but rather weighting their seats (or vote share, if you are interested more in the electorate than in the parliament) according to their own strength and relative to the strength of the other represented parties.

That is where the idea of the effective (rather than the actual) number of parties comes into place. The concept is satisfyingly described on Wikipedia, from where I also copied the formula, although I use a slightly different one.

N = \frac{1}{\sum_{i=1}^n p_i^2}

Formula to calculate the effective number of players

It looks more complicated than it actually is. You basically take 1 (if you work with percentages) and divide it by the sum of the squares of the shares. In the case of our two exemplary parliaments, the number of effective parties in the first one would be 2.8 and in the second one 2.3, so the difference is actually reversed in comparison to the simple count. This difference is however much closer to reality than following the naive assumption and counting all the units the same way.

We can apply this concept to football easily, we simply have to adjust the formula a bit and of course use minutes or starts instead of seats or vote shares. Given that minutes played by all the players in a team are variable (due to injuries and sending-offs, a team's players not always play the sum of 11*90 minutes), we have to make the values comparable. We do this by not dividing 1 by sum of squares but using the square of the sum as the dividend.

To make this a little clearer, we'll apply this to a very simple example, a team which in a single match makes three substitutions at half time and compare it to a team which plays with the same eleven players during the whole game. The first team would have

(90+90+90+90+90+90+90+90+45+45+45+45+45+45)^2

divided by

(8100+8100+8100+8100+8100+8100+8100+8100+2025+2025+2025+2025+2025+2025)

effective players, i.e. 12.74. The team without substitutions accordingly would have 11 effective players.

By applying this formula to the sum of minutes played and starts throughout the season so far, we can calculate the effective number of players and starters for each team.

Graph 4 (click to enlarge)

By comparing the actual and effective numbers of players used, we can see how many players teams give minutes and also how evenly they distribute minutes among them. A quick look at graph 4 shows that teams cluster to a certain point. St. Pölten stands out in absolute terms, a strategy not really backed up by results so far (they will start the remaining from the last position). Rapid have given a lot of young players a limited amount of minutes without conceding them starts, meanwhile the rest of teams has used between 24 and 28 players out of which 21 to 27 have been starters.

The picture clearly changes when we look at the effective numbers of players used and started. St. Pölten still leads the trail in terms of the amount of players used, but Salzburg have more effective starters, indicating a high level of rotation in the league. This might be a decisive advantage over title rivals LASK, who used the same number of players and just one starter less in absolute terms, but whose substitution players received far less minutes among them.

In the middle, there are a lot of teams in a quite congested area of the graph. Although I put the team names with an angle, there was no way to prevent this overplotting, because teams are quite close to each other, hence there are no big differences to expect for the rest of the season. Interestingly, the closely packed teams are all from the lower play-offs, with Austria Vienna having a slightly deeper and more balanced squad than their direct rivals. This might favour them to finish first in the lower play-off, a position that would give them the chance to qualify for Europe in spite of a rather underwhelming campaign so far.

At the end of the scale, there are those three teams which also used their players 15 and 16 the least (Graph 3). Wolfsberg, Hartberg and Sturm will battle for European qualification, but based on their squad depth and quality will not be able to join the title race, even with the point difference halved.

Sonntag, 10. Mai 2020

Liverpool F.C. Goal Analysis

A dynamic zone evaluation model

The embedded presentation shows the usage of zones by Liverpool FC before scoring goals in form of animated gifs. The purpose is to show how Jürgen Klopp's team move the ball into more dangerous zones of the pitch and towards the opponent's goal in order to increase their chance of scoring rapidly. Tracking data for the presentation with a total of nineteen goals (nine displayed in the slides) were provided by Friends of Tracking. Before analysing the positional data, there was however some preparatory work to be done, using event data and some football ideas.

The starting point for my analysis was to divide the football pitch into zones. There are several approaches out there and various used at different clubs. From a technical standpoint, there was as often a trade-off between precision (using as many zones as possible to make the model as accurate as possible) on the one hand and robustness and feasibility on the other (my device only has so much computing power). Plus, the choice also had to make sense from a pure football point of view. After some attempts with rather small zones (one for each square meter) which my laptop could not handle, I decided to discriminate between twenty-five different zones.

Vertically (seen from one goal to the other), the pitch is divided into five strips; the centre, the wings and the half-spaces in between. You can see a similar approach used for instance by Liverpool's title rivals on their training ground. Horizontally, I also used five layers, but unlike vertically, these are not symmetrical. The pitch is divided into four fourths, but the final fourth in the attacking direction is further halved. This is mainly due to the xG-model which I will describe later, because shots from outside the box should not be directly compared with tap-ins inside the six-yard box. All in all, this leaves us with twenty-five zones, which are displayed in the following graph.

The next step was to put a numeric value to each of these zones. These could be assigned manually with some football knowledge and some gut instincts, but a little more sophisticated model should do a better job. Additionally, I also wanted the value to be dynamic, changing as the game evolves and the ball is moved to other zones. Based on the assumption that teams in general try to keep the ball when they have it, score goals themselves and prevent the opponent from scoring, I opted for three parameters to assign the overall danger value of a zone in a given moment:

pass probability (how easy are we going to get the ball there?)
expected goal value (if we shoot from there, how likely are we to score)
closeness to opponent goal (even if the chance of scoring from there is not very high, it might be a better location to have the ball)

Pass probability

Based on historical data, we can estimate how likely a pass will end within the reach of a teammate for each start and end zone of the pitch. Data therefore are provided by StatsBomb, who offer free event data including x,y coordinates for a number of competitions. For my model, I only used data from league games (NWSL, FA Women's Super League and La Liga), which left me with 677.277 passes.

As a first step, I looked at the historical completion rates of passes from one zone to each other (25*25, so 625 possible combinations). Some of these combinations are naturally more likely to happen than others. Furthermore, shorter passes are way more often than longer (the six most frequent combinations are passes that end in the same zone they are played from) and those within the midfield zone (second and third fourth of the pitch) are most common.

Some combinations, however, did not occur at all. As one could expect from a football point of view, teams don't tend to play passes across the pitch from one wing to the other, towards their own goal. Passes from the most defensive fourth to the most offensive one are also relatively infrequent in the dataset (maybe not so surprising, given that a lot of games in it are actually FC Barcelona matches). In order to overcome the scarcity of some combinations of interzonal passes, I also calculated a very basic completion model (logistic regression), in which I only used pass length as an independent predicator. I use both historical distributions and the model to predict pass success from one zone to the other, giving the model more weight if less data are in the dataset. With this prediction we can estimate the probability of a successful pass from any point of the pitch to any other zone. A pass from the centre point (not necessarily a kick-off since game situation is not included into the model) would for instance have the following chances of being completed (lighter shade of blue indicates a higher chance of a successful pass). Please note that information concerning player dispersion is not directly included in the model, only historic events.

We can see that keeping the ball in the same zone or playing it to a neighbouring would be a good choice if we want to keep possession, a move out towards the wing might also be a possibility. Nothing really surprising, model and viz are working.

Expected goal value

Next for shots and expected goals. The basic idea was the same as with passes. I looked at the shots in my dataset (n=17.331) and determined the distribution of those going in and those that do not, based on the zone they came from. The results were again rather unsurprising, with the zone right in front of the opponent's goal being the most dangerous one (~24% chance of scoring) and also the one with the most shots. Yet, there are again difficulties only using at historic averages. In my case, there are some zones from which very few shots are fired, which leaves more space for random variation (just as in the case of the pass zone combinations). For instance, from the right wing (highest zone), there are 34 shots in the dataset, with five of them ending in the back of the net (~15%). On the equivalent zone on the left side, things are quite different; some number of shots, only two goals (~6%). This could have something set pieces (maybe free kicks from the right directly taken from left-footed players are more dangerous), but I would rather suggest this phenomenon is not systematic.

Fortunately for me, StatsBomb also includes an xG value for each of their shots, which allows me to overcome the small subsample size problem by calculating the average expected goal value for every shot zone. Doing this, we can see that shots from the right wing are actually not so different from those from the left, so the different scoring rates are probably an anomaly. Shots from the central danger zone are on the other hand scored almost exactly as expected, showing that the more shots there are from one zone, the more trustworthy are also historic scoring rates.

As in the case of the passing model, I assigned the expected goal value for the zones based on the historic rates and the results of the model (in this case, the one by StatsBomb), giving more weight to more frequent shots. Expected goal values are not assigned to zones in the own half, assuming that shooting from there is basically useless. The average expected goal value is displayed in the following graph, with brighter colour shades again indicating higher values.

Distance to goal

Given that not all zones receive an xG-value, there is still the need to capture the fact that some are more prone to create dangerous attacks than others. Therefore, the final factor to assess the danger of a zone is simply its distance to goal (measured from the midpoint of the zone) opposed to the distance of the zone in question.

With these three measures, we are able to calculate an index by simply adding the values of each of them. In order to be dynamic, the position of the ball has to be taken into account, hence the danger for each zone is calculated for each frame in the tracking data set. In the case of xG and distance, I did not use absolute values but rather the ratio of the zone in question, compared to the position of the ball. We have therefore an index which ranges (at least theoretically) from 0 (no chance of getting the ball there, no point in shooting from there, and the point is much further away from the opposition goal) to 3 (100% pass completion, a sure goal if a shot happens, and it is much closer to the opposition goal). Of course, in reality these extreme values will not occur.

Visualization

There were some minor issues to be solved in order for the visualization to work. Unlike in the videos, the attacking direction is always from left to right, given that the model itself is not symmetrical. The ball is always black and a bit smaller in size than the players, which are depicted in the colour of the bgcolor variable. The gganimate-package in R-Studio did not allow me to make the colour of zones a bit transparent, hence all the green of the pitch is lost in the gifs. The dynamical change of colour does however work for each frame, even if the ball is passed. Given that in these instances no decision can be made as to where move the ball next, a next step would be only to calculate danger in moments when there is actually a decision to be taken and referring a default value if not.

As a general result, it is striking how Liverpool almost always try to move the ball immediately to the closest more dangerous area, either by playing passes or with the ball on their feet. Only seldom to they play the ball out wide, at least in case of the videos I had data for. The pass to the wing if usually only played against set defences and most of the time the ball is immediately crossed back into the centre, with many goals stemming from almost completely free shots from somewhere between the six-yard box and the penalty spot. Although they are not necessarily known for their passing game and some technical errors are inevitable playing such a high-tempo style, their passes behind the last opponent line are often perfectly timed, giving the receiver the one moment he needs to either score at once or go around the keeper and finish the play.

For the presentation above, I had to cherry-pick some of the goals, since it was reduced to five slides only. I therefore only used nine of the nineteen goals in the dataset for the slides. These goals are divided broadly into three sub-categories, based on watching the videos and my eye test. A more robust way might be to classify them using some form of clustering (based on factors like speed of attacks, number of players involved, passes played, usage of central or outside channels, etc.) or even ML procedures. Finally, including tracking data and concepts such as pitch control might help to evaluate the danger of different zones on the pitch even better.

To close this entry, I add the one goal which I personally liked best, the1:0 against Wolves in the last game of the 2018/19 season, scored by Sadio Mané. The video can be found here.

The code for my analysis as well as the gifs for all goals can be found at Github.