Statistical Analysis Q&A

Shwafta

MLS Cup Attendee
Elite Donor
Jan 16, 2018
7,406
10,397
303
26
Long Island
Alright so I figured I'm not the only statistical layman who has questions about how the statistical analysis done by some people on this forum (i.e. dummyrun dummyrun and the rest of y'all) is actually calculated, or just don't understand how/why it works. Therefore, a thread dedicated to Statistical Analysis-related things feels appropriate.

For example, let me start off with a question I've had for a while now:
There is a lot of talk using xGF and xGA, but what exactly determines how "likely" a chance is? Clearly we saw in an earlier post (by I forget who) showing the xGF/A difference based on who calculates it, so what is the actual determiner? And when determining these, do they also take into account "unlucky" or how well prepared the other team's gk/defense is for a shot of that type?
For example: Take a shot like that awful Wallace miss from earlier this year. Does the calculation take into account 1) Wallace might flub it by missing/skying the ball; 2) The goalkeeper might be standing directly in front of the ball; etc?
 
Alright so I figured I'm not the only statistical layman who has questions about how the statistical analysis done by some people on this forum (i.e. dummyrun dummyrun and the rest of y'all) is actually calculated, or just don't understand how/why it works. Therefore, a thread dedicated to Statistical Analysis-related things feels appropriate.

For example, let me start off with a question I've had for a while now:
There is a lot of talk using xGF and xGA, but what exactly determines how "likely" a chance is? Clearly we saw in an earlier post (by I forget who) showing the xGF/A difference based on who calculates it, so what is the actual determiner? And when determining these, do they also take into account "unlucky" or how well prepared the other team's gk/defense is for a shot of that type?
For example: Take a shot like that awful Wallace miss from earlier this year. Does the calculation take into account 1) Wallace might flub it by missing/skying the ball; 2) The goalkeeper might be standing directly in front of the ball; etc?

Here are three good places to start on understanding ASA's expected goals model:

These are kind of dense but should answer your questions if you spend a little time with them. In short: (1) Yes. (2) Sort of (this is what the keeper model is for). Good topic for happy hour.
 
Here are three good places to start on understanding ASA's expected goals model:

These are kind of dense but should answer your questions if you spend a little time with them. In short: (1) Yes. (2) Sort of (this is what the keeper model is for). Good topic for happy hour.
So for the ASA's model they say they do it based on how far the goalkeeper would have to dive had he placed himself in the optimal location. Now, in MLS is that a statement you can make? I feel like a lot of goalkeepers in the league wouldn't be able to position themselves there.
O3OMijK.png

For example, take this crudely-made model (With like 20% mathematical accuracy for bisection and perpendicularity) based on a cross to Wallace and a classic Wallace-like miss. Normally I'd assume the goalkeeper would be somewhere in the green area (where the black squares are the posts and the black circle is the ball/Wallace) as because of the cross, he would be positioned on the other side of the goal and running back across to where Wallace would be standing with the ball... So wouldn't that increase the xG in actuality? And the converse applies as well.
Or is there a key point I missed, such as "xG doesn't care about the actual positioning of the keeper, etc."
 
People always talk about "strength of schedule" regarding teams already played and our own ease of schedule up until that point based on opponent's ppg. One glaring issue I see, which is what I've finally been able to pinpoint after two years of it bothering me, is that of the ppg itself. Ppg is a useful measure of a team's prowess, yes, but it's the team's aggregate prowess that we currently use to calculate it. What I mean by this is that say we face a team during their hot streak. They lost their first 15 games, but then have won their last 5 back to back. Their overall ppg is crap, but their rolling ppg is perfect.

Shouldn't we be calculating these ease or of schedule etc. based on rolling ppg instead of aggregate ppg? It reflects better how a team's strength of schedule is.
Tl;dr: why don't we use Rolling ppg to determine ease and strength of schedule?
 
People always talk about "strength of schedule" regarding teams already played and our own ease of schedule up until that point based on opponent's ppg. One glaring issue I see, which is what I've finally been able to pinpoint after two years of it bothering me, is that of the ppg itself. Ppg is a useful measure of a team's prowess, yes, but it's the team's aggregate prowess that we currently use to calculate it. What I mean by this is that say we face a team during their hot streak. They lost their first 15 games, but then have won their last 5 back to back. Their overall ppg is crap, but their rolling ppg is perfect.

Shouldn't we be calculating these ease or of schedule etc. based on rolling ppg instead of aggregate ppg? It reflects better how a team's strength of schedule is.
Tl;dr: why don't we use Rolling ppg to determine ease and strength of schedule?
I agree strength of schedule based on PPG is flawed, but I'm not sure your suggestion helps, because it's turtles all the way down. I don't think we really have a better option than what is being used.

First problem with opponent PPG is it almost always includes the games you played against them. So, for example, Houston at 6-1-1 has played opponents whose records collectively include a record of 1-6-1 against Houston, which makes their records worse but pretty clearly does not reflect how good they are when not having to play Houston. And those 8 games probably represent about 10% of their PPG score which is enough to make some real difference. So has Houston played an easy schedule because those clubs are weak or do those clubs seem weaker than they are because they all played Houston? Then repeat this for every team, with reverse effects for the worst teams. You could exclude those games, but it's either a ton of manual work or writing an algorithm that probably needs special data feeds.

For the issue you describe, it can be the same thing. Atlanta had no wins in the first 4 games, but 3 wins in the last 4. So you were lucky if you played Atlanta in the first month, right? Maybe. But also in the last 4 games Atlanta played teams with a combined H/A-adjusted PPG of just 0.95. Which brings us right back to chicken and egg. Did Atlanta get better or is it just playing easier opposition? If it is the latter, it doesn't make sense to feel sorry for their next opponent, Toronto. Nor should Toronto's strength of schedule be made to seem more difficult because of when they played Atlanta. So I don't think you should adjust strength of schedule for streaks. Plus doing so just slices the data into smaller chunks more likely to be affected by weird anomalous results. In the last 9 games of 2015 NYCFC lost 3 straight, then won 3 straight, then lost the final 3 straight. The team didn't actually get substantially better for 3 games when it won, and even if it did, that obviously told us nothing about how well it would play the final 3 games.
 
I agree strength of schedule based on PPG is flawed, but I'm not sure your suggestion helps, because it's turtles all the way down. I don't think we really have a better option than what is being used.

First problem with opponent PPG is it almost always includes the games you played against them. So, for example, Houston at 6-1-1 has played opponents whose records collectively include a record of 1-6-1 against Houston, which makes their records worse but pretty clearly does not reflect how good they are when not having to play Houston. And those 8 games probably represent about 10% of their PPG score which is enough to make some real difference. So has Houston played an easy schedule because those clubs are weak or do those clubs seem weaker than they are because they all played Houston? Then repeat this for every team, with reverse effects for the worst teams. You could exclude those games, but it's either a ton of manual work or writing an algorithm that probably needs special data feeds.

For the issue you describe, it can be the same thing. Atlanta had no wins in the first 4 games, but 3 wins in the last 4. So you were lucky if you played Atlanta in the first month, right? Maybe. But also in the last 4 games Atlanta played teams with a combined H/A-adjusted PPG of just 0.95. Which brings us right back to chicken and egg. Did Atlanta get better or is it just playing easier opposition? If it is the latter, it doesn't make sense to feel sorry for their next opponent, Toronto. Nor should Toronto's strength of schedule be made to seem more difficult because of when they played Atlanta. So I don't think you should adjust strength of schedule for streaks. Plus doing so just slices the data into smaller chunks more likely to be affected by weird anomalous results. In the last 9 games of 2015 NYCFC lost 3 straight, then won 3 straight, then lost the final 3 straight. The team didn't actually get substantially better for 3 games when it won, and even if it did, that obviously told us nothing about how well it would play the final 3 games.
PPG is useful, but as with all stats one has to be careful when using it. Who's the better team, the one who has 10ppg but they lose their 10 games all by a score of 11-10, or the one who wins 10 games by a score of 1-0, earning a PPG of 1? Yes, farfetched, of course, but just pointing out that PPG isn't perfect. Your team could win the Shield, own the league, have double the PPG of anyone else, and then still end up getting knocked out of the playoffs in the first round. Analysis, planning, and strategy are great, but sometimes your whole season comes down to Vincent Kompany from out of nowhere putting that one shot in the upper 90.
 
Who's the better team, the one who has 10ppg but they lose their 10 games all by a score of 11-10, or the one who wins 10 games by a score of 1-0, earning a PPG of 1?

This doesn't address your main point, which I don't argue with, but I am pretty sure it is impossible to have 10 PPG, and if you win 10 games 1-0 your PPG is 3.0.
Goals ≠ points.

Which I well know you know well. But I think you had a moment of confusion there. :confused:
 
  • Like
Reactions: Seth
PPG is useful, but as with all stats one has to be careful when using it. Who's the better team, the one who has 10ppg but they lose their 10 games all by a score of 11-10, or the one who wins 10 games by a score of 1-0, earning a PPG of 1? Yes, farfetched, of course, but just pointing out that PPG isn't perfect. Your team could win the Shield, own the league, have double the PPG of anyone else, and then still end up getting knocked out of the playoffs in the first round. Analysis, planning, and strategy are great, but sometimes your whole season comes down to Vincent Kompany from out of nowhere putting that one shot in the upper 90.
NYC’s luck with recent back-2-back 100-year storms confirms that Kompany is about to score another screamer from 40m out.
 
  • Like
Reactions: Shwafta and Seth
This doesn't address your main point, which I don't argue with, but I am pretty sure it is impossible to have 10 PPG, and if you win 10 games 1-0 your PPG is 3.0.
Goals ≠ points.

Which I well know you know well. But I think you had a moment of confusion there. :confused:
Ugh, you're right, I'm sorry. Points ≠ goals of course. Is it time to go home yet? LOL.
 
I agree strength of schedule based on PPG is flawed, but I'm not sure your suggestion helps, because it's turtles all the way down. I don't think we really have a better option than what is being used.

First problem with opponent PPG is it almost always includes the games you played against them. So, for example, Houston at 6-1-1 has played opponents whose records collectively include a record of 1-6-1 against Houston, which makes their records worse but pretty clearly does not reflect how good they are when not having to play Houston. And those 8 games probably represent about 10% of their PPG score which is enough to make some real difference. So has Houston played an easy schedule because those clubs are weak or do those clubs seem weaker than they are because they all played Houston? Then repeat this for every team, with reverse effects for the worst teams. You could exclude those games, but it's either a ton of manual work or writing an algorithm that probably needs special data feeds.

For the issue you describe, it can be the same thing. Atlanta had no wins in the first 4 games, but 3 wins in the last 4. So you were lucky if you played Atlanta in the first month, right? Maybe. But also in the last 4 games Atlanta played teams with a combined H/A-adjusted PPG of just 0.95. Which brings us right back to chicken and egg. Did Atlanta get better or is it just playing easier opposition? If it is the latter, it doesn't make sense to feel sorry for their next opponent, Toronto. Nor should Toronto's strength of schedule be made to seem more difficult because of when they played Atlanta. So I don't think you should adjust strength of schedule for streaks. Plus doing so just slices the data into smaller chunks more likely to be affected by weird anomalous results. In the last 9 games of 2015 NYCFC lost 3 straight, then won 3 straight, then lost the final 3 straight. The team didn't actually get substantially better for 3 games when it won, and even if it did, that obviously told us nothing about how well it would play the final 3 games.
I hear what you're saying, and that therefore makes the entire thing way harder to calculate. Especially the fact that the best team in the league will always have the "strongest schedule" because everyone below them has a worse ppg?
 
  • Like
Reactions: mgarbowski
I hear what you're saying, and that therefore makes the entire thing way harder to calculate. Especially the fact that the best team in the league will always have the "strongest schedule" because everyone below them has a worse ppg?
Correct, though that overstates it a bit, especially as the season goes on and the larger game sample minimizes that effect. But yes, to some extent it stays true. Like last year, it seemed that Atlanta and the Red Bulls had easy schedules compared to the rest of the East because everyone else played 4 or 5 games against those 2 teams, while they each played each other just 2 games (which sort of flips the analysis upside-down but comes to the same result).
 
  • Like
Reactions: Shwafta
Correct, though that overstates it a bit, especially as the season goes on and the larger game sample minimizes that effect. But yes, to some extent it stays true. Like last year, it seemed that Atlanta and the Red Bulls had easy schedules compared to the rest of the East because everyone else played 4 or 5 games against those 2 teams, while they each played each other just 2 games (which sort of flips the analysis upside-down but comes to the same result).
At the end of the day, all that ppg analysis is pretty pointless then, imo. It "tells you something" but also really doesn't. At the end of the day, the eye test is what I prefer. Thanks for your help, I think I've learned a bit more from this.
 
  • Like
Reactions: mgarbowski
At the end of the day, all that ppg analysis is pretty pointless then, imo. It "tells you something" but also really doesn't. At the end of the day, the eye test is what I prefer. Thanks for your help, I think I've learned a bit more from this.
How do you do eye test on strength of schedule?
 
How do you do eye test on strength of schedule?
I look at the remaining games and determine what I think would be the "Strength" of each time. Basically the same way everyone determines "X stretch of games is gonna be brutal" vs "X stretch of games should be easy bc they're all against weak opponents."