5 insights from academic research on predicting world soccer/football matches

Image from Flickr account of Antony Pranata

You want to know which country will win an international soccer match.

In particular, which rankings make the best predictions? Should you stick to the ubiquitous FIFA rankings or switch to the calculations of an upcoming number cruncher?

Recent academic research from the Netherlands sheds light on this question. Jan Lasek and coworkers looked at a variety of world rankings in soccer and asked how well they predicted the results of 979 test matches, a huge sample set.

To test the rankings, they developed a method so each rankings gave a “win probability” for a match. Then they looked at how far this probability deviated from the actual result of the match.

For example, suppose the United States is predicted to have a 54% chance to beat Mexico. If the match ends as a draw, the deviation of the prediction (0.54) from the result (0.5 for a draw) is 0.04. Taking the square of 0.04 gives a measure of the error. A win for the United States gives an error of (1.0 – 0.54) squared, while a loss results in an error of 0.54 squared.

Jan asked me to take part in the study with The Power Rank. I directly provided him with the win probability for the 979 matches in the test set.

The visual shows the results for the mean squared error. A smaller error implies a better predictor.

lasek

The horizontal bar gives a measure of the uncertainty in the error estimate. There is a 2 in 3 chance the true error is within the range of the bar.

The authors also looked at a different error measure called the binomial deviance. However, the results are similar to the mean squared error.

For the curious soccer fan, the paper draws the following conclusions.

The FIFA rankings

FIFA, the international governing body for soccer, publishes the most popular international rankings. However, it’s just a table (3 points for a win, 1 for a draw, 0 for a loss) that attempts to account for strength of opponent and importance of the match.

The FIFA rankings do poorly at predicting the outcome of matches.

What did you expect from such a simple method? They account for strength of schedule by taking the rank of an opponent and subtracting it from 200. That might have been novel in 1863.

While FIFA fails in ranking nations in men’s soccer, they do a better job for the women. The FIFA Women’s ranking uses an Elo type rating system that accounts for margin of victory. This information is critical in predicting match outcomes.

Margin of victory

The top 5 rankings for predicting matches use margin of victory in their calculations. Only one of the remaining rankings in the study (not shown in the visual) use this information.

Two of the top rankings, the FIFA women’s rankings and EloRatings.net, do not use margin of victory in any kind of sophisticated way.

For example, a typical Elo ranking uses a 1, 0.5, or 0 for a win, draw or loss in a match respectively. Instead, the FIFA women’s rankings use a number between 0 and 1 for a match outcome based on the score. These numbers, which Lasek and coworkers show in Table 2 of their paper, appears to have no mathematical justification. However, the rankings perform well in prediction.

The Least Squares rankings and The Power Rank, two methods that naturally use margin of victory, were two of the other top systems.

The Elo++ rankings show the critical importance of margin of victory. This system won a Kaggle competition for ranking chess players. It has advanced features like giving less importance to matches in the distant past and uses a sophisticated regression method in its calculation.

However, it does not account for margin of victory. While it’s performance in predicting matches isn’t as bad the FIFA rankings, it does not perform as well as the top 4 rankings.

The wisdom of crowds

The best method for predicting football matches was the Ensemble, which combined the predictions of the FIFA women’s rankings, EloRatings.net, The Power Rank and Least Squares.

The improvement from aggregation was significant. The ensemble of 4 rankings had an error 4.3% lower than the average error of the 4 systems.

Others have aggregated the wisdom of many computers, a type of ensemble learning, to make predictions. Nate Silver uses 4 different college basketball rankings in his NCAA tourney predictions. I aggregated 7 preseason baseball predictions to forecast the 2014 season.

You’ll see a lot more of this from The Power Rank heading into football season.

More games or only recent games?

The FIFA rankings use a four year window to calculate rankings. With the turnover in players and coaches on national teams, this seems like a reasonable time span over which to evaluate a team.

But maybe a team just gets lucky over that time span. Four years means less than 80 games for most countries. Maybe an underachieving country like Argentina has had bad luck in world competition recently.

When Jan Lasek asked me to be a part of his study, I did two separate calculations. For each match, I used these sets of games in predicting the outcome.

  • Every match from July 15, 2006 until the day before the match
  • Every match from January 4, 2002 through March 29, 2011 (a few days before matches in the test set)

Even though the first set contains fewer and more recent games than the second set, the two calculations had about the same predictive accuracy. The first appeared in the paper, but the second had a slightly smaller mean squared deviation.

Soccer teams don’t change much over time. Simon Kuper and Stephan Szymanski found the same result for England in the book Soccernomics. From 1980 through 2001, they found that the sequence of wins for the national team was identical to the random flipping of a coin.

Network research in rankings

Lasek and coworkers also studied the rankings from a paper by Park and Newman. They developed a ranking method based on their research in networks. The nodes in the network represent teams, and edges that connect nodes are games between the teams. The Power Rank uses the same concept.

I’m not sure why, but the Park Newman method has a cult following. Maybe it’s because the paper is available for free on an archive, or that Mark Newman has a prestigious professorship in physics at the University of Michigan. But these rankings pop up everywhere. I even get random emails asking me about it.

However, the method does not use margin of victory, and it’s terrible at predicting football matches. It performs much worse than the FIFA rankings.

Check out the best international rankings

Lasek and coworkers highlight important aspects in ranking world soccer teams. However, it’s not the last word on predicting matches.

The biggest problem with their method is using one win probability for a match. While this works for testing the predictive power of rankings, it does not get to the heart of football prediction: the probability for a win, loss and draw.

But the paper does give some simple advice for following world football. Check out EloRatings.net and The Power Rank.

Comments

  1. Hi Ed,

    I’m quite new to this but none the less found the idea behind article intersting. I have some questions and they probably relate to entry level stats in all reality..

    From the article

    “For example, suppose the United States is predicted to have a 54% chance to beat Mexico. If the match ends as a draw, the deviation of the prediction (0.54) from the result (0.5 for a draw) is 0.04. Taking the square of 0.04 gives a measure of the error.”

    – Why is the deviation squared ? My initial thoughts were squaring was used in the case where deviations could be negatives, but I had thought probabilities could only be positives so there possibly wouldn’t be a need to square them . Or is it just that it is a standard process to square them to get the deviation ?

    “A win for the United States gives an error of (1.0 – 0.54) squared, while a loss results in an error of 0.54 squared.”

    – Is this saying that the 3 outcomes in terms of probability are (1 for win, 0.5 for a draw and 0.0 for a loss ?) ahhhhh…. this is where my previous statement falls over ..is it ? If the US had lost,their outcome would be 0.0, so when we subtract the .54 we get a negative (i.e -0.54), but we need to square it to make it positive and this gives us our error ?

    Thanks for reading this, apologies for the noob questions, wanted to make sure I understood before I took the whole article in.

    • Thanks for the clarification. You pretty much answered your own questions. The deviation of a prediction from the result can be negative, so taking the square makes it positive.

      There are some papers out there that show the goodness of a mean square estimator for error.

  2. I looked at the ELO rankings and while many are similar to the Power Rank, some are quite different. For example, what accounts for the difference between ELO’s ranking of 22 for Ivory Coast and your ranking of 13?

    • Not sure about the Ivory Coast; they might have been really strong early in the period of games I look at. These games would be weighted less in an Elo system.

      But the diversity between the two polls is a good thing. It makes the ensemble predictor that much more powerful.

  3. Hi Ed,

    It’s great to read about more rankings systems, because they’re starting to grow in popularity. We’ve developed rankings at We Global Football that have performed very well in predictive analysis, with lower Standard Deviation than even the SPI.

    We definitely agree that MOV needs to be accounted for in predictive rankings. But we disagree on the length of time that needs to be used for the rankings. To us, it’s the same as doing a ranking of another professional sport and using multiple seasons to rank teams, when there is vastly different roster composition.

    Lastly, we use a unique SOS metric by using a benchmark of our own rankings. Unlike FIFA, it’s not calculated as of the time of those rankings, but in aggregate. We think you’ll enjoy.

    Great article.

    http://www.weglobalfootball.com/2014/03/25/the-we-global-football-rankings-explained/

    • Thanks for stopping by. Good stuff over on your site. The rankings make sense.

      I agree that you shouldn’t use different seasons to rank teams in professional leagues. But international soccer is different. There’s no transfer market. You can’t trade players.

      As much as the United States would like to be world powerhouse, there’s no amount of money that make them into Brazil. A country’s tradition matters in the development of their youth. That’s why we see the same accuracy for our rankings over a 4 year window as a 12 year window.

      You may have different results with your rankings.

Trackbacks

  1. […] ranking is my macro indicator, I was delighted to find Ed Fang’s post (2014a) on insights into predicting the outcome of football […]

  2. […] One system is not enough. Research has shown that better predictions arise from aggregating many predictions. This was a key finding in a recent academic paper on using rankings to predicting football matches. […]

  3. […] considers margin of victory in adjusting for schedule strength in international soccer. As an academic study has shown, using margin of victory is critical in making […]

  4. […] These win probabilities come from my world football/soccer rankings, which performed favorably in predicting matches according to an academic study. […]

Speak Your Mind

*