As a smart football fan, you would like to identify overrated college football teams. This is a difficult task, as half of the top 5 teams in the preseason AP poll have made the College Football Playoff the past 4 seasons.

However, analytics has a nifty trick for identifying teams that don’t belong near the top of the preseason polls.

In addition, this trick lets you look at the statistics on any major media site and identify teams playing above their skill level. In a similar fashion, you can find teams that are better than their record.

The trick relies on regression to the mean.

When you hear the word regression, you probably think of how extreme performance during an earlier period most likely gets closer to average during a later period. It’s difficult to sustain an outlier performance.

This intuitive idea of reversion to the mean is based on linear regression, a simple yet powerful data science method. It powers my preseason college football model that has predicted almost 70% of game winners the past 3 seasons.

The regression model also powers my preseason analysis over on SB Nation. In the past 3 years, I haven’t been wrong about any of 9 overrated teams (7 correct, 2 pushes).

Linear regression might seem scary, as quants throw around terms like “R squared value,” not the most interesting conversation at cocktail parties. However, you can understand linear regression through pictures.

Let me explain.

## 1. The 4 minute data scientist

To understand the basics behind regression, consider a simple question: how does a quantity measured during an earlier period predict the same quantity measured during a later period?

In football, this quantity could measure team strength, the holy grail for computer team rankings. It could also be turnover margin or win percentage in one score games.

Again, consider this question:

How does a quantity in an earlier period predict the same quantity during a later period?

Some quantities persist from the early to later period, which makes a prediction possible. For other quantities, measurements during the earlier period have no relationship to the later period. You might as well guess the mean, which corresponds to our intuitive idea of regression.

To show this in pictures, let’s look at 3 data points from a football example. I plot the quantity during the 2016 season on the x-axis, while the quantity during the 2017 season appears as the y value.

If the quantity during the earlier period were a perfect predictor of the later period, the data points would lie along a line. The visual shows the diagonal line along which x and y values are equal.

In this example, the points do not line up along the diagonal line or any other line. There is an error in predicting the 2017 quantity by guessing the 2016 value. This error is the distance of the vertical line from a data point to the diagonal line.

For the error, it should not matter whether the point lies above or below the line. It makes sense to multiply the error by itself, or take the square of the error. This square is always a positive number, and its value is the area of the blue boxes in this next picture.

The area of the blue boxes is the mean squared error.

In the previous example, we looked at the mean squared error for guessing the early period as the perfect predictor of the later period. Now let’s look at the opposite extreme: the early period has zero predictive ability. For each data point, the later period is predicted by the mean of all values in the later period.

This prediction corresponds to a horizontal line with the y value at the mean. This visual shows the prediction, and the blue boxes correspond to the mean squared error.

The area of these boxes is a visual representation of the variance of the y values of the data points. Also, this horizontal line with its y value at the mean gives the minimum area of the boxes. You can show that every other choice of horizontal line would give three boxes with a larger total area.

Regression requires finding the line that minimizes the squared error, or the area of the boxes. This line is called the best fit line, and the next visual shows the best fit line along with the corresponding minimum mean squared error.

In trying to scare off normal people, the quants will thumb up their nose and say things like “the best fit line explains 70% of the variance.” Even worse, they might call this the “R squared” value.

You can understand this statement through the pictures. The best fit line explains 70% of the variance means that the total area of the red boxes is 70% less than the original blue boxes of the horizontal line.

In this example, the best fit line causes a significant reduction in the area of the box corresponding to the left data point. The box gets larger for the middle point (the blue box is obscured by the red box corresponding to the best fit line). But overall, the area of the red boxes are 70% less than the blue boxes.

You can also think about the error the best fit line doesn’t explain. The area of the red boxes is 30% of the area of the blue boxes. This 30% is the remaining variance after the best fit line removes 70% of the original variance.

The higher the R squared value (70% in the example above), the smaller the red boxes of the best fit line. The line explains the data very well.

In contrast, the lower the R squared value, the larger the red boxes of the best fit line, which will be more horizontal. It doesn’t do much better than the horizontal line of the average.

## Get my best football predictions

The Power Rank specializes in making accurate football and March Madness predictions.

To get my college football win totals this preseason and a sample of football predictions usually reserved for paying members, enter your best email and click on “Sign up now!”

## 2. Persistence versus regression to the mean

The data above come from my team ratings in college football. To develop these numbers, I take margin of victory in games over a season and adjust for strength of schedule through my ranking algorithm. The rating gives an expected margin of victory against an average team on a neutral site.

For the 2014 through 2016 season, here is how the team rating for the season predicted the next season.

The data hug the best fit line, and ratings from the previous season explain more than half of the variance in ratings the following season (54.1%).

Compare this to the same plot with turnover margin, or take aways minus give aways.

For turnover margin, the best fit line is almost flat. Turnover margin in one season explains 2.6% of the variance in turnover margin the following season.

From these two plots, we can make two statements:

- Team strength in college football as measured by adjusted margin of victory tends to persist from year to year. This will be useful in making a preseason college football model.
- Turnover margin regresses to a mean of zero from year to year. This implies that turnover margin last season has almost zero ability to predict turnover margin this season.

When most people talk about regression, they usually mean the strong type we see in turnover margin. To see this from a different perspective, let’s consider the wins and losses of Coach Average. His results come from flipping of a coin with an equal chance for a win and loss.

For this visual, I generated this data once with 8 lines of Python code. In no way did I search for a sequence with streaks. However, streaks almost always appear in the sequence.

In this simple experiment, the flipping of any one coin has no impact on the outcome of the next coin. The code makes each game for Coach Average independent of all other games.

Regression to the mean implies that despite a hot 8-2 start for Coach Average, he should still expect to win half of the next 10 games. In fact, he wins 6 of the next 10 games.

Coach Average also expects to regress to .500 after 9 straight losses starting on game 19. However, he loses 6 of the next 8 games.

## 3. Skill versus luck

Phenomena in the real world are not as simple as this coin flipping experiment, and we need to be cautious in making statements about sports. When a quantity like turnover margin has no ability to predict future turnover margin, it doesn’t imply a lack of skill in preventing or forcing turnovers.

While the analytics community doesn’t have a complete picture of turnovers, a few key insights have started to emerge. The game situations matters in turnovers. Over the past decade in college football, teams in the lead have committed a third of all turnovers. For teams ahead by a touchdown or more, the rate drops to 14% of all turnovers.

This would help explain why Alabama has posted +68 turnover margin the past 8 seasons. The Crimson Tide failed to have a positive turnover margin only in 2014, when they gave away the ball 2 more times than they took it away.

The dependence of turnovers on game situation makes sense. Teams in the lead tend to run the ball, especially later in the game. Turnovers happen at a lower rate on running than passing plays. If a team faces a deficit, they need to throw the ball to get back into the game.

For another example of how regression doesn’t necessarily imply a lack of skill, let’s turn to college basketball. Ken Pomeroy wondered how much control teams have over three point shots. He asked how a team’s 3 point shooting percentage in the first half of the conference season predicted the second half.

The visual shows his results.

The left panel shows 3 point percentage on *offense*. The first half of the conference season has almost no ability to predict the same quantity later in the season. Does that mean shooting is not a skill? Tell that to Steph Curry.

The visual also shows the strong regression for 3 point field goal defense. This suggests a lack of skill in defending the 3. To confirm this, Pomeroy performed a more detailed study on 3 point defense for teams over a 5 year period. He concluded there is some skill, but randomness plays a bigger role than anyone expects.

When a quantity regresses to the mean like turnover margin or 3 point shooting percentage, it doesn’t necessarily imply a lack a skill. Fumbles may regress to the mean, but that doesn’t mean a running back isn’t fumble prone if he palms the ball with one hand while running through the line of scrimmage.

However, it is safe to say that randomness plays a large role in turnovers and 3 point shots.

## 4. How to make preseason college football predictions

USC had high expectations heading into 2017. Sam Darnold took over the starting QB job the previous season and led the Trojans to a 9-0 finish, which included a dramatic win over Penn State in the Rose Bowl.

At the start of the 2017 season, the pollsters put USC 4th in the preseason poll (both AP and Coaches). This made Clay Helton’s team the favorite to make the College Football Playoff out of the Pac-12 conference.

In contrast, no one knew quite what to expect from Georgia. Just like USC, 2017 was their second under a young coach, Kirby Smart. But in contrast to USC, they struggled in 2016.

Georgia started true freshman Jacob Eason at quarterback, who delivered a mediocre 55% completion rate. They ranked 81st in my adjusted yards per attempt.

Georgia went 8-5 in 2016, a record acceptable only for new coaches in Athens. To start 2017, they landed at 15th in the preseason AP poll.

So what does regression say about these two teams? Each year, I put together a preseason college football model that uses regression on many variables.

In college football, team performance tends to persist from year to year. Programs like Alabama have financial resources and traditions that Rice will never have. These teams will not swap places in the college football hierarchy.

Hence, my preseason model uses the past 4 years of team performance to predict the next season. This part of the model says that team is most likely to perform as some combination of their last 4 years, with recent years weighted more. This makes the model cautious about an outlier season or 9 games.

The preseason model also considers turnover margin. Turnovers can impact the scoreboard, as a key fumble halts a game winning drive, or an interception returned for a touchdown turns a tight game into a laugher.

However, turnover margin regresses to the mean of zero from year to year in college football. Hence, the model uses turnover margin in each of the past 4 seasons. This holds back the excitement for a team that made a huge jump in rating with a +25 in turnover margin.

Last, the model considers returning starters. More experience implies better performance for college football teams.

Over the past 3 season (2015-17), this regression model for college football has predicted 69.8% of game winners. This rate doesn’t include easier to predict cupcake games with FBS teams facing inferior FCS teams. The model only makes predictions for games with two FBS teams.

Heading into 2017, the preseason college football model had USC 16th. In the previous 4 seasons, USC had never finished the season higher than 14th in my college football rankings. Despite their impressive 9-0 finish in 2016, they only rose to 14th because of a poor start.

The model doesn’t distinguish between returning starters at different positions. The quarterback has an outsized impact on a football team, and Darnold’s status as a top NFL prospect could convince you of USC as higher than 16th. However, 4th in the AP poll seemed too optimistic.

In contrast, the regression model agreed with the AP poll on Georgia. The model had the Bulldogs at 18th while the polls had them at 15th. Analytics and polls agreed on Georgia as a solid top 25 team but not a playoff contender.

During the actual 2017 season, USC did not live up to their top 4 ranking. They dropped an early road game at Washington State. Then their playoff hopes ended when Notre Dame stomped their defense for 8.4 yards per carry in a 49-14 win.

Georgia’s season got off to a distressing start as Eason sustained an injury in the first game. However, this turned into a blessing, as true freshman Jake Fromm took over and had a brilliant season. Georgia’s pass offense finished 10th in my adjusted yards per attempt.

The defense also improved, as they jumped from a solid 28th in my adjusted yards per play in 2016 to an elite 3rd in 2017. This unit had only one bad game when they allowed 40 points at Auburn. However, they atoned for this blip by holding Auburn to 7 points in the SEC title game win a few weeks later.

Georgia made the College Football Playoff, and they took Alabama to overtime in the championship game before losing.

Here’s the take home message about college football preseason predictions: It’s much easier to predict regression for a team like USC than a sudden rise for Georgia.

In each of the past 3 seasons, I’ve written about 3 overrated college football teams in the preseason polls on Football Study Hall, an SB Nation site (2015, 2016, 2017). This analysis combines my regression model with knowledge of programs.

Looking back on these predictions, I’ve been right about 7 of 9 teams. The two teams, Penn State and Oklahoma State, finished lower in the final poll than the preseason poll in 2017. However, I’ll call these predictions a push as they both performed better than I predicted.

There’s a good reason I have never written a corresponding 3 underrated teams article. A regression model is unlikely to identify breakout teams like Georgia in 2017. Their true freshman quarterback worked out, and their defense made a leap.

There is a random element to teams that make a sudden rise. These predictions are more difficult than finding an overrated team by a regression model.

## 5. How to make accurate predictions with regression

I want you to take home two main points.

First, quantities in football like turnover margin show very little persistence from an earlier period to a later period. These quantities regress to the mean, as your best guess for the later period is the average.

To find football teams that might not be as strong as their record suggests, look for teams with large, positive turnover margin. In contrast, teams with large, negative turnover margin might be better than their record. Use these links for data on college football and the NFL.

As another example, consider an NFL team’s record in close games from year to year. Based on data from the 2012 through 2017 season, the visual shows that one year explains 1.3% of the variance the next year.

A team like Oakland, who went 9-2 in one score games in 2016, should not expect the same good fortune again. The Raiders went 4-3 in one score games in 2017. However, regression to the mean isn’t a perfect predictor, as Cleveland has a 2-16 record in one score games the past 3 seasons.

Second, some quantities like team strength in college football tends to persist from year to year. This allows for predictive models based on linear regression.

Even with this persistence, the models still predict regression for outlier performances, both good and poor. The 9-0 stretch for USC to end 2016 serves as an example. However, regression models can not predict teams that jump from ordinary to the outlier, like Georgia in 2017.

These ideas apply for both my preseason regression model at The Power Rank and Bill Connelly’s S&P+ numbers. Use these rankings as a guide to find overrated teams near the top of the polls.

Dear Ed Feng,

For school I am writing (something like) a thesis, about the predictability of sport. I am trying to predict soccer, basketball, baseball, american football and ice hockey (the 5 most popular team sports in the US). Afterwards i will compare the results and try to find out what causes the difference in predictability. What i struggled with most until now was choosing which statistics to use for my predictive models. I still have to learn a lot about predictive modeling and this article and the other articles on this site have helped me enormously. The article is very well written and explains everything very clear. What i struggled with most until now was choosing which statistics to use for my predictive models. Just to be sure: if you want to find out which statistics will be useful for your model you have to make a regression between two different years? Consequently if there is a strong correlation betweent the two it would be a good predictor? I am asking this question becuase I was under the impression that a correlation between the value you want to predict and an independed variable would imply that it would be useful for predicting.

I recommend looking at variables such as margin of victory and turnovers and see if there is a correlation, both year to year and early to late season. Then you get a baseline for whether to use the variable.

You can also test you models with partial F-tests. Meaning you can run the full model and selectively remove variables. This is known as nested models. Nested models remove the variables you wish to test and the summed R values validate your thought process. Using the nested models will give indication to which models reject/accept your hypothesis. It is a great way to quickly test variables, especially with R or some other software. The SSR (Sum of Squared R-values) basically shows you which variables improve/degrade your model performance.

Example: You are testing a full model prediction on an NFL team. Suppose you have selected a multiple regression model of PPR fantasy points scored as a QB as your dependent and choose DVOA, PPR given up by defensive team, Division/Non Division, Home/Away, etc. There are a veritable cornucopia of stats you could add to your full model and never get anything more accurate than watch ESPN Fantasy Live…BUT…by building your model and seeing the impact of variables based on partial F-test we can see how removing variables from your full model improve/degrade your model.

This is how I approach any new concept for model building and I further enhance them based on new information or new stats introduced by the geeks.

Great stuff, Ed. I really enjoyed the article and will spend more time with it.

CR