College football fans like us hate the BCS.

Unless you work for ESPN and the BCS contributes to your paycheck, the idea of allowing only two teams to play for the national championship is criminal.

And if you’re reading this numbers based blog, you probably know about the problems with the computer polls used in the BCS rankings.

First, only one of the six ranking systems gives enough details so that others can reproduce the results. The other five black boxes are shrouded in mystery.

Second, the BCS forbids the computers from using margin of victory in their calculations. It does not matter that a 33 point loss says something much different about a team than a 1 point loss. In the name of sportsmanship, the BCS will not give teams the incentive to run up the score.

Last, you may have even heard that Richard Billingsley, the man behind one computer poll, is not a mathematician. As he admitted to the authors of *Death to the BCS*, “I don’t even have a degree. I have a high school education. I never had calculus. I don’t even remember much about algebra.”

But it gets worse.

### Why Strength of Schedule and Margin of Victory Matter

I wasn’t looking for a flaw in a BCS computer poll.

I was thinking about strength of schedule and margin of victory. In college football debates, most people agree that a ranking system should account for these factors. The intuition is obvious. Northern Illinois does not play the quality schedule that Alabama does. Oregon’s 24 point win over a solid Oregon State team says something much different about the Ducks than a 1 point win. However, no one has provided any quantitative evidence to support accounting for schedule strength and victory margin in rankings.

Bowl games at neutral sites provide a simple quantitative test for ranking systems: how often does the higher ranked team win each game? For a system that incorporates neither schedule strength and victory margin, rank teams by winning percentage. For a system that accounts for strength of schedule but not margin of victory, rank teams with the Colley Matrix of the BCS. Last, my rankings account for both.

I spent some time digging into the details of the Colley Matrix.

### Colley Matrix does not consider the results of each game

It was easy to get mesmerized by the beautiful mathematics behind Colley’s method. His paper discusses Laplace’s die problem, a symmetric positive definite matrix and solving a linear system of equations. I spent the weekend telling my wife that if college football games had only winners and losers, this would be a dandy little ranking algorithm.

Then thinking back on the equations, it hit me.

The method does not care who a team loses to in ranking them. It considers the win loss record of each team and the number of games played between each pair of teams. However, the specifics of who won each game are not an input to Colley’s method.

**Omitting specific game results in a ranking system is like disabling the guiding system on a missile. The technology will do its job, but it will not be that accurate.**

You can check this yourself by reading the descriptions of equation 18 and 19 in Colley’s paper. It’s possible to solve for the rankings (equation 17) knowing only each team’s record and how many games each pair of teams played.

As a mathematician, I find this omission appalling. To see why, take Alabama in 2012 as an example. The Crimson Tide lost to Texas A&M, a respectable loss to another top 10 team. But the Colley Matrix does not account for this. Suppose Alabama beat Texas A&M but lost to a bad Florida Atlantic team. Since a top team almost never loses to a bad team, this bad loss should lower Alabama’s rank. It doesn’t.

You can check this with your own example. Wesley Colley has set up a page on which you can add and remove games and recalculate the rankings.

When I first discovered this omission in 2012, Stewart Mandel of Sports Illustrated suggested looking into whether this flawed computer poll was helping Kent State. The Golden Flashes were 11-1 heading in the MAC championship game and ranked 17th in the BCS. If they moved up to 16th or better, they would earn a BCS bowl bid.

Sure enough, the Colley Matrix had Kent State ranked 15th, the highest rank in any computer poll. It did not consider that their lone loss came at Kentucky, a 2-10 team that won zero SEC games that year. This flawed computer poll played a small role in placing Kent State 17th overall in the BCS rankings. I wrote about this on SI.com.

### “Sam Feng’s article is a perfect example of anti-science”

In response to my article, I got this tweet the next day. Amidst a flurry of four letter words, a blogger blasted the mathematics behind my analysis of Colley’s method.

I disapprove of Feng because I don’t know what the f*$% he was doing, and I don’t think he knows what the f*$% he was doing either.

I guess that can happen when your writing jumps from a small blog to SI.com. At least he could get my name right.

I promptly replied to his post, and a conversation ensued about the details of the mathematics. In the end, the blogger verified my main conclusion that Colley omits the results of each game. The post started with a rant about “anti-science”. It progressed with a dense mathematical discussion in the comments. It ended with this in the last comment.

I’d rather have Ed’s rankings making the decisions than Colley’s or a roomful of NCAA bureaucrats.

The blogger still had a problem with the example I used in the article, a criticism with some merit. The example is equivalent to the Alabama scenario above. In exchanging the loss to Texas A&M for a loss to Florida Atlantic, the records of these two opponents change. Since the Colley Matrix does consider each team’s record, the rankings do change. However, Alabama’s rank does not change. This makes no sense when a team loses to a cupcake.

To be precise, one can show the rankings remain exactly the same under certain changes of wins and losses. For an example in 2012,

- Stanford beat Oregon
- Oregon beat Washington
- Washington beat Stanford

Suppose we change the result in each game.

- Oregon beat Stanford
- Washington beat Oregon
- Stanford beat Washington

Since all teams have the same record, the rankings stay **exactly** the same. Oregon would remain 7th despite losing to an average Washington team. It just doesn’t make sense.

However, for the sake of simplicity, I went with the example in which one team traded a loss for a win. At the end of the day, Colley’s method disregards massive amounts of useful information.

### Northern Illinois busts the BCS.

Before the MAC championship game in 2012, Kent State threatened to bust the BCS with their ranking of 17th. However, their opponent, Northern Illinois, wasn’t too far behind at 21st.

After winning the championship game, Northern Illinois jumped to 15th in the final rankings to earn a BCS bowl game against Florida State. In the computers, the Huskies made massive jumps in the polls of Richard Billingsley and Peter Wolfe.

Billingsley does not describe this ranking method, so no one knows why he bumped Northern Illinois from 19th to 12th.

However, Peter Wolfe describes his method and even offers a few references for his Bradley-Terry model. This nice academic article by Keener describes the model in some detail, making it possible to reproduce the results. After carefully reading this paper, I didn’t find any problems with the ranking method. It **does matter** that Kent State lost to Kentucky. The math seems to favor teams that play an extra game, which most likely helped Northern Illinois jump from 23rd to 12th after their win over Kent State.

### A Playoff on the Horizon

I didn’t think anything could make me hate the BCS more. I was wrong.

At least the current computer polls will be banished when a four team playoff arrives in 2014. A selection committee similar to the group that determines the field for the NCAA men’s basketball tournament will pick the four teams. I only hope their debates will be aided by better algorithms for ranking teams.

Thanks for reading.

Thanks for the interesting article. I appreciate the analysis of Colley’s system. I’ll take mild exception to a couple of points.

First, it’s a bit of hyperbole to call this the “Shocking Truth”. Colley’s algorithm is at least publicly available and well understood. And (to my knowledge, anyway) he has never claimed any particular superiority for it. But to the extent that most people don’t understand the implications, I guess this is “shocking.”

Second, you should be cautious in deriding ranking systems as “black boxes.” Few people (yourself included from what I can see) publish their ranking algorithms in enough detail to allow others to re-create them. Likewise, the fact that Billingsley is or is not a mathematician is likewise largely irrelevant — what matters (it seems to me) is how well his algorithm performs.

Scott,

Thanks for the comments. I have no problem that you don’t find this finding “Shocking”. SI didn’t put that word in their title either. It’s just my opinion, and I’m perfectly happy if others feel a milder form of discomfort over this analysis. Also, it is true that I don’t reveal all the details of my ranking algorithm. However, my system is not helping to determine the national champion in college football. Moreover, I don’t think my algorithm should be a part of any official selection procedure unless I was ready to reveal those details. Of course, this is not a problem that I have

Ed

On Twitter, I got into a discussion with Brian Fremeau of Football Outsiders and Matthew Smith of cfn.scout.com on this ignoring the identity of losses issue with Colley’s method. While Brian feels the same outrage at this omission of information, Matthew is not convinced. He does not accept the premise that losing to a cupcake is worse than losing to a top team for a one loss team. He also brings up the good point that there is no data to back up my assertion that this matters.

I completely agree that this argument needs to be supported by data. When I get around to checking the accuracy of algorithms over bowl games, we will finally have some data to look at. Because of the small numer of bowl games, I don’t think the results will be conclusive. Colley’s method probably will not be as accurate as other non margin of victory systems (e.g. Wolfe’s rankings). But the margin will probably not be large enough to say that with any statistical certainty. However, I do think there will be a large discrepancy between Colley and a good algorithm that uses margin of victory. We shall see.

Ed, are you a part of Bill Connelly’s “footballs study hall” google group? I posted a few more comments about this topic there as well.

I’m of the opinion, like Ed, that the specific results do matter and that beating a good team and losing to a poor one is not the same as beating the poor team and losing to the good one. Not precisely the same, anyway. As Matthew Smith pointed out, and I have discussed at FO, my FEI rating system includes a ‘relevance’ factor as part of the opponent-adjustment. Generally speaking, this means that games against teams that are of similar strength receive more weight in my formula, but I also add more weight to bad losses.

Brian, yes, I’m on that email list. In fact, anyone interested in crunching college football numbers should get on that list. Find Bill on Twitter @SBN_BillC.

Brian also pointed out an interesting article by Peter Keating in Dec 24th issue of ESPN Magazine on the Simple Rating System. In Brian’s words:

“In the current issue of ESPN the Magazine, Peter Keating wrote a college football bit about the SRS (simple rating system), trumping it up as the best rating system in existence. It’s absurd to consider any individual system the best without a lot of data to back it up, and Keating’s column is pretty thin in that regard. Aside from the hyperbole, the article did get me to thinking about a key aspect to SRS that Keating valued — the output is represented as the “points better than average” of each team. i.e., Alabama is 30 points better than average team, Oregon is 28 points better, Notre Dame is only 22 points better, etc.

Most systems represent their data this way. Maybe not in terms of points, but in terms of relationship to average. FEI is represented this way. The thing that jumped out at me from Keating’s article was that we take the transitive property for granted when we represent our data this way. Alabama is 30 points better than average, Notre Dame is 22 points better than average, ergo, Alabama is 8 points better than Notre Dame. It makes intuitive sense to draw that conclusion, but I wonder if that’s actually what the numbers themselves mean. Does a great team’s relationship to an average team have a linear relationship to its relationship to another great team?”

Great questions…

My personal thought is that who specifically you beat/lost to, with schedule held as a given, can help indicate consistency, but I’m still skeptical that it’s a true indicator of quality. To reverse the example, let’s say that (ignoring scores, just using W/L for simplicity) Western Kentucky (still 7-5) had beaten Alabama but lost to FAU. Does this raise or lower your assessment of them?

How about UMass (1-11), if they’d gotten their win over Michigan instead of Akron? Does it raise their level of esteem to say “well, at least they’re capable of getting a REALLY big win even if everything else sucks”?

Personally I tend to decouple consistency from quality, in part because if you’re going to essentially reward consistency on the upper end (which seems to be your approach) then you probably need to punish it on the lower end… and I’m skeptical that either is really appropriate.

Thanks for your thoughts, Matthew. The UMass example is easy. They definitely get credit for beating Michigan and should move up. It’s the flip example of Alabama. As for Western Kentucky, or any average team, I think swapping losses here depends on the strength of the “good” and “bad” team. If you beat #1 and lose to #124 instead of beating #124 and losing to #1, that keeps you at #62. That’s my intuition on things.

Great article. It’s nice to see some thought going into the ranking problem. I wonder though, how much clearer a ratings system would be if it was averaged with other systems? Reading up on many articles on the subject, everyone seems to be supporting a single metric over another. Why not look for a few ‘good’ performers and then average them out? Just a thought.

Also, seeing the pdf files got me really excited to try out some math, especially the James P. Keener paper. Unfortunately my (lack of) intellect is a restraint…I’ve never formally studied linear algebra so I am stuck on populating the matrix for rj. He says that A = aij / ni, so for a league of just 2 teams and 1 game played I am expecting a 2 x 2 matrix. So I assume the entries then are ai / ni and aj / ni. so I will have 1 / 1 for the win, and 0 / 1 (assuming i won the game). So would the first matrix look like this? [1,1 ; 0,1]. ‘;’ is meant to be a line break. Any help on this would be wonderful, thanks.

Well, Travis, thanks for making public my plans for the off season. Yes, I think averaging a few polls is a great idea. The machine learning community has a similar concept call ensemble learning.

http://en.wikipedia.org/wiki/Ensemble_learning

And if you look over at The Prediction Tracker, Todd has a system average predictor that does pretty well. And I bet he’s including some crummy ranking systems that don’t use margin of victory.

For the example you’re looking at, I think there are problems with using the two team example. It will not be irreducible. However, you can still construct the matrix, and it looks like:

1 0

0 0

Hope this helps.

Glad you will take a look at it. I will be sure to pop in for the future post.

I just read up on the the perron-frobenius theorem and two teams would not be a good idea, can’t have a win less team either it seems. Think I will read more up on it before I start. thanks for the feedback.

Hi Ed, Nice article (the ‘shocking’ title got my attention). I’m pretty sure the substance (math) of your argument is not correct. Who a team wins and loses to is absolutely important to solving for the Colley equations. You can try it right now (October 1 2013) with Georgia and their win over North Texas and loss to Clemson. Go to the add-remove games at will and add “Georgia beats Clemson” and “North Texas beats Georgia”, then remove “Clemson beats Georgia” and “Georgia beats North Texas”. Georgia’s rank goes from 8 to 7 in that scenario. Furthermore, the rating of *every* single team changes. With the Georgia-Clemson-N Texas scenario above, Oregon State goes from 0.69094 to 0.69027.

Now, in your specific example about Alabama in 2012 sure– you give Alabama over Texas A&M and Florida Atlantic over Alabama. Yes, the overall rank of Alabama may stay #1 in that scenario, but the rating does change (as you point out). So yes, the formula does consider who who lost to / beat and produces a result. Furthermore, if Alabama had beaten Texas A&M and lost to Florida Atlantic in 2012, where would you (and human voters for that matter) rate Alabama? Look it over, and you will probably still rate them #1. After all, they would have in that scenario completed a season with wins over Georgia, LSU, Texas A&M and Michigan, a total of 4 top 10 teams. Would you put that resume behind, say, a Kansas State, with only 1 comparable top 10 win (Oklahoma) and a pretty bad loss to Baylor?

Of all the computer based systems for ranking college football, I am fairly convinced that CM is the best one out there. I do wish that he would not complicate things by adding the FCS teams into the equation, but I understand why he has to do it since it is a computer that has consequences and not just an academic exercise. It would be very nice if the other computers make their formulas public and we could play out a season in September and know with 100% certainty what the computers will say in December. That transparency would have helped the system.

Neville, thanks for the well thought out comment.

However, the math is correct. The Colley Matrix does not consider who a team loses to.

Now how does this manifest itself? Your Georgia with Clemson and North Texas example is correct. Everyone’s rating changes since the **records** of Clemson and North Texas change. I address this in the section “perfect example of anti-science”. The rankings are invariant to a different changing of game results (see the 3 team cyclic permutation).

But at the end of the day, the math behind Colley only depends on the records of each team and the identity of a team’s opponents. There is no input for which team a team loses to.

This is true, but I don’t think is as large a problem as you think it is. If a team loses a game in an upset, the only way for them to attain the record that they “should” get is to perform an upset themselves, which creates a kind of symmetry. I like this system because of its simplicity, and I think the argument about its flaws boils down to the question should similar teams, one losing the games it should and winning the games it should, and the other pulling and falling victim to upsets but still attaining the same record, be favored in the ranking?

Coleby, that’s a good question, and not one that mathematics can answer.

I like the elegance of the Colley ranking solution because it behaves largely as intended. It is a system designed to add strength of schedule to the win/loss rating. Because it discards all data except for wins and losses, it’s not very good at predicting results.If the goal is to give BCS berths to teams who “earn” it based on their win/loss record, this system is one of the best.

I appreciate systems that ranks teams based strictly on performance against common opponents. I have seen a lot of systems using score differentials in a linear or logarithmic scale, but most of them seem to use some arbitrary normalization factor. I’m also not sure that using a linear or logarithmic best-fit for score differential makes sense. The game is played differently year to year, and differently throughout the year.

For example, let’s say Team A and Team B both play Team C. Based on the final scores of these games, what is the probability that Team A wins against Team B? It would be interesting to see some charts relating scores against common opponents to probability of winning head-to-head.

Another interesting thing would be to see how probability of winning changes based on how far apart these common opponent games take place. Location of games could be factored in as well. Probabilities could be combined into a ranking algorithm that hopefully doesn’t have the deficiencies of Colley’s. I know that there are similar systems that try to predict final scores, but I prefer percentage-based rankings based strictly on who would be expected to win.

Err… perhaps you have heard of such a system?

Ed,

Came across the article after doing a bunch of research on the Colley Matrix method. I was looking for some details on what the rating actually means (e.g. what does 0.79 mean?) and saw the shocker title. After reading it I feel like you must have skimmed through the Colley Matrix method and are missing some major points mostly around Strength of Schedule. I find it sad that you missed such an important part to the method and somehow got your article published.

I agree with Neville that the CM does consider SOS and I agree with the comments in general that it is the best algorithm that focuses on only W, L, and SOS which is what a computer poll was supposed to focus on. I will also say that considering other factors like margin of victory are useful but the poll was told not to focus on that to be used in the now defunct BCS ratings.

First, to explain how the poll uses SOS you have to understand the math. Colley shows in his detailed explanation how his initial math considers who you beat and then walks the reader through how that initial math must be iterated on and finally how the matrix method automates the iteration work. Yes, the matrix itself doesn’t take a list of opponents as a direct input but that’s only because the opponent is applied indirectly within the matrix. It’s all a part of network theory and is why the method is so genius. The SOS is there without having to be directly inserted. If you know the math, the matrix can be decomposed in a way that just by you telling me how many wins each team has and the number of games each team played there is only one possible network of games that contribute to that final matrix. Because of that math it’s pointless to enter specific opponents because it is implied.

Second, your Oregon, Stanford, and Washington example explains why human intuition is no match for good math. You seem to miss the significance of changing 2012 Oregon’s loss to a win over a real good Stanford team. You said “It just doesnâ€™t make sense” that the Washington loss doesn’t hurt them but you forgot to mention the Stanford win. Let me try and help – with Colley Matrix beating a really good team has the exact opposite effect of losing to a really bad team which actually does make sense if SOS matters. Additionally, beating a really bad team has the opposite effect of losing to a really good team – that’s to say not a big deal. Now if a team like Oregon beats a really bad Washington team and loses to a really good Stanford team then for those 2 games Oregon is ok. Similarly if they beat a really good Stanford team and lose to a really bad Washington team they are still ok. We shouldn’t expect the rating to change as one result change had a positive impact and another result change had a negative impact and they cancel out. It actually does make sense and this is the symmetry that Coleby spoke to.

Finally the best way to check if SOS is a factor is to just add a game to anyone’s schedule and have them beat a good team and then compare that to if they had beat a bad team instead. If that’s the only change you make you will see that the rating change is better when they beat the good team. That in and of itself proves that through the network theory foundation of the Colley Matrix that SOS is considered.

Let me know if you have any questions or comments on my thoughts. I realize this article is old but just felt I had to say something so that more people aren’t misled.

Oh and for fun, I think I figured out what the CM rating actually means by doing a lot of Monte Carlo simulations. I may be wrong but based on simulations, a rating of 0.79 means that team has a 79% chance of beating the perfectly average team. Furthermore, if Team A has a 0.79 rating and Team B has a 0.69 rating then Team A has a 60% chance of winning (0.5 + 0.79-0.69). Lastly, if the gap between two teams is more than 0.5 it is assumed that the better team has a >100% chance of winning and what is surprising but true (this is a bigger shocker to me) is that if the gap between two teams is greater than 0.5 and the better team wins as expected then their rating will drop. After pondering on this test it makes sense though and is basically saying the teams are so far apart in skill they shouldn’t be playing and the CM actually penalizes the better team for such a poor SOS choice even in the win.

Ryan,

Thanks for comments. There’s nothing misleading in the article. Colley only requires a record and list of opponents in its calculation. It doesn’t care which teams a team loses to. If you don’t think that’s important, that’s fine by me.

The biggest problem is that Colley doesn’t consider margin of victory, and hence it’s not a very good predictive model. However, that no longer matters with the death of the BCS.

Ed,

Thanks for the reply. I guess we disagree on whether who a team loses to is inferred by the algorithm. My point is that the matrix doesn’t require direct entry of who a team loses to for it to be implied in the matrix equations. Who a team loses to is important and it does matter to the matrix. By providing who a team has played and teams overall record the matrix automatically deduces who the team lost to using network principles. That’s what makes it a beautiful algorithm in my opinion.

For the margin of victory, I agree it would make the algorithm better to consider it but Colley was told directly not to use it so he purposely did not. I dont think it’s a fault of the model since it was part of the constraints provided by the BCS.

Finally, glad the BCS is dead so that college football can be much more exciting. The playoff should be very engaging and the selection committee will surely create more debate than you have here. I do enjoy researching the Colley method as I may use it in some other work I do so I appreciate you allowing me to challenge some concerns on your site.

Thanks for the conversation.

-Ryan

What I see after reviewing Colley’s math is that Colley uses a comparison of overall schedule strength, not the individual games themselves. What I see is that Colley uses a team’s record and the success of the other teams played. Let’s say that you set up the results of the season in a matrix form using Colley’s method, the inputs into the matrix are as follows: Cii = total games played by team i plus 2, Cij = total games played between teams i and j, and bi = (number of wins by team i – number losses by team i)/2 +1. Therefore, the only factors uses are the games that are played, wins, and losses, not a comparison of idividual games. Keep in mind that my mathematics education is limited to a minor in mathematics, so there might be some misunderstanding on my part.

One thing I don’t understand is why don’t these guys use an efficiency based model (like Brian Burke’s over at AdvancedNFLStats)? Efficiency models don’t need margin of victory since they don’t take W/L into consideration at all. SOS is still definitely needed though. Efficiency models are super easy to set up as well. I put one together in a couple of days a few weeks ago. (By the way, my model shows Washington is anything but average).

I’m all in favor on efficiency models. In fact, I think my offense and defense rankings by yards per play define this site. Check it out the links at the bottom of this page.

Well my question was, why don’t any of the BCS guys use an efficiency model?

LOL, that’s way too intelligent for those guys. Plus, there’s a huge barrier in getting the data. It’s not as easy as getting final scores, which are almost ubiquitous.

I disagree with both of the key points in this article:

1. Strength of schedule: The author does not provide a solid argument for why each outcome should be considered individually. The two 3-cycles of wins are considered for Stanford (S), Oregon (O), and Washington (W). The author claims that O should be ranked lower if losing to W instead of losing to S. However, this case implies that O beat S, which adds to O’s strength of victory. The only argument provided by the author is “[Otherwise], it just doesn’t make sense.” This offers no insight for the case of assigning greater negative weight to losses versus bad teams than the positive weight assigned to wins versus good teams.

2. Points consideration: There is nothing “shocking” about Colley not considering points scored/allowed; it simply wasn’t allowed. Regardless, Colley’s method offers an analogue for ranking teams by points. Simply consider points for as wins and points allowed as losses.

Let PF(i,j) denote the points scored by team i in games against team j, and let TPF(i) be the total points scored by team i over the entire season. Define PA(i,j) and TPA(i) similarly for points allowed. The analogous Colley matrix C is defined by

C(i,i) = 2 + TPF(i) + TPA(i)

C(i,j) = – (PF(i,j) + PA(i,j))

and the vector b is given by

b = 1 + (TPF – TPA) / 2.T

Solve Cr = b to find the points-based ratings r.

The same idea can be applied to yards, turnovers, etc.. It can even be applied to non-integer data, as long as the data is non-negative.

The author is correct in that Colley’s matrix is limited, but that is a result of BCS rules. The method simply wasn’t designed for predicting individual game; it was designed to measure the whole of each team’s season. The core concept of Colley’s approach is that he considers all depths of strength of schedule: strength of opponents, strength of opponents’ opponents, and so on. That is an important feature, a mathematical sound feature, and it should be utilized in other ranking method by way of the analogues noted above..