More about the Ratings

or
 
Why These Ratings Are The Best There Are And You Might As Well Forget All Those Other Ratings


Well, not really, but of course, I do calculate and present these ratings by choice, and it's not because I think any old numbers I throw up here are worth something. I think there's something special about this system.

History

I began trying my hand at calculating ratings right before one year's NCAA Division I Men's Basketball tournament, giving in to a long-standing whim of mine to see what statistics could come up with, in comparison and contrast to the well-known college basketball polls. (Of course, quite possibly there was an NCAA pool staring me in the face at the time. For fun, of course.) I remembered from college that "least squares" is how to come up with fits that minimize errors, and thought I'd apply this valuable knowledge to sports rating. But what "error" do you minimize? And what does that even mean? Exactly what is it that you must "fit"?

What I could come up with was this: when a game is played, you expect the winner's rating to be higher than the loser's, and to the degree that it isn't, you have an error, and one that you can quantify. In fact, you could feel a won game suggests you would expect to see a certain margin between the ratings of the winner and the loser. I decided to assume such a margin, and since I had yet to calibrate these ratings-in-the-making, I decided to call that margin "1", and proceed from there. (It could have been called "10" or "100", it's totally arbitrary, the resulting calculations being equivalent.) The idea is that a rating could be tested by seeing what the sum of the squares of these errors is when you try to "fit" the rating to the game results. And a rating could be found that minimizes this sum.

(A "zero margin" is an interesting idea: that only games in which the higher rated team loses to the lower rated team are considered to have "errors". However, such errors are easily minimized by giving all the teams exactly the same rating. A huge first-place tie is mathematically consistent, but not overly interesting. And if instead, we adopt a fixed minimal finite margin, we might as well call it "1".)

Doing a little bit of algebra, I discovered that the "sum of the squares of errors are minimized" when a team's rating fits a pretty simple relationship with those of its opponents. Given that no error is seen if the winner's rating exceeds the loser's by my chosen margin, any games that fit that criteria have no affect on the ratings numbers. If the games weren't played, the ratings could be the same. In other words, if there is too large a differential in the ratings, and the game goes as expected, the game hasn't told you anything about the teams you didn't already know.

Once you have set aside those game-results (which we might term "obvious"), and look at the games that remain (the "interesting" or "relevant" game-results), the team's rating has a simple relationship with those of its opponents, that does not require squaring numbers to calculate, thankfully. If you add one to the rating of the opponent of each such game that you won, and subtract one from that of the opponent of each such game that you lost, then average all those numbers, that should be your rating.

Long after I got this far with a system, my friend Darryl Marsee pointed out that this was David Wilson's system. David took a number other than "1" be what I might term the "expected ratings margin," but my system was exactly equivalent to David's, though both David and I have added tweaks to this basic system, which I shall describe. And others have "invented" the same system.

But back to my discoveries, I made calculations and compared my results to what polls were saying, at the end of that regular season of NCAA Division I Men's basketball. There was similarity between my new ratings and the polls. In fact, it was so striking to me given the rankings were derived from such wildly different methods that my first impression was that pollees must look at computer ratings. (While some dwell on the differences between the results of formula rankings versus human polls, I marvel that two such disparate methods come up with something so similar, often the same number one team, and so on.) However, despite the similarity, there were also striking differences, teams here and there that were ranked radically differently.

There were also some issues with the formulation. First of all, early in seasons, when few games have been played, the types of calculations one generally uses to find the ratings are "unstable". I used "damping" in my calculating to help with that, probably very naively. When I iterated calculations, each time I recalculated a team's rating, I knocked off a percent, e.g. 5%, of the difference between the rating and the average of all ratings. In other words, I pushed all the ratings just a little toward the average.

This also helped with another issue with David's and my formulation, which is illustrated by undefeated teams. If you have a group of teams that have all played, won, and lost games to each other, very often the rating I described is unambiguous and calculating is just a means of finding it. However, if a team has won every game, its rating is not determined. As long as its rating is at least 1 more than the best team it has beaten, its rating has no effect on that of any other team nor does that of any other team affect its rating. Thus, if the best team it has beaten has a rating of 2, its rating could be 3, or 3.1 or 1001. It can't be too high. This reflects the real-life fact that you don't quite know how good a team is when it is undefeated.

A natural thing to do is to rate it at its minimum "proven" rating. In other words, if an undefeated team's best opponent had a rating of 2, then give that undefeated team a 3. There is an analogous problem for teams that haven't won a game, and similar problems also affect any groupings of teams where one of two groupings has won all the games between the two. This is also the same as the solution David Wilson had come up with. In my own case, the "damping" in the calculations helped me produce such results.

However, this got me thinking about something I observed in the ratings. One curiosity was that you could have two undefeated teams, both with a rating of 3, both of which would be there because they'd beaten teams with rating "2" (which we'll assume is a pretty good rating). However, it could be that one of these two undefeated teams had won four such games (i.e., against teams rated "2"), while the other had won only one such. This system would not reflect the fact that one of the two had beaten more of these "pretty good" teams than the other one had. In contrast, this system gave these two undefeated teams the very same rating.

Yet this difference between the teams, that the ratings were hiding but the game-results clearly held, could become visible in the ratings as more games are played. For example, if both of these two undefeated teams lost their next game, once again let's say, to opponents with identical ratings, the previously-undefeated team which had beat only one "2" would be pulled down more than the other. If you imagine a "3" that had beaten a bunch of "2"s (or teams close to "2"s), its rating would be pulled down hardly at all by its first loss. If another had beaten one "2" along with a bunch of bad teams (e.g. "-1"), then its rating would drop like a stone with its first loss to a mediocre team. The problem is, this rating system gives these two undefeated teams equal ratings of "3", but that is not reflecting something about the teams' records that will come out if they lose a game. Should these two teams' ratings be equal? I don't think you or I would call them equal if we were inspecting their records. We'd say: "Hey, Team A has beaten four top twenty teams and Team B has beaten just one." We instinctively know there is value in looking at more than simply the rating of the best opponent your team has beaten.

So one inspiration I had was: let's carry out this "what if". Let's pretend all the teams have one more loss (or a first loss, if the team was undefeated), rate them according to the formula, and use that as a rating. It's "fair" because you apply it to all teams, and it shows the difference between the undefeated team that has had a bunch of quality wins versus the one that's had fewer.

Of course, for the bottom of the ratings, there is a perfectly analogous problem for teams that haven't won any games. So I also threw in a win for every team. In fact, I threw in a loss to a "phantom opponent" rated "1" and a win against another "phantom opponent" rated "-1".

The interesting thing about the experiment is that the resulting rankings much more reflected conventional wisdom, as reflected in the polls. Initially, I calculated this as a variation on my rankings, but soon I simply adopted this revision.

From a calculation point of view, only the "phantom win" or the "phantom loss" affects a specific team's rating, but not both, because either one or the other (or both, if your rating is exactly "0") is out of range for you. If your rating is high, say "2" or more, than a win against a "-1" team means nothing. In fact, we set it aside, as I described above. And vice versa for losses.

Eventually (years later, because I'm not always too quick on these things), I realized that the rating equation for a team actually matched its rating to this formula:


( SUM OF RATINGS OF EACH GAME'S OPPONENT + WINS - LOSSES )
-----------------------------------------------------------
                  ( 1 + NUMBER OF GAMES )

if the sum of all teams' ratings average to 0, and where we are only talking about the wins, losses, and, opponents, and games that "count", i.e., we've set aside games where the winner's rating was 1 or more greater than the loser's. I also removed the "damping" from my calculations, because by adding 1 to the number of games in the denominator, problems with indeterminate ratings and unstable calculations arise in fewer situations.

Testing this rating system

By the test of conventionality, i.e., how close are the results to generally-accepted rankings, I observed (with the help of Kenneth Massey's comparison pages) that this system did reasonably well when compared to other systems that use only wins and losses (e.g. the NCAA's RPI), but that a considerable number of the systems that use game scores, dates of games, home/away information, etc., did "better". Sometimes my system was at the top of the heap among the W/L systems, but if my head grew too large, another season would come along (or even just one more week of games) and it would be at the bottom. A typical season doesn't have enough data to really keep the effects of mere happenstance out of such ratings, and many W/L systems were close enough to each other that there was not t going to be one that continually stayed at the top of the heap, by this criteria.

But I also used another test on these ratings. Besides rating teams, I do "comparison correlations" between early-week rankings and late-week rankings in seasons for lots of folks' ratings/rankings systems. I get the data to do this from Kenneth Massey's Ranking Comparison pages. Among other things, Kenneth produces a weekly "mean-derived" ranking (which I referred to as "a generally-accepted ranking" above), a ranking based upon the mean of all the rankings he displays. This mean-derived ranking is a relatively conventional "consensus" ranking for the time.

So, for my test, I picked a past season, picked Kenneth's mean-derived ranking from the end of the season, and then tried my system using only early season records, seeing see how close my system's results matched it. In a way, it was testing the system's ability to predict the future.

Such a test yields a correlation number, but what I was really after were ways to improve the system, so I tried variations on the system to see how the corresponding correlation number was affected. I tried a lot of systems, using variations on my basic formula, and also using totally different systems that use game scores, and applying functions such as logarithms, and exponents, and other tweaks. Curiously, I found it quite easy to produce systems that compete with the best systems on Kenneth's comparison pages in terms of correlating to the current mean, but these systems I was producing did not predict future mean rankings very well at all, compared to some of the systems on Kenneth's page (such as Kenneth's own ratings/rankings.)

Regarding my own W/L system, questions I tested were included:

A result that is very interesting to me, is that when I made these, and more parameter adjustments to see if my system could be tweaked, that none of these changes made it predict better. All of them made it worse, and the greater I adjusted the parameters away from my existing formula, the worse they were.

I was amazed. I also suspected that there are statisticians who would look at my formulation and tell me about some elementary standard method that I'd blindly stumbled into, and, yes, of course it would work better than variations on the theme. My calculation looked something like an average, but using (N+1) in the denominator instead of (N) just looked weird.

Another item I tested was a type of tweak that David Wilson used to stabilize calculations. David's variation was to not set aside some game results (in the case where the winner's rating was too far above the loser's), but to count such games at only a fraction the weight. Specifically, a game in which the winner's rating was above a fixed margin more than the loser's, was weighted only 1/20 as much in his averaging. This was actually a case where the winner was hurt by winning the game, but the rule applies to all teams, and such negative impacts were given a very small weight. I tried that system, along with variations using other weighting fractions, and my own observations were not what I expected. I expected to see very small weighting fractions match the "untweaked" David Wilson system pretty closely, and larger fractions to show greater difference. What I observed was far more erratic, slight changes in the fractions making big differences in the resulting rankings. Perhaps I made mistakes, or perhaps there's something chaotic about this class of tweaks.

Analysis

Lately, it occurred to me, that my system does consist of an average, which I've multiplied by another factor, i.e. N/(N+1), where N is the number of (relevant) games. In other words, you "squeeze" a team's rating toward the average rating, and the fewer ("relevant") games it has played, the more you do that. Once again, this is something a human analyst might do: if a team is winning games, you have more confidence that it is good if it's a lot of games than if it is just a few. With just a few, you figure there's an excellent chance that the team is closer to an average team than its current won/loss record indicates. With enough games under its belt, you believe what the wins and losses indicate.

(All this talk of games and won/loss is after setting aside games with "obvious" results, stronger teams beating teams weaker than themselves by more than a specific rating margin.)

To put it another way, you "begin" with the assumption that the team is average, then you factor in what its wins and losses tell you, weighting this factor according to how many "non-obvious" games the team has played. With lots of such games, and you believe what the wins and losses say. With just a few, you give it some credit, but figure the team is actually closer to average than that. You trust a record of 6-3 against top twenty teams more than a record of 2-1 against such teams, even though the won/loss percentages are obviously identical. If all teams have played the same number of such "non-obvious" games, their ratings are all affected equally, and this weighting would have no affect on the rating. What this factor extra gives you is a way to compare teams where one has played a bunch of "non-obvious" games, and the other only a few.

So here is how I see the system.

For a team, where:

T - is its rating.
W - is its number of "relevant" wins.
L - is its number of "relevant" losses.
G - is W+L
R - is the average of all its "relevant" games' opponents' ratings.
A - is the average of all teams' ratings. 

and "relevant" means the winner's rating was not more than 1 greater
than the loser's,

this equation holds:
                               W - L              G
( T - A ) =   ( ( R - A ) +  --------- )   x   --------
                                 G              1 + G

or if all ratings are "normalized" so that A = 0:

                      W - L         G
        T =   ( R +   ----- )  x  -----
                        G         1 + G

The initial part (without the final factor) is equivalent to David Wilson's (untweaked) system. The final factor, "G / ( 1+G )", is a factor that says "the fewer relevant games the team has played, the less you use this rating, and the more you consider the team to be average". This extra factor "fades away to 1" as more games are played, and its most visible role appears to be contributing to the determination of ratings of undefeated teams. However, as I pointed out, I first noted that including this factor makes the end-of-season ratings look more conventional.

(Since I first wrote this, I finally did track down reference to this kind of factor, i.e. Laplace's Rule of Succession. See Wikipedia article Pierre-Simon Laplace. From what I've read, there are experts who reject this rule as "incorrect', and I believe any controversy on this point must stem from the slippery relationship between the mathematics of probability and reality, especially as it relates to what you "know". Use of the rule implies there's something you know about what it is you are rating.)

So is this the best of all possible ratings for W/L systems? Is there a theoretical best? I'll let experts answer that.

Potential Improvements

And what could I use to improve my ratings? For football, I use Darryl Marsee's game-results data, and rate only the top divisions. I use the mixed-division games as follows: I rate the lower division opponents based only upon the games they played with the top division. My understanding is that the RPI (in the sports for which it is used) excludes games with lower-division teams.

Other data that various formula ratings use include game scores, dates, home/away data, and more lower-division games. The latter I could do without throwing away a W/L philosophy. Whether to incorporate the other data brings you into the "predictive versus retrodictive" debates. I sometimes go into a funk when attempting to fathom exactly what the term "retrodictive" implies, but undoubtedly people would place me in that camp. I've experimented with systems that use game scores, but, of course, they simply make your calculations wrong. They help quite a bit when insufficient games have been played to determine who's best, but as more games are incorporated, eventually the fact that some teams run up scores while others don't will tell, and make your ratings just plain wrong. And the longer the season, the more wrong they'll be: they converge to the wrong values. If only all teams would continue to play their best throughout every game, even through blowouts, then game scores could tell a true story.

I've experimented with incorporating home/away/neutral data. One thing I tried was to consider a team to be two teams: the "home" team and the "away" team, and calculate a separate rating for each. But then I had two numbers for each team and didn't know what to do to come up with just a single number. One idea was simply to average the two. Adopting such an averaged rating might "improve" the result by helping you compare one team that's played a ton of away games with another that's played a ton of home games. In theory. However, I gave up on the idea since the games-per-team was lowered so happenstance was a bigger factor.

But for the time being, I'm seduced by simplicity. I just display what my fairly simple system produces, and it's been fun to watch how it fares. Lately, I've been intrigued with the KRACH system applied to College Hockey, that appears to have at least some parallels with mine, and has a long and storied history involving knowledgeable statisticians. (I'm amazed no one posts a KRACF.) The fact that my system totally ignores "obvious" game results doesn't strike me as correct: you could make the case that "obvious" game results should count (positively for the winner) a tiny amount (assuming you figured out the appropriate amount) and that there would be a "smooth" change in this weighting as you look at smaller and smaller differentials in a pair of teams' ratings. KRACH seems to have a way to do that. Perhaps my system isn't the world's best after all.

Something I've experimented with (but haven't put up on the web) is a kind of Strength Of Schedule measure based upon my ratings. The procedure (which could be applied to any ratings system) is this: for the purposes of calculating a specific team's SOS, use the actual game results except assume that team's games to be all wins. Then calculate all the ratings and see what that team's rating would be. Such numbers, calculated for each team in analogous fashion, would reveal the strength (but not the weakness!) of the schedules of each team. My inclination is to subtract 1 from the resulting number, since a team is expected to be 1 better than those it beats in my system. I haven't tried deriving teams' "weakness of schedule", very important for achieving a last place ranking, but the procedure would be analogous.

But... how did this system I've come up with actually fare with NCAA pools? The first year that I used it to pick winners, I won an early round of the pool, and the second year, I won the final round. With that, I gained a local rep (amongst those who paid any attention) as a definite sports stats geek. But alas, it was all a fluke. Never has the system fared so well since.

John Wobus, 10/11/05

 


Wobus Sports: www.vaporia.com/sports