23 Jul 2012
Baseball and Eigenvectors
Win-loss records are the most common way to rank sports teams. They’re simple to compute, and occasionally lead to some dramatic season finishes. They don’t really measure the quality of a team, though. A win against a team that ends up losing 100 games should not be worth as much as a team that wins 100. With that in mind, I was determined to find a better way.
My best guess was that some modified version of PageRank would work best. In short, each team gives up points to the teams that beat it, and receives points for the teams it beats. This flow of points-in-points-out eventually finds some stable state, and every team is left with some stable number of points. These points determine their rank.
I used the wonderful numpy package for all of the linear algebra involved. Rather than analytically solve for the eigenvector, I can use an iterative approach of repeatedly multiplying the matrix and a vector. Doing this programmatically is far easier.
What’s the data look like?
Let’s use a few teams as our example dataset: the Milwaukee Brewers, the Arizona Diamondbacks, and the San Francisco Giants.
If we were to draw a grid where rows contain wins and columns contain losses, we’d get the following:
Using that as our matrix, we can find an eigenvector that ranks the teams. As expected, this ranks all three teams evenly.
MIL = 0.333 ARI = 0.333 SFG = 0.333
This format for the matrix (rows representing wins) eventually falls apart. Arizona and San Francisco are in the same league, so they will actually play 18 times. What might that matrix look like?
MIL = 0.22 ARI = 0.39 SFG = 0.39
This is a problem. In both of the above examples, the Giants and the Diamondbacks evenly split wins and losses. There should be no impact on the relative rankings just because they played more games against one another. It seems that a matrix based solely on wins will not work.
In later research, I learned that this approach was used by Stanford professor Joseph Keller in 1978. He used the technique to rank the 1984 National League. This technique was only one of four that he presented in a 1993 paper on the topic, and he acknowledges that each approach has its own unintended consequences. The issue of multiple games played between opponents is particularly an issue in baseball, and was not factored in to his original paper. I think it deserves special accommodation in our approach.
If we want our matrix to be invariant based on the number of games played between any two teams, we should use a win-loss ratio as the value in each cell.
MIL = 0.333 ARI = 0.333 SFG = 0.333
Just as expected. This will be our model.
Now with real data
I’ve turned the previous fractional scores into an expected win count, which is far more useful.
For each team, we rank them based on my calculated score, which represents the total number of wins a team of their quality could expect to win against all of their opponents. Beside that value we have the true number of wins the team ended with, and the difference between my calculated win count, and the true win count. A green difference means they were a better team than their record indicates, and red means they were worse.
As it turns out, compared to rankings based on win-loss record, my method preserves the order of the top 5 teams, 4 of which would advance to the postseason in 2011. That’s probably a good thing. We do see some more interesting results clustered in the middle, where 3 teams are all within one game of “middling” (81 wins, 81 losses) but are shuffled relative to their win-loss rankings. The San Francisco Giants feature the biggest discrepancy between their record and their calculated win expectancy. Two big factors are the league worst performance against the last place Houston Astros (3/7) and a miserable showing against the Atlanta Braves (1/7).
Since the results are not radically different from win-loss ranking, this technique might be useful for predicting future performance.
Can it predict?
Let’s pick July 25th, 2011 as our reference point. Given all of the data up to and including July 25th, would this technique be able to predict winners well? What about “well” compared to win-loss record based predictions?
First, let’s dispel the idea that win-loss records are useful for prediction. Team schedules differ based on the division they’re in, and the timing of games is completely different (one team might have a really easy schedule in the month of August compared to another team).
Let’s take a look at the standing on July 25th.
If we were to extrapolate wins based on this data, Philiadelphia, San Francisco, and Pittsburgh would all win their division. Atlanta would win the wild card. Of those teams, only Philadelphia actually made it into the postseason. Pittsburgh would collapse in historic fashion, San Francisco would fall out of the wild card chase, and Atlanta was eliminated the last day of the season.
How does my eigenvector approach rank the teams?
Milwaukee, Philadelphia, and Arizona are all correctly predicted to be division winners. Only the New York Mets are out of place. Their high ranking is due to the fact that up until this point, they had won 100% of their games against Arizona. The next time the two teams met, Arizona won all three games, and the Mets’ predicted outcome dropped to 86 wins.
Another anomaly here is the Florida Marlins. They’d suffered a disastrous 5 win 23 loss June, but still managed to rank relatively high in this list. The only thing I can imagine is that they’re being lifted by the presence of the Mets in the top four (against whom they’d won 60% of their games at this point of the season). A potential improvement to my approach would be to add Laplace Smoothing for matchups that have not registered some threshold number of games.
The July 25th date is actually significant, in that it is the last day before the Pittsburgh Pirates fell apart in the second half of the season. It seems that this outcome wasn’t totally out of character for the team. Even though they were in first place in their division, according to my approach’s rankings, they were still performing at the level of a team that would lose more games than it would win.
The entirety of the NL West (Arizona, Los Angeles, San Francisco, San Diego, Colorado) are all predicted to within 2 games of their season’s end scores.
This model is far from perfect, but it’s certainly an improvement over what has been the norm for over a century. Let’s see how it fares against current data.
2012 season predictions
Let’s use the data available as of the night of July 22nd, 2012. This is less data than the 2012 predictions, but let’s give it a shot.
So, what does this tell us. Atlanta is a 100 win team (they’re currently 5th overall in the league). The Washington Nationals, despite having the best record that night, will not advance to the playoffs. The Los Angeles Dodgers will barely take the division over the San Francisco Giants, who end up with the first wild card spot. Cincinatti Reds will win the NL Central, and the Pittsburgh Pirates will be the first team in MLB history to win the second wild card spot.
As an aside, this algorithm really hates the Padres, and predicts that they’ll only win eight more of their remaining 65 games. I don’t get it.
Let’s check on this in a couple months to see if it’s at all correct.