23 Jul 2012

Baseball and Eigenvectors

Win-loss records are the most common way to rank sports teams. They’re simple to compute, and occasionally lead to some dramatic season finishes. They don’t really measure the quality of a team, though. A win against a team that ends up losing 100 games should not be worth as much as a team that wins 100. With that in mind, I was determined to find a better way.

My best guess was that some modified version of PageRank would work best. In short, each team gives up points to the teams that beat it, and receives points for the teams it beats. This flow of points-in-points-out eventually finds some stable state, and every team is left with some stable number of points. These points determine their rank.

I used the wonderful numpy package for all of the linear algebra involved. Rather than analytically solve for the eigenvector, I can use an iterative approach of repeatedly multiplying the matrix and a vector. Doing this programmatically is far easier.

What’s the data look like?

Let’s use a few teams as our example dataset: the Milwaukee Brewers, the Arizona Diamondbacks, and the San Francisco Giants.

If we were to draw a grid where rows contain wins and columns contain losses, we’d get the following:

MILARISFG
MIL033
ARI303
SFG330

Using that as our matrix, we can find an eigenvector that ranks the teams. As expected, this ranks all three teams evenly.

MIL = 0.333
ARI = 0.333
SFG = 0.333

This format for the matrix (rows representing wins) eventually falls apart. Arizona and San Francisco are in the same league, so they will actually play 18 times. What might that matrix look like?

MILARISFG
MIL033
ARI309
SFG390
MIL = 0.22
ARI = 0.39
SFG = 0.39

This is a problem. In both of the above examples, the Giants and the Diamondbacks evenly split wins and losses. There should be no impact on the relative rankings just because they played more games against one another. It seems that a matrix based solely on wins will not work.

In later research, I learned that this approach was used by Stanford professor Joseph Keller in 1978. He used the technique to rank the 1984 National League. This technique was only one of four that he presented in a 1993 paper on the topic, and he acknowledges that each approach has its own unintended consequences. The issue of multiple games played between opponents is particularly an issue in baseball, and was not factored in to his original paper. I think it deserves special accommodation in our approach.

If we want our matrix to be invariant based on the number of games played between any two teams, we should use a win-loss ratio as the value in each cell.

MILARISFG
MIL03/63/6
ARI3/609/18
SFG3/69/180
MIL = 0.333
ARI = 0.333
SFG = 0.333

Just as expected. This will be our model.

Now with real data

teamscorerealdiff
PHI104.6102+2.6
MIL95.896-0.2
ARI92.194-1.9
STL90.990+0.9
ATL88.589-0.5
CIN81.579+2.5
WSN81.280+1.2
LAD81.282-0.8
SFG79.886-6.2
NYM78.277+1.2
SDP75.471+4.4
CHC75.271+4.2
COL74.373+1.3
FLA71.972-0.1
PIT67.272-4.8
HOU58.159-0.9

I’ve turned the previous fractional scores into an expected win count, which is far more useful.

For each team, we rank them based on my calculated score, which represents the total number of wins a team of their quality could expect to win against all of their opponents. Beside that value we have the true number of wins the team ended with, and the difference between my calculated win count, and the true win count. A green difference means they were a better team than their record indicates, and red means they were worse.

As it turns out, compared to rankings based on win-loss record, my method preserves the order of the top 5 teams, 4 of which would advance to the postseason in 2011. That’s probably a good thing. We do see some more interesting results clustered in the middle, where 3 teams are all within one game of “middling” (81 wins, 81 losses) but are shuffled relative to their win-loss rankings. The San Francisco Giants feature the biggest discrepancy between their record and their calculated win expectancy. Two big factors are the league worst performance against the last place Houston Astros (3/7) and a miserable showing against the Atlanta Braves (1/7).

Since the results are not radically different from win-loss ranking, this technique might be useful for predicting future performance.

Can it predict?

Let’s pick July 25th, 2011 as our reference point. Given all of the data up to and including July 25th, would this technique be able to predict winners well? What about “well” compared to win-loss record based predictions?

First, let’s dispel the idea that win-loss records are useful for prediction. Team schedules differ based on the division they’re in, and the timing of games is completely different (one team might have a really easy schedule in the month of August compared to another team).

Let’s take a look at the standing on July 25th.

team wins losses pct.
PHI6437.634
SFG5943.578
ATL5944.573
ARI5547.539
PIT5347.530
STL5448.529
MIL5449.524
NYM5151.500
CIN5052.490
WSN4952.485
FLA4953.480
COL4855.466
LAD4656.451
SDP4558.437
CHC4260.412
HOU3369.324

If we were to extrapolate wins based on this data, Philiadelphia, San Francisco, and Pittsburgh would all win their division. Atlanta would win the wild card. Of those teams, only Philadelphia actually made it into the postseason. Pittsburgh would collapse in historic fashion, San Francisco would fall out of the wild card chase, and Atlanta was eliminated the last day of the season.

How does my eigenvector approach rank the teams?

teamwins
MIL98.7808026452
PHI94.1012472417
ARI92.83591851
NYM92.1415627452
STL91.0907384264
FLA85.6495626848
ATL85.4732881801
LAD81.2782903122
PIT79.853920851
WSN78.3922069834
SFG77.9103226938
SDP73.8149861818
COL72.7936815204
CHC71.7226149492
CIN69.8043513893
HOU50.3565046855

Milwaukee, Philadelphia, and Arizona are all correctly predicted to be division winners. Only the New York Mets are out of place. Their high ranking is due to the fact that up until this point, they had won 100% of their games against Arizona. The next time the two teams met, Arizona won all three games, and the Mets’ predicted outcome dropped to 86 wins.

Another anomaly here is the Florida Marlins. They’d suffered a disastrous 5 win 23 loss June, but still managed to rank relatively high in this list. The only thing I can imagine is that they’re being lifted by the presence of the Mets in the top four (against whom they’d won 60% of their games at this point of the season). A potential improvement to my approach would be to add Laplace Smoothing for matchups that have not registered some threshold number of games.

The July 25th date is actually significant, in that it is the last day before the Pittsburgh Pirates fell apart in the second half of the season. It seems that this outcome wasn’t totally out of character for the team. Even though they were in first place in their division, according to my approach’s rankings, they were still performing at the level of a team that would lose more games than it would win.

The entirety of the NL West (Arizona, Los Angeles, San Francisco, San Diego, Colorado) are all predicted to within 2 games of their season’s end scores.

This model is far from perfect, but it’s certainly an improvement over what has been the norm for over a century. Let’s see how it fares against current data.

2012 season predictions

Let’s use the data available as of the night of July 22nd, 2012. This is less data than the 2012 predictions, but let’s give it a shot.

teamwins
ATL100.543681086
LAD99.815436171
SFG99.1144014954
CIN97.6521052193
PIT93.0410442969
WSN91.0015853079
NYM85.3310838798
STL80.9079547281
ARI80.4118499289
MIA77.927648015
PHI75.5229907201
CHC69.409273466
MIL69.2133944994
COL66.1678982613
HOU60.4149854959
SDP49.5246674291

So, what does this tell us. Atlanta is a 100 win team (they’re currently 5th overall in the league). The Washington Nationals, despite having the best record that night, will not advance to the playoffs. The Los Angeles Dodgers will barely take the division over the San Francisco Giants, who end up with the first wild card spot. Cincinatti Reds will win the NL Central, and the Pittsburgh Pirates will be the first team in MLB history to win the second wild card spot.

As an aside, this algorithm really hates the Padres, and predicts that they’ll only win eight more of their remaining 65 games. I don’t get it.

Let’s check on this in a couple months to see if it’s at all correct.

Comments? Thoughts? Contact me on twitter or via email.