Sunday, October 27, 2013

What wins hockey games?

Check out this post on the new version of Rink Stats.
What wins a hockey game? We often hear from announcers things like, "The team that wins in the faceoff circle will win this game" or "They've got to get the puck on the net if they want to win tonight." But which of these statements are true? More generally, which team stats are best at predicting who wins hockey games?

I take a first stab at that question here by looking at what correlates with winning hockey games. I use seven years of NHL play-by-play data to generate statistics like shot-differential, hit-differential, faceoff win-differential, etc. and then use those statistics as variables in logistic regression in order to evaluate who wins games and why.

What surprised me is that the statistic that I would have guess correlates most strongly with winning (shots on goal) is highly correlated with winning, but in the wrong direction. That's to say, the team that takes more shots in a game is, on average, less likely to win the game. More predictably, I also find that winning faceoffs and being the beneficiary of turnovers both positively contribute to a team's chances of winning.


To answer do my analysis, I used my cleaned dataset of more than 2 million plays in every season from 07-08 through 12-13. For each game I calculated the number of times the home and away teams registered several different types of statistics. Specifically I considered shots on goal, blocked shot attempts, missed shots, hits, faceoff wins, and turnovers. I then took the difference between the two teams for each of these statistics (e.g.: home SOG minus away SOG).

From there, I used logistic regression to evaluate how each of these six factors played a role in whether the home team won or lost the game. Logistic regression is similar to linear regression except, instead  of having a dependent (response) variable that possibly ranges from negative infinity to infinity, the dependent variable takes a value of zero (home team loses) or one (home team wins). One advantage of this approach is that it allows me to consider the effect of one variable (say, shots on goal) on winning probability, while "controlling" for the other five variables.

Instead of presenting a complicated table of results from the regression (which even trained statisticians would struggle to easily and effectively interpret), I turned the results into easily interpretable graphs.




The first set of graphs show, what I think, are the most surprising results. The first graph shows that as the home team takes more and more shots (relative to the away team) the home team's probability of winning decreases. Similarly, the big positive differences in blocked shot and missed shot attempts also correspond with lower win probabilities for the home team.

These findings are particularly interesting because of what they suggest about the increasingly-popular Corsi and Fenwick statistics. Corsi can be thought of as the plus-minus for a player or a team, except instead of measuring goal plus-minus it measures plus-minus for all shot attempts (shots on goal, missed shots, and blocked shots). Fenwick is similar, except it omits blocked shots from its calculation. The results of my analysis imply that high values of Corsi and Fenwick should actually correlate with lower probabilities of winning games.

In addition to the results related to shooting the puck, the regression tells us some things about a few other common statistics. First, big differences in the number of hits appears to favor the team receiving the hits rather than the one dishing them out. This result confirms what I found in an earlier post.

Second, winning the turnover battle pays big dividends. Generating five extra turnovers during the course of the game increases a team's probability of winning by about 10%. Third, winning faceoffs also makes a team more likely to win a game, although this effect isn't as big as somebody might expect.

Third Period Effects?

One possible explanation for why shot attempts has the opposite effect on winning probabilities as we'd expect is that desperate teams trailing on the scoreboard might tend to throw the puck at the net a lot more. To account for this possibility I did the same set of analyses as above, except this time I restricted the data so that I was only looking at third period stats in games that were tied entering the third.

Games that are tied entering the third should shouldn't have the problem of the data being distorted by desperate strategies by teams trailing on the scoreboard. Luckily because the dataset is so big, even restricting the analysis to these games leaves me with over 900 games to analyze. The graphs below show the results from this analysis.




The patterns in close third period games are very similar to those in games overall. Shooting more often diminishes your chance of winning the game. I think the most interesting thing to point out is how large the effects of turnovers and faceoffs are. Generating an extra turnover or two in the third period can cause a team's chances of winning to jump by 5 or 10 percentage points. Likewise, winning an extra faceoffs or two gives a team a nice bump in their winning probability.


So what have we learned? Shooting the puck more often might not be all it's cracked up to be. Same with hitting. And if you want to win games, then win the faceoff battle and create turnovers in your favor.

4 comments:

  1. Cool analysis. Curious what your data source is. Haven't found a source with downloadable data in a relatively easy to manage format.

    ReplyDelete
  2. Thanks! I used Python to scrape and clean the HTML code of play-by-play data from the NHL website. And then I used R do to the analysis and make the graphs

    ReplyDelete
  3. Thanks. Don’t know if you plan to expand on this at all, but since I came about your page and thought it was a neat analysis I figure I might as well give you my two cents, and you can ignore or proceed as you wish:
    1. Try coding the outcome as goal differential rather than win vs. lose. You might find some neat patterns there as I would think (i.e. speculate) the shooting-behaviors depend somewhat on the score – e.g. in close games you might observe the pattern you saw, but in less close games it might be opposite, or less strong, or perhaps even stronger. Whichever it might be, there’s more information there and you don’t lose the win/loss information as + sore diff is a win for the home team and – score diff is a loss for the home team.
    2. I also think the team trailing a likely to “throw the puck at the net,” but I don’t know how well the 3rd period analysis captures that. I can imagine (again, speculate) that the effect is stronger in close games, which might actually magnify the effect you found in the overall analysis. Perhaps try adding score differential at each intermission (or just the 2nd) as a control rather than restrict the analysis. You could check interactions with shot differentials, as well.
    3. There might be some interesting team effects. It might be interesting to use a multilevel model and let the intercept vary by team. You could add interesting team-level variables, calculate team specific “effects,” etc.
    Cheers,
    Dan

    ReplyDelete
  4. Appreciate the feedback Dan! I'd given a little bit of thought to some of these, and am really happy to hear that somebody else things they'd be interesting. When I originally wrote the post, I tried using goal differential as an outcome variable. The takeaways were virtually the same as using winning/losing as the outcome.

    Also, you're totally right about the third period being an arbitrary cutoff. I'm thinking of going through and doing this same analysis for every 10 or 15 second interval in the game, conditioning on the game being tied at that point. Not sure if what I just said makes sense, but hopefully when I write the post (maybe next week) it'll make more sense.

    I also really like the idea of team effects. I'll lose some degrees of freedom, since I'll only have 82 observations for each team-year unit. Actually maybe only 41, depending on how I specify the model and run the analysis.

    ReplyDelete