What wins a hockey game? We often hear from announcers
things like, "The team that wins in the faceoff circle will win this game"
or "They've got to get the puck on the net if they want to win tonight."
But which of these statements are true? More
generally, which team stats are best at predicting who wins hockey games?
I take a first stab at that question here by looking at what
correlates with winning hockey games. I use seven years of NHL play-by-play
data to generate statistics like shot-differential, hit-differential, faceoff
win-differential, etc. and then use those statistics as variables in logistic
regression in order to evaluate who wins games and why.
What surprised me is that the statistic that I would have
guess correlates most strongly with winning (shots on goal) is highly
correlated with winning, but in the wrong direction. That's to say, the team that takes more shots in a game is, on average,
less likely to win the game. More
predictably, I also find that winning faceoffs and being the beneficiary of
turnovers both positively contribute to a team's chances of winning.
To answer do my analysis, I used my cleaned dataset of more
than 2 million plays in every season from 07-08 through 12-13. For each game I
calculated the number of times the home and away teams registered several
different types of statistics. Specifically I considered shots on goal, blocked
shot attempts, missed shots, hits, faceoff wins, and turnovers. I then took the
difference between the two teams for each of these statistics (e.g.: home SOG
minus away SOG).
From there, I used logistic regression to evaluate how each
of these six factors played a role in whether the home team won or lost the
game. Logistic regression is similar to linear regression except, instead of having a dependent (response) variable that
possibly ranges from negative infinity to infinity, the dependent variable
takes a value of zero (home team loses) or one (home team wins). One advantage
of this approach is that it allows me to consider the effect of one variable (say,
shots on goal) on winning probability, while "controlling" for the
other five variables.
Instead of presenting a complicated table of results from
the regression (which even trained statisticians would struggle to easily and effectively
interpret), I turned the results into easily interpretable graphs.
The first set of graphs show, what I think, are the most
surprising results. The first graph shows that as the home team takes more and more shots (relative to the away
team) the home team's probability of winning decreases. Similarly, the big
positive differences in blocked shot and missed shot attempts also correspond
with lower win probabilities for the home team.
These findings are particularly interesting because of what
they suggest about the increasingly-popular Corsi and Fenwick statistics. Corsi
can be thought of as the plus-minus for a player or a team, except instead of
measuring goal plus-minus it measures plus-minus for all shot attempts (shots
on goal, missed shots, and blocked shots). Fenwick is similar, except it omits
blocked shots from its calculation. The results of my analysis imply that high
values of Corsi and Fenwick should actually correlate with lower probabilities
of winning games.
In addition to the results related to shooting the puck, the
regression tells us some things about a few other common statistics. First, big differences in the number of
hits appears to favor the team receiving the hits rather than the one dishing
them out. This result confirms what I found in an earlier post.
Second, winning the turnover battle pays big dividends. Generating five extra turnovers during the
course of the game increases a team's probability of winning by about 10%. Third,
winning faceoffs also makes a team more
likely to win a game, although this effect isn't as big as somebody might
expect.
Third Period Effects?
One possible explanation for why shot attempts has the
opposite effect on winning probabilities as we'd expect is that desperate teams
trailing on the scoreboard might tend to throw the puck at the net a lot more.
To account for this possibility I did the same set of analyses as above, except
this time I restricted the data so that
I was only looking at third period stats in games that were tied entering the
third.
Games that are tied entering the third should shouldn't have
the problem of the data being distorted by desperate strategies by teams
trailing on the scoreboard. Luckily because the dataset is so big, even
restricting the analysis to these games leaves me with over 900 games to
analyze. The graphs below show the results from this analysis.
The patterns in close
third period games are very similar to those in games overall. Shooting
more often diminishes your chance of winning the game. I think the most
interesting thing to point out is how large the effects of turnovers and faceoffs
are. Generating an extra turnover or two
in the third period can cause a team's chances of winning to jump by 5 or 10
percentage points. Likewise, winning an extra faceoffs or two gives a team a
nice bump in their winning probability.
So what have we learned? Shooting the puck more often might
not be all it's cracked up to be. Same with hitting. And if you want to win
games, then win the faceoff battle and create turnovers in your favor.
Cool analysis. Curious what your data source is. Haven't found a source with downloadable data in a relatively easy to manage format.
ReplyDeleteThanks! I used Python to scrape and clean the HTML code of play-by-play data from the NHL website. And then I used R do to the analysis and make the graphs
ReplyDeleteThanks. Don’t know if you plan to expand on this at all, but since I came about your page and thought it was a neat analysis I figure I might as well give you my two cents, and you can ignore or proceed as you wish:
ReplyDelete1. Try coding the outcome as goal differential rather than win vs. lose. You might find some neat patterns there as I would think (i.e. speculate) the shooting-behaviors depend somewhat on the score – e.g. in close games you might observe the pattern you saw, but in less close games it might be opposite, or less strong, or perhaps even stronger. Whichever it might be, there’s more information there and you don’t lose the win/loss information as + sore diff is a win for the home team and – score diff is a loss for the home team.
2. I also think the team trailing a likely to “throw the puck at the net,” but I don’t know how well the 3rd period analysis captures that. I can imagine (again, speculate) that the effect is stronger in close games, which might actually magnify the effect you found in the overall analysis. Perhaps try adding score differential at each intermission (or just the 2nd) as a control rather than restrict the analysis. You could check interactions with shot differentials, as well.
3. There might be some interesting team effects. It might be interesting to use a multilevel model and let the intercept vary by team. You could add interesting team-level variables, calculate team specific “effects,” etc.
Cheers,
Dan
Appreciate the feedback Dan! I'd given a little bit of thought to some of these, and am really happy to hear that somebody else things they'd be interesting. When I originally wrote the post, I tried using goal differential as an outcome variable. The takeaways were virtually the same as using winning/losing as the outcome.
ReplyDeleteAlso, you're totally right about the third period being an arbitrary cutoff. I'm thinking of going through and doing this same analysis for every 10 or 15 second interval in the game, conditioning on the game being tied at that point. Not sure if what I just said makes sense, but hopefully when I write the post (maybe next week) it'll make more sense.
I also really like the idea of team effects. I'll lose some degrees of freedom, since I'll only have 82 observations for each team-year unit. Actually maybe only 41, depending on how I specify the model and run the analysis.