The Path to WAR*

*Wins-Above-Replacement-Like Algorithm-Based Rating

Dream On

The single metric dream has existed in hockey analytics for some time now. The most relevant metric, WAR or Wins Above Replacement, represents an individual player’s contribution to the success of their team by attempting to quantify the number of goals the add over a ‘replacement-level’ player. More widely known in baseball, WAR in hockey is much tougher to delineate, but has been attempted, most notably at the excellent, but now defunct, The pursuit of a single, comprehensive metric has been attempted by Ryder, Awad, Macdonald, Schuckers and Curro, and Gramacy, Taddy, and Jensen.

Their desires and effort are justified: a single metric, when properly used, can be used to analyze salaries, trades, roster composition, draft strategy, etc. Though it should be noted that WAR, or any single number rating, is not a magic elixir since it can fail to pick up important differences in skill sets or role, particularly in hockey. There is also a risk that it is used as a crutch, which may be the case with any metric.

Targeting the Head

Prior explorations into answering the question have been detailed and involved, and rightfully so, aggregating and adjusting an incredible amount of data to create a single player-season value.[1] However, I will attempt to reverse engineer a single metric based on in-season data from a project.

For the 2015-16 season, the CrowdScout project aggregated the opinions of individual users. The platform uses the Elo formula, a memoryless algorithm that constantly adjusts each player’s score with new information. In this case, information is the user’s opinion that is hopefully guided by the relevant on-ice metrics (provided to the user, see below). Hopefully, the validity of this project is closer to Superforecasting than the NHL awards, and it should be: the ‘best’ users or scouts are given increasing more influence over the ratings, while the worst are marginalized.[2]

The CrowdScout platform ran throughout the season with over 100 users making over 32,000 judgments on players, creating a population of player ratings ranging from Sidney Crosby to Tanner Glass. The system has largely worked as intended, but needs to continue to acquire an active, smart, and diverse user base – this will always be the case when trying to harness the ‘wisdom of the crowd.’ Hopefully, as more users sign-up and smarter algorithms emphasize the opinions of the best, the Elo rating will come closer to answering the question posed to scouts as they are prompted to rank two players – if the season started today, which player would you choose if the goal were to win a championship.

Let’s put our head’s together

Each player’s Elo is adjusted by the range of ratings within the population. The result, ranging from 0 to 100, generally passes the sniff test, at times missing on players due to too few or poor ratings. However, this player-level rating provides something more interesting – a target variable to create an empirical model from. Whereas in theory, WAR is cumulative metric representing incremental wins added by a player, the CrowdScout Score, in theory, represents a player’s value to a team pursuing a championship. Both are desirable outcomes, and will not work perfectly in practice, but this is hockey analytics: we can’t let perfect get in the way of good.

Why is this analysis useful or interesting?

  1. Improve the CrowdScout Score – a predicted CrowdScout Score based on-ice data could help identify misvalued players and reinforce properly valued players. In sum, a proper model would be superior to the rankings sourced from the inaugural season with a small group of scouts.
  2. Validate the CrowdScout Score – Is there a proper relationship between CrowdScout Score and on-ice metrics? How large are the residuals between the predicted score and actual score? Can the CrowdScout Score or predicted score be reliably used in other advanced analyses? A properly constructed model that reveals a solid relationship between crowdsourced ratings and on-ice metrics would help validate the project. Can we go back in time to create a predicted score for past player seasons?
  3. Evaluate Scouts – The ability to reliably predict the CrowdScout Score based on on-ice metrics can be used to measure the accuracy of the scout’s ratings in real-time. The current algorithm can only infer correctness in the future – time needs to pass to determine whether the scout has chosen a player preferred by the rest of the crowd. This could be the most powerful result, constantly increasing the influence of users whose ratings agree with the on-ice results. This is, in turn, would increase the accuracy of the CrowdScout Score, leading a stronger model, continuing a virtuous circle.
  4. Fun – Every sports fan likes a good top 10 list or something you can argue over.

Reverse Engineering the Crowd

We are lucky enough to have a shortcut to a desirable target variable, the end of season CrowdScout Score for each NHL player. We can then merge on over 100 player-level micro stats and rate metrics for the 2015-16 season, courtesy of There are 539 skaters that have at least 50 CrowdScout games and complete metrics. This dataset can then be used to fit a model using on-ice data to explain CrowdScout Score, then we use the model output to predict the CrowdScout Score, using the same player-level on-ice data. Where the crowd may have failed to accurately gauge a player’s contribution to winning, the model can use additional information to create a better prediction.

The strength of any model is proper feature selection and prevention of overfitting. Hell, with over 100 variables and over 500 players, you could explain the number of playoff beard follicles with spurious statistical significance. To prevent this, I performed couple operations using the caret package in R.

  1. Find Linear Combination of Variables – using the findLinearCombos function in caret, variables that were mathematically identical to a linear combination of another set of variables were dropped. For example, you don’t need to include goals, assists, and points, since points are simply assists plus goals.
  2. Recursive Feature Elimination – using the rfe function in caret and a 10-fold cross-validation control (10 subsets of data were considered when making the decision, all decision were made on the models performance on unseen, or holdout, data) the remaining 80-some skater variables were considered from most powerful to least powerful. The RFE plot below shows a maximum strength of model at 46 features, but most of the gains are achieve by about the 8 to 11 most important variables.
  3. Correlation Matrix – create a matrix to identify and remove features that are highly correlated with each other. The final model had 11 variables listed below.RFEcorr.matrix

The remaining variables were placed into a Random Forest models targeting the skaters CrowdScout Score. Random Forest is a popular ensemble model[3]: it randomly subsets variables and observations (random) and creates many decision-trees to explain the target variable (forest).  Each observation or player is assigned a predicted score based on the aggregate results of the many decision-trees.

Using the caret package in R,  I created Random Forest model controlled by a 10-fold cross-validation, not necessarily to prevent overfitting which is not a large concern with Random Forest, but to cycle through all data and create predicted scores for each player. I gave the model the flexibility to try 5 different tuning combinations, allowing it to test the ideal number of variables randomly sampled at each split and number of trees to use. The result was a very good fitting model, explaining over 95% of the CrowdScout Score out of sample. Note the variation explained, rather than the variance explained was closer to 70%.


Note the slope of the best-fit relationship between actual and predicted scores is a little less than 1. The model doesn’t want to credit the best players too much for their on-ice metrics, or penalize the worst players too much, but otherwise do a very good job.


Capped Flexibility

Let’s return to the original intent of the analysis. We can predict about 95% of CrowdScout Score using vetted on-ice metrics. This suggests the score is reliable, but that doesn’t necessarily mean the CrowdScout Score is right. In fact, we can assume that the actual score is often wrong. How does a simpler model do? Using the same on-ice metrics in a Generalized Linear Model (GLM) performs fairly well out of sample, explaining about 70% of the variation. The larger error terms of the GLM model represent larger deviations of the predicted score from the actual. While these larger deviations result in a poorer fitting model fit, they may also contain some truth. The worse fitting linear model has more flexibility to be wrong, perhaps allowing a more accurate prediction.



Note the potential interaction between TOI.GM and position

Residual Compare

How do the player-level residuals between the two models compare? They are largely the same directionally, but the GLM residuals are about double in magnitude. So, for example, the Random Forest model predicts Sean Monahan’s CrowdScout Score to be 64 instead of his current 60, giving a residual of +4 (residual = predicted – actual). Not to be outdone, the Generalized Linear Model doubles that residual predicting a 68 score (+8 residual). It appears that both models generally agree, with the GLM being more likely to make a bold correction to the actual score.



The development of an accurate single comprehensive metric to measure player impact will be an iterative process. However, it seems the framework exists to fuse human input and on-ice performance into something that can lend itself to more complex analysis. Our target variable was not perfect, but it provided a solid baseline for this analysis and will be improved. To recap the original intent of the analysis:

  1. Both models generally agree when a player is being overrated or underrated by the crowd, though by different magnitudes. In either case, the predicted score is directionally likely to be more accurate than the current score. This makes sense since we have more information (on-ice data). If it wasn’t obvious, it appears on-ice metrics can help improve the CrowdScout Score.
  2. Fortunate, because our models fail to explain between 5% and 30% of the score and vary more from the true ability. Some of the error will be justified, but often it will signal that the CrowdScout Score needs to adjust. Conversely, a beta project with relatively few users was able to create a comprehensive metric that can be mostly engineered and validated using on-ice metrics.
  3. Being able to calculate a predicted CrowdScout Score more accurate than the actual score gives the platform an enhanced ability to evaluate scouting performance in real-time. This will strengthen the virtuous circle of giving the best scouts more influence over Elo ratings, which will help create a better prediction model.
  4. Your opinion will now be held up against people, models, and your own human biases. Fun.


Huge thanks to asmean to contributing to this study, specifically advising on machine learning methods.

[1] The Wins Above Replacement problem is not unlike the attribution problem my Data Science marketing colleagues deal with. We know the was a positive event (a win or conversion) but how do we attribute that event to the input actions between hockey players or marketing channels. It’s definitely a problem I would love to circle back to.

[2] What determines the ‘best’ scout? Activity is one component, but picking players that continue to ascend is another. I actually have plans to make this algorithm ‘smarter’ and is a long overdue explanation on my end.

[3] The CrowdScout platform and ensemble models have similar philosophies – they synthesize the results of models or opinions of users into a single score in order to improve their accuracy.

Leave a Reply

Your email address will not be published. Required fields are marked *