Can the NFL Combine Predict Future Success?

Logo

Abstract

Each year, college football players hoping to be drafted into the NFL are invited to the NFL Combine. There, players perform certain drills, like the 40 yard dash, vertical jump, and bench press, generating a set of statistics on each player. They also take basic measurements such as height and weight. Our hypothesis was that players with better results would be drafted earlier and have more success in the NFL. Thus, we hoped to be able to predict draft pick number and future success in the NFL based on these metrics. However, we were not able to find any significant relationships between NFL Combine results and either draft position or future career success. Yet there was a significant relationship between future NFL career success and draft position – indicating that NFL teams do indeed draft better players earlier based on factors other than a prospect’s Combine results.

Dataset

We compiled two datasets for our research: one on NFL Combine data from 1999-2015 and another on NFL Draft data from 1999-2015.

The first dataset, of 1999-2015 Combine results, scraped the 2000-2015 results from http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi – this involved copying and pasting the tables from the website which luckily yielded tab-separated values. We used R to clean this data by appending every year’s results into one master file, converting the resulting file into comma-separated values, deleting unnecessary headers (the header row was repeated multiple times throughout the tables), converted height from “feet-inches” string format into inches, deleted an unnecessary column of hyperlinks to “College Stats”, converted the single column of “Team/Round/Pick Number/Year” into four separate columns, and cleaned the “Round” and “Pick Number” columns from string format into integer format (e.g. “4th” into 4).

To add the 1999 NFL Combine data, the oldest we could find, we used Daren Willman’s dataset from his website, http://nflsavant.com/index.php. Luckily, his data was already formatted well. Using R, we simply extracted the relevant columns that matched up with our 2000-2015 dataset and appended it to our master .csv to complete our first full dataset of NFL Combine data from 1999-2015.

Our second dataset is all players who were drafted in the 1999-2015 NFL Drafts with their draft position (team, round, pick number) and their NFL career statistics that can help us measure how successful a player was in the league. We retrieved this data from http://www.pro-football-reference.com/draft/, year-by-year from 1999-2015. We downloaded the comma separated value file for a given year’s draft results and appended all 17 years’ worth into one master file. We then cleaned the data by deleting unnecessary headers (the header row was repeated multiple times throughout the tables), renamed some headers for clarification purposes (e.g. differentiating Passing Touchdowns from Rushing Touchdowns), and deleted an unnecessary column of hyperlinks to “College Stats”.

However, not all players who were drafted in those years participated in the Combine, and not all players who participated in the Combine were drafted into the NFL. Thus, we had to match the players’ combine statistics to their Draft and career data.

In order to do this, we imported the data into Microsoft Excel, and wrote a macro to integrate the two datasets. Essentially, the macro performed a left outer join on the NFL Combine dataset with the NFL Draft dataset, matching on draft year and overall draft pick number. Specifically, we used two nested for loops; the first went through all the players in the combine and selected their draft year and overall draft pick number, while the inner loop went through all the draft data looking for an entry with matching draft year and overall draft pick number. When a match was found, the macro added the data from the draft table to the combine entry for the player and broke out of the inner for loop, moving onto the next combine entry.

The result was a dataset with 3,795 players who participated in the Combine and were subsequently drafted, along with 1,806 players with only Combine data who were never drafted. In all, our final integrated dataset included 5601 players with 39 variables. Notable variables include all 6 NFL Combine drills (40 Yard Dash, Bench Press, Vertical Jump, Broad Jump, Three Cone Drill, Shuttle Run), position-specific career metrics (e.g. Rushing Yards, Receiving Touchdowns, Tackles), miscellaneous variables (e.g. Position, Team Drafted By), and our “success” metric, Career Approximate Value (Career AV).

Approximate Value is a widely-used advanced football metric, independent of position, that measures the value a player contributes to their team during a season. Career Approximate Value (Career AV) is a weighted sum of a player’s season-by-season Approximate Values, where their best season is unweighted, the second best season is weighted at 0.95, their third best weighted at 0.9, and so on. This weighting scale values players who have had at least one exceptional season over those who have consistently average seasons. We specifically defined “success” in the NFL to be Career AV divided by seasons in the league in order to avoid penalizing currently active players to allow us to be able to make cross-generational comparisons.

Screen Shot 2016-04-26 at 9.46.01 PM

 

We explored the NFL Combine data by finding the “shape” of NFL prospects, grouped by position. In the visualization we sought to determine how different positions varied in their performance on Combine drills. The six points of each hexagon represent six different Combine measurements. The six points of the blue hexagon represent the average performance of players from a given position on each drill. Better performance, also interpreted as greater athleticism, is indicated by a more expansive hexagon.

There are several interesting patterns we noticed from this exploration. Wide receivers and cornerbacks have very similar shapes, confirming the notion that they have similar athletic builds given that cornerbacks’ primary role is to guard wide receivers. Free safeties also have similar shapes to cornerbacks and wide receivers, given that they often serve as backup to cornerbacks for guarding wide receivers. Strong safeties and free safeties are similar as they play near-identical roles, except for the fact that strong safeties often assist in run defense – explaining their slightly higher bench performance as an indicator of the strength necessary to stop the run. Some positions are extremely specialized, namely offensive linemen and defensive tackles who primarily need to have extraordinary strength, seen in their bench press, and little else as far as other Combine measurements which leads to their “pointy” hexagonal shape. Lastly, other positions are less specialized, i.e. shaped closer to a regular hexagon. Fullbacks, tight ends, and outside/inside linebackers all have fairly balanced measurements. This agrees with their versatile usage on the field, where they must assume the roles of several more specialized positions (e.g. fullbacks must be able to proficiently block, run, and catch the ball).

Screen Shot 2016-05-03 at 11.53.56 PM

 

Additionally, we compared the “shape” of the top players at each position (specifically, the top 10 based on Career AV per Season) against the average for their respective position. The red areas represent the average player’s Combine measurements and the blue areas represent the top players.

There are noteworthy differences for several positions. Top quarterbacks appear to be exceptionally more athletic than average, which can be influenced by mobile “dual-threat” quarterbacks who are skilled runners in addition to passers. Top offensive guards appear to be more athletic and less one-dimensional than average guards, which can be explained by the athletic expectations of an offensive guard to be able to “pull” from one side of the line to another on run plays – requiring lateral quickness and overall greater athleticism. The same could be said for outside linebackers, defensive ends, and defensive tackles where exceptional athleticism is present in the top players, differentiating them from the crowd. Lastly, top tight ends are also more athletic than average, which aligns with the fact that pass-catching tight ends require greater physical versatility and provide far greater value to their team beyond blocking.

Methodology and Results

We explored the pairwise relationships between three different quantities: Combine results, success in the NFL as measured by Career AV per season, and draft pick position.  First, by determining if there is a strong or weak correlation between Combine results and career success, we could determine how teams should factor in the combine results when drafting players.  Second, if combine results are related to draft pick, teams would be able to predict how early players they are interested in would be drafted, allowing them to strategize accordingly. Lastly, by exploring the relationship between career success and draft pick position, we would be able to compare the value of different draft picks; for example, how much better it is for a team to have the first overall draft pick versus the fifth overall pick.

Combine Results and NFL Success

Due to the importance of athleticism in football, we hypothesized that NFL success could be predicted by combine results. First, to determine the feasibility of building a predictive model, we began by investigating correlations between each individual statistic from the Combine and Career AV per season. Unfortunately, the scatterplots below revealed that there were no clear relationships. Correlation coefficients ranged from -0.15 to 0.12, suggesting that there were no linear associations.

Rplot01

Despite our scatterplot and correlation results suggesting that the relationship between individual statistics and career success was weak, we trained a linear regression model using all Combine results to predict Career AV per season. We thought that perhaps this model would capture dependencies between different Combine drills that interact to predict success, even though individual drills were not highly correlated. The results are shown in the following graphic.

Rplot02

To our disappointment, the low R-squared value indicates that our regression model was a very poor fit. It was only able to predict 4% of the variation in Career AV per season. Additionally, although some of our coefficient estimates were statistically significant, we doubt their accuracy, since some suggested that worse Combine performance (i.e. slower 40-yard dash and 3-cone drill times) would lead to better Career AV per season.Therefore, we concluded that Career AV per season could not be predicted purely by Combine results.

However, based on our exploratory visualization of different positions combine results, we knew that different combine drills are more important for different positions. Thus, we suspected that splitting up players by position before predicting success would yield better results. Unfortunately, splitting by position yielded no significant relationship between Combine results and Career AV per season. We also tried using different predictors for success for different positions, such as tackles per season for linebackers and passing yards per season for quarterbacks. We again investigated the correlation between individual combine drill results and position specific success metrics, but found no significant relationship. For example, the scatter plot below shows the lack of correlation between cornerbacks’ combine results and interceptions per season.

Rplot

Based on these poor results, we concluded that NFL success could not be predicted by Combine results. Thus, teams should not factor in the Combine when determining which players to draft, but should instead focus on other measures of ability such as college performance or traditional qualitative player scouting.

Combine Results and Draft Pick Position

We also hypothesized that players that did better in the Combine would be drafted earlier, since the Combine is supposed to be a tool for teams to evaluate potential picks. However, the following scatter plot of individual combine results versus draft pick number shows no correlation.

Rplot04

This visualization shows that draft pick positions are spread relatively uniformly across all levels of combine performance. Splitting up by position also yielded similarly uncorrelated results. This is unsurprising given our previous conclusion that Combine results do not predict future success in the NFL. If coaching staffs and front offices know that the Combine does not predict success, then there is no reason for them to use it to choose when to draft players either.

We further explored this result by examining the subset of players who attended the Combine but did not get drafted, comparing their results on the Combine drills to participants who did get drafted. There were no significant differences between groups, as can be seen in the following box and whisker plot. The distribution of results in each drill appears to be the same between drafted and non-drafted combine participants. This further supported our conclusion that coaches do not take into account the Combine when determining who to draft.

Rplot

Draft Pick Position and NFL Success

Based on background knowledge of the NFL, we hypothesized that better players do get drafted earlier. Looking at a basic scatter plot of pick number versus Career AV per season, this hypothesis was supported.

Rplot01

We also determined that players drafted earlier would have better career performance measured not only by Career AV per season, but also by position-specific metrics. We created scatterplots for multiple position-specific metrics vs. pick number and found that earlier picks did tend to have stronger NFL performances. We include two scatterplots as example evidence:

This first graph plots cornerbacks’ average tackles per season vs. pick number.

Rplot02

This second graph plots linebackers’ average sacks per season vs. pick number.

Rplot03

We decided to formalize the correlation between Career AV per season and draft pick number further by assigning each pick number a “value” based on how well players drafted in that position have historically done compared to those drafted before and after them.

The most famous draft pick valuation was done by Cowboys’ coach Jimmy Johnson in the 1990s to help his team’s management decide how to properly evaluate trades involving draft picks. For example, he valued the #1 overall pick at 3000. This means that under his system, trading away the #1 pick requires a return draft picks whose total value is greater than or equal to 3000 (for example, the third overall pick valued at 2200 grouped with the twentieth overall pick valued at 850 – a trade with a net of +50 value).

It must be noted that Coach Johnson did not use any rigorous statistics to support his pick valuation. Indeed, his valuation has been heavily critiqued since then, and thus we decided to compare our results to his and see if we agree with the critics.

First, we grouped together players by draft pick number, and averaged the Career AV per season across all players in each group. We did not include players drafted in 2015 or 2016, since there is not yet enough data to infer how well they will truly perform in the NFL. Thus, for most pick numbers, we had 16 players drafted in that position in the years between 1999 and 2014.

When analyzing any type of sports data, sample bias is very important to consider. Better players are given more playing time and thus accumulate better metrics to measure the value they add to their team. This leads to a non-representative sample from the population of NFL players as worse players who are given less to no playing time are under-represented. In order to take into account players who did not last an entire season in the NFL – an important group of draft “busts” that must be represented – we give them a Career AV per season of 0 to indicate the absence of value added to their team.

draftvalue_unscaled

Above, we plotted the Average Career AV per Season for each individual pick number across the 16 drafts in our dataset. A non-linear inverse relationship is evident in the scatter plot. Using a log transformation of the dependent variable, we ran a linear regression to find the best-fit line through these points. This yielded a statistically significant model where a substantial 81.2% of the variation in the log(Average Career AV per Season) was predictable by Pick Number.

We then used this model to generate a new valuation of each pick position in a draft and compared this to Jimmy Johnson’s original values in the following graph for easy visual comparison. To facilitate comparison, we scaled our values so that the sum of picks 1 to 224 was equal between the two systems. For exact numbers in our and Jimmy Johnson’s valuations, see Appendix A.

value_compare

Compared to our valuation, Jimmy Johnson severely overvalued early picks and also undervalued later picks. However, while our valuation is based in quantitative analysis unlike Coach Johnson’s, we neglect to take into account factors like a specific team’s desire for players of a specific position. For example, if two teams are both weak at the quarterback position and there is one standout quarterback in the draft, having the higher draft pick of the two is far more important than pure Career AV per season would indicate. Thus, it is likely that the true valuation is somewhere between the two systems.

Conclusion

We concluded that, contrary to our hypothesis, Combine results alone cannot accurately predict either draft pick or eventual success in the NFL. This result was frustrating, because we had hoped to create a predictive model to grade future NFL drafts. However, better players do get drafted earlier, implying that coaches look at other indicators, such as college statistics or traditional qualitative scouting, to identify better players. While pure physical ability is important in football, it seems that all players who are capable of playing at the highest levels have sufficient athleticism, and what really matters are intangibles like situational awareness and on-field decision making – actual football skills. Thus, according to our analysis, the Combine alone is not a useful event for evaluating talent.

There are several opportunities for further exploration and analysis to be done on evaluating the NFL Draft. Although we failed to find a relationship between the Combine and career success, it is worth exploring the integration of college performance and other qualitative analyses such as sentiment analysis of traditional scouting reports into a predictive model. Both this and a similar model to predict when a player will be drafted would be of great use to an NFL franchise when preparing and participating in the entire NFL Draft process.

Additionally, draft pick valuation could be extended to better reflect reality. There are inequalities in how different positions are valued in the league, such as quarterbacks being valued more than usual due to their important leadership role in an offense, and these deviations could be integrated into the valuation system. This valuation system could also be extended to reflect the different year-to-year positional needs of each team – a dynamic model that could be tailored to a specific year that could help a team more accurately value draft picks when considering trades.

 

Appendix A

In the following chart, for each round of draft picks, the first column is the pick number, the second is Jimmy Johnson’s valuation of that pick, and the third is our valuation.

draftvalues_smaller

Further Exploration: Can NFL career success be predicted based on Combine results?

Last week, we failed to find any strong correlations between Combine results and CareerAV (Career Approximate Value). We realized that CareerAV may not be the best way to quantify “success,” especially since it compares athletes across different positions. Thus, this week we investigated whether we could use Combine results to predict position-specific career metrics (e.g. tackles per season for linebackers, passing yards per season for quarterbacks).

We failed to find notable correlations for any position. Below is one example set of scatterplots. It plots cornerbacks’ interceptions per season vs. Combine results.

Rplot

Is there a significant difference in career success between NFL players who participated in the Combine and NFL players who did not?

Each year, approximately 250 athletes are selected in the NFL draft. Out of these 250, approximately 215 were invited to the Combine a few months beforehand. We hypothesized that the Combine participants would get picked earlier and go on to have greater success in the NFL.

From our Combine and draft data from 1999-2015, there were a total of 3620 drafted players who attended the Combine and 712 drafted players who did not attend.

Rplot04

Note: in all the charts above, Combine participants are represented by the beige boxplots, while non-participants are represented by the green boxplots

From these plots, we could see that Combine participants were generally picked earlier in the draft and went on to have better career success, measured both in terms of CareerAV and position-specific metrics. Another notable finding was that all of the NFL “superstars” (represented by the highest outliers) were Combine participants.

These trends suggest that for an individual player, simply getting invited to the Combine, rather than performance at the actual event, is a good predictor of draft pick and career achievement. However, it is also important to note that Combine attendance does not guarantee success, since each year, there are approximately 125 athletes who attend the Combine but do not get drafted.

Can NFL career success be predicted based on draft results?

In last week’s exploration, we also found that Combine results did not seem to be correlated with draft pick or CareerAV. However, draft pick did seem to be correlated with CareerAV:

Rplot01

From this scatterplot, we could see that generally, players drafted earlier tend to have better performance in the NFL.

This week we wanted to determine whether players drafted earlier would have better career performance measured not only by CareerAV, but also by position-specific metrics.

After creating scatterplots for multiple position-specific metrics vs. pick number, we found that earlier picks did tend to have stronger NFL stats. We include two scatterplots as example evidence:

This first graph plots cornerbacks’ average tackles per season vs. pick number.

Rplot02

This second graph plots linebackers’ average sacks per season vs. pick number.

Rplot03

From this exploration, we found that draft pick could be used to predict career success (measured by either CareerAV or position-specific career metrics). However, neither CareerAV nor position-specific career metrics correlate with Combine results. These findings suggest that although teams pick players strategically during the draft, the pick order is not highly influenced by Combine results. We expect that when teams draft players, college performance is a more important factor.

 

Draft Pick Valuation

We decided to explore the correlation between CareerAV per season and draft pick number further by assigning each pick number a “value” based on how well players drafted in that position have historically done compared to those drafted before and after them.

The most famous draft pick valuation was done by Cowboys’ coach Jimmy Johnson in the 1990s to help his team decide how to properly evaluate trades involving draft picks. For example, he valued the #1 overall pick at 3000. This means that under his system, trading away the #1 pick requires a return draft picks whose total value is greater than or equal to 3000 (for example, the third overall pick valued at 2200 grouped with the twentieth overall pick valued at 850 – a trade with a net of +50 value).

It must be noted that Coach Johnson did not use any rigorous statistics to support his pick valuation. Indeed, his valuation has been heavily critiqued since then, and thus we decided to compare our results to his and see if we agree with the critics.

First, we grouped together players by draft pick number, and averaged the CareerAV per season across all players in each group. We did not include players drafted in 2015 or 2016, since there is not yet enough data to infer how well they will perform in the NFL. Thus, for most pick numbers, we had 16 players drafted in that position in the years between 1999 and 2014.

When analyzing any type of sports data, sample bias is very important to consider. Better players are given more playing time and thus accumulate better metrics to measure the value they add to their team. This leads to a non-representative sample from the population of NFL players as worse players who are given less to no playing time are under-represented. In order to take into account players who did not last an entire season in the NFL – an important group of draft “busts” that must be represented – we give them a CareerAV per season of 0 to indicate the absence of value added to their team.

draftvalue_unscaled

Here we plotted the Average Career AV per Season for each individual pick number across the 16 drafts in our dataset. A non-linear inverse relationship is evident in the scatter plot. Using a log transformation of the dependent variable, we ran a linear regression to find the best-fit line through these points. This yielded a statistically significant model where a substantial 81.2% of the variation in the log(Average Career AV per Season) was predictable by Pick Number.

We then used this model to generate a new valuation of each pick position in a draft and compared this to Jimmy Johnson’s original values. In the following chart, for each round of draft picks, the first column is the pick number, the second is Jimmy Johnson’s valuation of that pick, and the third is our valuation. To facilitate comparison, we scaled our values so that the sum of picks 1 to 224 was equal between the two systems.

draftvalues_smaller

value_compare

It is evident that Jimmy Johnson severely overvalued early picks and also undervalued later picks. In addition, our findings confirm the hypothesis that teams do draft better players earlier, perhaps based on college performance or other traditional scouting methods.

 

Bonus: An added twist to the last visualization

Screen Shot 2016-05-03 at 11.53.56 PM

See our last post for details on how to read the above visualization. In the graphic above, blue shapes represent average player stats while red shapes indicate top player stats.

 

Group Members: Caroline Malin-Mayor (cmalinma), Monica-Ann Mendoza (momendoz), Tyler Devlin (tddevlin), Victor Li (vcli)

Question 1: Can NFL career success be predicted based on Combine results?

The primary goal of our project is to determine whether NFL career success can be predicted based on Combine results. To quantify success, we are using a player’s Weighted Career Approximate Value divided by number of seasons played – we will refer to this metric as CareerAV. CareerAV is a continuous variable, and higher values correspond to greater success.

The Combine includes the following information about a player: height, weight, 40-yard dash time, vertical jump, bench press, broad jump, 3-cone drill time, and shuttle run time. We hoped to use these features (either all or a subset) in linear regression in order to predict a player’s future CareerAV.

To determine whether we could build an accurate prediction model, we began by investigating correlations between each attribute and CareerAV. The scatterplots below reveal that there were no clear relationships. Numerical measures of correlations ranged from -0.15 to 0.12, suggesting that there were no linear associations.

Rplot01

Despite our scatterplot and correlation results suggesting that there were no predictable trends, we investigated what would result if we built a multiple regression model nevertheless.
This is the output of our R code to create a model involving all physical metrics.

Rplot02

The low R-squared value indicates that our regression model was a poor fit, as expected. It was only able to predict 4% of the variation in CareerAV. Additionally, although some of our coefficient estimates were statistically significant, we doubt their accuracy, since some suggested that worse Combine performance (i.e. slower 40-yard dash and 3-cone drill times) would lead to better CareerAV.

Overall, we did not find evidence that Combine performance could accurately be predicted by Combine results using linear regression.

Question 1a: After grouping players by position, can NFL career success be predicted based on Combine results?

We hypothesized that we might be able to build more accurate regression models once we grouped athletes by position, since the importance of certain physical traits differs across roles (e.g. agility more crucial for wide receivers, strength more crucial for linemen).

This investigation also yielded negative results: there were no strong linear trends between Combine results and CareerAV, so regression models were inappropriate.

Question 2: Can NFL draft pick be predicted based on Combine results?

Our other primary question was whether we could predict NFL draft pick using Combine results. A scatterplot of CareerAV vs. Pick suggested that there was potentially a weak inverse relationship.

Rplot03.jpeg

Because the Combine serves as a scouting camp for NFL draft prospects, we presumed that stronger performers in the Combine would get drafted earlier. However, the data did not support our expectation:

Rplot04

These scatterplots suggested that Combine results did not have a strong impact on draft picks. There was a relatively uniform distribution of pick numbers across all levels of Combine performance.

Question 2a: After grouping players by position, can Combine results predict when a player will be drafted in the Draft?

We again tried to see if analyzing positions separately would allow us to make better predictions. However, Combine results and pick numbers remained uncorrelated.

Question 3: What is the “shape” of an NFL player? (D3 Visualization)

Screen Shot 2016-04-26 at 9.46.01 PM

In the visualization above (click to see the entire thing), we sought to determine how different positions varied in their performance on Combine drills. The six points of each hexagon represent six different Combine measurements. The six points of the blue polygon represent the average performance of players from a given position on each drill. Better performance is indicated by a more expansive polygon. Here are some interesting things we noticed:

  •  Wide receivers and cornerbacks have very similar shapes. This is interesting because a cornerback’s primary role is to guard wide receivers.
  • Free safeties also have similar shapes because their primary role is to guard receivers as backup for cornerbacks.
  • Strong safeties and free safeties play similar roles, but strong safeties have the additional task of helping out with run defense which could explain their slightly higher bench performance.
  • Some positions are less specialized, i.e. closer to a regular hexagon. Fullbacks, tight ends, and outside/inside linebackers all have fairly balanced measurements. This agrees with their roles on the field, since they all have to be good at multiple things.
  • Finally offensive linemen and defensive tackles mostly need to be really strong, which explains their “pointy” shape.

Question 4: In the NFL draft, are some positions valued more than others?

We hoped to determine whether certain positions were in higher demand from teams during the draft. If that hypothesis were true, we would expect to see certain positions having earlier pick numbers overall.

These boxplots revealed that teams tended to pick long snappers (LS), kickers (K), and punters (P) in later rounds of the draft. This makes sense given the specialized nature of the positions and how few roster spots are given to them. Otherwise, there was little notable variation in Pick number across positions. There was not a sizable difference between skill positions (QB-quarterback, RB-running back, WR-wide receiver), and non-skill positions.Rplot05

Question 5: Is there a difference in Combine performance between athletes who were drafted and those who were not?

Not all athletes who participate in the Combine get drafted later in the year. Thus, we wanted to determine whether poor performers in the Combine would be less likely to get drafted. However, once we split our dataset into drafted vs. non-drafted players, we did not see large differences in performance between the two groups:

Rplot.jpeg

Thus, we believe that other factors played a larger role in determining whether a player was drafted. For example, an athlete’s college performance could more likely be an influential predictor.

For next week, we plan on investigating whether we can run a machine learning algorithm that can better differentiate between drafted and non-drafted athletes.

Overall Conclusions

This initial exploration casts some doubt into how useful the NFL Combine actually is to evaluate college prospects. There is more analysis to be done to further examine the standalone utility of the NFL Combine.

Next Steps

We failed to predict CareerAV based off of Combine performance. We realize, though, that CareerAV may not be the best measure of a player’s “success.” Thus, we would like to investigate whether we can predict position-specific career metrics (e.g. career sacks, career interceptions, career touchdowns, etc.).

We also plan on determining whether some teams value the Combine more than others. In that case, when examining certain teams individually, we would see correlations between Combine performance and Pick number.

Additionally, we hope to explore whether there is a significant difference between NFL players who attended the Combine and those who did not (each year, only a fraction of drafted players were invited to the Combine beforehand).

Group Members

Caroline Malin-Mayor (cmalinma)

Monica-Ann Mendoza (momendoz)

Tyler Devlin (tddevlin)

Victor Li (vcli)

Data Scraping, Cleaning, Integration

In this first week of the project we focused on scraping, cleaning, and integrating relevant data from our sources. We then also defined the scope of our investigation, namely by figuring out what is the best metric to use to measure an NFL player’s career “success”, regardless of position. Lastly, we begin some basic exploratory analysis on our data.

The first dataset we retrieved contains data from the NFL Combine workouts of every player who participated from 1999-2015. The specific variables include:

  • Player Name
  • Position
  • School
  • Height (inches)
  • Weight (pounds)
  • 40 Yard Dash Time (seconds)
  • Vertical Jump (inches)
  • Bench Press Repetitions at 225 pounds (repetitions)
  • Broad Jump (inches)
  • 3 Cone Drill (seconds)
  • Shuttle Run (seconds)
  • Team Drafted by
  • Draft Round
  • Overall Pick Number
  • Draft Year

NFL Combine data from 2000-2015 was scraped from http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi. This involved copy/pasting each individual year’s Combine results into a text file, which luckily yielded tab-separated values.

  • Using R, we then cleaned the 2000-2015 data in the following manner. For a given year, we:
  • Converted the file into comma-separated values
  • Deleted unnecessary headers (header row was repeated a few times in the values)
  • Converted height from “feet-inches” string format into an integer of total height inches
  • Deleted an unnecessary column of hyperlinks for “College Stats”
  • Converted the single column of “Team / Round / Pick Number / Year” into four separate columns
  • Cleaned the “Round” column from string format into integer format (e.g. “4th” into “4”)
  • Cleaned the “Pick Number” column in a similar fashion (e.g. “204th pick” into “204”)
  • Appended this clean year-specific .csv to the master .csv that eventually contained clean Combine data from 2000-2015

To add the 1999 NFL Combine data (the oldest we could find), we used Daren Willman’s dataset from his website, http://nflsavant.com/index.php. Luckily, his data was already formatted well. Using R, we simply extracted the relevant columns that matched up with our 2000-2015 dataset and appended it to our master .csv to complete our first full dataset of NFL Combine data from 1999-2015.

Our second dataset is all players who were drafted in the 1999-2015 NFL Drafts with their draft position (team, round, pick number) and their NFL career statistics that can help us measure how successful a player was in the league. We retrieved this data from http://www.pro-football-reference.com/draft/, year-by-year from 1999-2015. For a given year, we downloaded the .csv of draft results and cleaned it in R as follows:

  • Deleted unnecessary headers (header row was repeated a few times in the values)
  • Renamed some headers for clarification purposes (e.g. Passing Yards vs. Rushing Yards)
  • Deleted an unnecessary column of hyperlinks for “College Stats”
  • Appended this clean year-specific .csv to the master .csv that eventually contained NFL Draft data from 1999-2015 for our complete second dataset

Our second dataset of NFL Draft data includes the following variables:

  • Draft Year
  • Round
  • Overall Pick Number
  • Team Drafted by
  • Player Name
  • Position
  • Age
  • Most Recent Year Played in NFL
  • All-Pro 1st Team Selections
  • Pro Bowl Selections
  • Number of Years as Primary Team Starter
  • Weighted Career Approximate Value
  • Cumulative Approximate Value for Drafting Team
  • Career Games Played
  • Career Passing Completions
  • Career Passing Attempts
  • Career Passing Yards
  • Career Passing Touchdowns
  • Career Passing Interceptions
  • Career Rushing Attempts
  • Career Rushing Yards
  • Career Rushing Touchdowns
  • Career Receptions
  • Career Reception Yards
  • Career Reception Touchdowns
  • Career Tackles
  • Career Defensive Interceptions
  • Career Sacks
  • School

However, not all players who were drafted in those years participated in the Combine, and not all players who participated in the Combine were drafted into the NFL. Thus, we had to match the players’ combine statistics to their Draft and career data.

In order to do this, we imported the data into Microsoft Excel, and wrote a macro to integrate the two datasets. Essentially, the macro performed a left outer join on the NFL Combine dataset with the NFL Draft dataset, matching on draft year and overall draft pick number. Specifically, we used two nested for loops; the first went through all the players in the combine and selected their draft year and overall draft pick number, while the inner loop went through all the draft data looking for an entry with matching draft year and overall draft pick number. When a match was found, the macro added the data from the draft table to the combine entry for the player and broke out of the inner for loop, moving onto the next combine entry. The result was a table with 3,795 players who participated in the Combine and were subsequently drafted, along with 1,806 players with only Combine data who were never drafted.

 

Preliminary Exploration of “Success” Metric

As far as defining a metric for an NFL player’s career “success”, we have tentatively planned to use Weighted Career Approximate Value divided by Number of Seasons Played. Approximate Value (AV), is an advanced metric that measures the value a specific player adds to their team. It is meant to be used as a quick approximation of a player’s skill level and contributions when on the field. The higher a player’s AV, the better. The following scale provides a rough sketch of what AV translates into:

  • Bust: AV < -25
  • Poor: -24 < AV < -15
  • Disappointing: -14 < AV < -5
  • Satisfactory: -4 < AV < +4
  • Good: +5 < AV < +14
  • Very Good: +15 < AV < +24
  • Excellent: AV > +25

Source: https://codeandfootball.wordpress.com/tag/approximate-value/

Weighted Career Approximate Value is a weighted sum of a player’s AV by season over their career – a sum that is 100% of their best season, 95% of their second-best season, 90% of their third-best, etc., which heavily favors players who may have a few outstanding seasons among some mediocre ones over a player’s consistent performance over multiple seasons. Dividing this by seasons played then allows for players who have played different numbers of seasons to be easily compared. We will continue to research this metric’s pros and cons and evaluate whether or not this is a suitable measure of success to attempt to predict with NFL Combine results. We may also need to determine some threshold of games played or seasons played to ensure reasonable results.

 

Preliminary Exploratory Analysis

To wrap up the week, we began some exploratory analysis on our full dataset. One of the more interesting finds is detailed in the bar plot below, of the Top 25 schools who produced the most drafted players from 1999-2015.

 

top25schools.png

Group members: Caroline Malin-Mayor (cmalinma), Monica-Ann Mendoza (momendoz), Victor Li (vcli), Tyler Devlin (tddevlin)