April 2016 – Valuing the NFL Draft

Question 1: Can NFL career success be predicted based on Combine results?

The primary goal of our project is to determine whether NFL career success can be predicted based on Combine results. To quantify success, we are using a player’s Weighted Career Approximate Value divided by number of seasons played – we will refer to this metric as CareerAV. CareerAV is a continuous variable, and higher values correspond to greater success.

The Combine includes the following information about a player: height, weight, 40-yard dash time, vertical jump, bench press, broad jump, 3-cone drill time, and shuttle run time. We hoped to use these features (either all or a subset) in linear regression in order to predict a player’s future CareerAV.

To determine whether we could build an accurate prediction model, we began by investigating correlations between each attribute and CareerAV. The scatterplots below reveal that there were no clear relationships. Numerical measures of correlations ranged from -0.15 to 0.12, suggesting that there were no linear associations.

Rplot01

Despite our scatterplot and correlation results suggesting that there were no predictable trends, we investigated what would result if we built a multiple regression model nevertheless.
This is the output of our R code to create a model involving all physical metrics.

Rplot02

The low R-squared value indicates that our regression model was a poor fit, as expected. It was only able to predict 4% of the variation in CareerAV. Additionally, although some of our coefficient estimates were statistically significant, we doubt their accuracy, since some suggested that worse Combine performance (i.e. slower 40-yard dash and 3-cone drill times) would lead to better CareerAV.

Overall, we did not find evidence that Combine performance could accurately be predicted by Combine results using linear regression.

Question 1a: After grouping players by position, can NFL career success be predicted based on Combine results?

We hypothesized that we might be able to build more accurate regression models once we grouped athletes by position, since the importance of certain physical traits differs across roles (e.g. agility more crucial for wide receivers, strength more crucial for linemen).

This investigation also yielded negative results: there were no strong linear trends between Combine results and CareerAV, so regression models were inappropriate.

Question 2: Can NFL draft pick be predicted based on Combine results?

Our other primary question was whether we could predict NFL draft pick using Combine results. A scatterplot of CareerAV vs. Pick suggested that there was potentially a weak inverse relationship.

Because the Combine serves as a scouting camp for NFL draft prospects, we presumed that stronger performers in the Combine would get drafted earlier. However, the data did not support our expectation:

Rplot04

These scatterplots suggested that Combine results did not have a strong impact on draft picks. There was a relatively uniform distribution of pick numbers across all levels of Combine performance.

Question 2a: After grouping players by position, can Combine results predict when a player will be drafted in the Draft?

We again tried to see if analyzing positions separately would allow us to make better predictions. However, Combine results and pick numbers remained uncorrelated.

Question 3: What is the “shape” of an NFL player? (D3 Visualization)

In the visualization above (click to see the entire thing), we sought to determine how different positions varied in their performance on Combine drills. The six points of each hexagon represent six different Combine measurements. The six points of the blue polygon represent the average performance of players from a given position on each drill. Better performance is indicated by a more expansive polygon. Here are some interesting things we noticed:

Wide receivers and cornerbacks have very similar shapes. This is interesting because a cornerback’s primary role is to guard wide receivers.
Free safeties also have similar shapes because their primary role is to guard receivers as backup for cornerbacks.
Strong safeties and free safeties play similar roles, but strong safeties have the additional task of helping out with run defense which could explain their slightly higher bench performance.
Some positions are less specialized, i.e. closer to a regular hexagon. Fullbacks, tight ends, and outside/inside linebackers all have fairly balanced measurements. This agrees with their roles on the field, since they all have to be good at multiple things.
Finally offensive linemen and defensive tackles mostly need to be really strong, which explains their “pointy” shape.

Question 4: In the NFL draft, are some positions valued more than others?

We hoped to determine whether certain positions were in higher demand from teams during the draft. If that hypothesis were true, we would expect to see certain positions having earlier pick numbers overall.

These boxplots revealed that teams tended to pick long snappers (LS), kickers (K), and punters (P) in later rounds of the draft. This makes sense given the specialized nature of the positions and how few roster spots are given to them. Otherwise, there was little notable variation in Pick number across positions. There was not a sizable difference between skill positions (QB-quarterback, RB-running back, WR-wide receiver), and non-skill positions. Rplot05

Question 5: Is there a difference in Combine performance between athletes who were drafted and those who were not?

Not all athletes who participate in the Combine get drafted later in the year. Thus, we wanted to determine whether poor performers in the Combine would be less likely to get drafted. However, once we split our dataset into drafted vs. non-drafted players, we did not see large differences in performance between the two groups:

Thus, we believe that other factors played a larger role in determining whether a player was drafted. For example, an athlete’s college performance could more likely be an influential predictor.

For next week, we plan on investigating whether we can run a machine learning algorithm that can better differentiate between drafted and non-drafted athletes.

Overall Conclusions

This initial exploration casts some doubt into how useful the NFL Combine actually is to evaluate college prospects. There is more analysis to be done to further examine the standalone utility of the NFL Combine.

Next Steps

We failed to predict CareerAV based off of Combine performance. We realize, though, that CareerAV may not be the best measure of a player’s “success.” Thus, we would like to investigate whether we can predict position-specific career metrics (e.g. career sacks, career interceptions, career touchdowns, etc.).

We also plan on determining whether some teams value the Combine more than others. In that case, when examining certain teams individually, we would see correlations between Combine performance and Pick number.

Additionally, we hope to explore whether there is a significant difference between NFL players who attended the Combine and those who did not (each year, only a fraction of drafted players were invited to the Combine beforehand).

Group Members

Caroline Malin-Mayor (cmalinma)

Monica-Ann Mendoza (momendoz)

Tyler Devlin (tddevlin)

Victor Li (vcli)

Data Scraping, Cleaning, Integration

In this first week of the project we focused on scraping, cleaning, and integrating relevant data from our sources. We then also defined the scope of our investigation, namely by figuring out what is the best metric to use to measure an NFL player’s career “success”, regardless of position. Lastly, we begin some basic exploratory analysis on our data.

The first dataset we retrieved contains data from the NFL Combine workouts of every player who participated from 1999-2015. The specific variables include:

Player Name
Position
School
Height (inches)
Weight (pounds)
40 Yard Dash Time (seconds)
Vertical Jump (inches)
Bench Press Repetitions at 225 pounds (repetitions)
Broad Jump (inches)
3 Cone Drill (seconds)
Shuttle Run (seconds)
Team Drafted by
Draft Round
Overall Pick Number
Draft Year

NFL Combine data from 2000-2015 was scraped from http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi. This involved copy/pasting each individual year’s Combine results into a text file, which luckily yielded tab-separated values.

Using R, we then cleaned the 2000-2015 data in the following manner. For a given year, we:
Converted the file into comma-separated values
Deleted unnecessary headers (header row was repeated a few times in the values)
Converted height from “feet-inches” string format into an integer of total height inches
Deleted an unnecessary column of hyperlinks for “College Stats”
Converted the single column of “Team / Round / Pick Number / Year” into four separate columns
Cleaned the “Round” column from string format into integer format (e.g. “4th” into “4”)
Cleaned the “Pick Number” column in a similar fashion (e.g. “204th pick” into “204”)
Appended this clean year-specific .csv to the master .csv that eventually contained clean Combine data from 2000-2015

To add the 1999 NFL Combine data (the oldest we could find), we used Daren Willman’s dataset from his website, http://nflsavant.com/index.php. Luckily, his data was already formatted well. Using R, we simply extracted the relevant columns that matched up with our 2000-2015 dataset and appended it to our master .csv to complete our first full dataset of NFL Combine data from 1999-2015.

Our second dataset is all players who were drafted in the 1999-2015 NFL Drafts with their draft position (team, round, pick number) and their NFL career statistics that can help us measure how successful a player was in the league. We retrieved this data from http://www.pro-football-reference.com/draft/, year-by-year from 1999-2015. For a given year, we downloaded the .csv of draft results and cleaned it in R as follows:

Deleted unnecessary headers (header row was repeated a few times in the values)
Renamed some headers for clarification purposes (e.g. Passing Yards vs. Rushing Yards)
Deleted an unnecessary column of hyperlinks for “College Stats”
Appended this clean year-specific .csv to the master .csv that eventually contained NFL Draft data from 1999-2015 for our complete second dataset

Our second dataset of NFL Draft data includes the following variables:

Draft Year
Round
Overall Pick Number
Team Drafted by
Player Name
Position
Age
Most Recent Year Played in NFL
All-Pro 1st Team Selections
Pro Bowl Selections
Number of Years as Primary Team Starter
Weighted Career Approximate Value
Cumulative Approximate Value for Drafting Team
Career Games Played
Career Passing Completions
Career Passing Attempts
Career Passing Yards
Career Passing Touchdowns
Career Passing Interceptions
Career Rushing Attempts
Career Rushing Yards
Career Rushing Touchdowns
Career Receptions
Career Reception Yards
Career Reception Touchdowns
Career Tackles
Career Defensive Interceptions
Career Sacks
School

However, not all players who were drafted in those years participated in the Combine, and not all players who participated in the Combine were drafted into the NFL. Thus, we had to match the players’ combine statistics to their Draft and career data.

In order to do this, we imported the data into Microsoft Excel, and wrote a macro to integrate the two datasets. Essentially, the macro performed a left outer join on the NFL Combine dataset with the NFL Draft dataset, matching on draft year and overall draft pick number. Specifically, we used two nested for loops; the first went through all the players in the combine and selected their draft year and overall draft pick number, while the inner loop went through all the draft data looking for an entry with matching draft year and overall draft pick number. When a match was found, the macro added the data from the draft table to the combine entry for the player and broke out of the inner for loop, moving onto the next combine entry. The result was a table with 3,795 players who participated in the Combine and were subsequently drafted, along with 1,806 players with only Combine data who were never drafted.

Preliminary Exploration of “Success” Metric

As far as defining a metric for an NFL player’s career “success”, we have tentatively planned to use Weighted Career Approximate Value divided by Number of Seasons Played. Approximate Value (AV), is an advanced metric that measures the value a specific player adds to their team. It is meant to be used as a quick approximation of a player’s skill level and contributions when on the field. The higher a player’s AV, the better. The following scale provides a rough sketch of what AV translates into:

Bust: AV < -25
Poor: -24 < AV < -15
Disappointing: -14 < AV < -5
Satisfactory: -4 < AV < +4
Good: +5 < AV < +14
Very Good: +15 < AV < +24
Excellent: AV > +25

Source: https://codeandfootball.wordpress.com/tag/approximate-value/

Weighted Career Approximate Value is a weighted sum of a player’s AV by season over their career – a sum that is 100% of their best season, 95% of their second-best season, 90% of their third-best, etc., which heavily favors players who may have a few outstanding seasons among some mediocre ones over a player’s consistent performance over multiple seasons. Dividing this by seasons played then allows for players who have played different numbers of seasons to be easily compared. We will continue to research this metric’s pros and cons and evaluate whether or not this is a suitable measure of success to attempt to predict with NFL Combine results. We may also need to determine some threshold of games played or seasons played to ensure reasonable results.

Preliminary Exploratory Analysis

To wrap up the week, we began some exploratory analysis on our full dataset. One of the more interesting finds is detailed in the bar plot below, of the Top 25 schools who produced the most drafted players from 1999-2015.

Group members: Caroline Malin-Mayor (cmalinma), Monica-Ann Mendoza (momendoz), Victor Li (vcli), Tyler Devlin (tddevlin)