The Optimal NBA Roster Updated

4 Sep


The Optimal Roster

JhabvalaUpdatedPost

As shown in the movie Moneyball, the use of data analytics has played an ever growing role in decisions made by professional sports teams.

My concept of the “Optimal Roster” was developed after an Optimization class at Carnegie Mellon’s Tepper School of Business.  In a particular class discussion, my professor attempted to maximize the return of a bond portfolio by choosing an ideal group of five bonds, out of a possible ten, given a specified dollar amount to invest.   In order to do this, he used Excel Risk Solver to help identify the optimal solution/portfolio of bonds to choose.

This discussion led me to the optimal roster concept.  What if I replaced bonds with NBA players?  What if instead of an investment portfolio, I optimized the selection of players within the constraints of a team’s Salary Cap?   What if instead of considering bond coupons (interest), maturity and risk, we looked at a player’s points, rebounds and assists?  Lastly instead of a having to choose five bonds out of a possible ten, what if we had to choose 15 players that comprised the ‘Optimal Roster’ taken from a pool of all players currently in the NBA?

My concept of the “Optimal Roster” uses the Excel Risk Solver software to create an optimal NBA roster of 15 players while staying in the confines of the salary cap or potential tax level based on the 2011 CBA agreement for NBA teams.

Using John Hollinger’s GameScore calculation averaged over an 82-game regular season and adjusted to per 48 minutes played,, I was able to rank players based on position.  Then adding in their Salary information, I was able to set the constraints in Risk Solver stating the total money spent on the 15 players cannot exceed a given limit, in this case I used a ceiling that was a calculation of the average NBA team salary in the 2012-13 season.   An additional constraint was added such that only one player from each position may be chosen to build the starting lineup and each roster must have two back up players at each position.   Given each team may carry 15 players, but can only have 12 active for a game, it made sense to build the model to accommodate all 15.

The main focus of this research is to be able to understand small variances in spending and to place a framework around how a team can go about statistically maximizes roster to meet given expectations.   In other words, if half of a percent more of the salary cap was dedicated to spending on the starting five players, how much greater does total team PER become.   Does it make sense to potentially over spend and dip into the luxury tax just to gain more productivity from players?   Or can you find ways to save money while still getting productive output from your lineup.  Lastly, if a team were to lose a few players in the offseason due to Free Agency, are we able to take a pool of undrafted collegiate prospects, international NBA eligible players and current NBA free agents and use this model to find the ideal solutions to the vacated roster holes?  These are the ideas that are being explored using this model.

The last question in the preceding paragraph is largely dependent on the ability to standardize collegiate and international players with current NBA players in order to compare apples to apples.  This would require a projection model to attempt to predict the output of college players at the NBA level, and a similar conversion model to convert international player productivity to the NBA game.    In the current optimization model, we are only using a pool of current NBA players to build the ideal roster because the framework for the projection models is not currently in place.   I wanted to take a moment to acknowledge the next steps in the process that I have yet to achieve, but certainly plan to do.

If you have any questions or thoughts on this model and research, please contact me at jjhabval@tepper.cmu.edu.

Jordan Jhabvala 

The Optimal NBA Roster

1 Apr

JhabvalaPost

Following the movie Moneyball, the use of data analytics has played an ever growing role in decisions made by professional sports teams.

My concept of the “Optimal Roster” was developed after an Optimization class at Carnegie Mellon’s Tepper School of Business.  In a particular class discussion, my professor attempted to maximize the return of a bond portfolio by choosing an ideal group of five bonds, out of a possible ten, given a specified dollar amount to invest.   In order to do this, he used Excel Risk Solver to help identify the optimal solution/portfolio of bonds to choose.

This discussion led me to the optimal roster concept.  What if I replaced bonds with NBA players?  What if instead of an investment portfolio, I optimized the selection of players within the constraints of a team’s Salary Cap?   What if instead of considering bond coupons (interest), maturity and risk, we looked at a player’s points, rebounds and assists?  Lastly instead of a having to choose five bonds out of a possible ten, what if we had to choose five players, one player at each position, out all NBA players currently in the NBA?

My concept of the “Optimal Roster” uses the Excel Risk Solver software to create an optimal starting lineup of an NBA team while staying in the confines of currently salary cap for a team.

Using a simplified version of John Hollinger’s PER calculation, I was able to rank players based on position.  Then adding in their Salary information, I was able to set the constraints in Risk Solver stating the total money spent on the five players cannot exceed the percentage of the salary cap inputted.   Also, only one player from each position may be chosen in order to maximize total team PER.

The main focus of this research is to be able to understand small variances in spending.  Meaning, if half of a percent more of the salary cap was dedicated to spending on the starting five players, how much greater does total team PER become.   Does it make sense to potentially over spend and dip into the luxury tax just to gain more productivity from players?   Or can you find ways to save money while still getting productive output from your lineup.  These are the ideas that are being explored using this model.

The ideal scenario in this case is to be able to build a team of 15 players; however the student version of Risk Solver limits us to 200 variables so this isn’t possible.   Also, the next step in this research is to be able to account for rookie salaries versus veteran deals which certainly skew the results.   Nonetheless, the framework from the model is in place and I will continue to post more once I am able to do more of the variance research resulting from having more variables to work with.

If you have any questions or thoughts on this model and research, please contact me at jjhabval@tepper.cmu.edu.

Jordan Jhabvala 

Red Sox Roster Optimization

27 Jan

Now that San Francisco has won the World Series, the offseason has officially started.  Call me crazy, but I enjoy watching my team, the Red Sox, come together in the offseason almost as much as I enjoy watching the season.  This is especially true for me given the long season that the Red sox had.

With the offseason comes speculation.  Who goes where?  Who signs the big contract?  The speculation has become more quantitative in recent years.  However, there is much less conversation around building the best team.  Of the team conversations that do happen, they tend to be very qualitative in nature and come spring training it’s hard to tell if offseason goals have been accomplished.

Since I’ve been learning all sorts of methods on optimization, I thought it would be fun to see if I could use these skills to optimize a baseball team’s offseason.  So, I made an effort to optimize a lineup for the Red Sox based on their current roster and the available free agents.  To start, I have only optimized the offensive side of the roster.  There are a couple reasons for this that I will get into later, but for now I have an optimization model for the lineup and the bench.  The optimization is based on two things:  OPS and salary.  OPS, while not universal or comprehensive, is a common measuring stick to determine a player’s offensive value, and salary is the common limiting factor for every team.  To illustrate the correlation between team OPS and runs I ran a regression based on last year’s team data.  It should come as no surprise that the R^2 of OPS on runs is 97%, and hence a good measure of offensive output.

So, I’ve put together a model that optimized the team’s OPS.  This model adds free agents to the current roster to determine the optimal lineup.  Taking every free agent into account would take too long to do for free, so my model is limited to the free agents listed below:

Name

OPS

2012 Salary

Name

OPS

2012 Salary

Ryan Lavarnway

0.459

$500,000

Maicer Izturis (32)

0.634

$3,966,667

Guillermo Quiroz

0.625

$500,000

Kelly Johnson (31)

0.687

$6,375,000

Jarrod Saltalamacchia

0.742

$2,500,000

Jason Bartlett (33)

0.433

$5,500,000

Pedro Ciriaco

0.705

$500,000

Yuniesky Betancourt (31)

0.656

$2,000,000

Mauro Gomez

0.746

$500,000

Brian Bixler (30)

0.583

$500,000

Jose Iglesias

0.391

$500,000

Ronny Cedeno (30)

0.741

$1,150,000

Dustin Pedroia

0.797

$8,250,000

Marco Scutaro (37)

0.753

$6,000,000

Jacoby Ellsbury

0.682

$8,050,000

Eric Chavez (35)

0.845

$900,000

Ryan Kalish

0.625

$500,000

Brandon Inge (36)

0.658

$5,500,000

Che-Hsuan Lin

0.500

$500,000

Jose Lopez (29)

0.626

$800,000

Daniel Nava

0.742

$500,000

Scott Rolen (38)

0.716

$8,166,667

Will Middlebrooks

0.835

$500,000

Travis Buck (29)

0.595

$580,000

David Ortiz

1.026

$14,575,000

Melky Cabrera (28)

0.906

$6,000,000

Cody Ross

0.807

$3,000,000

Jonny Gomes (32)

0.868

$1,000,000

Ivan DeJesus

0.000

$480,500

Josh Hamilton (32)

0.930

$15,250,000

James Loney

0.630

$6,375,000

Andruw Jones (36)

0.701

$2,000,000

Danny Valencia

0.388

$515,000

Michael Bourn (30)

0.739

$6,845,000

Russell Martin (30)

0.713

$7,500,000

Scott Hairston (33)

0.803

$1,100,000

Mike Napoli (31)

0.812

$9,400,000

B.J. Upton (28)

0.752

$7,000,000

A.J. Pierzynski (36)

0.827

$6,000,000

Shane Victorino (32)

0.704

$9,500,000

Kelly Shoppach (33)

0.798

$1,350,000

Brian Bixler (30)

0.583

$500,000

Eric Hinske (35)

0.583

$1,600,000

Travis Buck (29)

0.595

$580,000

Casey Kotchman (30)

0.612

$3,000,000

Torii Hunter (37)

0.817

$18,500,000

Carlos Lee (37)

0.697

$19,000,000

Nick Swisher (32)

0.837

$10,250,000

Carlos Pena (35)

0.684

$7,250,000

The use of these particular free agents is completely arbitrary; no free agent is left off for any particular reason.   Also, to simplify things for right now, I’m using last year’s OPS and salary for each free agent.  I understand that this is not ideal, but I will add these pieces into the equation at a later point.  To begin, I simply want to test the theory.  So, using this information I get the following offensive roster for the 2013 Red Sox.

Name

Position

OPS

2102 Salary

Mike Napoli (31)

C

0.812

$9,400,000

Carlos Lee (37)

1B

0.697

$19,000,000

Dustin Pedroia*

2B

0.797

$8,250,000

Will Middlebrooks*

3B

0.835

$500,000

Marco Scutaro (37)

SS

0.753

$6,000,000

Melky Cabrera (28)

OF

0.906

$6,000,000

Jonny Gomes (32)

OF

0.868

$1,000,000

Josh Hamilton (32)

OF

0.93

$15,250,000

David Ortiz

DH

1.026

$14,575,000

A.J. Pierzynski (36)

C

0.827

$6,000,000

Ronny Cedeno (30)

SS

0.741

$1,150,000

Eric Chavez (35)

3B

0.845

$900,000

Scott Hairston (33)

OF

0.803

$1,100,000

Total Offensive Salary:

$89,125,000

Average OPS:

0.834

*Under contract, not part of optimization.

There are a couple of things to note with this offensive group.  First, by design, there are no salary constraints in this first run.  This would explain the $6 million backup catcher and the $90 million offensive payroll.  This is simply a first step sanity check to prove that the model will in fact pick out the optimal combination of players.  Also, there’s only two current Red Sox that are on this roster.  This was mostly due to the fact that Pedroia and Middlebrooks are the only two currently signed Red Sox players who you can write into the 2013 plan in ink.  While there are others under contract, there’s at least some speculation around everyone else.   Plus, the model is more interesting with more moving parts.  I’ll replace interesting with accurate as I go through the steps of refining the model, but for a first pass I give preference to interesting.

Ok, now that I’ve shown that the model works, let’s start refining the model so that it is useful.  The first thing that I will tackle is salary constraints.  There are two types of salary constraints built into the model: individual position salary constraints and a total salary constraint.  Since Pedroia and Middlebrooks are set, we don’t have to worry about them.  I’m going to set every other position at $10 million, except for DH because Ortiz had been identified as a priority.  I’m also going to set the bench position maximum salaries at $5 million.  Also, given the trade during the season, I’m going to assume that the Red Sox are not going to break the bank this offseason.  Taking this into consideration, I’m going to set the total offensive salary at $60 million.  Using these constraints, we get the following results:

Name

Position

OPS

Salary

A.J. Pierzynski (36)

C

0.827

$6,000,000

Carlos Pena (35)

1B

0.684

$7,250,000

Dustin Pedroia

2B

0.797

$8,250,000

Will Middlebrooks

3B

0.835

$500,000

Marco Scutaro (37)

SS

0.753

$6,000,000

David Ortiz

DH

1.026

$14,575,000

Melky Cabrera (28)

OF

0.906

$6,000,000

Jonny Gomes (32)

OF

0.868

$1,000,000

Cody Ross

RF/LF

0.807

$3,000,000

Kelly Shoppach (33)

C

0.798

$1,350,000

Ronny Cedeno (30)

SS

0.741

$1,150,000

Eric Chavez (35)

3B

0.845

$900,000

Scott Hairston (33)

OF

0.803

$1,100,000

Total Salary:

$57,075,000

Average OPS:

0.822

This looks more like a potential 2013 roster than the first roster.  Obviously, there are a lot of different iterations that can be done by playing with salary variables which can affect the roster.

So, to some extent I’ve proven that an optimization model can be used as some sort of tool for developing a roster which means that I will continue to develop the model.  At this point it there are two roads that this analysis can go down.  One is to do all the research ahead of the transactions (this would involve predicting 2013 OPS and salary) and use this as a predictive model.  The other is to see what the Red Sox and other teams actually do this offseason and use this information to analyze the Red Sox offseason.  I’m choosing to do the latter right now.  So, I will continue to refine and add data as decisions are made and update at interesting points during the offseason.

-Erik Clark

Sources:

www.espn.com/mlb

www.mlbtraderumors.com

http://www.baseballprospectus.com/compensation/cots/

Weekly Sports Analytics Round-Up

16 Jul

I hesitated to include the “weekly” moniker on this post since it has been a good 11 weeks since the last sports analytics round-up.  But as you probably guessed, we have all been busy impressing potential future employers at our summer internships.  So, without further ado, here are this edition’s links:

 

Fascinating read about applying network theory to analyzing ball movement during the 2010 World Cup and 2012 Euro Cup.  http://www.technologyreview.com/view/428399/pagerank-algorithm-reveals-soccer-teams/

 

Interesting article comparing NFL Draft and NBA Draft pick values accounting for the difference in the number of starters between the two sports: http://www.thebiglead.com/index.php/2012/06/27/comparing-the-nba-draft-to-the-nfl-draft/

I, for one, am wary of people who still hate on Lebron because of the Decision or because he had to partner with a “Big Three” to win a title.  As for the latter, this article places Miami’s “other two” in perspective with other championship teams.  As for the former, I would only ask how many stupid decisions you made when you were 25?

http://www.thebiglead.com/index.php/2012/06/22/the-miami-heat-still-have-decisions-to-make-if-they-want-to-repeat-or-threepeat/

 

-DaveCaughman

Weekly Sports Analytics Round-up

29 Apr

In light of the recent NFL draft, Grantland ran a nice article postulating a new method for determining the “net value” of draft picks over or below what is expected from a player drafted in that position.

http://www.grantland.com/story/_/id/7849206/how-tell-which-draft-picks-truly-valuable

 

Not recent, but a very interesting sabermetrician profile that just came across my radar.

http://www.thepostgame.com/features/201101/sabermetrician-exile

 

The guys over at HSAC find out that April wins in baseball mean more than you might think (sorry Phils fans)

http://harvardsportsanalysis.wordpress.com/2012/04/20/how-important-is-a-good-april/#more-3111

 

-DaveCaughman

 

MLB Closer Analysis – Understanding Statistical Components

12 Apr

With baseball back in season, I started to think of my team (the Red Sox) and their closer situation.  They are transitioning to a new closer which made me wonder how they came to this decision.  To get an idea of what makes a good closer I did some research on closer statistics.  Ultimately, your ability as a closer comes down to how many save opportunities are converted.  So, I analyzed save percentage (saves/save opportunities) as a function of various independent variables with the following results:

Table 1

Variable

Probability

R-squared

BAA

0.0311

0.137126

BABIP

0.7301

0.003770

BB/9

0.4358

0.019089

ERA

0.0001

0.285438

K/9

0.2874

0.035279

K/BB

0.0976

0.083421

OBP

0.0155

0.159534

OPS

0.0005

0.315516

SLG

0.0017

0.268218

WHIP

0.0174

0.164343

Interpreting this information first requires an explanation of the table.  Probability can be described as the likelihood that a statistic is significant.  Generally, you want to see a number less than 0.05.  R-squared can be roughly described as what percentage of the total outcome can be attributed to the independent variable.  R-squared has a range of 0.0 to 1.0.  For example, OBP has a probability of 0.0155 which means OBP is significant and has an R-squared value of 0.159534 which means that OBP accounts for 15.95% of total variability in save percentage.  Below, I removed all insignificant variables and ordered them in order of R-squared values.

Table 2

Variable

Probability

R-squared

OPS

0.0005

0.315516

ERA

0.0001

0.285438

SLG

0.0017

0.268218

WHIP

0.0174

0.164343

OBP

0.0155

0.159534

BAA

0.0311

0.137126

There are a couple of interesting takeaways from this information.  The first takeaway is to recognize what stats are missing.  K/9 appears to have no significance in converting saves.  This is kind of a strange outcome because closers are generally thought of as big strikeout guys.  BB/9 and K/BB are also missing.  This is a little surprising because, even though closers aren’t necessarily control guys, you would expect a closer’s results to be somewhat dependent on these two variables.  The final missing stat is BABIP (batting average on balls in play).  This is something that is starting to be talked about more in the baseball world, but apparently has little effect on save percentage.

The next takeaway is the relative importance of each variable.  According to my research, OPS is the most significant stat with respect to save percentage.  This is somewhat surprising because I don’t think I’ve ever heard this stat with respect to pitchers.  However, after absorbing this information for a minute, it should not be that surprising.  OPS has grown in popularity over recent years as far as measuring a hitters performance.  So, it would stand to reason that you could measure a pitcher’s value based on the OPS that hitters have against him.  The next most important variable is an old-time statistic, ERA.  As it turns out, the old guard in baseball are bigger stat geeks than they mare care to admit.  But, again, this makes sense because if ERA didn’t matter than it wouldn’t have been such a popular stat for so long.  One stat that I was surprised by is WHIP.  This is my favorite new pitcher stat because I thought it encompassed most of the areas that ERA fell short on.  As it turns out, WHIP is the 4th most significant stat that I analyzed which is far less important than I would have anticipated.

So, now that we have all this information we should apply some of it.  First, let’s look at the closers who might be over their heads in their 2011 roles.  Here are the highest OPS values for closers with more than 10 saves:

Name                                    OPS

Jon Rauch                            0.799

Houston Street                 0.781

Kevin Gregg                       0.773

Matt Capps                         0.726

Frank Francisco                 0.721

Joakim Soria                       0.709

Of these six, Frank Francisco is the only one with a save so far this year and the only clear-cut closer.  However, Francisco changed teams and leagues and also signed a very reasonable $5.5 million contract, all of which may contribute to him continuing as a closer.  To be fair, Soria is out for the season with an elbow injury (perhaps his uncharacteristically poor 2011 could have been an indication of the arm injury), otherwise he’d be closing for the Royals.  The other 4 are either setup guys or in a situation where it’s not completely clear.  So, for the most part, the empirical evidence supports the theoretical research.

That’s it for my analysis this time around, but I intend to do some research on multi-variable analysis in the near future.

-Erik Clark

The 2012 Kentucky Wildcats – They Are Who We Thought They Were

6 Apr

Well, my regression model correctly predicted Kentucky winning it all.  So what?  Sometimes the data just confirms what we already believe.  While not as exciting as correctly predicting a 4 seed to win it all, the results nonetheless speak for themselves.  Overall, my model-based bracket placed 6th out of 27 in the first annual Tepper Sports Fanalytics Club Tourney Pick Em.  This equated to the 83rd percentile among Yahoo! brackets.  While not perfect, it beats the heck out of my usual bottom-feeder brackets.

But how should we really define “success” with an analytical-based bracket?  Picking the winner is nice, but I’m going to hold out and say success should be finishing “in the money”, which typically means 3rd place and higher in your pool.  That being said, you usually have to pick the winner to cash out, so we’re on the right track.

So it’s time to put a bow on the 2011-12 college basketball season.  Now we have baseball and the NBA playoffs to look forward to.  Stay tuned for more blogs along those lines in the near future!

-DaveCaughman

March Madness 2012: Trying to Bring Some Sanity to the Madness

14 Mar

Nothing gets me fired up like hearing the CBS March Madness intro.  It’s been a mainstay on my GATA (workout) playlist for years, and whenever it comes on I think back to the Laettner’s  and George Mason’s of the world while trying to forget about how many times Vanderbilt has broken my heart.  The seemingly unpredictable nature of the tournament is its biggest asset, but it also got me thinking.  Is there any way to better predict who is going to win in the tournament at each stage?  Having never won any tournament pool in my life, I have a financial motivation to get to the bottom of this as well.  Using the handy tool of regression analysis and a weird fondness of drinking coffee, listening to Pandora and plugging numbers into spreadsheets, I set about to answer that very question.

Data Collection

Dependent Variable:  Since we are trying to predict tournament wins, I went back 5 tournaments (to the 2006-07 season) and for all 64 teams (leaving out the recent play-in games) gathered data for how many games each team won in each tournament.  Some quick math will tell you there are 320 observations in total.

Independent Variables:  As badly as I wanted to discover some overlooked statistical category right off the bat, starting with some basic stat lines first seemed like the best way to peel back the onion.  As a disciple of Dave Berri and Dean Oliver, I used offensive and defensive efficiency as a starting point.  To keep things simple, I threw in RPI to level out the playing field.

Sources:

1)      Teamrankings.com for offensive efficiency, defensive efficiency and RPI.  For these stats I went back and made sure I captured them immediately before the tournament began.

2)      Wikipedia for tournament wins each year.  Now that I routinely see my professors citing Wikipedia for class slides, I think we’ve removed the final hurdle to Wikipedia legitimacy.

Results/Insights:

Variable Coefficient Std Error P-value
Constant 2.090886 -3.58401 0.0004
OFF_EFF 4.949186 1.733313 0.0046
DEF_EFF -5.416243 1.764366 0.0023
RPI 14.09208 2.126283 0.0000

Above are the results of the regression.  As you can see, all variables are significant predictors of tournament wins.  Together they account for 36% of the variation in tournament wins (adjusted R-squared).

To put these numbers in perspective, the range of offensive efficiencies for these 5 tournaments is [1.17,.909].  This is the difference between the 2007 Florida national champs and the #224 RPI-ranked Mississippi Valley State Delta Devils of 2008.  With a 5 year average offensive efficiency of 1.065, the difference between tournament average and greatness is only .106.  Thus, using the coefficients from the above regression, a marginal increase by this amount results in roughly .5 additional tournament victories.  An analysis of defensive efficiency yields similar results.

The range in RPI over these 5 tournaments is [.688,.463].  This is the difference between the 2010 Kansas Jayhawks (who lost in the second round) and, once again, the 2008 Mississippi Valley State Delta Devils.  With a 5 year average of .592, moving from an average RPI to the best results in an estimated 1.4 additional tournament wins.  More on RPI later.

Underdog/Overseeded profiles

Regression results aside, what we really want to do is pick March Madness teams.  Before we do that, let’s take a look at what the analysis tells us about future Cinderellas and upset-prone teams.  The table below can help shed some light on this.

Observations Avg Off_Eff Avg Def_Eff Avg RPI
First Round Upsets
10-16 seed winners (underdogs who won) 30 1.055 0.944 0.578
1-7 seed losers (favorites who lost) 30 1.065 0.949 0.608
Second Round Upsets
6-16 seed winners (underdogs who won) 18 1.075 0.951 0.592
1-3 seed losers (favorites who lost) 13 1.095 0.933 0.636
Overall Avg 320 1.065 0.947 0.592
Standard Deviation 0.046 0.041 0.041
Best 1.17 0.816 0.688
Worst 0.909 1.068 0.463
Range 0.261 0.252 0.225

For First Round Upsets, I think it’s easier to look at favorites who lost – that is teams seeded 1-7 that lost in the first round.  Their offensive and defensive efficiencies are around the Overall Average.

For Second Round Upsets, looking at the profile of the underdogs sheds some interesting light.  Teams seeded 6-16 that advanced to the Sweet 16 had offensive efficiencies well above average while only slightly below average defensive efficiencies.

Based on this segmentation, it might be beneficial to profile each first and second round matchup as described above in order to wisely predict upsets.

Offense wins Championships

It appears as though offense is rewarded more in the tournament.  There are a number of observations that support this.

Most importantly, in four of the five years under observation, the tournament winner has had a higher rated offense than defense.  Between 2007-11, the winners had the number 1, 1, 2, 6 and 34 rated offenses in that tournament and the 9, 3, 28, 5 and 40 rated defenses respectively.  In other words, teams with a relative strength in offense won the tournament in 4 of the last 5 years.  The only team that did not was a well-balanced Duke in 2010, who had the #6 offense and #5 defense in that tournament.

Second, using a single regression model, offensive officiency has a higher R^2 value than defensive efficiency (15% versus 10%).

Finally, as noted above, it seems that if you want to be a good Cinderella story, offensive efficiency is more important than defensive efficiency.

The 2012 Bracket

Now the fun begins.  The results of my regression to predict tournament wins for the 2012 field are summarized below.

Coefficients
C -7.49374487
Off_Eff 4.949185836
Def_Eff -5.416243101
RPI% 14.09207946
Year Team Seed Off_Eff Def_Eff RPI% Pred Wins
2012 Kentucky 1 1.135 0.873 0.665 2.766433671
2012 Syracuse 1 1.103 0.894 0.667 2.522502778
2012 North Carolina 1 1.1 0.898 0.658 2.359161533
2012 Michigan State 1 1.079 0.884 0.652 2.246503557
2012 Ohio State 2 1.104 0.872 0.638 2.237939008
2012 Kansas 2 1.09 0.9 0.642 2.073363917
2012 Missouri 2 1.184 0.969 0.629 1.981669579
2012 Duke 2 1.109 0.979 0.651 1.866343958
2012 Wichita State 5 1.112 0.907 0.622 1.862490714
2012 Memphis 8 1.081 0.906 0.62 1.686298038
2012 Marquette 3 1.056 0.921 0.633 1.664521778
2012 Baylor 3 1.08 0.949 0.634 1.645739511
2012 Indiana 4 1.131 0.96 0.618 1.613096043
2012 Wisconsin 4 1.055 0.873 0.61 1.595434434
2012 New Mexico 5 1.067 0.877 0.607 1.590883453
2012 Murray State 6 1.087 0.913 0.612 1.565342815
2012 Georgetown 3 1.04 0.898 0.62 1.526711363
2012 Gonzaga 7 1.079 0.922 0.609 1.434726902
2012 Florida State 3 1 0.895 0.626 1.429545136
2012 Saint Louis 9 1.061 0.881 0.599 1.42678673
2012 Saint Mary’s 7 1.124 0.957 0.606 1.425595518
2012 UNLV 6 1.061 0.927 0.616 1.417204898
2012 Louisville 4 0.988 0.878 0.621 1.391770641
2012 Creighton 8 1.159 1.007 0.608 1.356189026
2012 Vanderbilt 5 1.077 0.959 0.616 1.323072092
2012 Florida 7 1.127 0.976 0.603 1.295258218
2012 Harvard 12 1.046 0.885 0.59 1.204055254
2012 Michigan 4 1.062 0.991 0.621 1.145974923
2012 Temple 5 1.079 0.992 0.615 1.140142362
2012 Belmont 14 1.155 0.958 0.572 1.094473334
2012 California 12 1.062 0.916 0.586 1.058970374
2012 San Diego State 6 1.023 0.938 0.607 1.042728447
2012 Long Beach State 12 1.071 0.938 0.589 1.026631937
2012 S Dakota St 14 1.125 0.981 0.585 1.004621201
2012 Southern Miss 9 1.052 0.985 0.611 0.988059728
2012 Virginia 10 1.01 0.86 0.577 0.978093609
2012 VCU 12 1.024 0.895 0.584 0.956458258
2012 BYU 14 1.056 0.911 0.578 0.943619839
2012 Iona 14 1.14 0.994 0.579 0.923895351
2012 Alabama 9 0.999 0.894 0.586 0.866329014
2012 Iowa State 8 1.069 0.974 0.592 0.864025052
2012 Kansas State 8 1.026 0.915 0.579 0.787571371
2012 Cincinnatti 6 1.031 0.921 0.579 0.779819841
2012 Connecticut 9 1.036 0.965 0.594 0.777632266
2012 New Mexico State 13 1.072 0.944 0.573 0.773610392
2012 Ohio 13 1.024 0.914 0.578 0.768997163
2012 Davidson 13 1.094 0.96 0.57 0.753556353
2012 Notre Dame 7 1.04 0.963 0.586 0.69552486
2012 Purdue 10 1.091 0.998 0.579 0.659720273
2012 Colorado State 11 1.07 1.03 0.598 0.650217101
2012 Montana 13 1.049 0.916 0.561 0.642328971
2012 Texas 11 1.061 0.97 0.577 0.634715345
2012 North Carolina St 11 1.058 0.981 0.577 0.560289114
2012 Lehigh 15 1.071 0.926 0.547 0.499759516
2012 Xavier 10 1.008 0.96 0.582 0.497031324
2012 West Virginia 10 1.047 0.967 0.57 0.483030917
2012 Saint Bonaventure 14 1.054 0.969 0.56 0.365921937
2012 South Florida 12 0.961 0.937 0.578 0.332624864
2012 Colorado 11 0.994 0.953 0.572 0.32473563
2012 Loyola Maryland 15 1.016 0.959 0.557 0.189739068
2012 LIU Brooklyn 16 1.067 1.008 0.557 0.176751633
2012 Lamar 16 1.033 0.937 0.536 0.097098906
2012 UNC Asheville 16 1.099 1.012 0.541 0.087987336
2012 Vermont 16 1.034 0.931 0.516 -0.14729604
2012 Detroit 15 1.037 0.994 0.524 -0.36093516
2012 Norfolk State 15 0.983 0.943 0.522 -0.38014696
2012 Mississippi Valley St 16 0.963 0.968 0.513 -0.74136547
2012 Western Kentucky 16 0.93 0.972 0.487 -1.29274764

First observation – lots of chalk.  The top 4 teams are the one seeds and the next 4 are the two seeds.  So, filling out a bracket based strictly on my predicted wins will yield no surprises in the Elite 8.  In fact, if I only use the model’s predicted wins to determine who wins each game, I will have only 3 upsets all tournament.

If we loosen the rules a little to account for some of the upset-prone and Cinderella profiles described above, we see the following.

First Round Upset Prone Teams:

Notre Dame

Marquette

UNLV

Michigan

San Diego State

Cinderella Stories (potential to advance to the sweet 16):

Wichita State

Memphis

Murray State

Gonzaga

Saint Mary’s

Belmont

Long Beach State

New Mexico State

What I would advise is look at the first and second round match ups involving the above teams to get a better sense of the match up, and then pick your upsets wisely.  For the record, my Final Four is Kentucky, Missouri, Ohio State and UNC.  As much as I dislike him as a person, I have Calipari’s Wildcats cutting down the nets in New Orleans on April 2.

 A Brief Note about RPI

The RPI has come under a lot of criticism recently, and some of the data seem to support this assertion.  First, RPI and Seed are highly correlated.  In fact, RPI accounts for 81% of the variation in tournament seedings.  The data certainly seems to indicate the selection committee takes RPI into account to a large extent when seeding the bracket.  However, when we include Seed in our regression model to predict tournament wins, the adjusted R^2 of the model only increases to .41 (from .36 before).  Interestingly, the highest-ranked RPI team heading into the tournament over the last five years was the 2010 Kansas squad who lost in the second round as a 1 seed.

I have included a table at the end of this post that shows predicted wins from a regression EXCLUDING RPI.

Next Steps

My March Madness analysis is just beginning.  I plan to peel back the onion further to understand the critical components of offensive and defensive efficiency that help determine tournament wins.  I also plan to investigate alternative rankings besides RPI that level the playing field.

But, before I do all that I’m headed to the Casino tomorrow to do some of my favorite things:  watch the opening of March Madness, gamble and drink free beer.

-DaveCaughman

P.S.  In case you’re interested, below is the table showing the results from a regression that EXCLUDED RPI

Coefficients
C -1.094
Off_Eff 12.505
Def_Eff -11.869
Year Team Seed Off_Eff Def_Eff RPI% Pred Wins
2012 Kentucky 1 1.135 0.873 0.665 2.737538
2012 Ohio State 2 1.104 0.872 0.638 2.361752
2012 Missouri 2 1.184 0.969 0.629 2.210859
2012 Syracuse 1 1.103 0.894 0.667 2.088129
2012 Wichita State 5 1.112 0.907 0.622 2.046377
2012 North Carolina 1 1.1 0.898 0.658 2.003138
2012 Belmont 14 1.155 0.958 0.572 1.978773
2012 Michigan State 1 1.079 0.884 0.652 1.906699
2012 Kansas 2 1.09 0.9 0.642 1.85435
2012 New Mexico 5 1.067 0.877 0.607 1.839722
2012 Wisconsin 4 1.055 0.873 0.61 1.737138
2012 Saint Louis 9 1.061 0.881 0.599 1.717216
2012 Memphis 8 1.081 0.906 0.62 1.670591
2012 Murray State 6 1.087 0.913 0.612 1.662538
2012 Indiana 4 1.131 0.96 0.618 1.654915
2012 Saint Mary’s 7 1.124 0.957 0.606 1.602987
2012 Harvard 12 1.046 0.885 0.59 1.482165
2012 Gonzaga 7 1.079 0.922 0.609 1.455677
2012 Creighton 8 1.159 1.007 0.608 1.447212
2012 Florida 7 1.127 0.976 0.603 1.414991
2012 Iona 14 1.14 0.994 0.579 1.363914
2012 S Dakota St 14 1.125 0.981 0.585 1.330636
2012 Virginia 10 1.01 0.86 0.577 1.32871
2012 California 12 1.062 0.916 0.586 1.314306
2012 Lehigh 15 1.071 0.926 0.547 1.308161
2012 BYU 14 1.056 0.911 0.578 1.298621
2012 Georgetown 3 1.04 0.898 0.62 1.252838
2012 Davidson 13 1.094 0.96 0.57 1.19223
2012 Marquette 3 1.056 0.921 0.633 1.179931
2012 UNLV 6 1.061 0.927 0.616 1.171242
2012 Long Beach State 12 1.071 0.938 0.589 1.165733
2012 Duke 2 1.109 0.979 0.651 1.154294
2012 Montana 13 1.049 0.916 0.561 1.151741
2012 Baylor 3 1.08 0.949 0.634 1.147719
2012 New Mexico State 13 1.072 0.944 0.573 1.107024
2012 VCU 12 1.024 0.895 0.584 1.088365
2012 Vanderbilt 5 1.077 0.959 0.616 0.991514
2012 Kansas State 8 1.026 0.915 0.579 0.875995
2012 Cincinnatti 6 1.031 0.921 0.579 0.867306
2012 Ohio 13 1.024 0.914 0.578 0.862854
2012 Louisville 4 0.988 0.878 0.621 0.839958
2012 Florida State 3 1 0.895 0.626 0.788245
2012 Alabama 9 0.999 0.894 0.586 0.787609
2012 Vermont 16 1.034 0.931 0.516 0.786131
2012 Iowa State 8 1.069 0.974 0.592 0.713439
2012 Purdue 10 1.091 0.998 0.579 0.703693
2012 Lamar 16 1.033 0.937 0.536 0.702412
2012 Texas 11 1.061 0.97 0.577 0.660875
2012 UNC Asheville 16 1.099 1.012 0.541 0.637567
2012 Temple 5 1.079 0.992 0.615 0.624847
2012 Saint Bonaventure 14 1.054 0.969 0.56 0.585209
2012 San Diego State 6 1.023 0.938 0.607 0.565493
2012 West Virginia 10 1.047 0.967 0.57 0.521412
2012 North Carolina St 11 1.058 0.981 0.577 0.492801
2012 Notre Dame 7 1.04 0.963 0.586 0.481353
2012 Michigan 4 1.062 0.991 0.621 0.424131
2012 Connecticut 9 1.036 0.965 0.594 0.407595
2012 Southern Miss 9 1.052 0.985 0.611 0.370295
2012 LIU Brooklyn 16 1.067 1.008 0.557 0.284883
2012 Loyola Maryland 15 1.016 0.959 0.557 0.228709
2012 Xavier 10 1.008 0.96 0.582 0.1168
2012 Detroit 15 1.037 0.994 0.524 0.075899
2012 Colorado State 11 1.07 1.03 0.598 0.06128
2012 Colorado 11 0.994 0.953 0.572 0.024813
2012 Norfolk State 15 0.983 0.943 0.522 0.005948
2012 South Florida 12 0.961 0.937 0.578 -0.197948
2012 Mississippi Valley St 16 0.963 0.968 0.513 -0.540877
2012 Western Kentucky 16 0.93 0.972 0.487 -1.001018



Welcome!

14 Mar

Welcome from the Tepper Sports Fanalytics Club.  We are a student-run organization at Carnegie Mellon’s Tepper School of Business whose mission is to bring quantitative rigor to the enjoyment and interpretation of sports.  We are excited to bring Carnegie Mellon’s analytical prowess to this flourishing field.

The purpose of this blog is to serve as a forum for everybody in our community to publish original research or humorous anecdotes on anything sports-related.  Whether you are a former all-state athlete or armchair mathlete, this blog has something to offer everybody.

We encourage you to make comments and provide feedback.

Enjoy!

-DaveCaughman