The Fanalytics Blog

The Optimal NBA Roster Updated

The Optimal Roster

As shown in the movie Moneyball, the use of data analytics has played an ever growing role in decisions made by professional sports teams.

My concept of the “Optimal Roster” was developed after an Optimization class at Carnegie Mellon’s Tepper School of Business. In a particular class discussion, my professor attempted to maximize the return of a bond portfolio by choosing an ideal group of five bonds, out of a possible ten, given a specified dollar amount to invest. In order to do this, he used Excel Risk Solver to help identify the optimal solution/portfolio of bonds to choose.

This discussion led me to the optimal roster concept. What if I replaced bonds with NBA players? What if instead of an investment portfolio, I optimized the selection of players within the constraints of a team’s Salary Cap? What if instead of considering bond coupons (interest), maturity and risk, we looked at a player’s points, rebounds and assists? Lastly instead of a having to choose five bonds out of a possible ten, what if we had to choose 15 players that comprised the ‘Optimal Roster’ taken from a pool of all players currently in the NBA?

My concept of the “Optimal Roster” uses the Excel Risk Solver software to create an optimal NBA roster of 15 players while staying in the confines of the salary cap or potential tax level based on the 2011 CBA agreement for NBA teams.

Using John Hollinger’s GameScore calculation averaged over an 82-game regular season and adjusted to per 48 minutes played,, I was able to rank players based on position. Then adding in their Salary information, I was able to set the constraints in Risk Solver stating the total money spent on the 15 players cannot exceed a given limit, in this case I used a ceiling that was a calculation of the average NBA team salary in the 2012-13 season. An additional constraint was added such that only one player from each position may be chosen to build the starting lineup and each roster must have two back up players at each position. Given each team may carry 15 players, but can only have 12 active for a game, it made sense to build the model to accommodate all 15.

The main focus of this research is to be able to understand small variances in spending and to place a framework around how a team can go about statistically maximizes roster to meet given expectations. In other words, if half of a percent more of the salary cap was dedicated to spending on the starting five players, how much greater does total team PER become. Does it make sense to potentially over spend and dip into the luxury tax just to gain more productivity from players? Or can you find ways to save money while still getting productive output from your lineup. Lastly, if a team were to lose a few players in the offseason due to Free Agency, are we able to take a pool of undrafted collegiate prospects, international NBA eligible players and current NBA free agents and use this model to find the ideal solutions to the vacated roster holes? These are the ideas that are being explored using this model.

The last question in the preceding paragraph is largely dependent on the ability to standardize collegiate and international players with current NBA players in order to compare apples to apples. This would require a projection model to attempt to predict the output of college players at the NBA level, and a similar conversion model to convert international player productivity to the NBA game. In the current optimization model, we are only using a pool of current NBA players to build the ideal roster because the framework for the projection models is not currently in place. I wanted to take a moment to acknowledge the next steps in the process that I have yet to achieve, but certainly plan to do.

If you have any questions or thoughts on this model and research, please contact me at jjhabval@tepper.cmu.edu.

Jordan Jhabvala

Comments Leave a Comment
Categories Uncategorized

The Optimal NBA Roster

1 Apr

Following the movie Moneyball, the use of data analytics has played an ever growing role in decisions made by professional sports teams.

This discussion led me to the optimal roster concept. What if I replaced bonds with NBA players? What if instead of an investment portfolio, I optimized the selection of players within the constraints of a team’s Salary Cap? What if instead of considering bond coupons (interest), maturity and risk, we looked at a player’s points, rebounds and assists? Lastly instead of a having to choose five bonds out of a possible ten, what if we had to choose five players, one player at each position, out all NBA players currently in the NBA?

My concept of the “Optimal Roster” uses the Excel Risk Solver software to create an optimal starting lineup of an NBA team while staying in the confines of currently salary cap for a team.

Using a simplified version of John Hollinger’s PER calculation, I was able to rank players based on position. Then adding in their Salary information, I was able to set the constraints in Risk Solver stating the total money spent on the five players cannot exceed the percentage of the salary cap inputted. Also, only one player from each position may be chosen in order to maximize total team PER.

The main focus of this research is to be able to understand small variances in spending. Meaning, if half of a percent more of the salary cap was dedicated to spending on the starting five players, how much greater does total team PER become. Does it make sense to potentially over spend and dip into the luxury tax just to gain more productivity from players? Or can you find ways to save money while still getting productive output from your lineup. These are the ideas that are being explored using this model.

The ideal scenario in this case is to be able to build a team of 15 players; however the student version of Risk Solver limits us to 200 variables so this isn’t possible. Also, the next step in this research is to be able to account for rookie salaries versus veteran deals which certainly skew the results. Nonetheless, the framework from the model is in place and I will continue to post more once I am able to do more of the variance research resulting from having more variables to work with.

If you have any questions or thoughts on this model and research, please contact me at jjhabval@tepper.cmu.edu.

Jordan Jhabvala

Comments Leave a Comment
Categories Uncategorized

Red Sox Roster Optimization

27 Jan

Now that San Francisco has won the World Series, the offseason has officially started. Call me crazy, but I enjoy watching my team, the Red Sox, come together in the offseason almost as much as I enjoy watching the season. This is especially true for me given the long season that the Red sox had.

With the offseason comes speculation. Who goes where? Who signs the big contract? The speculation has become more quantitative in recent years. However, there is much less conversation around building the best team. Of the team conversations that do happen, they tend to be very qualitative in nature and come spring training it’s hard to tell if offseason goals have been accomplished.

Since I’ve been learning all sorts of methods on optimization, I thought it would be fun to see if I could use these skills to optimize a baseball team’s offseason. So, I made an effort to optimize a lineup for the Red Sox based on their current roster and the available free agents. To start, I have only optimized the offensive side of the roster. There are a couple reasons for this that I will get into later, but for now I have an optimization model for the lineup and the bench. The optimization is based on two things: OPS and salary. OPS, while not universal or comprehensive, is a common measuring stick to determine a player’s offensive value, and salary is the common limiting factor for every team. To illustrate the correlation between team OPS and runs I ran a regression based on last year’s team data. It should come as no surprise that the R^2 of OPS on runs is 97%, and hence a good measure of offensive output.

So, I’ve put together a model that optimized the team’s OPS. This model adds free agents to the current roster to determine the optimal lineup. Taking every free agent into account would take too long to do for free, so my model is limited to the free agents listed below:

Name	OPS	2012 Salary	Name	OPS	2012 Salary
Ryan Lavarnway	0.459	$500,000	Maicer Izturis (32)	0.634	$3,966,667
Guillermo Quiroz	0.625	$500,000	Kelly Johnson (31)	0.687	$6,375,000
Jarrod Saltalamacchia	0.742	$2,500,000	Jason Bartlett (33)	0.433	$5,500,000
Pedro Ciriaco	0.705	$500,000	Yuniesky Betancourt (31)	0.656	$2,000,000
Mauro Gomez	0.746	$500,000	Brian Bixler (30)	0.583	$500,000
Jose Iglesias	0.391	$500,000	Ronny Cedeno (30)	0.741	$1,150,000
Dustin Pedroia	0.797	$8,250,000	Marco Scutaro (37)	0.753	$6,000,000
Jacoby Ellsbury	0.682	$8,050,000	Eric Chavez (35)	0.845	$900,000
Ryan Kalish	0.625	$500,000	Brandon Inge (36)	0.658	$5,500,000
Che-Hsuan Lin	0.500	$500,000	Jose Lopez (29)	0.626	$800,000
Daniel Nava	0.742	$500,000	Scott Rolen (38)	0.716	$8,166,667
Will Middlebrooks	0.835	$500,000	Travis Buck (29)	0.595	$580,000
David Ortiz	1.026	$14,575,000	Melky Cabrera (28)	0.906	$6,000,000
Cody Ross	0.807	$3,000,000	Jonny Gomes (32)	0.868	$1,000,000
Ivan DeJesus	0.000	$480,500	Josh Hamilton (32)	0.930	$15,250,000
James Loney	0.630	$6,375,000	Andruw Jones (36)	0.701	$2,000,000
Danny Valencia	0.388	$515,000	Michael Bourn (30)	0.739	$6,845,000
Russell Martin (30)	0.713	$7,500,000	Scott Hairston (33)	0.803	$1,100,000
Mike Napoli (31)	0.812	$9,400,000	B.J. Upton (28)	0.752	$7,000,000
A.J. Pierzynski (36)	0.827	$6,000,000	Shane Victorino (32)	0.704	$9,500,000
Kelly Shoppach (33)	0.798	$1,350,000	Brian Bixler (30)	0.583	$500,000
Eric Hinske (35)	0.583	$1,600,000	Travis Buck (29)	0.595	$580,000
Casey Kotchman (30)	0.612	$3,000,000	Torii Hunter (37)	0.817	$18,500,000
Carlos Lee (37)	0.697	$19,000,000	Nick Swisher (32)	0.837	$10,250,000
Carlos Pena (35)	0.684	$7,250,000

The use of these particular free agents is completely arbitrary; no free agent is left off for any particular reason. Also, to simplify things for right now, I’m using last year’s OPS and salary for each free agent. I understand that this is not ideal, but I will add these pieces into the equation at a later point. To begin, I simply want to test the theory. So, using this information I get the following offensive roster for the 2013 Red Sox.

Name	Position	OPS	2102 Salary
Mike Napoli (31)	C	0.812	$9,400,000
Carlos Lee (37)	1B	0.697	$19,000,000
Dustin Pedroia*	2B	0.797	$8,250,000
Will Middlebrooks*	3B	0.835	$500,000
Marco Scutaro (37)	SS	0.753	$6,000,000
Melky Cabrera (28)	OF	0.906	$6,000,000
Jonny Gomes (32)	OF	0.868	$1,000,000
Josh Hamilton (32)	OF	0.93	$15,250,000
David Ortiz	DH	1.026	$14,575,000
A.J. Pierzynski (36)	C	0.827	$6,000,000
Ronny Cedeno (30)	SS	0.741	$1,150,000
Eric Chavez (35)	3B	0.845	$900,000
Scott Hairston (33)	OF	0.803	$1,100,000
Total Offensive Salary:			$89,125,000
Average OPS:			0.834

*Under contract, not part of optimization.

There are a couple of things to note with this offensive group. First, by design, there are no salary constraints in this first run. This would explain the $6 million backup catcher and the $90 million offensive payroll. This is simply a first step sanity check to prove that the model will in fact pick out the optimal combination of players. Also, there’s only two current Red Sox that are on this roster. This was mostly due to the fact that Pedroia and Middlebrooks are the only two currently signed Red Sox players who you can write into the 2013 plan in ink. While there are others under contract, there’s at least some speculation around everyone else. Plus, the model is more interesting with more moving parts. I’ll replace interesting with accurate as I go through the steps of refining the model, but for a first pass I give preference to interesting.

Ok, now that I’ve shown that the model works, let’s start refining the model so that it is useful. The first thing that I will tackle is salary constraints. There are two types of salary constraints built into the model: individual position salary constraints and a total salary constraint. Since Pedroia and Middlebrooks are set, we don’t have to worry about them. I’m going to set every other position at $10 million, except for DH because Ortiz had been identified as a priority. I’m also going to set the bench position maximum salaries at $5 million. Also, given the trade during the season, I’m going to assume that the Red Sox are not going to break the bank this offseason. Taking this into consideration, I’m going to set the total offensive salary at $60 million. Using these constraints, we get the following results:

Name	Position	OPS	Salary
A.J. Pierzynski (36)	C	0.827	$6,000,000
Carlos Pena (35)	1B	0.684	$7,250,000
Dustin Pedroia	2B	0.797	$8,250,000
Will Middlebrooks	3B	0.835	$500,000
Marco Scutaro (37)	SS	0.753	$6,000,000
David Ortiz	DH	1.026	$14,575,000
Melky Cabrera (28)	OF	0.906	$6,000,000
Jonny Gomes (32)	OF	0.868	$1,000,000
Cody Ross	RF/LF	0.807	$3,000,000
Kelly Shoppach (33)	C	0.798	$1,350,000
Ronny Cedeno (30)	SS	0.741	$1,150,000
Eric Chavez (35)	3B	0.845	$900,000
Scott Hairston (33)	OF	0.803	$1,100,000
Total Salary:			$57,075,000
Average OPS:			0.822

This looks more like a potential 2013 roster than the first roster. Obviously, there are a lot of different iterations that can be done by playing with salary variables which can affect the roster.

So, to some extent I’ve proven that an optimization model can be used as some sort of tool for developing a roster which means that I will continue to develop the model. At this point it there are two roads that this analysis can go down. One is to do all the research ahead of the transactions (this would involve predicting 2013 OPS and salary) and use this as a predictive model. The other is to see what the Red Sox and other teams actually do this offseason and use this information to analyze the Red Sox offseason. I’m choosing to do the latter right now. So, I will continue to refine and add data as decisions are made and update at interesting points during the offseason.

-Erik Clark

Sources:

www.espn.com/mlb

www.mlbtraderumors.com

http://www.baseballprospectus.com/compensation/cots/

Comments Leave a Comment
Categories Uncategorized

Weekly Sports Analytics Round-Up

16 Jul

I hesitated to include the “weekly” moniker on this post since it has been a good 11 weeks since the last sports analytics round-up. But as you probably guessed, we have all been busy impressing potential future employers at our summer internships. So, without further ado, here are this edition’s links:

Fascinating read about applying network theory to analyzing ball movement during the 2010 World Cup and 2012 Euro Cup. http://www.technologyreview.com/view/428399/pagerank-algorithm-reveals-soccer-teams/

Interesting article comparing NFL Draft and NBA Draft pick values accounting for the difference in the number of starters between the two sports: http://www.thebiglead.com/index.php/2012/06/27/comparing-the-nba-draft-to-the-nfl-draft/

I, for one, am wary of people who still hate on Lebron because of the Decision or because he had to partner with a “Big Three” to win a title. As for the latter, this article places Miami’s “other two” in perspective with other championship teams. As for the former, I would only ask how many stupid decisions you made when you were 25?

http://www.thebiglead.com/index.php/2012/06/22/the-miami-heat-still-have-decisions-to-make-if-they-want-to-repeat-or-threepeat/

-DaveCaughman

Comments Leave a Comment
Categories Uncategorized

Weekly Sports Analytics Round-up

29 Apr

In light of the recent NFL draft, Grantland ran a nice article postulating a new method for determining the “net value” of draft picks over or below what is expected from a player drafted in that position.

http://www.grantland.com/story/_/id/7849206/how-tell-which-draft-picks-truly-valuable

Not recent, but a very interesting sabermetrician profile that just came across my radar.

http://www.thepostgame.com/features/201101/sabermetrician-exile

The guys over at HSAC find out that April wins in baseball mean more than you might think (sorry Phils fans)

http://harvardsportsanalysis.wordpress.com/2012/04/20/how-important-is-a-good-april/#more-3111

-DaveCaughman

Comments Leave a Comment
Categories Uncategorized

MLB Closer Analysis – Understanding Statistical Components

12 Apr

With baseball back in season, I started to think of my team (the Red Sox) and their closer situation. They are transitioning to a new closer which made me wonder how they came to this decision. To get an idea of what makes a good closer I did some research on closer statistics. Ultimately, your ability as a closer comes down to how many save opportunities are converted. So, I analyzed save percentage (saves/save opportunities) as a function of various independent variables with the following results:

Table 1

Variable	Probability	R-squared
BAA	0.0311	0.137126
BABIP	0.7301	0.003770
BB/9	0.4358	0.019089
ERA	0.0001	0.285438
K/9	0.2874	0.035279
K/BB	0.0976	0.083421
OBP	0.0155	0.159534
OPS	0.0005	0.315516
SLG	0.0017	0.268218
WHIP	0.0174	0.164343

Interpreting this information first requires an explanation of the table. Probability can be described as the likelihood that a statistic is significant. Generally, you want to see a number less than 0.05. R-squared can be roughly described as what percentage of the total outcome can be attributed to the independent variable. R-squared has a range of 0.0 to 1.0. For example, OBP has a probability of 0.0155 which means OBP is significant and has an R-squared value of 0.159534 which means that OBP accounts for 15.95% of total variability in save percentage. Below, I removed all insignificant variables and ordered them in order of R-squared values.

Table 2

Variable	Probability	R-squared
OPS	0.0005	0.315516
ERA	0.0001	0.285438
SLG	0.0017	0.268218
WHIP	0.0174	0.164343
OBP	0.0155	0.159534
BAA	0.0311	0.137126

There are a couple of interesting takeaways from this information. The first takeaway is to recognize what stats are missing. K/9 appears to have no significance in converting saves. This is kind of a strange outcome because closers are generally thought of as big strikeout guys. BB/9 and K/BB are also missing. This is a little surprising because, even though closers aren’t necessarily control guys, you would expect a closer’s results to be somewhat dependent on these two variables. The final missing stat is BABIP (batting average on balls in play). This is something that is starting to be talked about more in the baseball world, but apparently has little effect on save percentage.

The next takeaway is the relative importance of each variable. According to my research, OPS is the most significant stat with respect to save percentage. This is somewhat surprising because I don’t think I’ve ever heard this stat with respect to pitchers. However, after absorbing this information for a minute, it should not be that surprising. OPS has grown in popularity over recent years as far as measuring a hitters performance. So, it would stand to reason that you could measure a pitcher’s value based on the OPS that hitters have against him. The next most important variable is an old-time statistic, ERA. As it turns out, the old guard in baseball are bigger stat geeks than they mare care to admit. But, again, this makes sense because if ERA didn’t matter than it wouldn’t have been such a popular stat for so long. One stat that I was surprised by is WHIP. This is my favorite new pitcher stat because I thought it encompassed most of the areas that ERA fell short on. As it turns out, WHIP is the 4^th most significant stat that I analyzed which is far less important than I would have anticipated.

So, now that we have all this information we should apply some of it. First, let’s look at the closers who might be over their heads in their 2011 roles. Here are the highest OPS values for closers with more than 10 saves:

Name OPS

Jon Rauch 0.799

Houston Street 0.781

Kevin Gregg 0.773

Matt Capps 0.726

Frank Francisco 0.721

Joakim Soria 0.709

Of these six, Frank Francisco is the only one with a save so far this year and the only clear-cut closer. However, Francisco changed teams and leagues and also signed a very reasonable $5.5 million contract, all of which may contribute to him continuing as a closer. To be fair, Soria is out for the season with an elbow injury (perhaps his uncharacteristically poor 2011 could have been an indication of the arm injury), otherwise he’d be closing for the Royals. The other 4 are either setup guys or in a situation where it’s not completely clear. So, for the most part, the empirical evidence supports the theoretical research.

That’s it for my analysis this time around, but I intend to do some research on multi-variable analysis in the near future.

-Erik Clark

Tags: baseball, MLB, regression

Comments Leave a Comment
Categories Uncategorized

The 2012 Kentucky Wildcats – They Are Who We Thought They Were

6 Apr

Well, my regression model correctly predicted Kentucky winning it all. So what? Sometimes the data just confirms what we already believe. While not as exciting as correctly predicting a 4 seed to win it all, the results nonetheless speak for themselves. Overall, my model-based bracket placed 6^th out of 27 in the first annual Tepper Sports Fanalytics Club Tourney Pick Em. This equated to the 83^rd percentile among Yahoo! brackets. While not perfect, it beats the heck out of my usual bottom-feeder brackets.

But how should we really define “success” with an analytical-based bracket? Picking the winner is nice, but I’m going to hold out and say success should be finishing “in the money”, which typically means 3^rd place and higher in your pool. That being said, you usually have to pick the winner to cash out, so we’re on the right track.

So it’s time to put a bow on the 2011-12 college basketball season. Now we have baseball and the NBA playoffs to look forward to. Stay tuned for more blogs along those lines in the near future!

-DaveCaughman

Tags: college basketball, march madness, ncaa tournament

Comments Leave a Comment
Categories Uncategorized

March Madness 2012: Trying to Bring Some Sanity to the Madness

14 Mar

Nothing gets me fired up like hearing the CBS March Madness intro. It’s been a mainstay on my GATA (workout) playlist for years, and whenever it comes on I think back to the Laettner’s and George Mason’s of the world while trying to forget about how many times Vanderbilt has broken my heart. The seemingly unpredictable nature of the tournament is its biggest asset, but it also got me thinking. Is there any way to better predict who is going to win in the tournament at each stage? Having never won any tournament pool in my life, I have a financial motivation to get to the bottom of this as well. Using the handy tool of regression analysis and a weird fondness of drinking coffee, listening to Pandora and plugging numbers into spreadsheets, I set about to answer that very question.

Data Collection

Dependent Variable: Since we are trying to predict tournament wins, I went back 5 tournaments (to the 2006-07 season) and for all 64 teams (leaving out the recent play-in games) gathered data for how many games each team won in each tournament. Some quick math will tell you there are 320 observations in total.

Independent Variables: As badly as I wanted to discover some overlooked statistical category right off the bat, starting with some basic stat lines first seemed like the best way to peel back the onion. As a disciple of Dave Berri and Dean Oliver, I used offensive and defensive efficiency as a starting point. To keep things simple, I threw in RPI to level out the playing field.

Sources:

1) Teamrankings.com for offensive efficiency, defensive efficiency and RPI. For these stats I went back and made sure I captured them immediately before the tournament began.

2) Wikipedia for tournament wins each year. Now that I routinely see my professors citing Wikipedia for class slides, I think we’ve removed the final hurdle to Wikipedia legitimacy.

Results/Insights:

Variable	Coefficient	Std Error	P-value
Constant	2.090886	-3.58401	0.0004
OFF_EFF	4.949186	1.733313	0.0046
DEF_EFF	-5.416243	1.764366	0.0023
RPI	14.09208	2.126283	0.0000

Above are the results of the regression. As you can see, all variables are significant predictors of tournament wins. Together they account for 36% of the variation in tournament wins (adjusted R-squared).

To put these numbers in perspective, the range of offensive efficiencies for these 5 tournaments is [1.17,.909]. This is the difference between the 2007 Florida national champs and the #224 RPI-ranked Mississippi Valley State Delta Devils of 2008. With a 5 year average offensive efficiency of 1.065, the difference between tournament average and greatness is only .106. Thus, using the coefficients from the above regression, a marginal increase by this amount results in roughly .5 additional tournament victories. An analysis of defensive efficiency yields similar results.

The range in RPI over these 5 tournaments is [.688,.463]. This is the difference between the 2010 Kansas Jayhawks (who lost in the second round) and, once again, the 2008 Mississippi Valley State Delta Devils. With a 5 year average of .592, moving from an average RPI to the best results in an estimated 1.4 additional tournament wins. More on RPI later.

Underdog/Overseeded profiles

Regression results aside, what we really want to do is pick March Madness teams. Before we do that, let’s take a look at what the analysis tells us about future Cinderellas and upset-prone teams. The table below can help shed some light on this.

	Observations	Avg Off_Eff	Avg Def_Eff	Avg RPI
First Round Upsets
10-16 seed winners (underdogs who won)	30	1.055	0.944	0.578
1-7 seed losers (favorites who lost)	30	1.065	0.949	0.608

Second Round Upsets
6-16 seed winners (underdogs who won)	18	1.075	0.951	0.592
1-3 seed losers (favorites who lost)	13	1.095	0.933	0.636

Overall Avg	320	1.065	0.947	0.592
Standard Deviation		0.046	0.041	0.041
Best		1.17	0.816	0.688
Worst		0.909	1.068	0.463
Range		0.261	0.252	0.225

For First Round Upsets, I think it’s easier to look at favorites who lost – that is teams seeded 1-7 that lost in the first round. Their offensive and defensive efficiencies are around the Overall Average.

For Second Round Upsets, looking at the profile of the underdogs sheds some interesting light. Teams seeded 6-16 that advanced to the Sweet 16 had offensive efficiencies well above average while only slightly below average defensive efficiencies.

Based on this segmentation, it might be beneficial to profile each first and second round matchup as described above in order to wisely predict upsets.

Offense wins Championships

It appears as though offense is rewarded more in the tournament. There are a number of observations that support this.

Most importantly, in four of the five years under observation, the tournament winner has had a higher rated offense than defense. Between 2007-11, the winners had the number 1, 1, 2, 6 and 34 rated offenses in that tournament and the 9, 3, 28, 5 and 40 rated defenses respectively. In other words, teams with a relative strength in offense won the tournament in 4 of the last 5 years. The only team that did not was a well-balanced Duke in 2010, who had the #6 offense and #5 defense in that tournament.

Second, using a single regression model, offensive officiency has a higher R^2 value than defensive efficiency (15% versus 10%).

Finally, as noted above, it seems that if you want to be a good Cinderella story, offensive efficiency is more important than defensive efficiency.

The 2012 Bracket

Now the fun begins. The results of my regression to predict tournament wins for the 2012 field are summarized below.

Coefficients
C	-7.49374487
Off_Eff	4.949185836
Def_Eff	-5.416243101
RPI%	14.09207946

Year	Team	Seed	Off_Eff	Def_Eff	RPI%	Pred Wins
2012	Kentucky	1	1.135	0.873	0.665	2.766433671
2012	Syracuse	1	1.103	0.894	0.667	2.522502778
2012	North Carolina	1	1.1	0.898	0.658	2.359161533
2012	Michigan State	1	1.079	0.884	0.652	2.246503557
2012	Ohio State	2	1.104	0.872	0.638	2.237939008
2012	Kansas	2	1.09	0.9	0.642	2.073363917
2012	Missouri	2	1.184	0.969	0.629	1.981669579
2012	Duke	2	1.109	0.979	0.651	1.866343958
2012	Wichita State	5	1.112	0.907	0.622	1.862490714
2012	Memphis	8	1.081	0.906	0.62	1.686298038
2012	Marquette	3	1.056	0.921	0.633	1.664521778
2012	Baylor	3	1.08	0.949	0.634	1.645739511
2012	Indiana	4	1.131	0.96	0.618	1.613096043
2012	Wisconsin	4	1.055	0.873	0.61	1.595434434
2012	New Mexico	5	1.067	0.877	0.607	1.590883453
2012	Murray State	6	1.087	0.913	0.612	1.565342815
2012	Georgetown	3	1.04	0.898	0.62	1.526711363
2012	Gonzaga	7	1.079	0.922	0.609	1.434726902
2012	Florida State	3	1	0.895	0.626	1.429545136
2012	Saint Louis	9	1.061	0.881	0.599	1.42678673
2012	Saint Mary’s	7	1.124	0.957	0.606	1.425595518
2012	UNLV	6	1.061	0.927	0.616	1.417204898
2012	Louisville	4	0.988	0.878	0.621	1.391770641
2012	Creighton	8	1.159	1.007	0.608	1.356189026
2012	Vanderbilt	5	1.077	0.959	0.616	1.323072092
2012	Florida	7	1.127	0.976	0.603	1.295258218
2012	Harvard	12	1.046	0.885	0.59	1.204055254
2012	Michigan	4	1.062	0.991	0.621	1.145974923
2012	Temple	5	1.079	0.992	0.615	1.140142362
2012	Belmont	14	1.155	0.958	0.572	1.094473334
2012	California	12	1.062	0.916	0.586	1.058970374
2012	San Diego State	6	1.023	0.938	0.607	1.042728447
2012	Long Beach State	12	1.071	0.938	0.589	1.026631937
2012	S Dakota St	14	1.125	0.981	0.585	1.004621201
2012	Southern Miss	9	1.052	0.985	0.611	0.988059728
2012	Virginia	10	1.01	0.86	0.577	0.978093609
2012	VCU	12	1.024	0.895	0.584	0.956458258
2012	BYU	14	1.056	0.911	0.578	0.943619839
2012	Iona	14	1.14	0.994	0.579	0.923895351
2012	Alabama	9	0.999	0.894	0.586	0.866329014
2012	Iowa State	8	1.069	0.974	0.592	0.864025052
2012	Kansas State	8	1.026	0.915	0.579	0.787571371
2012	Cincinnatti	6	1.031	0.921	0.579	0.779819841
2012	Connecticut	9	1.036	0.965	0.594	0.777632266
2012	New Mexico State	13	1.072	0.944	0.573	0.773610392
2012	Ohio	13	1.024	0.914	0.578	0.768997163
2012	Davidson	13	1.094	0.96	0.57	0.753556353
2012	Notre Dame	7	1.04	0.963	0.586	0.69552486
2012	Purdue	10	1.091	0.998	0.579	0.659720273
2012	Colorado State	11	1.07	1.03	0.598	0.650217101
2012	Montana	13	1.049	0.916	0.561	0.642328971
2012	Texas	11	1.061	0.97	0.577	0.634715345
2012	North Carolina St	11	1.058	0.981	0.577	0.560289114
2012	Lehigh	15	1.071	0.926	0.547	0.499759516
2012	Xavier	10	1.008	0.96	0.582	0.497031324
2012	West Virginia	10	1.047	0.967	0.57	0.483030917
2012	Saint Bonaventure	14	1.054	0.969	0.56	0.365921937
2012	South Florida	12	0.961	0.937	0.578	0.332624864
2012	Colorado	11	0.994	0.953	0.572	0.32473563
2012	Loyola Maryland	15	1.016	0.959	0.557	0.189739068
2012	LIU Brooklyn	16	1.067	1.008	0.557	0.176751633
2012	Lamar	16	1.033	0.937	0.536	0.097098906
2012	UNC Asheville	16	1.099	1.012	0.541	0.087987336
2012	Vermont	16	1.034	0.931	0.516	-0.14729604
2012	Detroit	15	1.037	0.994	0.524	-0.36093516
2012	Norfolk State	15	0.983	0.943	0.522	-0.38014696
2012	Mississippi Valley St	16	0.963	0.968	0.513	-0.74136547
2012	Western Kentucky	16	0.93	0.972	0.487	-1.29274764

First observation – lots of chalk. The top 4 teams are the one seeds and the next 4 are the two seeds. So, filling out a bracket based strictly on my predicted wins will yield no surprises in the Elite 8. In fact, if I only use the model’s predicted wins to determine who wins each game, I will have only 3 upsets all tournament.

If we loosen the rules a little to account for some of the upset-prone and Cinderella profiles described above, we see the following.

First Round Upset Prone Teams:

Notre Dame

Marquette

UNLV

Michigan

San Diego State

Cinderella Stories (potential to advance to the sweet 16):

Wichita State

Memphis

Murray State

Gonzaga

Saint Mary’s

Belmont

Long Beach State

New Mexico State

What I would advise is look at the first and second round match ups involving the above teams to get a better sense of the match up, and then pick your upsets wisely. For the record, my Final Four is Kentucky, Missouri, Ohio State and UNC. As much as I dislike him as a person, I have Calipari’s Wildcats cutting down the nets in New Orleans on April 2.

A Brief Note about RPI

The RPI has come under a lot of criticism recently, and some of the data seem to support this assertion. First, RPI and Seed are highly correlated. In fact, RPI accounts for 81% of the variation in tournament seedings. The data certainly seems to indicate the selection committee takes RPI into account to a large extent when seeding the bracket. However, when we include Seed in our regression model to predict tournament wins, the adjusted R^2 of the model only increases to .41 (from .36 before). Interestingly, the highest-ranked RPI team heading into the tournament over the last five years was the 2010 Kansas squad who lost in the second round as a 1 seed.

I have included a table at the end of this post that shows predicted wins from a regression EXCLUDING RPI.

Next Steps

My March Madness analysis is just beginning. I plan to peel back the onion further to understand the critical components of offensive and defensive efficiency that help determine tournament wins. I also plan to investigate alternative rankings besides RPI that level the playing field.

But, before I do all that I’m headed to the Casino tomorrow to do some of my favorite things: watch the opening of March Madness, gamble and drink free beer.

-DaveCaughman

P.S. In case you’re interested, below is the table showing the results from a regression that EXCLUDED RPI

Coefficients
C	-1.094
Off_Eff	12.505
Def_Eff	-11.869


Year	Team	Seed	Off_Eff	Def_Eff	RPI%	Pred Wins
2012	Kentucky	1	1.135	0.873	0.665	2.737538
2012	Ohio State	2	1.104	0.872	0.638	2.361752
2012	Missouri	2	1.184	0.969	0.629	2.210859
2012	Syracuse	1	1.103	0.894	0.667	2.088129
2012	Wichita State	5	1.112	0.907	0.622	2.046377
2012	North Carolina	1	1.1	0.898	0.658	2.003138
2012	Belmont	14	1.155	0.958	0.572	1.978773
2012	Michigan State	1	1.079	0.884	0.652	1.906699
2012	Kansas	2	1.09	0.9	0.642	1.85435
2012	New Mexico	5	1.067	0.877	0.607	1.839722
2012	Wisconsin	4	1.055	0.873	0.61	1.737138
2012	Saint Louis	9	1.061	0.881	0.599	1.717216
2012	Memphis	8	1.081	0.906	0.62	1.670591
2012	Murray State	6	1.087	0.913	0.612	1.662538
2012	Indiana	4	1.131	0.96	0.618	1.654915
2012	Saint Mary’s	7	1.124	0.957	0.606	1.602987
2012	Harvard	12	1.046	0.885	0.59	1.482165
2012	Gonzaga	7	1.079	0.922	0.609	1.455677
2012	Creighton	8	1.159	1.007	0.608	1.447212
2012	Florida	7	1.127	0.976	0.603	1.414991
2012	Iona	14	1.14	0.994	0.579	1.363914
2012	S Dakota St	14	1.125	0.981	0.585	1.330636
2012	Virginia	10	1.01	0.86	0.577	1.32871
2012	California	12	1.062	0.916	0.586	1.314306
2012	Lehigh	15	1.071	0.926	0.547	1.308161
2012	BYU	14	1.056	0.911	0.578	1.298621
2012	Georgetown	3	1.04	0.898	0.62	1.252838
2012	Davidson	13	1.094	0.96	0.57	1.19223
2012	Marquette	3	1.056	0.921	0.633	1.179931
2012	UNLV	6	1.061	0.927	0.616	1.171242
2012	Long Beach State	12	1.071	0.938	0.589	1.165733
2012	Duke	2	1.109	0.979	0.651	1.154294
2012	Montana	13	1.049	0.916	0.561	1.151741
2012	Baylor	3	1.08	0.949	0.634	1.147719
2012	New Mexico State	13	1.072	0.944	0.573	1.107024
2012	VCU	12	1.024	0.895	0.584	1.088365
2012	Vanderbilt	5	1.077	0.959	0.616	0.991514
2012	Kansas State	8	1.026	0.915	0.579	0.875995
2012	Cincinnatti	6	1.031	0.921	0.579	0.867306
2012	Ohio	13	1.024	0.914	0.578	0.862854
2012	Louisville	4	0.988	0.878	0.621	0.839958
2012	Florida State	3	1	0.895	0.626	0.788245
2012	Alabama	9	0.999	0.894	0.586	0.787609
2012	Vermont	16	1.034	0.931	0.516	0.786131
2012	Iowa State	8	1.069	0.974	0.592	0.713439
2012	Purdue	10	1.091	0.998	0.579	0.703693
2012	Lamar	16	1.033	0.937	0.536	0.702412
2012	Texas	11	1.061	0.97	0.577	0.660875
2012	UNC Asheville	16	1.099	1.012	0.541	0.637567
2012	Temple	5	1.079	0.992	0.615	0.624847
2012	Saint Bonaventure	14	1.054	0.969	0.56	0.585209
2012	San Diego State	6	1.023	0.938	0.607	0.565493
2012	West Virginia	10	1.047	0.967	0.57	0.521412
2012	North Carolina St	11	1.058	0.981	0.577	0.492801
2012	Notre Dame	7	1.04	0.963	0.586	0.481353
2012	Michigan	4	1.062	0.991	0.621	0.424131
2012	Connecticut	9	1.036	0.965	0.594	0.407595
2012	Southern Miss	9	1.052	0.985	0.611	0.370295
2012	LIU Brooklyn	16	1.067	1.008	0.557	0.284883
2012	Loyola Maryland	15	1.016	0.959	0.557	0.228709
2012	Xavier	10	1.008	0.96	0.582	0.1168
2012	Detroit	15	1.037	0.994	0.524	0.075899
2012	Colorado State	11	1.07	1.03	0.598	0.06128
2012	Colorado	11	0.994	0.953	0.572	0.024813
2012	Norfolk State	15	0.983	0.943	0.522	0.005948
2012	South Florida	12	0.961	0.937	0.578	-0.197948
2012	Mississippi Valley St	16	0.963	0.968	0.513	-0.540877
2012	Western Kentucky	16	0.93	0.972	0.487	-1.001018

Tags: college basketball, march madness, ncaa tournament, regression

Comments Leave a Comment
Categories Uncategorized

Welcome!

14 Mar

Welcome from the Tepper Sports Fanalytics Club. We are a student-run organization at Carnegie Mellon’s Tepper School of Business whose mission is to bring quantitative rigor to the enjoyment and interpretation of sports. We are excited to bring Carnegie Mellon’s analytical prowess to this flourishing field.

The purpose of this blog is to serve as a forum for everybody in our community to publish original research or humorous anecdotes on anything sports-related. Whether you are a former all-state athlete or armchair mathlete, this blog has something to offer everybody.

We encourage you to make comments and provide feedback.

Enjoy!

-DaveCaughman

Comments Leave a Comment
Categories Uncategorized

Search

The Fanalytics Blog

The Optimal NBA Roster Updated

The Optimal NBA Roster

Red Sox Roster Optimization

Weekly Sports Analytics Round-Up

Weekly Sports Analytics Round-up

MLB Closer Analysis – Understanding Statistical Components

The 2012 Kentucky Wildcats – They Are Who We Thought They Were

March Madness 2012: Trying to Bring Some Sanity to the Madness

Welcome!

Follow me on Twitter

Baseball

Basketball

Football

General Sports Analysis

Stats/Data