Tag Archives: regression

MLB Closer Analysis – Understanding Statistical Components

12 Apr

With baseball back in season, I started to think of my team (the Red Sox) and their closer situation.  They are transitioning to a new closer which made me wonder how they came to this decision.  To get an idea of what makes a good closer I did some research on closer statistics.  Ultimately, your ability as a closer comes down to how many save opportunities are converted.  So, I analyzed save percentage (saves/save opportunities) as a function of various independent variables with the following results:

Table 1

Variable

Probability

R-squared

BAA

0.0311

0.137126

BABIP

0.7301

0.003770

BB/9

0.4358

0.019089

ERA

0.0001

0.285438

K/9

0.2874

0.035279

K/BB

0.0976

0.083421

OBP

0.0155

0.159534

OPS

0.0005

0.315516

SLG

0.0017

0.268218

WHIP

0.0174

0.164343

Interpreting this information first requires an explanation of the table.  Probability can be described as the likelihood that a statistic is significant.  Generally, you want to see a number less than 0.05.  R-squared can be roughly described as what percentage of the total outcome can be attributed to the independent variable.  R-squared has a range of 0.0 to 1.0.  For example, OBP has a probability of 0.0155 which means OBP is significant and has an R-squared value of 0.159534 which means that OBP accounts for 15.95% of total variability in save percentage.  Below, I removed all insignificant variables and ordered them in order of R-squared values.

Table 2

Variable

Probability

R-squared

OPS

0.0005

0.315516

ERA

0.0001

0.285438

SLG

0.0017

0.268218

WHIP

0.0174

0.164343

OBP

0.0155

0.159534

BAA

0.0311

0.137126

There are a couple of interesting takeaways from this information.  The first takeaway is to recognize what stats are missing.  K/9 appears to have no significance in converting saves.  This is kind of a strange outcome because closers are generally thought of as big strikeout guys.  BB/9 and K/BB are also missing.  This is a little surprising because, even though closers aren’t necessarily control guys, you would expect a closer’s results to be somewhat dependent on these two variables.  The final missing stat is BABIP (batting average on balls in play).  This is something that is starting to be talked about more in the baseball world, but apparently has little effect on save percentage.

The next takeaway is the relative importance of each variable.  According to my research, OPS is the most significant stat with respect to save percentage.  This is somewhat surprising because I don’t think I’ve ever heard this stat with respect to pitchers.  However, after absorbing this information for a minute, it should not be that surprising.  OPS has grown in popularity over recent years as far as measuring a hitters performance.  So, it would stand to reason that you could measure a pitcher’s value based on the OPS that hitters have against him.  The next most important variable is an old-time statistic, ERA.  As it turns out, the old guard in baseball are bigger stat geeks than they mare care to admit.  But, again, this makes sense because if ERA didn’t matter than it wouldn’t have been such a popular stat for so long.  One stat that I was surprised by is WHIP.  This is my favorite new pitcher stat because I thought it encompassed most of the areas that ERA fell short on.  As it turns out, WHIP is the 4th most significant stat that I analyzed which is far less important than I would have anticipated.

So, now that we have all this information we should apply some of it.  First, let’s look at the closers who might be over their heads in their 2011 roles.  Here are the highest OPS values for closers with more than 10 saves:

Name                                    OPS

Jon Rauch                            0.799

Houston Street                 0.781

Kevin Gregg                       0.773

Matt Capps                         0.726

Frank Francisco                 0.721

Joakim Soria                       0.709

Of these six, Frank Francisco is the only one with a save so far this year and the only clear-cut closer.  However, Francisco changed teams and leagues and also signed a very reasonable $5.5 million contract, all of which may contribute to him continuing as a closer.  To be fair, Soria is out for the season with an elbow injury (perhaps his uncharacteristically poor 2011 could have been an indication of the arm injury), otherwise he’d be closing for the Royals.  The other 4 are either setup guys or in a situation where it’s not completely clear.  So, for the most part, the empirical evidence supports the theoretical research.

That’s it for my analysis this time around, but I intend to do some research on multi-variable analysis in the near future.

-Erik Clark

Advertisements

March Madness 2012: Trying to Bring Some Sanity to the Madness

14 Mar

Nothing gets me fired up like hearing the CBS March Madness intro.  It’s been a mainstay on my GATA (workout) playlist for years, and whenever it comes on I think back to the Laettner’s  and George Mason’s of the world while trying to forget about how many times Vanderbilt has broken my heart.  The seemingly unpredictable nature of the tournament is its biggest asset, but it also got me thinking.  Is there any way to better predict who is going to win in the tournament at each stage?  Having never won any tournament pool in my life, I have a financial motivation to get to the bottom of this as well.  Using the handy tool of regression analysis and a weird fondness of drinking coffee, listening to Pandora and plugging numbers into spreadsheets, I set about to answer that very question.

Data Collection

Dependent Variable:  Since we are trying to predict tournament wins, I went back 5 tournaments (to the 2006-07 season) and for all 64 teams (leaving out the recent play-in games) gathered data for how many games each team won in each tournament.  Some quick math will tell you there are 320 observations in total.

Independent Variables:  As badly as I wanted to discover some overlooked statistical category right off the bat, starting with some basic stat lines first seemed like the best way to peel back the onion.  As a disciple of Dave Berri and Dean Oliver, I used offensive and defensive efficiency as a starting point.  To keep things simple, I threw in RPI to level out the playing field.

Sources:

1)      Teamrankings.com for offensive efficiency, defensive efficiency and RPI.  For these stats I went back and made sure I captured them immediately before the tournament began.

2)      Wikipedia for tournament wins each year.  Now that I routinely see my professors citing Wikipedia for class slides, I think we’ve removed the final hurdle to Wikipedia legitimacy.

Results/Insights:

Variable Coefficient Std Error P-value
Constant 2.090886 -3.58401 0.0004
OFF_EFF 4.949186 1.733313 0.0046
DEF_EFF -5.416243 1.764366 0.0023
RPI 14.09208 2.126283 0.0000

Above are the results of the regression.  As you can see, all variables are significant predictors of tournament wins.  Together they account for 36% of the variation in tournament wins (adjusted R-squared).

To put these numbers in perspective, the range of offensive efficiencies for these 5 tournaments is [1.17,.909].  This is the difference between the 2007 Florida national champs and the #224 RPI-ranked Mississippi Valley State Delta Devils of 2008.  With a 5 year average offensive efficiency of 1.065, the difference between tournament average and greatness is only .106.  Thus, using the coefficients from the above regression, a marginal increase by this amount results in roughly .5 additional tournament victories.  An analysis of defensive efficiency yields similar results.

The range in RPI over these 5 tournaments is [.688,.463].  This is the difference between the 2010 Kansas Jayhawks (who lost in the second round) and, once again, the 2008 Mississippi Valley State Delta Devils.  With a 5 year average of .592, moving from an average RPI to the best results in an estimated 1.4 additional tournament wins.  More on RPI later.

Underdog/Overseeded profiles

Regression results aside, what we really want to do is pick March Madness teams.  Before we do that, let’s take a look at what the analysis tells us about future Cinderellas and upset-prone teams.  The table below can help shed some light on this.

Observations Avg Off_Eff Avg Def_Eff Avg RPI
First Round Upsets
10-16 seed winners (underdogs who won) 30 1.055 0.944 0.578
1-7 seed losers (favorites who lost) 30 1.065 0.949 0.608
Second Round Upsets
6-16 seed winners (underdogs who won) 18 1.075 0.951 0.592
1-3 seed losers (favorites who lost) 13 1.095 0.933 0.636
Overall Avg 320 1.065 0.947 0.592
Standard Deviation 0.046 0.041 0.041
Best 1.17 0.816 0.688
Worst 0.909 1.068 0.463
Range 0.261 0.252 0.225

For First Round Upsets, I think it’s easier to look at favorites who lost – that is teams seeded 1-7 that lost in the first round.  Their offensive and defensive efficiencies are around the Overall Average.

For Second Round Upsets, looking at the profile of the underdogs sheds some interesting light.  Teams seeded 6-16 that advanced to the Sweet 16 had offensive efficiencies well above average while only slightly below average defensive efficiencies.

Based on this segmentation, it might be beneficial to profile each first and second round matchup as described above in order to wisely predict upsets.

Offense wins Championships

It appears as though offense is rewarded more in the tournament.  There are a number of observations that support this.

Most importantly, in four of the five years under observation, the tournament winner has had a higher rated offense than defense.  Between 2007-11, the winners had the number 1, 1, 2, 6 and 34 rated offenses in that tournament and the 9, 3, 28, 5 and 40 rated defenses respectively.  In other words, teams with a relative strength in offense won the tournament in 4 of the last 5 years.  The only team that did not was a well-balanced Duke in 2010, who had the #6 offense and #5 defense in that tournament.

Second, using a single regression model, offensive officiency has a higher R^2 value than defensive efficiency (15% versus 10%).

Finally, as noted above, it seems that if you want to be a good Cinderella story, offensive efficiency is more important than defensive efficiency.

The 2012 Bracket

Now the fun begins.  The results of my regression to predict tournament wins for the 2012 field are summarized below.

Coefficients
C -7.49374487
Off_Eff 4.949185836
Def_Eff -5.416243101
RPI% 14.09207946
Year Team Seed Off_Eff Def_Eff RPI% Pred Wins
2012 Kentucky 1 1.135 0.873 0.665 2.766433671
2012 Syracuse 1 1.103 0.894 0.667 2.522502778
2012 North Carolina 1 1.1 0.898 0.658 2.359161533
2012 Michigan State 1 1.079 0.884 0.652 2.246503557
2012 Ohio State 2 1.104 0.872 0.638 2.237939008
2012 Kansas 2 1.09 0.9 0.642 2.073363917
2012 Missouri 2 1.184 0.969 0.629 1.981669579
2012 Duke 2 1.109 0.979 0.651 1.866343958
2012 Wichita State 5 1.112 0.907 0.622 1.862490714
2012 Memphis 8 1.081 0.906 0.62 1.686298038
2012 Marquette 3 1.056 0.921 0.633 1.664521778
2012 Baylor 3 1.08 0.949 0.634 1.645739511
2012 Indiana 4 1.131 0.96 0.618 1.613096043
2012 Wisconsin 4 1.055 0.873 0.61 1.595434434
2012 New Mexico 5 1.067 0.877 0.607 1.590883453
2012 Murray State 6 1.087 0.913 0.612 1.565342815
2012 Georgetown 3 1.04 0.898 0.62 1.526711363
2012 Gonzaga 7 1.079 0.922 0.609 1.434726902
2012 Florida State 3 1 0.895 0.626 1.429545136
2012 Saint Louis 9 1.061 0.881 0.599 1.42678673
2012 Saint Mary’s 7 1.124 0.957 0.606 1.425595518
2012 UNLV 6 1.061 0.927 0.616 1.417204898
2012 Louisville 4 0.988 0.878 0.621 1.391770641
2012 Creighton 8 1.159 1.007 0.608 1.356189026
2012 Vanderbilt 5 1.077 0.959 0.616 1.323072092
2012 Florida 7 1.127 0.976 0.603 1.295258218
2012 Harvard 12 1.046 0.885 0.59 1.204055254
2012 Michigan 4 1.062 0.991 0.621 1.145974923
2012 Temple 5 1.079 0.992 0.615 1.140142362
2012 Belmont 14 1.155 0.958 0.572 1.094473334
2012 California 12 1.062 0.916 0.586 1.058970374
2012 San Diego State 6 1.023 0.938 0.607 1.042728447
2012 Long Beach State 12 1.071 0.938 0.589 1.026631937
2012 S Dakota St 14 1.125 0.981 0.585 1.004621201
2012 Southern Miss 9 1.052 0.985 0.611 0.988059728
2012 Virginia 10 1.01 0.86 0.577 0.978093609
2012 VCU 12 1.024 0.895 0.584 0.956458258
2012 BYU 14 1.056 0.911 0.578 0.943619839
2012 Iona 14 1.14 0.994 0.579 0.923895351
2012 Alabama 9 0.999 0.894 0.586 0.866329014
2012 Iowa State 8 1.069 0.974 0.592 0.864025052
2012 Kansas State 8 1.026 0.915 0.579 0.787571371
2012 Cincinnatti 6 1.031 0.921 0.579 0.779819841
2012 Connecticut 9 1.036 0.965 0.594 0.777632266
2012 New Mexico State 13 1.072 0.944 0.573 0.773610392
2012 Ohio 13 1.024 0.914 0.578 0.768997163
2012 Davidson 13 1.094 0.96 0.57 0.753556353
2012 Notre Dame 7 1.04 0.963 0.586 0.69552486
2012 Purdue 10 1.091 0.998 0.579 0.659720273
2012 Colorado State 11 1.07 1.03 0.598 0.650217101
2012 Montana 13 1.049 0.916 0.561 0.642328971
2012 Texas 11 1.061 0.97 0.577 0.634715345
2012 North Carolina St 11 1.058 0.981 0.577 0.560289114
2012 Lehigh 15 1.071 0.926 0.547 0.499759516
2012 Xavier 10 1.008 0.96 0.582 0.497031324
2012 West Virginia 10 1.047 0.967 0.57 0.483030917
2012 Saint Bonaventure 14 1.054 0.969 0.56 0.365921937
2012 South Florida 12 0.961 0.937 0.578 0.332624864
2012 Colorado 11 0.994 0.953 0.572 0.32473563
2012 Loyola Maryland 15 1.016 0.959 0.557 0.189739068
2012 LIU Brooklyn 16 1.067 1.008 0.557 0.176751633
2012 Lamar 16 1.033 0.937 0.536 0.097098906
2012 UNC Asheville 16 1.099 1.012 0.541 0.087987336
2012 Vermont 16 1.034 0.931 0.516 -0.14729604
2012 Detroit 15 1.037 0.994 0.524 -0.36093516
2012 Norfolk State 15 0.983 0.943 0.522 -0.38014696
2012 Mississippi Valley St 16 0.963 0.968 0.513 -0.74136547
2012 Western Kentucky 16 0.93 0.972 0.487 -1.29274764

First observation – lots of chalk.  The top 4 teams are the one seeds and the next 4 are the two seeds.  So, filling out a bracket based strictly on my predicted wins will yield no surprises in the Elite 8.  In fact, if I only use the model’s predicted wins to determine who wins each game, I will have only 3 upsets all tournament.

If we loosen the rules a little to account for some of the upset-prone and Cinderella profiles described above, we see the following.

First Round Upset Prone Teams:

Notre Dame

Marquette

UNLV

Michigan

San Diego State

Cinderella Stories (potential to advance to the sweet 16):

Wichita State

Memphis

Murray State

Gonzaga

Saint Mary’s

Belmont

Long Beach State

New Mexico State

What I would advise is look at the first and second round match ups involving the above teams to get a better sense of the match up, and then pick your upsets wisely.  For the record, my Final Four is Kentucky, Missouri, Ohio State and UNC.  As much as I dislike him as a person, I have Calipari’s Wildcats cutting down the nets in New Orleans on April 2.

 A Brief Note about RPI

The RPI has come under a lot of criticism recently, and some of the data seem to support this assertion.  First, RPI and Seed are highly correlated.  In fact, RPI accounts for 81% of the variation in tournament seedings.  The data certainly seems to indicate the selection committee takes RPI into account to a large extent when seeding the bracket.  However, when we include Seed in our regression model to predict tournament wins, the adjusted R^2 of the model only increases to .41 (from .36 before).  Interestingly, the highest-ranked RPI team heading into the tournament over the last five years was the 2010 Kansas squad who lost in the second round as a 1 seed.

I have included a table at the end of this post that shows predicted wins from a regression EXCLUDING RPI.

Next Steps

My March Madness analysis is just beginning.  I plan to peel back the onion further to understand the critical components of offensive and defensive efficiency that help determine tournament wins.  I also plan to investigate alternative rankings besides RPI that level the playing field.

But, before I do all that I’m headed to the Casino tomorrow to do some of my favorite things:  watch the opening of March Madness, gamble and drink free beer.

-DaveCaughman

P.S.  In case you’re interested, below is the table showing the results from a regression that EXCLUDED RPI

Coefficients
C -1.094
Off_Eff 12.505
Def_Eff -11.869
Year Team Seed Off_Eff Def_Eff RPI% Pred Wins
2012 Kentucky 1 1.135 0.873 0.665 2.737538
2012 Ohio State 2 1.104 0.872 0.638 2.361752
2012 Missouri 2 1.184 0.969 0.629 2.210859
2012 Syracuse 1 1.103 0.894 0.667 2.088129
2012 Wichita State 5 1.112 0.907 0.622 2.046377
2012 North Carolina 1 1.1 0.898 0.658 2.003138
2012 Belmont 14 1.155 0.958 0.572 1.978773
2012 Michigan State 1 1.079 0.884 0.652 1.906699
2012 Kansas 2 1.09 0.9 0.642 1.85435
2012 New Mexico 5 1.067 0.877 0.607 1.839722
2012 Wisconsin 4 1.055 0.873 0.61 1.737138
2012 Saint Louis 9 1.061 0.881 0.599 1.717216
2012 Memphis 8 1.081 0.906 0.62 1.670591
2012 Murray State 6 1.087 0.913 0.612 1.662538
2012 Indiana 4 1.131 0.96 0.618 1.654915
2012 Saint Mary’s 7 1.124 0.957 0.606 1.602987
2012 Harvard 12 1.046 0.885 0.59 1.482165
2012 Gonzaga 7 1.079 0.922 0.609 1.455677
2012 Creighton 8 1.159 1.007 0.608 1.447212
2012 Florida 7 1.127 0.976 0.603 1.414991
2012 Iona 14 1.14 0.994 0.579 1.363914
2012 S Dakota St 14 1.125 0.981 0.585 1.330636
2012 Virginia 10 1.01 0.86 0.577 1.32871
2012 California 12 1.062 0.916 0.586 1.314306
2012 Lehigh 15 1.071 0.926 0.547 1.308161
2012 BYU 14 1.056 0.911 0.578 1.298621
2012 Georgetown 3 1.04 0.898 0.62 1.252838
2012 Davidson 13 1.094 0.96 0.57 1.19223
2012 Marquette 3 1.056 0.921 0.633 1.179931
2012 UNLV 6 1.061 0.927 0.616 1.171242
2012 Long Beach State 12 1.071 0.938 0.589 1.165733
2012 Duke 2 1.109 0.979 0.651 1.154294
2012 Montana 13 1.049 0.916 0.561 1.151741
2012 Baylor 3 1.08 0.949 0.634 1.147719
2012 New Mexico State 13 1.072 0.944 0.573 1.107024
2012 VCU 12 1.024 0.895 0.584 1.088365
2012 Vanderbilt 5 1.077 0.959 0.616 0.991514
2012 Kansas State 8 1.026 0.915 0.579 0.875995
2012 Cincinnatti 6 1.031 0.921 0.579 0.867306
2012 Ohio 13 1.024 0.914 0.578 0.862854
2012 Louisville 4 0.988 0.878 0.621 0.839958
2012 Florida State 3 1 0.895 0.626 0.788245
2012 Alabama 9 0.999 0.894 0.586 0.787609
2012 Vermont 16 1.034 0.931 0.516 0.786131
2012 Iowa State 8 1.069 0.974 0.592 0.713439
2012 Purdue 10 1.091 0.998 0.579 0.703693
2012 Lamar 16 1.033 0.937 0.536 0.702412
2012 Texas 11 1.061 0.97 0.577 0.660875
2012 UNC Asheville 16 1.099 1.012 0.541 0.637567
2012 Temple 5 1.079 0.992 0.615 0.624847
2012 Saint Bonaventure 14 1.054 0.969 0.56 0.585209
2012 San Diego State 6 1.023 0.938 0.607 0.565493
2012 West Virginia 10 1.047 0.967 0.57 0.521412
2012 North Carolina St 11 1.058 0.981 0.577 0.492801
2012 Notre Dame 7 1.04 0.963 0.586 0.481353
2012 Michigan 4 1.062 0.991 0.621 0.424131
2012 Connecticut 9 1.036 0.965 0.594 0.407595
2012 Southern Miss 9 1.052 0.985 0.611 0.370295
2012 LIU Brooklyn 16 1.067 1.008 0.557 0.284883
2012 Loyola Maryland 15 1.016 0.959 0.557 0.228709
2012 Xavier 10 1.008 0.96 0.582 0.1168
2012 Detroit 15 1.037 0.994 0.524 0.075899
2012 Colorado State 11 1.07 1.03 0.598 0.06128
2012 Colorado 11 0.994 0.953 0.572 0.024813
2012 Norfolk State 15 0.983 0.943 0.522 0.005948
2012 South Florida 12 0.961 0.937 0.578 -0.197948
2012 Mississippi Valley St 16 0.963 0.968 0.513 -0.540877
2012 Western Kentucky 16 0.93 0.972 0.487 -1.001018