Archive for the ‘Batting’ Category

If you haven’t read it, here’s the link.

For MY ball tests, the difference I found in COR was 2.6 standard deviations, as indicated in the article. The difference in seam height is around 1.5 SD. The difference in circumference is around 1.8 SD.

For those of you a little rusty on your statistics, the SD of the difference between two sample means is the square root of the sum of their respective variances.

The use of statistical significance is one of the most misunderstood and abused concepts in science. You can read about this on the internet if you want to know why. It has a bit to do with frequentist versus Bayesian statistics/inference.

For example, when you have a non-null hypothesis going into an experiment, such as, “The data suggest an altered baseball,” then ANY positive result supports that hypothesis and increases the probability of it being true, regardless of the “statistical significance of those results.”

Of course the more significant the result, the more we increase the prior probability. However, the classic case of using 2 or 2.5 SD to define “statistical significance” really only applies when you start out with the null hypothesis. In this case, for example, that would be if you had no reason to suspect a juiced ball, and you merely tested balls just to see if perhaps there were differences. In reality, you almost always have a prior P which is why the traditional concept of accepting or rejecting the null hypothesis based on the statistical significance of the results of your experiment is an obsolete concept.

In any case, from the results of MLB’s own tests, in which they tested something like 180 balls a year, the seam height reduction we found was something like 6 or 7 SD and the COR increase was something like 3 or 4 SD. We also can add to the mix, Ben’s original test whereby he found an increase in COR of .003 or around 60% of what I found.

So yes, the combined results of all three tests are almost unequivocal evidence that the ball was altered. There’s not much else you can do other than to test balls. Of course the ball testing would mean almost nothing if we didn’t have the batted ball data to back it up. We do.

I don’t think this “ball change” was intentional by MLB, although it could be.

In my extensive research for this project, I have uncovered two things:

One, there is quite a large actual year to year difference in the construction of the ball which can and does have a significant impact on HR and offensive rates in general. The concept of a “juiced” (or “de-juiced”) ball doesn’t really mean anything unless it is compared to some other ball – for example, in our case, 2014 to 2016/2017.

Two, we now know because of Statcast and lots of great work and insight by Alan Nathan and others, that very small changes in things like COR, seam height, and size can have a dramatic impact on offense. My (wild) guess is that we probably have something like a 2 or 3 feet (in batted ball distance for a typical HR trajectory) variation (one SD) from year to year based on the (random) fluctuating composition and construction of the ball.  And from 2014 to 2106 (and so far this year), we just happened to have seen a 2 or 3 standard deviation variation.

We’ve seen it before, most notably in 1987, and we’ll probably see it again. I have also altered my thinking about the “steroid era.” Now that I know that balls can fluctuate from year to year, sometimes greatly, it is entirely possible that balls were constructed differently starting in 1993 or so – perhaps in combination with burgeoning and rampant PED use.

Finally, it is true that there are many things that can influence run scoring and HR rates, some more than others. Weather and parks are very minor. Even a big change in one park or two or a very hot or cold year will have very small effects overall. And of course we can easily test or account for these things.

Change in talent can surprisingly have a large effect on overall offense. For example, this year, the AL lost a lot of offensive talent which is one reason why the NL and the AL have almost equal scoring despite the AL having the DH.

The only other thing that can fairly drastically change offense is the strike zone. Obviously it depends on the magnitude of the change. In the pitch f/x era we can measure that, as Joe Roegele and others do every year. It has not changed much the last few years until this year. It is smaller now, which is causing an uptick in offense from last year. I also believe, as others have said, that the uptick since last year is due to batters realizing that they are playing with a livelier ball and thus are hitting more air balls. They may be hitting more air balls even without thinking that the ball is juiced -they may be just jumping on the “fly-ball bandwagon.” Either way, hitting more fly balls compounds the effect of a juiced ball because it is correct to hit more fly balls.

Then there is the bat, which I know nothing about. I have not heard anything about the bats being different or what you can do to a bat to increase or decrease offense, within allowable MLB limits.

Do I think that the “juiced ball” (in combination with players taking advantage of it) is the only reason for the HR/scoring surge? I think it’s the primary driver, by far.

Advertisements

There’s been much research and many articles over the years with respect to hitter (and other) aging curves. (I even came across in a Google search a fascinating component aging curve for PGA golfers!) I’ve publicly and privately been doing aging curves for 20 years. So has Tango Tiger. Jeff Zimmerman has also been prolific in this regard. Others have contributed as well. You can Google them if you want.

Most of the credible aging curves use some form of the delta method which is described in this excellent series on aging by the crafty n’er do well, MGL. If you’re too lazy to look it up, the delta method basically is this, from the article:

The “delta method” looks at all players who have played in back-to-back years. Many players have several back-to-back year “couplets,” obviously. For every player, it takes the difference between their rate of performance in Year I and Year II and puts that difference into a “bucket,” which is defined by the age of the player in those two years….

When we tally all the differences in each bucket and divide by the number of players, we get the average change from one age to the next for every player who ever played in at least one pair of back-to-back seasons in. So, for example, for all players who played in their age 29 and 30 seasons, we get the simple average of the rate of change in offensive performance between 29 and 30.

That’s really the only way to do an aging curve, as far as I know, unless you want to use an opaque statistical method like J.C Bradbury did back in 2009 (you can look that up too). One of the problems with aging curves, which I also discuss in the aforementioned article, and one that comes up a lot in baseball research, is survivorship bias. I’ll get to that in a paragraph or two.

Let’s say we want to use the delta method to compute the average change in wOBA performance from age 29 to 30. To do that, we look at all players who played in their age 29 and age 30 years, record each player’s difference, weight it by some number of PA (maybe the lesser of the two – either year 1 or year 2, maybe the harmonic mean of the two, or maybe weight them all equally – it’s hard to say), and then take the simple weighted average of all the differences. For example, say we have two players. Player A has a .300 wOBA in his age 29 season in 100 PA and a .290 wOBA in his age 30 season in 150 PA. Player B is .320 in year one in 200 PA and .300 in year two in 300 PA. Using the delta method we get a difference of -.010 (a decline) for player A weighted by, say, 100 PA (the lesser of 100 and 150), and a difference of -.020 for Player B in 200 PA (also the lesser of the two PA). So we have an average decline in our sample of (10 * 100 + 20 * 200) / (300), or 16.67 points of wOBA decline. We would do the same for all age intervals and all players and if we chain them together we get an aging curve for the average MLB player.

There are issues with that calculation, such as our choice to weight each player’s difference by the “lesser of the two PA,” what it means to compute “an average decline” for that age interval (since it includes all types of players, part-time, full-time, etc.) and especially what it means when we chain every age interval together to come up with an aging curve for the average major league player when it’s really a compendium of a whole bunch of players all with different career lengths at different age intervals.

Typically when we construct an aging curve, we’re not at all looking at the careers of any individual players. If we do that, we end up with severe selective sampling and survivorship problems. I’m going to ignore all of these issues and focus on survivorship bias only. It has the potential to be extremely problematic, even when using the delta method.

Let’s say that a player is becoming a marginal player for whatever reason, perhaps it is because he is at the end of his career. Let’s also say that we have a bunch of players like that and their true talent is a wOBA of .280. If we give them 300 PA, half will randomly perform better than that and half will randomly perform worse than that simply because 300 PA is just a random sample of their talent. In fact, we know that the random standard deviation of wOBA in 300 trials is around 25 points in wOBA, such that 5% of our players, whom we know have a true talent of .280, will actually hit .230 or less by chance alone. That’s a fact. There’s nothing they or anyone else can do about it. No player has an “ability” to fluctuate less than random variance tells is in any specific number of PA. There might be something about them that creates more variance on the average, but it is mathematically impossible to have less (actually the floor is a bit higher than that because of varying opponents and conditions).

Let’s assume that all players who hit less than .230 will retire or be cut – they’ll never play again, at least not in the following season. That is not unlike what happens in real life when a marginal player has a bad season. He almost always gets fewer PA the following season than he would have gotten had he not had an unlucky season. In fact, not playing at all is just a subset of playing less – both are examples of survivorship bias and create problems with aging curves. Let’s see what happens to our aging interval with these marginal players when 5% of them don’t play the next season.

We know that this entire group of players are .280 hitters because we said so. If 5% of them hit, on average, .210, then the other 95% must have hit .284 since the whole group must hit .280 – that’s their true talent. This is just a typical season for a bunch of .280 hitters. Nothing special going on here. We could have split them up any way we wanted, as long as in the aggregate they hit at their true talent level.

Now let’s say that these hitters are in their age 30 season and they are supposed to decline by 10 points in their age 31 season. If we do an aging calculation on these players in a typical pair of seasons we absolutely should see .280 in the first year and .270 in the second. In fact, if we let all our players play a random or a fixed number of PA in season two, that is exactly what we would see. It has to be. It is a mathematical certainty, given everything we stated. However survivorship bias screws up our numbers and results in an incorrect aging value from age 30 to age 31. Let’s try it.

Only 95% of our players play in season two, so 5% drop out of our sample, at least from age 30 to age 31. There’s nothing we can do about that. When we compute a traditional aging curve using the delta method, we only use numbers from pairs of years. We can never use the last year of a player’s career as the first year in a year pairing. We don’t have any information about that player’s next season. We can use a player’s last year, say, at age 30 in an age 29 to 30 pairing but not in a 30 to 31 pairing. Remember that the delta method always uses age pairings for each player in the sample.

What do those 95% hit in season one? Remember they are true .280 hitters. Well, they don’t hit .280. I already said that they hit .284. That is because they got a little lucky. The ones that got really unlucky to balance out the lucky ones, are not playing in season two, and thus dropped out of our aging curve sample. What do these true .280 players (who hit .284) hit in season two? Season two is an unbiased sample of their true talent. We know that their true talent was .280 in season one and we know that from age 30 to age 31 all players will lose 10 points in true talent because we said so. So they will naturally hit .270 in year two.

What does our delta method calculation tell us about how players age from age 30 to age 31? It tells us they lose 14 points in wOBA and not 10! It’s giving us a wrong answer because of survivorship bias. Had those other 5% of players played, they would have also hit .270 in year two and when we add everyone up, including the unlucky players, we would come up with the correct answer of a 10-point loss from age 30 to age 31 (the unlucky players would have improved in year two by 60 points).

One way to avoid this problem (survivorship bias will always make it look like players lose more or gain less as they age because the players that drop out from season to season always, on the average, got unlucky in season one) is to ignore the last season of a player’s career in our calculations. That’s fine and dandy, but survivorship bias exists in every year of a player’s career. As I wrote earlier, dropping out is just a small subset of this bias. Every player that gets unlucky in one season will see fewer PA in his next season, which creates the same kind of erroneous results. For example, if the 5% of unlucky players did play in season two, but only got 50 PA whereas the other 95% of slightly lucky players got 500 PA, we would still come up with a decline of more than 10 points of wOBA – again an incorrect answer.

To correct for this survivorship bias, which really wreaks havoc with aging curves, a number of years ago, I decided to add a phantom year for players after their last season of action. For that year, I used a projection – our best estimate of what they would have done had they been allowed to play another year. That reduced the survivorship bias but it didn’t nearly eliminate it because, as I said, every player suffers from it in reduced PA for unlucky players and increased PA for lucky ones, in their subsequent seasons.

Not only that, but we get the same effect within years. If two players have .300 wOBA true talents, but player A hits worse than .250 by luck alone in his first month (which will happen more than 16% of the time) and player B hits .350 or more, who do you think will get more playing time for the remainder of the season even though we know that they have the same talent, and that both, on the average, will hit exactly .300 for the remainder of the season?

I finally came up with a comprehensive solution based on the following thought process: If we were conducting an experiment, how would we approach the question of computing aging intervals? We would record every player’s season one (which would be an unbiased sample of his talent, so no problem so far) and then we would guarantee that every player would get X number of PA the next season, preferably something like 500 or 600 to create large samples of seasonal data. We would also give everyone a large number of PA in all season ones too, but it’s not really necessary.

How do we do that? We merely extend season two data using projections, just as I did in adding phantom seasons after a player’s career was over (or he missed a season in the middle of his career). Basically I’m doing the same thing, whether I’m adding 600 PA to a player who didn’t play (the phantom season) or I’m adding 300 PA to a player who only had 300 PA in season two. By doing this I am completely eliminating survivorship bias. Of course this correction method lives or dies with how accurate the projections are but even a simple projection system like Marcel will suffice when dealing with a large number of players of different talent levels. Now let’s get to the results.

I looked at all players from 1977 to 2016 and I park and league adjusted their wOBA for each season. Essentially I am using wOBA+. I also only looked at seasonal pairs (with a minimum of 10 PA in each season) where the player played on the same team. I didn’t have to do that, but my sample was large enough that I felt that the reduction in sample size was worth getting rid of any park biases even though I was dealing with park- adjusted numbers.

Using the delta method with no survivorship bias other than ignoring the last year of every player’s career, this is the aging curve I arrived at after chaining all of the deltas. This is the typical curve you will see in most of the prior research.

1977-2016 Aging Curve using Delta Method Without Correcting for Survivorship Bias

curve1

 

Here is the same curve after completing all season two’s with projections. For example, let’s say that a player is projected to hit .300 in his age 30 season and he hits .250 in only 150 PA (his manager benches him because he’s hit so poorly). His in-season projection would change because of the .250. It might now be .290. So I complete a 600 PA season by adding 450 PA of .290 hitting to the 150 PA of .250 hitting for a complete season of .280 in 600 PA.

If that same player hits .320 in season two in 620 PA then I add nothing to his season two data. Only players with less than 600 PA have their seasons completed with projections. How do I weight the season pairs? Without any completion correction, as in the first curve above, I weighted each season pair by the harmonic mean of the two PA. With correction, as in the second curve above, I weighted each pair by the number of PA in season one. This corrects for intra-season survivorship bias in season one as well.

1977-2016 Aging Curve using Delta Method and Correcting for Survivorship Bias

curve2

 

You can see that in the first curve, uncorrected for survivorship bias, players gain around 40 points in wOBA from age 21 to age 27, seven points per year, plateau from age 27 to 28, then decline by also around seven points a year after that. In the second curve, after we correct for survivorship bias, we have a slightly quicker ascension from age 21 to 26, more than eight points per year, a plateau from age 26 to age 27, then a much slower decline at around 3 points per year.

Keep in mind that these curves represent all players from 1977 to 2016. It is likely that aging has changed significantly from era to era due to medical advances, PED use and the like. In fact, if we limit our data to 2003 and later, after the so called steroid era, we get an uncorrected curve that plateaus between ages 24-28 and then declines by an average of 9 points a year from age 28 to 41.

In my next installment I’ll do some survivorship corrections for components like strikeout and walk percentage.

In Game 7 of the World Series anyone who was watching the top of the 9th inning probably remembers Javier Baez attempting a (safety squeeze – presumably) bunt on a 3-2 count with 1 out and Jason Heyward on 3rd base. You also remember that Baez struck out on a foul ball, much to the consternation of Cubs fans.

There was plenty of noise on social media criticizing Maddon (or Baez, if he did that on his own) for such an unusual play (you rarely see position players bunt on 2-strike counts, let alone with a 3-2 count and let alone with a runner on 3rd) and of course because it failed and eventually led to a scoreless inning. I was among those screaming bloody murder on Twitter and continuing my long-running criticism of Maddon’s dubious (in my opinion) post-season in-game tactics dating back to his Tampa days. I did, however, point out that I didn’t know off the top of my head (and it was anything but obvious or trivial to figure out) what the “numbers” were but that I was pretty sure it was a bad strategy.

Some “prima facia” evidence that it might be bad play, as I also tweeted, was, “When have you ever seen a play like that in baseball game?” That doesn’t automatically mean that it’s a bad play, but it is evidence nonetheless. And the fact that it was a critical post-season game meant nothing. If was correct to do it in that game it would be correct to do it in any game – at least in the late innings of a tie or 1-run game.

Anyway, I decided to look at some numbers although it’s not an easy task to ascertain whether in fact this was a good, bad, or roughly neutral (or we just don’t know) play. I turned to Retrosheet as I often do, and looked at what happens when a generic batter (who isn’t walked, which probably eliminates lots of good batters) does not bunt (which is almost all of the time of course) on a 3-2 count with 1 out, runner on third base and no runner on first, in a tie game or one in which the batting team was ahead, in the late innings, when the infield would likely be playing in to prevent a run from scoring on a ground ball. This is what I found:

The runner scores around 28% of the time overall. There were 33% walks (pitcher should be pitching a bit around the batter in this situation), 25% strikeouts and 25% BIP outs. When the ball is put in play, which occurs 42% of the time, the runner scores 63% of the time.

Now let’s look at what happens when a pitcher simply bunts the ball on a 3-2 count in a sacrifice situation. We’ll use that as a proxy for what Baez might do when trying to bunt in this situation. Pitchers are decent bunters overall (although they don’t run well on a bunt) and Baez is probably an average bunter at best for a position player. In fact, Baez has a grand total of one sacrifice hit in his entire minor and major league career so he may be an poor bunter – but to give him and Maddon the benefit of the doubt we’ll assume that he is as good at bunting as your typical NL pitcher.

On a 3-2 count in a sac situation when the pitcher is still bunting, he strikes out 40% of the time and walks 22% of the time. Compare that to the hitter who swings away at 3-2, runner on 3rd and 1 out where he K’s 25% of the time and walks 33% of the time. Of those 40% strikeouts, lots are bunt fouls. In fact, pitchers strike out on a foul bunt with a 3-2 count 25% % of the time. The rest, 15%, are called strikes and missed bunt attempts. It’s very easy to strike out on a foul bunt when you have two strikes, even when there are 3 balls (and you can take some close pitches).

How often does the run score on a 3-2 bunt attempt with a runner on 3rd such as in the Baez situation? From that data we can’t tell because we’re only looking at 3-2 bunts from pitchers with no runner on 3rd so we have make some inferences.

The pitcher puts the ball in play 36% of the time when bunting on a 3-2 count. How often would a runner score if there were a runner on 3rd? We’ll have to make some more inferences. In situations where a batter attempts a squeeze (either a suicide or safety – for the most part, we can’t tell from the Retrosheet data), the runner scores 80% of the time when the ball in bunted in play. So let’s assume the same with our pitchers/Baez. So 36% of the time the ball is put in play on a 3-2 bunt, 80% of the time the run scores. That’s a score rate of 29% – around the same as when swinging away.

So swinging away, the run scores 28% of the time. With a bunt attempt the run scores 29% of the time, so it would appear to be a tie with no particular strategy a clear winner. But wait….

When the run doesn’t score, the batter who is swinging away at 3-2 walks 33% of the time while the pitcher who is attempting a bunt on a 3-2 pitch walks only 25% of the time. But, we won’t count that as an advantage for the batter swinging away. The BB difference is likely due to the fact that pitchers are pitching around batters in that situation and they are going right after pitchers on 3-2 counts in sacrifice situations. In a situation like Baez’ the pitcher is going to issue more than 25% walks since he doesn’t mind the free pass and he is not going to groove one. So we’ll ignore the difference in walks. But wait again….

When a run scores on a squeeze play the batter is out 72% of the time and ends up mostly on first 28% of the time (a single, error, or fielder’s choice). When a run scores with a batter swinging away on a 3-2 count, the batter is out only 36% of the time. 21% of those are singles and errors and 15% are extra base hits including 10% triples and 5% HR.

So even though the run scores with both bunting and hitting away on a 3-2 count around the same percentage of the time, the batter is safe, including walks, hits, errors and fielder’s choices, only 26% of the time when bunting and 50% when swinging away. Additionally, when the batter swinging away gets a hit, 20% are triples and 6% are HR. So even though the runner on third scores around the same percentage of time whether swinging away or bunting on that 3-2 count, when the run does score, the batter who is swinging away reaches base safely (with some extras base hits including HR) more than twice as often as the batter who is bunting

I’m going to say that the conclusion is that while the bunt attempt was probably not a terrible play, it was still the wrong strategy given that it was the top of the inning. The runner from third will probably score around the same percentage of the time whether Baez is bunting or swinging away, but when the run does score, Baez is going to be safe a much higher percentage of the time, including via the double, triple or HR, leading to an additional run scoring significantly more often than with the squeeze attempt.

I’m not giving a pass to Maddon on this one. That would be true regardless of whether the bunt worked or not – of course.

Addendum: A quick estimate is that an additional run (or more) will score around 12% more often when swinging away. An extra run in the top of the 9th, going from a 1-run lead to a 2-run lead,  increases a team’s chances of winning by 10% (after that every additional run is worth half the the value of the preceding run). So we get an extra 1.2% (10% times 12%) in win expectancy from swinging away rather than bunting via the extra hits that occur when the ball is put into play.

 

 

Let me explain game theory wrt sac bunting using tonight’s CLE game as an example. Bottom of the 10th, leadoff batter on first, Gimenez is up. He is a very weak batter with little power or on-base skills, and the announcers say, “You would expect him to be bunting.” He clearly is.

Now, in general, to determine whether to bunt or not, you estimate the win expectancies (WE) based on the frequencies of the various outcomes of the bunt, versus the frequencies of the various outcomes of swinging away. Since, for a position player, those two final numbers are usually close, even in late tied-game situations, the correct decision usually hinges on: On the swing side, whether the batter is a good hitter or not, and his expected GDP rate. On the bunt side, how good of a sac bunter is he and how fast is he (which affect the single and ROE frequencies, which are an important part of the bunt WE)?

Gimenez is a terrible hitter which favors the bunt attempt but he is also not a good bunter and slow which favors hitting away. So the WE’s are probably somewhat close.

One thing that affects the WE for both bunting and swinging, of course, is where the third baseman plays before the pitch is thrown. Now, in this game, it was obvious that Gimenez was bunting all the way and everyone seemed fine with that. I think the announcers and probably everyone would have been shocked if he didn’t (we’ll ignore the count completely for this discussion – the decision to bunt or not clearly can change with it).

The announcers also said, “Sano is playing pretty far back for a bunt.” He was playing just on the dirt I think, which is pretty much “in between when expecting a bunt.” So it did seem like he was not playing up enough.

So what happens if he moves up a little? Maybe now it is correct to NOT bunt because the more he plays in, the lower the WE for a bunt and the higher the WE for hitting away! So maybe he shouldn’t play up more (the assumption is that if he is bunting, then the closer he plays, the better). Maybe then the batter will hit away and correctly so, which is now better for the offense than bunting with the third baseman playing only half way. Or maybe if he plays up more, the bunt is still correct but less so than with him playing back, in which case he SHOULD play up more.

So what is supposed to happen? Where is the third baseman supposed to play and what does the batter do? There is one answer and one answer only. How many managers and coaches do you think know the answer (they should)?

The third baseman is supposed to play all the way back “for starters” in his own mind, such that it is clearly correct for the batter to bunt. Now he knows he should play in a little more. So in his mind again, he plays up just a tad bit.

Now is it still correct for the batter to bunt? IOW, is the bunt WE higher than the swing WE given where the third baseman is playing? If it is, of course he is supposed to move up just a little more (in his head).

When does he stop? He stops of course when the WE from bunting is exactly the same as the WE from swinging. Where that is completely depends on those things I talked about before, like the hitting and bunting prowess of the batter, his speed, and even the pitcher himself.

What if he keeps moving up in his mind and the WE from bunting is always higher than hitting, like with most pitchers at the plate with no outs? Then the 3B simply plays in as far as he can, assuming that the batter is bunting 100%.

So in our example, if Sano is indeed playing at the correct depth which maybe he was and maybe he wasn’t, then the WE from bunting and hitting must be exactly the same, in which case, what does the batter do? It doesn’t matter, obviously! He can do whatever he wants, as long as the 3B is playing correctly.

So in a bunt situation like this, assuming that the 3B (and other fielders if applicable) is playing reasonably correctly, it NEVER matters what the batter does. That should be the case in every single potential sac bunt situation you see in a baseball game. It NEVER matters what the batter does. Either bunting or not are equally “correct.” They result in exactly the same WE.

The only exceptions (which do occur) are when the WE from bunting is always higher than swinging when the 3B is playing all the way up (a poor hitter and/or exceptional bunter) OR the WE from swinging is always higher even when the 3B is playing completely back (a good or great hitter and/or poor bunter).

So unless you see the 3B playing all the way in or all the way back and they are playing reasonably optimally it NEVER matters what the batter does. Bunt or not bunt and the win expectancy is exactly the same! And if the 3rd baseman plays all the way in or all the way back and is playing optimally, then it is always correct for the batter to bunt or not bunt 100% of the time.

I won’t go into this too much because the post assumed that the defense was playing optimally, i.e. it was in a “Nash Equilibrium” (as I explained, it is playing in a position such that the WE for bunting and swinging are exactly equal) or it was correctly playing all the way in (the WE for bunting is still equal to or great than for swinging) or all the way back (the WE for swinging is >= that of bunting), but if the defense is NOT playing optimally, then the batter MUST bunt or swing away 100% of the time.

This is critical and amazingly there is not ONE manager or coach in MLB that understands it and thus correctly utilizes a correct bunt strategy or bunt defense.

Recently there has been some discussion about the use of WAR in determining or at least discussing an MVP candidate for position players (pitchers are eligible too for MVP, obviously, and WAR includes defense and base running, but I am restricting my argument to position players and offensive WAR). Judging from the comments and questions coming my way, many people don’t understand exactly what WAR measures, how it is constructed, and what it can or should be used for.

In a nutshell, offensive WAR takes each of a player’s offensive events in a vacuum, without regard to the timing and context of the event or whether that event actually produced or contributed to any runs or wins, and assigns a run value to it, based on the theoretical run value of that event (linear weights), adds up all the run values, converts them to theoretical “wins” by dividing by some number around 10, and then subtracts the approximate runs/wins that a replacement player would have in that many PA. A replacement player produces around 20 runs less than average for every 650 PA, by definition. This can vary a little by defensive position and by era. And of course a replacement player is defined as the talent/value of a player who can be signed for the league minimum even if he is not protected (a so-called “freely available player”).

For example, let’s say that a player had 20 singles, 5 doubles, 1 triple, 4 HR, 10 non-intentional BB+HP, and 60 outs in 100 PA. The approximate run values for these events are .47, .78, 1.04, 1.40, .31, and -.25. These values are marginal run values and by definition are above or below a league average position player. So, for example, if a player steps up to the plate and gets a single, on the average he will generate .47 more runs than 1 generic PA of a league average player. These run values and the zero run value of a PA for a league average player assume the player bats in a random slot in the lineup, on a league average team, in a league average park, against a league-average opponent, etc.

If you were to add up all those run values for our hypothetical player, you would get +5 runs. That means that theoretically this player would produce 5 more runs than a league-average player on a league average team, etc. A replacement player would generate around 3 fewer runs than a league average player in 100 PA (remember I said that replacement level was around -20 runs per 650 PA), so our hypothetical player is 8 runs above replacement in those 100 PA.

The key here is that these are hypothetical runs. If that player produced those offensive events while in a league average context an infinite number of times he would produce exactly 5 runs more than an average player would produce in 100 PA and his team would win around .5 more games (per 100 PA) than an average player and .8 more games (and 8 runs) than a replacement player.

In reality, for those 100 PA, we have no idea how many runs or wins our player contributed to. On the average, or after an infinite number of 100 PA trials, his results would have produced an extra 5 runs and 1/2 win, but in one 100 PA trial, that exact result is unlikely, just like in 100 flips of a coin, exactly 50 heads and tails is an unlikely though “mean” or “average” event. Perhaps 15 or those 20 singles didn’t result in a single run being produced. Perhaps all 4 of his HR were hit after his team was down by 5 or 10 runs and they were meaningless. On the other hand, maybe 10 of those hits were game winning hits in the 9th inning. Similarly, of those 60 outs, what if 10 times there was a runner on third and 0 or 1 out, and our player struck out every single time? Alternatively, what if he drove in the runner 8 out of 10 times with an out, and half the time that run amounted to the game winning run? WAR would value those 10 outs exactly the same in either case.

You see where I’m going here? Context is ignored in WAR (for a good reason, which I’ll get to in a minute), yet context is everything in an MVP discussion. Let me repeat that: Context is everything in an MVP discussion. An MVP is about the “hero” nature of a player’s seasonal performance. How much did he contribute to his team’s wins and to a lesser extent, what did those wins mean or produce (hence, the “must be on a contending team” argument). Few rational people are going to consider a player MVP-quality if little of his performance contributed to runs and wins no matter how “good” that performance was in a vacuum. No one is going to remember a 4 walk game when a team loses in a 10-1 blowout. 25 HR with most of them occurring in losing games, likely through no fault of the player? Ho-hum. 20 HR, where 10 of them were in the latter stages of a close game and directly led to 8 wins? Now we’re talking possible MVP! .250 wOBA in clutch situations but .350 overall? Choker and bum, hardly an MVP.

I hope you are getting the picture. While there are probably several reasonable ways to define an MVP and reasonable and smart people can legitimately debate about whether it is Trout, Miggy, Kershaw or Goldy, I think that most reasonable people will agree that an MVP has to have had some – no a lot – of articulable performance contributing to actual, real-life runs and wins, otherwise that “empty WAR” is merely a tree falling in the forest with no one to hear it.

So what is WAR good for and why was it “invented?” Mostly it was invented as a way to combine all aspects of a player’s performance – offense, defense, base running, etc. – on a common scale. It was also invented to be able to estimate player talent and to project future performance. For that it is nearly perfect. The reason it ignores context is because we know that context is not part of a player’s skill set to any significant degree. Which also means that context-non-neutral performance is not predictive – if we want to project future performance, we need a metric that strips out context – hence WAR.

But, for MVP discussions? It is a terrible metric for the aforementioned reasons. Again, regardless of how you define MVP caliber performance, almost everyone is in agreement that it includes and needs context, precisely that which WAR disdains and ignores. Now, obviously WAR will correlate very highly with non-context-neutral performance. That goes without saying. It would be unlikely that a player who is a legitimate MVP candidate does not have a high WAR. It would be equally unlikely that a player with a high WAR did not specifically contribute to lots of runs and wins and to his team’s success in general. But that doesn’t mean that WAR is a good metric to use for MVP considerations. Batting average correlates well with overall offensive performance and pitcher wins correlate well with good pitching performance, but we would hardly use those two stats to determine who was the better overall batter or pitcher. And to say, for example, that Trout is the proper MVP and not Cabrera because Trout was 1 or 2 WAR better than Miggy, without looking at context, is an absurd and disingenuous argument.

So, is there a good or at least a better metric than WAR for MVP discussions? I don’t know. WPA perhaps. WPA in winning games only? WPA with more weight for winning games? RE27? RE27, again, adjusted for whether the team won or lost or scored a run or not? It is not really important what you use for these discussions by why you use them. It is not so much that WAR is a poor metric for determining an MVP. It is using WAR without understanding what it means and why it is a poor choice for an MVP discussion in and of itself, that is the mistake. As long as you understand what each metric means (including traditional mundane ones like RBI, runs, etc.), how it relates to the player in question and the team’s success, feel free to use whatever you like (hopefully a combination of metrics and statistics) – just make sure you can justify your position in a rational, logical, and accurate fashion.

 

In The Book: Playing the Percentages in Baseball, we found that when a batter pinch hits against right-handed relief pitchers (so there are no familiarity or platoon issues), his wOBA is 34 points (10%) worse than when he starts and bats against relievers, after adjusting for the quality of the pitchers in each pool (PH or starter). We called this the pinch hitting penalty.

We postulated that the reason for this was that a player coming off the bench in the middle or towards the end of a game is not as physically or mentally prepared to hit as a starter who has been hitting and playing the field for two or three hours. In addition, some of these pinch hitters are not starting because they are tired or slightly injured.

We also found no evidence that there is a “pinch hitting skill.” In other words, there is no such thing as a “good pinch hitter.” If a hitter has had exceptionally good (or bad) pinch hitting stats, it is likely that that was due to chance alone, and thus it has no predictive value. The best predictor of a batter’s pinch-hitting performance is his regular projection with the appropriate penalty added.

We found a similar situation with designated hitters. However, their penalty was around half that of a pinch hitter, or 17 points (5%) of wOBA. Similar to the pinch hitter, the most likely explanation for this is that the DH is not as physically (and perhaps mentally) prepared for each PA as a player who is constantly engaged in the game. As well, the DH may be slightly injured or tired, especially if he is normally a position player. It makes sense that the DH penalty would be less than the PH penalty, as the DH is more involved in a game than a PH. Pinch hitting is often considered “the hardest job in baseball.” The numbers suggest that that is true. Interestingly, we found a small “DH skill” such that different players seem to have more or less of a true DH penalty.

Andy Dolphin (one of the authors of The Book) revisited the PH penalty issue in this Baseball Prospectus article from 2006. In it, he found a PH penalty of 21 points in wOBA, or 6%, significantly less than what was presented in The Book (34 points).

Tom Thress, on his web site, reports a PH penalty of .009 in “player won-loss records” (offensive performance translated into a “w/l record”), which he says is similar to that found in The Book (34 points). However, he finds an even larger DH penalty of .011 wins, which is more than twice that which we presented in The Book. I assume that .011 is slightly larger than 34 points in wOBA.

So, everyone seems to be in agreement that there is a significant PH and DH penalty, however, there is some disagreement as to the magnitude of each (with empirical data, we can never be sure anyway). I am going to revisit this issue by looking at data from 1998 to 2012. The method I am going to use is the “delta method,” which is common when doing this kind of “either/or” research with many player seasons in which the number of opportunities (in this case, PA) in each “bucket” can vary greatly for each player (for example, a player may have 300 PA in the “either” bucket and only 3 PA in the “or” bucket) and from player to player.

The “delta method” looks something like this: Let’s say that we have 4 players (or player seasons) in our sample, and each player has a certain wOBA and number of PA in bucket A and in bucket B, say, DH and non-DH – the number of PA are in parentheses.

wOBA as DH wOBA as Non-DH
Player 1 .320 (150) .330 (350)
Player 2 .350 (300) .355 (20)
Player 3 .310 (350) .325 (50)
Player 4 .335 (100) .350 (150)

In order to compute the DH penalty (difference between when DH’ing and playing the field) using the “delta method,” we compute the difference for each player separately and take a weighted average of the differences, using the lesser of the two PA (or the harmonic mean) as the weight for each player. In the above example, we have:

((.330 – .320) * 150 + (.355 – .350) * 20 + (.325 – .310) * 50 + (.350 – .335) * 100) / (150 + 20 + 50 + 100)

If you didn’t follow that, that’s fine. You’ll just have to trust me that this is a good way to figure the “average difference” when you have a bunch of different player seasons, each with a different number of opportunities (e.g. PA) in each bucket.

In addition to figuring the PH and DH penalties (in various scenarios, as you will see), I am also going to look at some other interesting “penalty situations” like playing in a day game after a night game, or both games of a double header.

In my calculations, I adjust for the quality of the pitchers faced, the percentage of home and road PA, and the platoon advantage between the batter and pitcher. If I don’t do that, it is possible for one bucket to be inherently more hitter-friendly than the other bucket, either by chance alone or due to some selection bias, or both.

First let’s look at the DH penalty. Remember that in The Book, we found a roughly 17 point penalty, and  Tom Thresh found a penalty that was greater than that of a PH, presumably more than 34 points in wOBA.

Again, my data was from 1998 to 2012, and I excluded all inter-league games. I split the DH samples into two groups: One group had more DH PA than non-DH PA in each season (they were primarily DH’s), and vice versa in the other group (primarily position players).

The DH penalty was the same in both groups – 14 points in wOBA.

The total sample sizes were 10,222 PA for the primarily DH group and 32,797 for the mostly non-DH group. If we combine the two groups, we get a total of 43,019 PA. That number represents the total of the “lesser of the PA” for each player season. One standard deviation in wOBA for that many PA is around 2.5 wOBA points. For the difference between two groups of 43,000 each, it is 3.5 points (the square root of the sum of the variances). So we can say with 95% confidence that the true DH penalty is between 7 and 21 points with the most likely value being 14. This is very close to the 17 point value we presented in The Book.

I expected that the penalty would be greater for position players who occasionally DH’d rather than DH’s who occasionally played in the field. That turned out not to be the case, but given the relatively small sample sizes, the true values could very well be different.

Now let’s move on to pinch hitter penalties. I split those into two groups as well: One, against starting pitchers and the other versus relievers. We would expect the former to show a greater penalty since a “double whammy” would be in effect – first, the “first time through the order” penalty, and second, the “sitting on the bench” penalty. In the reliever group, we would only have the “coming in cold” penalty. I excluded all ninth innings or later.

Versus starting pitchers only, the PH penalty was 19.5 points in 8,523 PA. One SD is 7.9 points, so the 95% confidence interval is a 4 to 35 point penalty.

Versus relievers only, the PH penalty was 12.8 points in 17,634 PA. One SD is 5.5 points – the 95% confidence interval is a 2 to 24 point penalty.

As expected, the penalty versus relievers, where batters typically only face the pitcher for the first and only time in the game, whether they are in the starting lineup or are pinch hitting, is less than that versus the starting pitcher, by around 7 points. Again, keep in mind that the sample sizes are small enough such that the true difference between the starter PH penalty and reliever PH penalty could be the same or could even be reversed. Of course, our prior when applying a Bayesian scheme is that there is a strong likelihood that the true penalty is larger against starting pitchers for the reason explained above. So it is likely that the true difference is similar to the one observed (a 7-point greater penalty versus starters).

Notice that my numbers indicate penalties of a similar magnitude for pinch hitters and designated hitters. The PH penalty is a little higher than the DH penalty when pinch hitters face a starter, and a little lower than the DH penalty when they face a reliever. I expected the PH penalty to be greater than the DH penalty, as we found in The Book. Again, these numbers are based on relatively small sample sizes, so the true PH and DH penalties could be quite different.

Role Penalty (wOBA)
DH 14 points
PH vs. Starters 20 points
PH vs. Relievers 13 points

Now let’s look at some other potential “penalty” situations, such as the second game of a double-header and a day game following a night game.

In a day game following a night game, batters hit 6.2 wOBA points worse than in day games after day games or day games after not playing at all the previous day. The sample size was 95,789 PA. The 95% certainty interval is 1.5 to 11 points.

What about the when a player plays both ends of a double-header (no PH or designated hitters)? Obviously many regulars sit out one or the other game – certainly the catchers.

Batters in the second game of a twin bill lose 8.1 points of wOBA compared to all other games. Unfortunately, the sample is only 9,055 PA, so the 2 SD interval is -7.5 to 23.5. If 8.1 wOBA points (or more) is indeed reflective of the true double-header penalty, it would be wise for teams to sit some of their regulars in one of the two games – which they do of course. It would also behoove teams to make sure that their two starters in a twin bill pitch with the same hand in order to discourage fortuitous platooning by the opposing team.

Finally, I looked at games in which a player and his team (in order to exclude times when the player sat because he wasn’t 100% healthy) did not play the previous day, versus games in which the player had played at least 8 days in a row. I am looking for a “consecutive-game fatigue” penalty and those are the two extremes. I excluded all games in April and all pinch-hitting appearances.

The “penalty” for playing at least 8 days in a row is 4.0 wOBA points in 92,287 PA. One SD is 2.4 so that is not a statistically significant difference. However, with a Bayesian prior such that we expect there to be a “consecutive-game fatigue” penalty, I think we can be fairly confident with the empirical results (although obviously there is not much certainty as to the magnitude).

To see whether the consecutive day result is a “penalty” or the day off result is a bonus, I compared them to all other games.

When a player and his team has had a day off the previous day, the player hits .1 points better than otherwise in 115,471 PA (-4.5 to +4.5). Without running the “consecutive days off” scenario, we can infer that there is an observed penalty when playing at least 8 days in a row, of around 4 points, compared to all other games (the same as compared to after an off-day).

So having a day off is not really a “bonus,” but playing too many days in row creates a penalty. It probably behooves all players to take an occasional day off. Players like Cal Ripken, Steve Garvey, and Miguel Tejada (and others) may have had substantially better careers had they been rested more, at least rate-wise.

I also looked at players who played in fewer days in a row (5, 6, and 7) and found penalties of less than 4 points, suggesting that the more days in a row a player plays, the more his offense is penalized. It would be interesting to see if a day off after several days in a row restores a player to his normal offensive levels.

There are many other situations where batters and pitchers may suffer penalties (or bonuses), such as game(s) after coming back from the DL, getaway (where the home team leaves for another venue) games, Sunday night games, etc.

Unfortunately, I don’t have the time to run all of these potentially interesting scenarios – and I have to leave something for aspiring saberists to do!

Addendum: Tango Tiger suggested I split the DH into “versus relievers and starters.” I did not expect there to be a difference in penalties since, unlike a PH, a DH faces the starter the same number of times as when he isn’t DH’ing. However, I found a penalty difference of 8 points – the DH penalty versus starters was 16.3 and versus relievers, it was 8.3. Maybe the DH becomes “warmer” towards the end of the game, or maybe the difference is a random, statistical blip. I don’t know. We are often faced with these conundrums (what to conclude) when dealing with limited empirical data (relatively small sample sizes). Even if we are statistically confident that an effect exists (or doesn’t), we are are usually quite uncertain as to the magnitude of that effect.

I also looked at getaway (where the home team goes on the road after this game) night games. It has long been postulated that the home team does not perform as well in these games. Indeed, the home team batter penalty in these games was 1.6 wOBA points, again, not a statistically significant difference, but consistent with the Bayesian prior. Interestingly, the road team batters performed .6 points better suggesting that home team pitchers in getaway games might have a small penalty as well.