In a recent tweet, the esteemed sabermetrician and current MLB Statcast honcho, Tom Tango, aka Tango, suggested that an increase in Statcast speed (average time from home to first on a batted ball that requires a max effort) from one year (year x) to the next (x+1), concomitant with an increase in offensive performance (from x to x+1), might portend an increase in expected offensive performance in the following year (x+2). He put out a call to saberists and aspiring saberists, to look at the data and see if there is any evidence of such an effect. I took up the challenge.

I looked at all batters who had a recorded Statcast speed in 2016 and 2017, as well as at least 100 PA in each of those years and 2018 as well, and separated them into 3 groups: One, an increase in speed greater than .5 seconds. Two, a decrease in speed of at least .46 second. Three, all the rest. I also separated each of those groups into 3 sub-groups: One, an increase in context-neutral wOBA of at least 21 points, a decrease of at least 21 points, and all the rest.

Then I looked at their 2018 performance compared to their 2018 projections. I used my own proprietary projections which are not publicly available, but are often referred to on social media and on this blog. They are probably more or less as good as any of the other credible projection systems out there, like Steamer and ZIPS. For the record, I don’t use Statcast speed data in my projection models – in fact I don’t use any Statcast data at all (such as exit velocity, launch angle, expected wOBA, etc.).

The hypothesis I suppose is that if a player sees an increase in Statcast speed and an increase in performance from 2016 to 2017, we will underestimate his expected performance in 2018. This makes sense, as any independent data which is correlated directly or indirectly with performance can and should be used to help with the accuracy of the projections. An increase or decrease in speed might suggest increased or decreased health, fitness, or injury status, which would in turn affect the integrity of our projections.

Let’s look at the data.

If we look at the players who gain speed, lose speed, and all the rest from 2016 to 2017, with no regard to how their offensive performance changed, we see this:

2018 PA 16-17 Speed Change 17-18 Speed Change 16-17 wOBA Change 18 Projected wOBA 18 Actual wOBA
9,374 .88 -.28 .031 .328 .342
30,593 -.72 .07 -.004 .332 .330
136,030 0.00 -.17 -.002 .332 .331

 

Lots of good stuff in the above chart! We can clearly see that increased Statcast speed is, on the average, accompanied by a substantial increase in batting performance – in fact a .88 second increase means a 31 point increase in wOBA. Presumably, either these players were suffering from some malady affecting their speed and offense in 2016 but not in 2017 or they did something in terms of health or fitness in 2017 in order to increase their speed and wOBA. And yes, we see a substantial underestimation of their predicted 2018 performance – 14 points. Interestingly, these players lose back in 2018 around a third of the speed they gained in 2017. Keep in mind that all players, on the average, lose speed from year x to x+1 because of the natural process of aging, probably starting from a very young age.

Also of interest is the fact that of all the other players, those that lose speed and those that have no change, our projections are right on the money. It is surprising that we aren’t overestimating the 2018 performance of players who lose speed. Also of note is the fact that players who lose speed from 2016 to 2017, gain back a little even though all players are expected to lose speed from one year to another. That suggests that at least a small portion of the loss is transient due to some injury or other health issue. We also don’t see a substantial loss in offensive performance accompanying a large loss in speed  – only 2 points in wOBA, which, again, is normal for all players because of aging.

Overall, the data suggest that losing speed may be a combination of injury and “bulking up” resulting in no net gain or loss in offensive performance, whereas a gain in speed suggests better fitness, at least relative to the previous year (which may have been injury-plagued), resulting in a substantial gain in wOBA.

What if we further break up these groups into those that did indeed gain or lose wOBA from 2016 to 2017? How does that affect the accuracy of our projections? For example, if a player gains speed but his offense doesn’t show a substantial uptick, are our projections more solid? Similarly, if a player loses speed and his wOBA substantially decreases, are our projections too pessimistic?

First let’s see how our projections do in general when players gain or lose a substantial amount of offense from 2016 to 2017. Maybe we aren’t weighting recent performance properly or otherwise handling drastic changes in performance well. Good forecasters should always be checking and calibrating these kinds of things.

2018 PA 16-17 wOBA Change 18 Projected wOBA 18 Actual wOBA
48,120 .000 .331 .334
42,720 .045 .334 .330
45,190 -.041 .330 .330

 

We don’t see any tremendous biases here. Maybe a bit too much weight on a recent uptick in performance. So let’s break those “speed increase and decrease” groups into increases and decreases in wOBA and see how it affects our projections.

Players whose speed increases from 2016 to 2017

2018 PA 16-17 Speed Change 17-18 Speed Change 16-17 wOBA Change 18 Projected wOBA 18 Actual wOBA
3,633 .85 -.15 .006 .326 .358
5,285 .92 -.38 .054 .332 .337
456 .67 -.14 -.035 .309 .292

 

So, the substantial under-projections seem to occur when a player gains speed but his wOBA remains about the same. When his speed and his offensive performance both go up, our projections don’t miss the mark by all that much. I don’t know why that is. Maybe it’s noise. We’re dealing with fairly small samples sizes here. In 4,000 PA, one standard deviation in wOBA is around 10 points. One standard deviation between the difference between a projected and an actual wOBA is even greater as there is random and other uncertainties in both measures. Interestingly, it appears to be quite unlikely that a player can gain substantial speed while his wOBA decreases.

What about for players who lose speed?

Players whose speed decreases from 2016 to 2017

2018 PA 16-17 Speed Change 17-18 Speed Change 16-17 wOBA Change 18 Projected wOBA 18 Actual wOBA
12,444 -.70 .12 .000 .337 .341
8,188 -.67 .05 .040 .333 .327
9,961 -.79 .03 -.045 .326 .320

 

We see a similar pattern here in reverse, although none of the differences between projected and actual are large. When a player loses speed and his offense decreases, we tend to overestimate his 2018 wOBA – our projections are too optimistic – by only around 6 points. Remember when a player gains speed and offense, we underestimate his 2018 performance by around the same amount. However, unlike with the “speed gain players,” here we see an overestimation regardless of whether the “speed losers” had an accompanying increase or decrease in offensive performance, and we don’t see a large error in our projections for players who lose speed but don’t gain or lose offense.

I think overall, we see that our initial hypothesis is likely true – namely that when players see an increase in Statcast speed score from one year to the next (2016-2017 in this study), we tend to underestimate their year 3 projection (2018 in this case). Similarly, when player speed decreases from year x to x+1, we tend to under-project their year x+2 wOBA, although at a level less than half that of the “speed gainers.” Keep in mind that our samples are relatively small so that more robust work needs to be done in order to gain more certainty in our conclusions, especially when you further break down the two main groups into those that increase, decrease, and remain the same in offensive performance. As well, age might play a significant factor here, as each of the groups might be different in average age and our projection algorithm might not be handling the aging process well, once the speed scores are included in the data sets.

Advertisements

I created a bit of controversy on Twitter a few days ago (imagine that) when I tweeted my top 10 to-date 2018 projections for the total value of position players, including batting, base running, and defense, including positional adjustments. Four of my top 10 were catchers, Posey, Flowers (WTF?), Grandal, and Barnes. How can that be? Framing, my son, framing. All of those catchers in addition to being good hitters, are excellent framers, according to Baseball Prospectus catcher framing numbers. I use their season numbers to craft a framing projection for each catcher, using a basic Marcel methodology – 4 years’ weighted and regressed toward a population mean, zero in this case.

When doing this, the spread of purported framing talent is quite large. Among the 30 catchers going into 2018 with the most playing time (minors and majors), the standard deviation of talent (my projection) is 7.6 runs. That’s a lot. Among the leaders in projected runs per 130 games are Barnes at +18 runs, and Grandal and Flowers at +21. Some of poor framers include such luminaries as Anthony Recker, Ramon Cabrera, and Tomas Telis (who are these guys?) at -18, -15, and -18, respectively. Most of your everyday catchers these days are decent (or a little on the bad side, like Kurt Suzuki) or very good framers. Gone are the days when Ryan Doumit (terrible framer) was a full-timer and Jose Molina (great framer) a backup.

Anyway, the beef on twitter was that surely framing can’t be worth so much that 4 of the top 10 all-around players in baseball are catchers. To be honest, that makes little sense to me either. If that were true, then catchers are underrepresented in baseball. In other words, there must be catchers in the minor leagues who should be in the majors, presumably because they are good framers though not necessarily good hitters or in other arenas like throwing, blocking pitches, and calling games. If this beef is valid, then either my projection methodology for framing is too strong, i.e., not enough regression, or BP’s numbers lack some integrity.

As a good sabermetricians should be wont to do, I set out to find out the truth. Or at least find evidence supporting the truth. Here’s what I did:

I did a WOWY (without and with you – invented by the illustrious Tom Tango) to compare every catcher’s walk and strikeout rate with each pitcher they worked with to that of the the same pitchers working with other catchers – the without. I did not adjust for the framing value of the other catchers. Presumably for a good framing catcher they should be slightly bad, framing-wise, and vice versa for bad-framing catchers, so that there will be a slight double counting. I did this for each projected season 2014-2017, or 4 seasons.

I split the projected catchers into 3 groups, Group I were projected at greater than 10 runs per 150 games (8.67 per 130), Group II at less than -10 runs, and Group III, all the rest. Here is the data for 2014-2017 combined. Remember I am using, for example, 2017 pre-season projections, and then comparing that to a WOWY for that same year.

Total PA Mean Proj per 130 g W/ BB rate WO/ BB rate Diff W/ SO rate WO/SO rate Diff
74,221 -12.6 .082 .077 .005 .197 .206 -..009
107,535 +13.3 .073 .078 -.005 .215 .212 .003
227,842 -.2 .078 .078 0 .213 .212 .001

 

We can clearly see that we’re on the right track. The catchers projected to be bad framers had more BB and fewer SO than average and the good framers had more SO and fewer BB. That shouldn’t be surprising. The question is how accurate are our projections in terms of runs. To answer that, we need to convert those BB and SO rates into runs. There are around 38 PA per game, so for 130 games, we have 4,940 PA. Let’s turn those rate differences into runs per 130 games by multiplying them by 4,940 and then by .57 runs which is the value of a walk plus an out, which assumes that every other component stays the same, other than outs. My presumption is that an out is turned into a walk or a walk is turned into an out. A walk as compared to a neutral PA is worth around .31 runs and an out around .26 runs.

Total PA Mean Proj per 130 g W/ BB rate WO/ BB rate Diff in runs/130 W/ SO rate WO/SO rate Diff
74,221 -12.6 .082 .077 +14.0 .197 .206 -.009
107,535 +13.3 .073 .078 -14.0 .215 .212 .003
227,842 -.2 .078 .078 0 .213 .212 .001

 

Let’s make sure that my presumption is correct before we get tool excited with those numbers. Namely that an out really is turning into a walk and vice versa due to framing. Changes in strikeout rate are mostly irrelevant in terms of translating into runs, assuming that the only other changes are in outs and walks (strikeouts are worth about the same as a batted ball out).

Total PA Mean Proj W/ HR WO/HR Diff W/ Hits WO/Hits Diff W/ Outs WO/

Outs

Diff
74,221 -12.6 .028 .028 0 .204 .203 .001 .675 .681 -.006
107,535 +13.3 .029 .029 0 .200 .198 .002 .689 .685 .004
227,842 -.2 .029 .029 0 .199 .200 -.001 .685 .683 .002

 

So, HR is not affected at all. Interestingly, both good and bad framers give up slightly more non-HR hits. This is likely just noise. As I presumed, the bad framers are not only allowing more walks and fewer strikeouts, but they’re also allowing fewer outs. The good framers are producing more outs. So this does in fact suggest that the walks are being converted into outs, strikeouts and/or batted ball outs and vice versa.

If we chalk up the difference in hits between the with and the without to noise (if you want to include that, that’s fine – both the good and bad framers lose a little, the good framers losing more), we’re left with outs and walks. Let’s translate each one into runs separately using .31 runs for the walks and .26 runs for the outs. Those are the run values compared to a neutral PA.

Total PA Mean Proj per 130 g W/ BB rate WO/ BB rate Diff in runs/130 W/ Outs WO/

Outs

Diff
74,221 -12.6 .082 .077 +7.7 .675 .681 +7.7
107,535 +13.3 .073 .078 -7.7 .689 .685 -5.1
227,842 -.2 .078 .078 0 .685 .683 -2.6

 

So our bad framers are allowing 15.4 runs more per 130 games than the average catcher or than their others at least, in terms of fewer outs and more BB. The good framers are allowing 12.8 fewer runs per 130 games. Compare that to our projections, and I think we’re in the same ballpark.

It appears from this data that we have pretty strong evidence that framing is worth a lot and our four catchers should be in the top 10 players in all of baseball.

Let’s face it. Most of you just can’t process the notion that a pitcher who’s had 10 or 15 starts at mid-season can have an ERA of 5+ and still be expected to pitch well for the remainder of the season. Maybe, if they’re a Kershaw or Verlander or a known ace, but not some run of the mill hurler. Similarly, if a previously unheralded and perhaps terrible starter were to be sporting a 2.50 ERA in July after 12 solid starts, the notion that he’s still a bad pitcher, although not quite as bad as we previously estimated, is antithetical to one of the strongest biases that human beings have when it comes to sports, gambling, and in fact, many other aspects of life in general – recency bias. According to the online skeptics dictionary, recency bias is, “the tendency to think that trends and patterns we observe in the recent past will continue in the future.”

I looked at all starting pitcher in the last 3 years who either:

  1. In the first week of July, had a RA9 (runs allowed per 9 innings) adjusted for park, weather, and opponent, that was at least 1 run higher than their mid-season (as of June 30) projection. In addition, these pitchers had to have a projected context-neutral RA9 of less than 4.00 (good pitchers).
  2. In the first week of July, had an adjusted RA9 at least 1 run lower than their mid-season projection. They also had to have a projection greater than 4.50 (bad pitchers).

Basically, group I pitchers above were projected to be good pitchers but had very poor results for around 3 months. Group II pitchers were projected to be bad pitchers despite having very good results in the first half of the season.

A projection is equivalent to estimating a player’s most likely performance for the next game or for the remainder of the season (not accounting for aging). So in order to test a projection, we usually look at that player’s or a group of players’ performance in the future. In order to mimic the real-time question, “How do we expect this pitcher to pitch today, I looked at the next 3 games performance, in RA9.

Here are the aggregate results:

The average RA9 from 2015-2017 was around 4.39.

Group I pitchers (cold first half) N=36 starts after first week in July

Season-to-date RA9 Projected RA9 Next 3 starts RA9
5.45 3.76 3.71

Group II Pitchers (hot first half) N=84 starts after first week in July

Season-to-date RA9 Projected RA9 Next 3 starts RA9
3.33 4.95 4.81

 

As you can see, the season-to-date context neutral (adjusted for park, weather and opponent) RA9 tells us almost nothing about how these pitchers are expected to pitch, independent of our projection. Keep in mind that the projection has the current season performance baked into the model, so it’s not that the projection is ignoring the “anomalous” performance, and somehow magically the pitcher reverts to somewhere around his prior performance.

Actually, two things are happening here to create these dissonant (within the context of recency bias) results: One, these projections are using 3 or 4 years of prior performance (including the minor leagues), if available, such that another 3 months, even the most recent 3 months (which gets more weight in our projection model), often doesn’t have much effect on the projection (depending on how much prior data there is). As well, even if there isn’t that much prior data the very bad or good 3-month performance is going to get regressed towards league average anyway.

Two, how much integrity is there in a very bad RA9 for a pitcher who was and is considered a very good pitcher, and vice versa? By that, I mean does it really reflect how well the pitcher has pitched in terms of the components allowed or was he just lucky or unlucky in terms of the timing of those events? We can attempt to answer that question by looking at our same pitchers above and see how their season-to-date RA9 looks compared to a a component RA9, which is an RA9 looking number constructed from a pitcher’s component stats (using a BaseRuns formula). Let’s add that to the charts above.

Group I

Season-to-date RA9 To-date component RA9 Projected RA9 Next 3 starts RA9
5.45 4.40 3.76 3.71

Group II

Season-to-date RA9 To-date component RA9 Projected RA9 Next 3 starts RA9
3.33 4.25 4.84 4.81

 

These pitchers’ component results were not nearly as bad or good as their RA9 suggests.

So, if a pitcher is still projected to be a good pitcher, even after a terrible first half (or vice versa), RA9-wise (and presumably ERA-wise), two things are going on to justify that projection: One, the first half may be a relatively small sample compared to 3 or 4 years prior performance – remember, everything counts (albeit recent performance is given more weight)! Two, and more importantly, that RA9 is mostly timing-driven luck. The to-date components suggest that both the hot and cold pitchers have not pitched nearly as badly or as well as their RA9 suggests. The to-date component RA9’s are around league-average for both groups.

The takeaway here is that your recency bias will cause to you reject these projections in favor of to-date performance as reflected in RA9 or ERA, when in fact the projections are still the best predictor of future performance.

When a team wins the World Series (or even a game), the winning manager is typically forgiven of all his ‘sins.’ His mistakes, large and small, are relegated to the scrap heap marked, “Here lies the sins of our manager and all managers before him, dutifully forgotten or forgiven by elated and grateful fans and media pundits and critics alike.”

But should they be forgotten or forgiven simply because his team won the game or series? I’m not going to answer that. I suppose that’s up to those fans and the media. What I can say is this: As with many things in life that require a decision or a strategy, the outcome in sports rarely has anything to do with the efficacy of that decision. In baseball, when a manager has a choice between, say, strategy A or strategy B, how it turns out in terms of the immediate outcome of the play or that of the game, has virtually nothing to do with which strategy increased or decreased each team’s win expectancy (their theoretical chance of winning the game, or how often they would win the game if it were played from that point forward an infinite number of times).

Of course, regardless of how much information we have or how good our analysis is, we can’t know with pinpoint accuracy what those win expectancies are; however, with a decent analysis and reasonably accurate and reliable information, we can usually do a pretty good job.

It’s important to understand that the absolute magnitude of those win percentages is not what’s important, but their relative values. For example, if we’re going to evaluate the difference between, say, issuing an intentional walk to player A versus allowing him to hit, it doesn’t matter much how accurate our pitcher projections are or even those of the rest of the lineup, other than the batter who may be walked and the following batter or two. It won’t greatly affect the result we’re looking for – the difference in win expectancy between issuing the IBB or not.

The other thing to keep in mind – and this is particularly important – is that if we find that the win expectancy of one alternative is close to that of another, we can’t be even remotely certain that the strategy with the higher win expectancy is the “better one.” In fact, it is a custom of mine that when I find a small difference in WE I call it a toss-up.

The flip side of that is this: When we find a large difference in WE, even with incomplete information and an imperfect model, there is a very good chance that the alternative that our model says has the higher win expectancy does in fact yield a higher win percentage if we had perfect information and a perfect model.

How small is “close” and how big is “a large difference?” There is no cut-off point above which we can say with certainty that, “Strategy A is better,” or below which we have to conclude, “It’s a toss-up.” It’s not a binary thing. Basically the larger the difference, the more confident we are in our estimate (that one decision is “better” than the other from the standpoint of win expectancy). In addition, the larger the difference, the more confident we are that choosing the “wrong strategy” is a big mistake.

To answer the question of specifically what constitutes a toss-up and what magnitude of difference suggests a big mistake (if the wrong strategy is chosen), the only thing I can offer is this: I’ve been doing simulations and analyses of managerial decisions for over 20 years. I’ve looked at pinch hitting, base running, bunting, relievers, starters, IBB’s, you name it. As a very rough rule of thumb, any difference below .5% in win expectancy could be considered a toss-up, although it depends on the exact nature of the decisions – some have more uncertainty than others. From .5% to 1%, I would consider it a moderate difference with some degree of uncertainty. 1-2% I consider fairly large and I’m usually quite certain that the alternative with the larger WE is indeed the better strategy. Anything over 2% is pretty much a no-brainer – strategy A is much better than strategy B and we are 95% or more certain that that is true and that the true difference is large.

With all that in mind, I want to revisit Game 6 of the World Series. In the top of the 5th inning, the Astros were up 1-0 with runners on second and third, one out, and Justin Verlander, arguably their best starting pitcher (although Morton, McCullers and Keuchel are probably not too far behind, if at all) , due to bat. I’m pretty sure that the Astros manager, Hinch, or anyone else for that matter, didn’t even think twice about whether Verlander was going to bat or not. The “reasoning” I suppose was that he’s only pitched 4 innings, was pitching well, and the Astros were already up 1-0.

Of course, reasoning in “words” like that rarely gets you anywhere in terms of making the “right” decision. The question, at least as a starting point, is, “What is the Astros’ win expectancy with Verlander batting versus with a pinch hitter?” You can argue all you want about how much removing Verlander, burning a pinch hitter, using your bullpen in the 5th, and perhaps damaging Verlander’s ego or affecting the morale of the team, affects the outcome of the game and the one after that (if there is a 7th game) and perhaps even the following season; however, that argument can only be responsibly made in the context of how much win expectancy is lost by letting Verlander hit. As it turns out, that’s relatively easy to find out with a simple game simulator.  We know approximately how good or bad of a hitter Verlander is, or at least we can estimate it, and we know the same about a pinch hitter like Gattis, Fisher, or Maybin. It doesn’t even matter how good those estimates are. It’s not going to change the numbers much.

Even without using a simulator, we can get a pretty good idea as to the impact of a pinch hitter in that situation: The run expectancy with a typical hitter at the plate is around 1.39 runs. With an automatic out, the run expectancy decreases to .59 runs, a loss of .78 runs or 7.8% in win expectancy. That’s enormous. Now, Verlander is obviously not an automatic out, although he is apparently not a good hitting pitcher, having spent his entire career in the AL prior to a few months ago. If we assume a loss of only .6 runs, we still get a 6% difference in win expectancy between Verlander and a pinch hitter. These are only very rough estimates however, since translating run expectancy to win expectancy depends on the score and inning. The best thing we can do is to run a game simulator.

I did just that, using the approximate offensive line for a poor hitting pitcher, and that of Evan Gattis as pinch hitter. The difference after simulating 100,000 games for each alternative was 6.6%, not too far off from our basic estimate using run expectancies. This is a gigantic difference. I can’t emphasize how large a difference that is. Decisions such as whether to IBB a batter, bunt, replace a batter or pitcher to get a platoon advantage, remove a starter for a reliever, replace a reliever for a better reliever, etc. typically involve differences in win expectancy of 1% or less. As I said earlier, anything greater than 1% is considered significant and anything above 2% is considered large. 6.6% is almost unheard of. About the only time you’ll encounter that kind of difference is exactly in this situation – a pitcher batting versus a pinch hitter, in a close game with runners on base, and especially with 1 or 2 outs, when the consequences of an out are devastating.

To give you an idea of how large a 6.6% win expectancy advantage is, imagine that your manager decided to remove Mike Trout and Joey Votto, perhaps the two best hitters in baseball, from a lineup and replace them with two of the worst hitters in baseball for game 6 of the World Series. How much do you think that would be worth to the opposing team? Well, that’s worth about 6.6%, the same as letting Verlander hit in that spot rather than a pinch hitter. What would you think of a manager who did that?

Now, as I said, there are probably other countervailing reasons for allowing him to hit. At least I hope there were, for Hinch’s and the Astros’ sake. I’m not here to discuss or debate those though. I’m simply here to tell you that I am quite certain that the difference between strategy A and B was enormous – likely on the order of 6-7%. Could those other considerations argue towards giving up that 6.6% at the moment? Again, I won’t discuss that. I’ll leave that up to you to ponder. I will say this, however: If you think that leaving Verlander in the game for another 2-3 innings or so (he ended up pitching another 2 innings) was worth that 6.6%, it’s likely that you’re sadly mistaken.

Let’s say that Verlander is better than any bullpen alternative (or at least the net result, including the extra pressure on the pen for the rest of game 6 and a potential game 7, was that Verlander was the best choice) by ½ run a game. It’s really difficult to argue that it could be much more than that, and if it were up to me, I’d argue that taking him out doesn’t hurt the Astros’ pitching at all. What is the win impact of ½ run a game, for 2.5 innings? Let’s call the average leverage in the 5th-7th innings 1.5 since it was a close game in the 5th. That comes out to 2.1%. So, if letting Verlander pitch through the middle of the 7th inning on the average was better than an alternative reliever by ½ run a game, the impact of removing Verlander for a pinch hitter would be 4.5% rather than 6.6%. 4.5% is still enormous. It’s worth more than the impact of replacing George Springer with Derek Fisher for an entire game because Springer didn’t say, “Good morning” to you – a lot more. Again, I’ll leave it to you to mull the impact of any other countervailing reasons for not removing Verlander.

Before we go, I want to also quickly address Roberts’ decision to walk Springer and pitch to Bregman after Verlander struck out. There were 2 outs, runners in second and third, and the Astros were still up 1-0. Of course Roberts brought in Morrow to pitch to the right-handed Bregman, although Morrow could have pitched to Springer, also a righty. What was the difference in win expectancies between walking and not walking Springer? That is also easy to simulate, although a basic simulator will undervalue the run and win expectancy when the bases are loaded because it’s difficult to pitch in that situation. In any case, the simulator says that not walking Springer is worth around 1.4% in win expectancy. That is considered a pretty large difference, and thus a pretty significant mistake by Roberts, although it was dwarfed by Hinch’s decision to let Verlander bat. It is interesting that one batter earlier Hinch gratuitously handed Roberts 6.6% in win expectancy and then Roberts’ promptly handed him back 1.4%! At least he returned the generosity!

Now, if you asked Hinch what his reasons were for not pinch hitting for Verlander, regardless of his answer – maybe it was a good one and maybe it wasn’t – you would expect that at the very least he must know what the ‘naked’ cost of that decision was. That’s critical to his decision-making process even if he had other good reasons for keeping Verlander in the game. The overall decision cannot be based on those reasons in isolation. It must be made with the knowledge that he has to “make up” the lost 6.6%. If he doesn’t know that, he’s stabbing in the dark. Did he have some idea as to the lost win expectancy in letting his pitcher bat, and how important and significant a number like 6.6% is? I have no idea. The fact that they won game 7 and “all is forgiven” has nothing to do with this discussion though. That I do know.

Last night in game 4 of the 2017 World Series, the Astros manager, A.J. Hinch, sort of a sabermetric wunderkind, at least as far as managers go (the Astros are one of the more, if not the most, analytically oriented teams), brought in their closer, Ken Giles, to pitch the 9th in a tie game. This is standard operating procedure for the sabemetrically inclined team – bring in your best pitcher in a tie game in the 9th inning or later, especially if you’re the home team, where you’ll never have the opportunity to protect a lead. The reasoning is simple: You want to guarantee that you’ll use your best pitcher in the 9th or later inning, in a high leverage situation (in the 9+ inning of a tie game, the LI is always at least 1.73 to start the inning).

So what’s the problem? Hinch did exactly what he was supposed to do. It is more or less the optimal move, although it depends a bit on the quality of that closer against the batters he’s going to face, as opposed to the alternative (as well as other bullpen considerations). In this case, it was Giles versus, say, Devenski. Let’s look at their (my) normalized (4.00 is average) runs allowed per 9 inning projections:

Devenski: 3.37

That’s a very good reliever. That’s closer quality although not elite closer quality.

Giles: 2.71

That is an elite closer. In fact, I have Giles as the 6th best closer in baseball. The gap between the two pitchers is pretty substantial, .66 runs per 9 innings. For one inning with a leverage index (LI) of 2.0, that translates to a 1.5% win expectancy (WE) advantage for Giles over Devenski. As one-decision “swings” (the difference between the optimal and a sub-optimal move) go, that’s considered huge. Of course, if you are going to use Giles later in the game anyway if you stay with Devenski for another inning or two, if the game goes that long, you get some of that WE back. Not all of it (because he may not get to pitch), but some of it. Anyway, that’s not really the issue I want to discuss.

Why were many of the so-called sabermetric writers (they often know just enough about sabermetrics or mathematical/logical thinking in general to be “dangerous,” although that’s a bit unfair on my part – let’s just say they know enough to be “right” much of the time, but “wrong” some of the time) aghast, or at least critical, of this seemingly correct move?

First, it was due to the result of course, which belies the fact that these are sabermetric writers. The first thing they teach you in sabermetrics 101 is not to be results oriented. For the most part, the results of a decision have virtually no correlation with the “correctness” of the decision itself. Sure, some of them will claim that they thought or even publicly said beforehand that it was the wrong move, and some of them are not lying – but it doesn’t really matter. That’s only one reason why lots of people were complaining of this move – maybe even the secondary reason (or not the reason at all), especially for the saber-writers.

The primary reason (again, at least stated – I’m 100% certain that the result strongly influenced nearly all of the detractors) was that these naysayers had little or no confidence in Giles going into this game. He must have had a bad season, right, despite my stellar projection? After all, good projection systems use 3, 4 or more years of data along with a healthy dose of regression, especially with relievers who never have a large sample size of innings pitched or batters faced. Occasionally you can have a great projection for a player who had a mediocre or poor season, and that projection will be just as reliable as any other (because the projection model accurately includes the current season, but doesn’t give it as much weight as nearly all fans and media do). So what were Giles’ 2017 numbers?

Only a 2.30 ERA and 2.39 FIP in a league where the average ERA was 4.37! His career ERA and FIP are 2.43 and 2.25, and he throws 98 mph. He’s a great pitcher. One of the best. There’s little doubt that’s true. But….

He’s thrown terribly thus far in the post-season. That is, his results have been poor. In 7.2 IP his ERA is 11.74. Of course he’s also struck out 10 and has a BABIP of .409. But he “looked terrible” these naysayers keep saying. Well, no shit. When you give up 10 runs in 7.2 innings on the biggest stage in sports, you’re pretty much going to “look bad.” Is there any indication, other than having poor results, that there’s “something wrong with Giles?” Given that his velocity is fine (97.9 so far) and that Hinch saw fit to remove Devenski who was “pitching well” and insert Giles in a critical situation, I think we can say with some certainty that there is no indication that anything is wrong with him. In fact, the data, such as his 12 K/9 rate, normal velocity, and an “unlucky” .409 BABIP, all suggest that there is nothing “wrong with him.” But honestly, I’m not here to discuss that kind of thing. I think it’s a futile and silly discussion. I’ve written many times how the notion that you can just tell (or that a manager can tell – which is not the case here, since Hinch was the one who decided to use him!) when a player is hot or cold by observing him is one of the more silly myths in sports, at least in baseball, and I have reams of data-driven evidence to support that assertion.

What I’m interested in discussing right now, is, “What do the data say?” How do we expect a reliever to pitch after 6 or 7 innings or appearances in which he’s gotten shelled? It doesn’t have to be 7 IP of course, but for research like this, it doesn’t matter. Whatever you find in 7 IP you’re going to find in 5 IP or in 12 IP, assuming you have large enough sample sizes and you don’t get really unlucky with a Type I or II error. The same goes for what constitutes getting shelled compared to how you perceive or define “getting shelled.” With research like this, it doesn’t matter. Again, you’re going to get the same answer whether you define getting shelled (or pitching brilliantly) by wOBA against, runs allowed, hard hit balls, FIP, etc. It also doesn’t matter what thresholds you set – you’ll also likely get the same answer.

Here’s what I did to answer this question – or at least to shed some light on it. I looked at all relievers over the last 10 years and split them up into three groups, depending on how they pitched in all 6-game sequences. Group I pitched brilliantly over a 6-game span. The criteria I set was a wOBA against less than .175. Group III were pitchers who got hammered over a 6-game stretch, at least as far as wOBA was concerned (of course in large samples you will get equivalent RA for these wOBA). They allowed a wOBA of at least .450.  Group II was all the rest. Here are what the groups looked like:

Group Average wOBA against Equivalent RA9
I .130 Around 0
II .308 Around 3
III .496 Around 10

 

Then I looked at their very next appearance. Again, I could have looked at their next 2 or 3 appearances but it wouldn’t make any difference (other than increasing the sample size – at the risk of the “hot” or “cold” state wearing off).

 

Group Average wOBA against wOBA next appearance
I .130 .307
II .308 .312
III .496 .317

 

While we certainly don’t see a large carryover effect, we do appear to see some effect. The relievers who have been throwing brilliantly continue to pitch 10 points better than the ones who have been getting hammered. 10 points in wOBA is equivalent to about .3 runs per 9 innings, so that would make a pitcher like Giles closer to Devenski, but still not quite there. But wait! Are these groups of pitchers of the same quality? No. The ones who were pitching brilliantly belong to a much better pool of pitchers than the ones who were getting hammered. Much better. This should not be surprising. I already assumed that when doing the research. How much better? Let’s look at their seasonal numbers (those will be a little biased because we already established that these groups pitched brilliantly or terribly for some period of time in the same season).

Group Average wOBA against wOBA next appearance Season wOBA
I .130 .307 .295
II .308 .312 .313
III .496 .317 .330

 

As you can see our brilliant pitchers are much better than our terrible ones. Even if we were able to back out the bias (say, by looking at last year’s wOBA), we still get .305 for the brilliant relievers and .315 for the hammered ones, based on the previous season’s numbers. In fact, we’ll use those instead.

Group Average wOBA against wOBA next appearance Prior season wOBA
I .130 .307 .305
II .308 .312 .314
III .496 .317 .315

 

Now that’s brilliant. We do have some sample error. The number of PA in the “next appearance” for group’s I and III are around 40,000 each (SD of wOBA = 2 points). However, look at the “expected” wOBA against, which is essentially the pitcher talent (Giles’ and Devenski’s projections) compared to their actual. They are almost identical. Regardless of how a reliever has pitched in his last 6 appearances, he pitches exactly as his normal projection would suggest on that 7th appearance. The last 6 IP has virtually no predictive value even at the extremes. I don’t want to hear, “Well he really (really, really) been getting hammered – what about that big shot?”.  Allowing a .496 wOBA is getting really, really, really hammered, and .130 is throwing almost no-hit baseball, so we’ve already looked at the extremes!

So, as you can clearly see, and exactly what you should have expected, if you really knew about sabermetrics (unlike some of these so-called saber-oriented writers and pundits who like to cherry pick the sabermetric principles that suit their narratives and biases), is that 7 IP of pitching compared to 150 or more, is almost worthless information. The data don’t lie.

But you just know that something is wrong with Giles, right? You can just tell. You are absolutely certain that he’ll continue to pitch badly. You just knew that he was going to implode again last night (and you haven’t been wrong about that 90% of the time in your previous feelings). It’s all bullshit folks. But if it makes you feel smart or happy, it’s fine by me. I have nothing invested in all of this. I’m just trying to find the truth. It’s the nature of my personality. That makes me happy.

There’s an article up on Fangraphs by Eno Saris that talks about whether the pitch to Justin Turner in the bottom of the 9th inning in Game 2 of the 2017 NLCS was the “wrong” pitch to throw in that count (1-0) and situation (tie game, runners on 1 and 2, 2 outs) given Turner’s proclivities at that count. I won’t go into the details of the article – you can read it yourself – but I do want to talk about what it means or doesn’t mean to criticize a pitcher’s pitch selection – on one particular pitch, and how pitch selection even works, in general.

Let’s start with this – the basic tenet of pitching and pitch selection: Every single situation calls for a pitch frequency matrix. One pitch is chosen randomly from that matrix according to the “correct” frequencies. The “correct” frequencies are those which result in the exact same “result” (where result is measured by the win expectancy impact of all the possible outcomes combined).

Now, obviously, most pitchers “think” they’re choosing one specific pitch for some specific reason, but in reality since the batter doesn’t know the pitcher’s reasoning, it is essentially a random selection as far as he is concerned. For example, a pitcher throws an inside fastball to go 0-1 on the batter. He might think to himself, “OK, I just threw the inside fastball so I’ll throw a low and away off-speed to give him a ‘different look.’ But wait, he might be expecting that. I’ll double up with the fastball! Nah, he’s a pretty good fastball hitter. I’ll throw the off-speed! But I really don’t want to hang one on an 0-1 count. I’m not feeling that confident in my curve ball yet. OK, I’ll throw the fastball, but I’ll throw it low and away. He’ll probably think it’s an off-speed and lay off of it and I’ll get a called strike, or he’ll be late if he swings.”

As you can imagine, there are an infinite number of permutations of ‘reasoning’ that a pitcher can use to make his selection. The backdrop to his thinking is that he knows what tends to be effective at 0-1 counts in that situation (score, inning, runners, outs, etc.) given his repertoire, and he knows the batter’s strengths and weaknesses. The result is a roughly game theory optimal (GTO) approach which cannot be exploited by the batter and is maximally effective against a batter who is thinking roughly GTO too.

The optimal pitch selection frequency matrix is dependent on the pitcher, the batter, the count and the game situation. In that situation with Lackey on the mound and Turner at the plate, it might be something like 50% 4-seam, 20% sinker, 20% slider, and 10% cutter. The numbers are irrelevant. Then a random pitch is selected according to those frequencies, where, for example, the 4-seamer is chosen twice as often as the sinker and slider, the sinker and slider twice as often as the cutter, etc.

Obviously doing that even close to accurately is impossible, but that’s essentially what happens and what is supposed to happen. Miraculously, pitchers and catchers do a pretty good job (really you just have to have a pretty good idea as to what pitches to throw, adjusted a little for the batter). At least I presume they do. It is likely that some pitchers and batters are better than others at employing these GTO strategies as well as exploiting opponents who don’t.

The more a batter likes (or dislikes) a certain pitch (in that count or overall), the less that pitch will be thrown. In order to understand why, you must understand that the result of a pitch is directly proportional to the frequency at which it is thrown in a particular situation. For example, if Taylor is particularly good against a sinker in that situation or in general, it might be thrown 10% rather than 20% of the time. The same is true for locations of course, which makes everything quite complex.

Remember that you cannot tell what types and locations of pitches a batter likes or dislikes in a certain count and game situation from his results! This is a very important concept to understand. The results of every pitch type and location in each count, game situation, and versus each pitcher (you would have to do a “delta method” to figure this) are and should be exactly the same! Any differences you see are noise – random differences (or the result of wholesale exploitative play or externalities as I explain below). We can easily prove this with an example.

Imagine that in all 1-0 counts, early in a game with no runners on base and 0 outs (we’re just choosing a ‘particular situation’ – which situation doesn’t matter), we see that Turner gets a FB 80% of the time and a slider 20% of the time (again, the actual numbers are irrelevant). And we see that Turner’s results (we have to add up the run or win value of all the results – strike, ball, batted ball out, single, double, etc.) are much better against those 80% FB than the 20% SL. Can we conclude that Turner is better against the FB in that situation?

No! Why is that? Because if we did, we would HAVE TO also conclude that the pitchers were throwing him too many FB, right? They would then reduce the frequency of the fastball. Why throw a certain pitch 80% of the time (or at all, for that matter) when you know that another pitch is better?

You would obviously throw it less often than 80% of the time. How much less? Well, say you throw it 79% and the slider 21%. You must be better off with that ratio (rather than 80/20) since the slider is the better pitch, as we just said for this thought exercise. Now what if the FB still yields better results for Turner (and it’s not just noise – he’s still better versus the FB when he knows it’s coming 79% of the time)? Well, again obviously, you should throw the FB even less often and the slider more often.

Where does this end? Every time we decrease the frequency of the FB, the batter gets worse at it since it’s more of a surprise. Remember the relationship between the frequency of a pitch and its effectiveness. At the same time, he gets better and better at the slider since we throw it more and more frequently. It ends at the point in which the results of both pitches are exactly equal. It HAS to. If it “ends” anywhere else, the pitcher will continue to make adjustments until an equilibrium point is reached. This is called a Nash equilibrium in game theory parlance, at which point the batter can look for either pitch (or any pitch if the GTO mixed strategy includes more than two pitches) and it won’t make any difference in terms of the results. (If the batter doesn’t employ his own GTO strategy, then the pitcher can exploit him by throwing one particular pitch – in which case he then becomes exploitable, which is why it behooves both players to always employ a GTO strategy or risk being exploited.) As neutral observers, unless we see evidence otherwise, we must assume that all actors (batters and pitchers) are indeed using a roughly GTO strategy and that we are always in equilibrium. Whether they are or they aren’t, to whatever degree and in whichever situations, it certainly is instructive for us and for them to understand these concepts.

Assuming an equilibrium, this is what you MUST understand: Any differences you see in either a batter’s results across different pitches, or as a pitcher’s, MUST be noise – an artifact of random chance. Keep in mind that it’s only true for each subset of identical circumstances – the same opponent, count, and game situation (even umpire, weather, park, etc.). If you look at the results across all situations you will see legitimate differences across pitch types. That’s because they are thrown with different frequencies in different situations. For example, you will likely see better results for a pitcher with his secondary pitches overall simply because he throws them more frequently in pitcher’s counts (although this is somewhat offset by the fact that he throws them more often against better batters).

Is it possible that there are some externalities that throws this Nash equilibrium out of whack? Sure. Perhaps a pitcher must throw more FB than off-speed in order to prevent injury. That might cause his numbers for the FB to be slightly worse than for other pitches. Or the slider may be particularly risky, injury-wise, such that pitchers throw it less than GTO (game theory optimally) which results in a result better (from the pitcher’s standpoint) than the other pitches.

Any other deviations you see among pitch types and locations, by definition, must be random noise, or, perhaps exploitative strategies by either batters or pitchers (one is making a mistake and the other is capitalizing on it). It would be difficult to distinguish the two without some statistical analysis of large samples of pitches (and then we would still only have limited certainty with respect to our conclusions).

So, given all that is true, which it is (more or less), how can we criticize a particular pitch that a pitcher throws in one particular situation? We can’t. We can’t say that one pitch is “wrong” and one pitch is “right” in ANY particular situation. That’s impossible to do. We cannot evaluate the “correctness” of a single pitch. Maybe the pitch that we observe is the one that is only supposed to be thrown 5 or 10% of the time, and the pitcher knew that (and the batter was presumably surprised by it whether he hit it well or not)! The only way to evaluate a pitcher’s pitch selection strategy is by knowing the frequency at which he throws his various pitches against the various batters in the various counts and game situations. And that requires an enormous sample size of course.

There is an exception.

The one time we can say that a particular pitch is “wrong” is when that pitch is not part of the correct frequency matrix at all – i.e., the GTO solution says that it should never be thrown. That rarely occurs. About the only time that occurs is on 3-0 counts where a fastball might be the only pitch thrown (for example, 3-0 count with a 5 run lead, or even a 3-1 or 2-0 count with any big lead, late in the game – or a 3-0 count on an opposing pitcher who is taking 100% of the time).

Now that being said, let’s say that Lackey is supposed to throw his cutter away only 5% of the time against Turner. If we observe only that one pitch and it is a cutter, Bayes tells is that there is an inference that Lackey was intending to throw that pitch MORE than 5% of the time and we can indeed say with some small level of certainty that he “threw the wrong pitch.” We don’t really mean he “threw the wrong pitch.” We mean that we think (with some low degree of certainty) he had the wrong frequency matrix in his head to some significant degree (maybe he intended to throw that pitch 10% or 20% rather than 5%).*

So, the next time you hear anyone say what a pitcher should be throwing on any particular pitch or that the pitch he threw was “right” or “wrong,” it’s a good bet that he doesn’t really know what he’s talking about, even if they are or were a successful major league pitcher.

* Technically, we can only say something like, “We are 10% sure he was thinking 5%, 12% sure he was thinking 7%, 13% sure he was thinking 8%, etc.” – numbers for illustration purposes only.

It’s quite simple actually.

Apropos to the myriad articles and discussions about the run scoring and HR surge starting in late 2015 and continuing through 2017 to date, I want to go over what can cause league run scoring to increase or decrease from one year to the next:

  1. Changes in equipment, such as the ball or bat.
  2. Changes to the strike zone, either the overall size or the shape.
  3. Rule changes.
  4. Changes in batter strength, conditioning, etc.
  5. Changes in batter or pitcher approaches.
  6. Random variation.
  7. Weather and park changes.
  8. Natural variation in player talent.

I’m going to focus on the last one, variation in player talent from year to year. How does the league “replenish” it’s talent from one year to the next? Poorer players get less playing time, including those who get no playing time at all (retired, injured, or switch to another league). Better players get more playing time and new players enter the league. Much of that is because of the aging curve. Younger players generally get better and thus amass more playing time and older players get worse, playing less – eventually retiring or released.  All these moves can lead to each league having a little more or less overall talent and run scoring than in the previous year. How can we measure that change in talent/scoring?

One good method is to look at how a player’s league normalized stats change from year X to year X+1. First we have to establish a base line. To do that, we track the average change in some league normalized stat like Linear Weights, RC+ or wOBA+ over many years. It is best to confine it to players in a narrow age range, like 25 to 29, so that we minimize the problem of average league age being different from one year to the next, and thus the amount of decline with age also being different.

We’ll start with batting. The stat I’m using is linear weights, which is generally zeroed out at the league level. In other words, the average player in each league, NL and AL separately, has a linear weights of exactly zero. If we look at the average change from 2000 to 2017 for all batters from 25 to 29 years old, we get -.12 runs per team per game in the NL and -.10 in the AL. That means that either these players decline with age and/or every year the quality of the league’s batting gets better. We’ll assume that most of that -.12 runs is due to aging (and that peak age is close to 25 or 26, which it probably is in the modern era), but it doesn’t matter for our purposes.

So, for example, if in year X to X+1 in the NL, all batters age 25-29 lost -.2 runs per game per team, what would that tell us? It would tell us that league batting in year X+1 was better than in year X by .1 runs per team per game. Why is that? If players should lose only -.1 runs but they lost -.2 runs, and thus they look worse than they should relative to the league as a whole, that means that the league got better.

Keep in mind that the quality of the pitching has no effect on this method. Whether the overall pitching talent changes from year 1 to year 2 has no bearing on these calculations. Nor do changes in parks, differences in weather, or any other variable that might change from year to year and affect run scoring and raw offensive stats. We’re using linear weights, which is always relative to other batters in the league. The sum of everyone’s offensive linear weights in any given year and league is always zero.

Using this method, here is the change in batting talent from year to year, in the NL and AL, from 2000 to 2017. Plus means the league got better in batting talent. Minus means it got worse. In other words, a plus value means that run scoring should increase, everything else being the same. Notice the decline in offense in both leagues from 2016 to 2017 even though we see increased run scoring. Either pitching got much worse or something else is going on. We’ll see about the pitching.

Table I

Change in batting linear weights, in runs per game per team

Years NL AL
00-01 .09 -.07
01-02 -.12 -.23
02-03 -.15 -.11
03-04 .09 -.11
04-05 -.10 -.14
05-06 .15 .05
06-07 .09 .08
07-08 -.05 .08
08-09 -.13 .08
09-10 .17 -.12
10-11 -.18 .04
11-12 .12 0
12-13 -.03 -.05
13-14 .01 .07
14-15 .06 .09
15-16 .01 .05
16-17 -.03 -.12

 

Here is the same chart for league pitching. The stat I am using is ERC, or component ERA. Component ERA takes a pitcher’s raw rate stats, singles, doubles, triples, home runs, walks, and outs, per PA, park and defense adjusted, and converts them into a theoretical runs per 9 inning, using a BaseRuns formula. Like linear weights, it is scaled to league average. A plus number means that league pitching got worse, and hence run scoring should go up.

Table II

Change in pitching, in runs per game per team

Years NL AL
00-01 .02 .21
01-02 .03 .00
02-03 -.04 -.23
03-04 .07 .11
04-05 .00 .07
05-06 -.14 -.12
06-07 .10 .06
07-08 -.15 -.10
08-09 -.13 -.17
09-10 .01 .04
10-11 .03 .16
11-12 .03 -.06
12-13 -.02 .26
13-14 -.02 -.04
14-15 .06 -.02
15-16 .03 .04
16-17 .04 -.01

 

Notice that pitching in the NL got a little worse. Overall, when you combine pitching and batting, the NL has worse talent in 2017 compared to 2016, by .07 runs per team per game. NL teams should score .01 runs per game more than in 2016, again, all other things being equal (they usually are not).

In the AL, while we’ve seen a decrease in batting of -.12 runs per team per game (which is a lot), we’ve also seen a slight increase in pitching talent, .01 runs per game per team. We would expect the AL to score .13 runs per team per game less in 2017 than in 2016, assuming nothing else has changed. The overall talent in the AL, pitching plus batting, decreased by .11 runs.

The gap in talent between the NL and AL, at least with respect to pitching and batting only (not including base running and defense, which can also vary from year to year) has presumably moved in favor of the NL by .04 runs a game per team, despite the AL’s .600 record in inter-league play so far this year compared to .550 last year (one standard deviation of the difference between this year’s and last year’s inter-league w/l record is over .05, so the difference is not even close to being statistically significant – less than one SD).

Let’s complete the analysis by doing the same thing for UZR (defense) and UBR (base running). A plus defensive change means that the defense got worse (thus more runs scored). For base running, plus means better (more runs) and minus means worse.

Table III

Change in defense (UZR), in runs per game per team

Years NL AL
00-01 .01 -.07
01-02 -.01 .05
02-03 .18 -.07
03-04 .10 .03
04-05 .12 .00
05-06 -.08 -.07
06-07 .02 .03
07-08 .04 .01
08-09 -.02 -.02
09-10 -.01 -.02
10-11 .15 -.04
11-12 -.10 -.07
12-13 -.02 .03
13-14 -.10 .03
14-15 -.02 -.02
15-16 -.07 -.05
16-17 -.06 .05

 

From last year to this year, defense in the NL got better by .06 runs per team per game, signifying a decrease in run scoring. In the AL, the defense appears to have gotten worse, by .05 runs a game. By the way, since 2012, you’ll notice that teams have gotten much better on defense in general, likely due to an increased awareness of the value of defense, and the move away from the slow, defensively-challenged power hitter.

Let’s finish by looking at base running and then we can add everything up.

Table IV

Change in base running (UBR), in runs per game per team

Years NL AL
00-01 -.02 -.01
01-02 -.02 -.01
02-03 -.01 .00
03-04 .00 -.04
04-05 .02 .02
05-06 .00 -.01
06-07 -.01 -.01
07-08 .00 .00
08-09 .02 .02
09-10 -.02 -.02
10-11 .04 -.01
11-12 .00 -.02
12-13 -.01 -.01
13-14 .01 -.01
14-15 .01 .05
15-16 .01 -.03
16-17 .01 .01

 

Remember that the batting and pitching talent in the AL presumably decreased by .11 runs per team per game and they were expected to score .13 fewer runs per game per team, in 2017, as compared to 2016. Adding in defense and base running, those numbers are a decrease in AL talent by .15 runs and a decrease in run scoring of only .07 runs per team per game.

In the NL, when we add defense and base running to batting and pitching, we get no overall change in talent, from 2016 to 2017, and a decrease in run scoring of -.04.

We also see a slight trend towards better base running since 2011, which should naturally occur with better defense.

Here is everything combined into one table.

Table V

Change in talent and run scoring, in runs per game per team. Plus means gain in talent and score more runs.

Years NL Talent AL Talent NL Runs AL Runs
00-01 .04 -.22 .09 .06
01-02 -.16 -.29 -.12 -.19
02-03 -.30 .19 -.02 -.41
03-04 -.08 -.29 .26 -.01
04-05 -.20 -.19 .04 -.05
05-06 .37 .23 -.07 -.15
06-07 -.02 -.02 .23 .16
07-08 .06 .17 -.16 -.01
08-09 .04 .29 -.26 -.09
09-10 .15 -.16 .05 -.12
10-11 -.31 -.09 .04 .15
11-12 .19 .11 .05 -.15
12-13 0 -.35 -.08 .23
13-14 .14 .07 -.10 .05
14-15 .03 .18 .11 .10
15-16 .06 .03 -.02 .03
16-17 0 -.15 -.04 -.07

If you haven’t read it, here’s the link.

For MY ball tests, the difference I found in COR was 2.6 standard deviations, as indicated in the article. The difference in seam height is around 1.5 SD. The difference in circumference is around 1.8 SD.

For those of you a little rusty on your statistics, the SD of the difference between two sample means is the square root of the sum of their respective variances.

The use of statistical significance is one of the most misunderstood and abused concepts in science. You can read about this on the internet if you want to know why. It has a bit to do with frequentist versus Bayesian statistics/inference.

For example, when you have a non-null hypothesis going into an experiment, such as, “The data suggest an altered baseball,” then ANY positive result supports that hypothesis and increases the probability of it being true, regardless of the “statistical significance of those results.”

Of course the more significant the result, the more we increase the prior probability. However, the classic case of using 2 or 2.5 SD to define “statistical significance” really only applies when you start out with the null hypothesis. In this case, for example, that would be if you had no reason to suspect a juiced ball, and you merely tested balls just to see if perhaps there were differences. In reality, you almost always have a prior P which is why the traditional concept of accepting or rejecting the null hypothesis based on the statistical significance of the results of your experiment is an obsolete concept.

In any case, from the results of MLB’s own tests, in which they tested something like 180 balls a year, the seam height reduction we found was something like 6 or 7 SD and the COR increase was something like 3 or 4 SD. We also can add to the mix, Ben’s original test whereby he found an increase in COR of .003 or around 60% of what I found.

So yes, the combined results of all three tests are almost unequivocal evidence that the ball was altered. There’s not much else you can do other than to test balls. Of course the ball testing would mean almost nothing if we didn’t have the batted ball data to back it up. We do.

I don’t think this “ball change” was intentional by MLB, although it could be.

In my extensive research for this project, I have uncovered two things:

One, there is quite a large actual year to year difference in the construction of the ball which can and does have a significant impact on HR and offensive rates in general. The concept of a “juiced” (or “de-juiced”) ball doesn’t really mean anything unless it is compared to some other ball – for example, in our case, 2014 to 2016/2017.

Two, we now know because of Statcast and lots of great work and insight by Alan Nathan and others, that very small changes in things like COR, seam height, and size can have a dramatic impact on offense. My (wild) guess is that we probably have something like a 2 or 3 feet (in batted ball distance for a typical HR trajectory) variation (one SD) from year to year based on the (random) fluctuating composition and construction of the ball.  And from 2014 to 2106 (and so far this year), we just happened to have seen a 2 or 3 standard deviation variation.

We’ve seen it before, most notably in 1987, and we’ll probably see it again. I have also altered my thinking about the “steroid era.” Now that I know that balls can fluctuate from year to year, sometimes greatly, it is entirely possible that balls were constructed differently starting in 1993 or so – perhaps in combination with burgeoning and rampant PED use.

Finally, it is true that there are many things that can influence run scoring and HR rates, some more than others. Weather and parks are very minor. Even a big change in one park or two or a very hot or cold year will have very small effects overall. And of course we can easily test or account for these things.

Change in talent can surprisingly have a large effect on overall offense. For example, this year, the AL lost a lot of offensive talent which is one reason why the NL and the AL have almost equal scoring despite the AL having the DH.

The only other thing that can fairly drastically change offense is the strike zone. Obviously it depends on the magnitude of the change. In the pitch f/x era we can measure that, as Joe Roegele and others do every year. It has not changed much the last few years until this year. It is smaller now, which is causing an uptick in offense from last year. I also believe, as others have said, that the uptick since last year is due to batters realizing that they are playing with a livelier ball and thus are hitting more air balls. They may be hitting more air balls even without thinking that the ball is juiced -they may be just jumping on the “fly-ball bandwagon.” Either way, hitting more fly balls compounds the effect of a juiced ball because it is correct to hit more fly balls.

Then there is the bat, which I know nothing about. I have not heard anything about the bats being different or what you can do to a bat to increase or decrease offense, within allowable MLB limits.

Do I think that the “juiced ball” (in combination with players taking advantage of it) is the only reason for the HR/scoring surge? I think it’s the primary driver, by far.

There’s been some discussion lately on Twitter about the sacrifice bunt. Of course it is used very little anymore in MLB other than with pitchers at the plate. I’ll spare you the numbers. If you want to verify that, you can look it up on the interweb. The reason it’s not used anymore is not because it was or is a bad strategy. It’s simply because there is no point in sac bunting in most cases. I’ve written about why before on this blog and on other sabermetric sites. It has to do with game theory. I’ll briefly explain it again along with some other things. This is mostly a copy and paste from my recent tweets on the subject.

First, the notion that you can analyze the efficacy (or anything really) about a sac bunt attempt by looking at what happens (say, the RE or WE) after an out and a runner advance is ridiculous. For some reason sabermetricians did that reflexively for a long time ever since Palmer and Thorn wrote The Hidden Game and concluded (wrongly) that the sac bunt was a terrible strategy in most cases. What they meant was that advancing the runner in exchange for an out is a terrible strategy in most cases, which it is. But again, EVERYONE knows that that isn’t the only thing that happens when a batter attempts to bunt. That’s not a shock. We all know that the batter can reach base on a single or an error, he can strike out, hit into a force or DP, pop out, or even walk. We obviously have to know  how often those things occur on a bunt attempt to have any chance to figure out whether a bunt might increase, decrease or not change the RE or WE, compared to hitting away. Why Palmer and Thorn or anyone else ever thought that looking at the RE or WE after something that occurs less than half the time on a bunt attempt (yeah, on the average an out and runner advance occurs around 47% of the time) could answer the question of whether a sac bunt might be a good play or not, is a mystery to me. Then again, there are probably plenty of stupid things we’re saying and doing now with respect to baseball analysis that we’ll be laughing or crying about in the future, so I don’t mean that literally.

What I am truly in disbelief about is that there are STILL saber-oriented writers and pundits who talk about the sac bunt attempt as if all that ever happens is an out and a runner advance. That’s indefensible. For cripes sake I wrote all about this in The Book 12 years ago. I have thoroughly debunked the idea that “bunts are bad because they considerably reduce the RE or WE.” They don’t. This is not controversial. It never was. It was kind of a, “Shit I don’t know why I didn’t realize that,” moment. If you still look at bunt attempts as an out and a runner advance instead of as an amalgam of all kinds of different results, you have no excuse. You are either profoundly ignorant, stubborn, or both. (I’ll give the casual fan a pass).

Anyway, without further ado, here is a summary of some of what I wrote in The Book 12 years ago about the sac bunt, and what I just obnoxiously tweeted in 36 or so separate tweets:

Someone asked me to post my 2017 W/L projections for each team. I basically added up the run values of my individual projections, using Fangraphs projected playing time for every player, as of around March 15.

I did use the actual schedule for a “strength of opponent” adjustment. I didn’t add anything additional for injuries, chances of each team making roster adjustments at trade deadline or otherwise, managerial skill, etc. I didn’t try and simulate lineups or anything like that. Plus, these are based on my preliminary projections without incorporating any Statcast or pitch F/X data. Also, these kinds of projections tend to regress toward a mean of .500 for all teams. That’s because bad teams tend to weed out bad players and otherwise improve, and injuries don’t hurt them much – in some cases improving them. And good teams tend to be hurt more by injuries (and I don’t think the depth charts I use account enough for chance of injury). As well, if good teams are not contending at the deadline, they tend to trade their good players.

So take these for what they are worth.

team wins div wc div+wc ds lcs ws
 

NL EAST

was 89 0.499 0.097 0.597 0.257 0.117 0.048
nyn 88 0.437 0.114 0.55 0.239 0.106 0.044
mia 78 0.046 0.02 0.066 0.024 0.01 0.004
phi 72 0.007 0.002 0.009 0.003 0.001 0
atl 72 0.011 0.004 0.014 0.006 0.002 0.001
 

NL Central

chn 100 0.934 0.044 0.978 0.56 0.303 0.146
sln 86 0.049 0.273 0.322 0.137 0.059 0.022
pit 82 0.017 0.129 0.146 0.056 0.023 0.008
cin 67 0 0.001 0.001 0 0 0
mil 61 0 0 0 0 0 0
 

NL WEST

lan 102 0.961 0.025 0.987 0.591 0.327 0.164
sfn 85 0.03 0.214 0.245 0.098 0.041 0.016
col 78 0.005 0.047 0.052 0.018 0.007 0.003
ari 77 0.003 0.03 0.033 0.011 0.004 0.002
sdn 66 0 0 0 0 0 0
 

AL EAST

tor 87 0.34 0.114 0.455 0.229 0.118 0.061
bos 87 0.359 0.129 0.487 0.238 0.117 0.064
tba 83 0.15 0.077 0.227 0.105 0.051 0.027
bal 81 0.099 0.056 0.155 0.071 0.032 0.014
nya 79 0.053 0.035 0.088 0.038 0.018 0.008
 

AL CENTRAL

cle 93 0.861 0.027 0.888 0.471 0.254 0.146
det 82 0.097 0.077 0.174 0.076 0.033 0.016
min 76 0.021 0.015 0.036 0.014 0.005 0.002
kca 75 0.02 0.014 0.033 0.014 0.005 0.003
cha 68 0.001 0.001 0.002 0 0 0
 

AL WEST

hou 91 0.541 0.13 0.671 0.362 0.188 0.11
sea 86 0.228 0.155 0.383 0.192 0.09 0.047
ala 84 0.181 0.12 0.301 0.146 0.071 0.036
tex 80 0.044 0.042 0.086 0.038 0.017 0.008
oak 73 0.006 0.007 0.014 0.006 0.002 0.001