Archive for the ‘In-game strategy’ Category

In Game 7 of the World Series anyone who was watching the top of the 9th inning probably remembers Javier Baez attempting a (safety squeeze – presumably) bunt on a 3-2 count with 1 out and Jason Heyward on 3rd base. You also remember that Baez struck out on a foul ball, much to the consternation of Cubs fans.

There was plenty of noise on social media criticizing Maddon (or Baez, if he did that on his own) for such an unusual play (you rarely see position players bunt on 2-strike counts, let alone with a 3-2 count and let alone with a runner on 3rd) and of course because it failed and eventually led to a scoreless inning. I was among those screaming bloody murder on Twitter and continuing my long-running criticism of Maddon’s dubious (in my opinion) post-season in-game tactics dating back to his Tampa days. I did, however, point out that I didn’t know off the top of my head (and it was anything but obvious or trivial to figure out) what the “numbers” were but that I was pretty sure it was a bad strategy.

Some “prima facia” evidence that it might be bad play, as I also tweeted, was, “When have you ever seen a play like that in baseball game?” That doesn’t automatically mean that it’s a bad play, but it is evidence nonetheless. And the fact that it was a critical post-season game meant nothing. If was correct to do it in that game it would be correct to do it in any game – at least in the late innings of a tie or 1-run game.

Anyway, I decided to look at some numbers although it’s not an easy task to ascertain whether in fact this was a good, bad, or roughly neutral (or we just don’t know) play. I turned to Retrosheet as I often do, and looked at what happens when a generic batter (who isn’t walked, which probably eliminates lots of good batters) does not bunt (which is almost all of the time of course) on a 3-2 count with 1 out, runner on third base and no runner on first, in a tie game or one in which the batting team was ahead, in the late innings, when the infield would likely be playing in to prevent a run from scoring on a ground ball. This is what I found:

The runner scores around 28% of the time overall. There were 33% walks (pitcher should be pitching a bit around the batter in this situation), 25% strikeouts and 25% BIP outs. When the ball is put in play, which occurs 42% of the time, the runner scores 63% of the time.

Now let’s look at what happens when a pitcher simply bunts the ball on a 3-2 count in a sacrifice situation. We’ll use that as a proxy for what Baez might do when trying to bunt in this situation. Pitchers are decent bunters overall (although they don’t run well on a bunt) and Baez is probably an average bunter at best for a position player. In fact, Baez has a grand total of one sacrifice hit in his entire minor and major league career so he may be an poor bunter – but to give him and Maddon the benefit of the doubt we’ll assume that he is as good at bunting as your typical NL pitcher.

On a 3-2 count in a sac situation when the pitcher is still bunting, he strikes out 40% of the time and walks 22% of the time. Compare that to the hitter who swings away at 3-2, runner on 3rd and 1 out where he K’s 25% of the time and walks 33% of the time. Of those 40% strikeouts, lots are bunt fouls. In fact, pitchers strike out on a foul bunt with a 3-2 count 25% % of the time. The rest, 15%, are called strikes and missed bunt attempts. It’s very easy to strike out on a foul bunt when you have two strikes, even when there are 3 balls (and you can take some close pitches).

How often does the run score on a 3-2 bunt attempt with a runner on 3rd such as in the Baez situation? From that data we can’t tell because we’re only looking at 3-2 bunts from pitchers with no runner on 3rd so we have make some inferences.

The pitcher puts the ball in play 36% of the time when bunting on a 3-2 count. How often would a runner score if there were a runner on 3rd? We’ll have to make some more inferences. In situations where a batter attempts a squeeze (either a suicide or safety – for the most part, we can’t tell from the Retrosheet data), the runner scores 80% of the time when the ball in bunted in play. So let’s assume the same with our pitchers/Baez. So 36% of the time the ball is put in play on a 3-2 bunt, 80% of the time the run scores. That’s a score rate of 29% – around the same as when swinging away.

So swinging away, the run scores 28% of the time. With a bunt attempt the run scores 29% of the time, so it would appear to be a tie with no particular strategy a clear winner. But wait….

When the run doesn’t score, the batter who is swinging away at 3-2 walks 33% of the time while the pitcher who is attempting a bunt on a 3-2 pitch walks only 25% of the time. But, we won’t count that as an advantage for the batter swinging away. The BB difference is likely due to the fact that pitchers are pitching around batters in that situation and they are going right after pitchers on 3-2 counts in sacrifice situations. In a situation like Baez’ the pitcher is going to issue more than 25% walks since he doesn’t mind the free pass and he is not going to groove one. So we’ll ignore the difference in walks. But wait again….

When a run scores on a squeeze play the batter is out 72% of the time and ends up mostly on first 28% of the time (a single, error, or fielder’s choice). When a run scores with a batter swinging away on a 3-2 count, the batter is out only 36% of the time. 21% of those are singles and errors and 15% are extra base hits including 10% triples and 5% HR.

So even though the run scores with both bunting and hitting away on a 3-2 count around the same percentage of the time, the batter is safe, including walks, hits, errors and fielder’s choices, only 26% of the time when bunting and 50% when swinging away. Additionally, when the batter swinging away gets a hit, 20% are triples and 6% are HR. So even though the runner on third scores around the same percentage of time whether swinging away or bunting on that 3-2 count, when the run does score, the batter who is swinging away reaches base safely (with some extras base hits including HR) more than twice as often as the batter who is bunting

I’m going to say that the conclusion is that while the bunt attempt was probably not a terrible play, it was still the wrong strategy given that it was the top of the inning. The runner from third will probably score around the same percentage of the time whether Baez is bunting or swinging away, but when the run does score, Baez is going to be safe a much higher percentage of the time, including via the double, triple or HR, leading to an additional run scoring significantly more often than with the squeeze attempt.

I’m not giving a pass to Maddon on this one. That would be true regardless of whether the bunt worked or not – of course.

Addendum: A quick estimate is that an additional run (or more) will score around 12% more often when swinging away. An extra run in the top of the 9th, going from a 1-run lead to a 2-run lead,  increases a team’s chances of winning by 10% (after that every additional run is worth half the the value of the preceding run). So we get an extra 1.2% (10% times 12%) in win expectancy from swinging away rather than bunting via the extra hits that occur when the ball is put into play.

 

 

Let me explain game theory wrt sac bunting using tonight’s CLE game as an example. Bottom of the 10th, leadoff batter on first, Gimenez is up. He is a very weak batter with little power or on-base skills, and the announcers say, “You would expect him to be bunting.” He clearly is.

Now, in general, to determine whether to bunt or not, you estimate the win expectancies (WE) based on the frequencies of the various outcomes of the bunt, versus the frequencies of the various outcomes of swinging away. Since, for a position player, those two final numbers are usually close, even in late tied-game situations, the correct decision usually hinges on: On the swing side, whether the batter is a good hitter or not, and his expected GDP rate. On the bunt side, how good of a sac bunter is he and how fast is he (which affect the single and ROE frequencies, which are an important part of the bunt WE)?

Gimenez is a terrible hitter which favors the bunt attempt but he is also not a good bunter and slow which favors hitting away. So the WE’s are probably somewhat close.

One thing that affects the WE for both bunting and swinging, of course, is where the third baseman plays before the pitch is thrown. Now, in this game, it was obvious that Gimenez was bunting all the way and everyone seemed fine with that. I think the announcers and probably everyone would have been shocked if he didn’t (we’ll ignore the count completely for this discussion – the decision to bunt or not clearly can change with it).

The announcers also said, “Sano is playing pretty far back for a bunt.” He was playing just on the dirt I think, which is pretty much “in between when expecting a bunt.” So it did seem like he was not playing up enough.

So what happens if he moves up a little? Maybe now it is correct to NOT bunt because the more he plays in, the lower the WE for a bunt and the higher the WE for hitting away! So maybe he shouldn’t play up more (the assumption is that if he is bunting, then the closer he plays, the better). Maybe then the batter will hit away and correctly so, which is now better for the offense than bunting with the third baseman playing only half way. Or maybe if he plays up more, the bunt is still correct but less so than with him playing back, in which case he SHOULD play up more.

So what is supposed to happen? Where is the third baseman supposed to play and what does the batter do? There is one answer and one answer only. How many managers and coaches do you think know the answer (they should)?

The third baseman is supposed to play all the way back “for starters” in his own mind, such that it is clearly correct for the batter to bunt. Now he knows he should play in a little more. So in his mind again, he plays up just a tad bit.

Now is it still correct for the batter to bunt? IOW, is the bunt WE higher than the swing WE given where the third baseman is playing? If it is, of course he is supposed to move up just a little more (in his head).

When does he stop? He stops of course when the WE from bunting is exactly the same as the WE from swinging. Where that is completely depends on those things I talked about before, like the hitting and bunting prowess of the batter, his speed, and even the pitcher himself.

What if he keeps moving up in his mind and the WE from bunting is always higher than hitting, like with most pitchers at the plate with no outs? Then the 3B simply plays in as far as he can, assuming that the batter is bunting 100%.

So in our example, if Sano is indeed playing at the correct depth which maybe he was and maybe he wasn’t, then the WE from bunting and hitting must be exactly the same, in which case, what does the batter do? It doesn’t matter, obviously! He can do whatever he wants, as long as the 3B is playing correctly.

So in a bunt situation like this, assuming that the 3B (and other fielders if applicable) is playing reasonably correctly, it NEVER matters what the batter does. That should be the case in every single potential sac bunt situation you see in a baseball game. It NEVER matters what the batter does. Either bunting or not are equally “correct.” They result in exactly the same WE.

The only exceptions (which do occur) are when the WE from bunting is always higher than swinging when the 3B is playing all the way up (a poor hitter and/or exceptional bunter) OR the WE from swinging is always higher even when the 3B is playing completely back (a good or great hitter and/or poor bunter).

So unless you see the 3B playing all the way in or all the way back and they are playing reasonably optimally it NEVER matters what the batter does. Bunt or not bunt and the win expectancy is exactly the same! And if the 3rd baseman plays all the way in or all the way back and is playing optimally, then it is always correct for the batter to bunt or not bunt 100% of the time.

I won’t go into this too much because the post assumed that the defense was playing optimally, i.e. it was in a “Nash Equilibrium” (as I explained, it is playing in a position such that the WE for bunting and swinging are exactly equal) or it was correctly playing all the way in (the WE for bunting is still equal to or great than for swinging) or all the way back (the WE for swinging is >= that of bunting), but if the defense is NOT playing optimally, then the batter MUST bunt or swing away 100% of the time.

This is critical and amazingly there is not ONE manager or coach in MLB that understands it and thus correctly utilizes a correct bunt strategy or bunt defense.

Note: I updated the pinch hitting data to include a larger sample (previously I went back to 2008. Now, 2000).

Note: It was pointed out by a commenter below and another one on Twitter that you can’t look only at innings where the #9 and #1 batters batted (eliminating innings where the #1 hitter led off), as Russell did in his study, and which he uses to support his theory (he says that it is the best evidence). That creates a huge bias, of course. It eliminates all PA in which the #9 hitter made the last out of an inning or at least an out was made while he was at the plate. In fact, the wOBA for a #9 hitter, who usually bats around .300, is .432 in innings where he and the #1 hitter bat (after eliminating so many PA in which an out was made). How that got past Russell, I have no idea.  Perhaps he can explain.

Recently, Baseball Prospectus published an article by one of their regular writers, Russell Carleton (aka Pizza Cutter), in which he examined whether the so-called “times through the order” penalty (TTOP) was in fact a function of how many times a pitcher has turned over the lineup in a game or whether it was merely an artifact of a pitcher’s pitch count. In other words, is it pitcher fatigue or batter familiarity (the more the batter sees the pitcher during the game, the better he performs) which causes this effect?

It is certainly possible that most or all of the TTOP is really due to fatigue, as “times through the order” is clearly a proxy for pitch count. In any case, after some mathematic gyrations that Mr. Carleton is want to do (he is the “Warning: Gory Mathematical Details Ahead” guy) in his articles, he concludes unequivocally that there is no such thing as a TTOP – that it is really a PCP or Pitch Count Penalty effect that makes a pitcher less and less effective as he goes through the order and it has little or nothing to do with batter/pitcher familiarity. In fact, in the first line of his article, he declares, “There is no such thing as the ‘times through the order’ penalty!”

If that is true, this is a major revelation which has slipped through the cracks in the sabermetric community and its readership. I don’t believe it is, however.

As one of the primary researchers (along with Tom Tango) of the TTOP, I was taken quite aback by Russell’s conclusion, not because I was personally affronted (the “truth” is not a matter of opinion), but because my research suggested that pitch count or fatigue was likely not a significant part of the penalty. In my BP article on the TTOP a little over 2 years ago, I wrote this: “…the TTOP is not about fatigue. It is about familiarity. The more a batter sees a pitcher’s delivery and repertoire, the more likely he is to be successful against him.” What was my evidence?

First, I looked at the number of pitches thrown going into the second, third, and fourth times through the order. I split that up into two groups—a low pitch count and a high pitch count. Here are those results. The numbers in parentheses are the average number of pitches thrown going into that “time through the order.”

Times Through the Order Low Pitch Count High Pitch Count
1 .341 .340
2 .351 (28) .349 (37)
3 .359 (59) .359 (72)
4 .361 (78) .360 (97)

 

If Russell’s thesis were true, you should see a little more of a penalty in the “high pitch count” column on the right, which you don’t. The penalty appears to be the same regardless of whether the pitcher has thrown few or many pitches. To be fair, the difference in pitch count between the two groups is not large and there is obviously sample error in the numbers.

The second way I examined the question was this: I looked only at individual batters in each group who had seen few or many pitches in their prior PA. For example, I looked at batters in their second time through the order who had seen fewer than three pitches in their first PA, and also batters who saw more than four pitches in their first PA. Those were my two groups. I did the same thing for each time through the order. Here are those results. The numbers in parentheses are the average number of pitches seen in the prior PA, for every batter in the group combined.

 

Times Through the Order Low Pitch Count each Batter High Pitch Count each Batter
1 .340 .340
2 .350 (1.9) .365 (4.3)
3 .359 (2.2) .361 (4.3)

 

As you can see, if a batter sees more pitches in his first or second PA, he performs better in his next PA than if he sees fewer pitches. The effect appears to be much greater from the first to the second PA. This lends credence to the theory of “familiarity” and not pitcher fatigue. It is unlikely that 2 or 3 extra pitches would cause enough fatigue to elevate a batter’s wOBA by 8.5 points per PA (the average of 15 and 2, the “bonuses” for seeing more pitches during the first and second PA, respectively).

So how did Russell come to his conclusion and is it right or wrong? I believe he made a fatal flaw in his methodology which led him to a faulty conclusion (that the TTOP does not exist).

Among other statistical tests, here is the primary one which led Russell to conclude that the TTOP is a mirage and merely a product of pitcher fatigue due to an ever-increasing pitch count:

This time, I tried something a little different. If we’re going to see a TTOP that is drastic, the place to look for it is as the lineup turns over. I isolated all cases in which a pitcher was facing the ninth batter in the lineup for the second time and then the first batter in the lineup for the third time. To make things fair, neither hitter was allowed to be the pitcher (this essentially limited the sample to games in AL parks), and the hitters needed to be faced in the same inning. Now, because the leadoff hitter is usually a better hitter, we need to control for that. I created a control variable for all outcomes using the log odds ratio method, which controls for the skills of the batter, as well as that of the pitcher. I also controlled for whether or not the pitcher had the platoon advantage in either case.

First of all, there was no reason to limit the data to “the same inning”. Regardless of whether the pitcher faces the 9th and 1st batters in the same inning or they are split up (the 9 hitter makes the last out), since one naturally follows the other, they will always have around the same pitch count, and the leadoff hitter will always be one time through the order ahead of the number nine hitter.

Anyway, what did Russell find? He found that TTOP was not a predictor of outcome. In other words, that the effect on the #9 hitter was the same as the #1 hitter, even though the #1 hitter had faced the pitcher one more time than the #9 hitter.

I thought about this for a long time and I finally realized why that would be the case even if there was a “times order” penalty (mostly) independent of pitch count. Remember that in order to compare the effect of TTO on that #9 and #1 hitter, he had to control for the overall quality of the hitter. The last hitter in the lineup is going to be a much worse hitter overall than the leadoff hitter, on the average, in his sample.

So the results should look something like this if there were a true TTOP: Say the #9 batters are normally .300 wOBA batters, and the leadoff guys are .330. In this situation, the #9 batters should bat around .300 (during the second time through the order we see around a normal wOBA) but the leadoff guys should bat around .340 – they should have a 10 point wOBA bonus for facing the pitcher for the third time.

Russell, without showing us the data (he should!), presumably gets something like .305 for the #9 batters (since the pitcher has gone essentially 2 ½ times through the lineup, pitch count-wise) and the leadoff hitters should hit .335, or 5 points above their norm as well (maybe .336 since they are facing a pitcher with a few more pitches under his belt than the #9 hitter).

So if he gets those numbers, .335 and .305, is that evidence that there is no TTOP? Do we need to see numbers like .340 and .300 to support the TTOP theory rather than the PCP theory? I submit that even if Russell sees numbers like the former ones, that is not evidence that there is no TTOP and it’s all about the pitch count. I believe that Russell made a fatal error.

Here is where he went wrong:

Remember that he uses the log-odds method to computer the baseline numbers, or what he would expect from a given batter-pitcher matchup, based on their overall season numbers. In this experiment, there is no need to do that, since both batters, #1 and #9, are facing the same pitcher the same number of times. All he has to do is use each batter’s seasonal numbers to establish the base line.

But where do those base lines come from? Well, it is likely that the #1 hitters are mostly #1 hitters throughout the season and that #9 hitters usually hit at the bottom of the order. #1 hitters get around 150 more PA than #9 hitters over a full season. Where do those extra PA come from? Some of them come from relievers of course. But many of them come from facing the starting pitcher more often per game than those bottom-of-the-order guys. In addition, #9 hitters sometimes are removed for pinch hitters late in a game against a starter such that they lose even more of those 3rd and 4th time through the order PA’s. Here is a chart of the mean TTO per game versus the starting pitcher for each batting slot:

 

Batting Slot Mean TTO/game
1 2.15
2 2.08
3 2.02
4 1.98
5 1.95
6 1.91
7 1.86
8 1.80
9 1.77

(By the way, if Russell’s thesis is true, bottom of the order guys have it even easier, since they are always batting when the pitcher has a higher pitch count, per time through the order. Also, this is the first time you have been introduced to the concept that the top of the order batters have it a little easier than the bottom of the order guys, and that switching spots in the order can affect overall performance because of the TTOP or PCP.)

What that does is result in the baseline for the #1 hitter being higher than for the #9 hitter, because the baseline includes more pitcher TTOP (more times facing the starter for the 3rd and 4th times). That makes it look like the #1 hitter is not getting his advantage as compared to the #9 hitter, or at least he is only getting a partial advantage in Russell’s experiment.

In other words, the #9 hitter is really a true .305 hitter and the #1 hitter is really a true .325 hitter, even though their seasonal stats suggest .300 and .330. The #9 hitters are being hurt by not facing starters late in the game compared to the average hitter and the #1 hitters are being helped by facing starters for the 3rd and 4th times more often than the average hitter.

So if #9 hitters are really .305 hitters, then the second time through the order, we expect them to hit .305, if the TTOP is true. If the #1 hitters are really .325 hitters, despite hitting .330 for the whole season, we expect them to hit .335 the third time through the order, if the TTOP is true. And that is exactly what we see (presumably).

But when Russell sees .305 and .335 he concludes, “no TTOP!” He sees what he thinks is a true .300 hitter hitting .305 after the pitcher has thrown around 65 pitches and what he thinks is a true .330 hitter hitting .335 after 68 or 69 pitches. He therefore concludes that both hitters are being affected equally even though one is batting for the second time and the other for the third time – thus, there is no TTOP!

As I have shown, those numbers are perfectly consistent with a TTOP of around 8-10 points per times through the order, which is exactly what we see.

Finally, I ran one other test which I think can give us more evidence one way or another. I looked at pinch hitting appearances against starting pitchers. If the TTOP is real and pitch count is not a significant factor in the penalty, we should see around the same performance for pinch hitters regardless of the pitcher’s pitch count, since the pinch hitter always faces the pitcher for the first time and the first time only. In fact, this is a test that Russell probably should have run. The only problem is sample size. Because there are relatively few pinch hitting PA versus starting pitchers, we have quite a bit of sample error in the numbers. I split the sample of pinch hitting appearances up into 2 groups: Low pitch count and high pitch count.

 

Here is what I got:

PH TTO Overall Low Pitch Count High Pitch Count
2 .295 (PA=4901) .295 (PA=2494) .293 (PA=2318)
3 .289 (PA=10774) .290 (PA=5370) .287 (PA=5404)

 

I won’t comment on the fact that the pinch hitters performed a little better against pitchers with a low pitch count (the differences are not nearly statistically significant) other than to say that there is no evidence that pitch count has any influence on the performance of pinch hitters who are naturally facing pitchers for the first and only time. Keep in mind that the times through the order (the left column) is a good proxy for pitch count in and of itself and we also see no evidence that that makes a difference in terms of pinch hitting performance. In other words, if pitch count significantly influenced pitching effectiveness, we should see pinch hitters overall performing better when the pitcher is in the midst of his 3rd time through the order as opposed to the 2nd time (his pitch count would be around 30-35 pitches higher). We don’t. In fact, we see a worse performance (the difference is not statistically significant – one SD is 8 points of wOBA).

 

I have to say that it is difficult to follow Russell’s chain of logic and his methodology in many of his articles because he often fails to “show his work” and he uses somewhat esoteric and opaque statistical techniques only. In this case, I believe that he made a fatal mistake in his methodology as I have described above which led him to the erroneous conclusion that, “The TTOP does not exist.” I believe that I have shown fairly strong evidence that the penalty that we see pitchers incur as the game wears on is mostly or wholly as a result of the TTO and not due to fatigue caused by an increasing pitch count.

I look forward to someone doing additional research to support one theory or the other.

There seems to be an unwritten rule in baseball – not on the field, but in the stands, at home, in the press box, etc.

“You can’t criticize a manager’s decision if it doesn’t directly affect the outcome of the game, if it appears to ‘work’, or if the team goes on to win the game despite the decision.”

That’s ridiculous of course. The outcome of a decision or the game has nothing to do with whether the decision was correct or not. Some decisions may raise or lower a team’s chances of winning from 90% and other decisions may affect a baseline of 10 or 15%.

If decision A results in a team’s theoretical chances of winning of 95% and decision A, 90%, obviously A is the correct move. Choosing B would be malpractice. Equally obvious is if manager chooses B, an awful decision, he is still going to win the game 90% of the time, and based on the “unwritten rule” we rarely get to criticize him. Similarly, if decision A results in a 15% win expectancy (WE) and B results in 10%, A is the clear choice, yet the team still loses most of the time and we get to second guess the manager whether he chooses A or B. All of that is silly and counter-productive.

If your teenager drives home drunk yet manages to not kill himself or anyone else, do you say nothing because “it turned out OK?” I hope not. In sports, most people understand the concept of “results versus process” if they are cornered into thinking about it, but in practice, they just can’t bring themselves to accept it in real time. No one is going to ask Terry Collins in the post-game presser why he didn’t pinch hit for DeGrom in the 6th inning – no one. The analyst – a competent one at least – doesn’t give a hoot what happened after that. None whatsoever. He looks at a decision and if it appears questionable at the time, he tries to determine what the average consequences are – with all known data at the time the decision is made – with the decision or with one or more alternatives. That’s it. What happens after that is irrelevant to the analyst. For some reason this is a hard concept for the average fan – the average person – to apply. As I said, I truly think they understand it, especially if you give obvious examples, like the drunk driving one. They just don’t seem to be able to break the “unwritten rule” in practice. It goes against their grain.

Well, I’m an analyst and I don’t give a flying ***k whether the Mets won, lost, tied, or Wrigley Field collapsed in the 8th inning. The “correctness” of the decision to allow DeGrom to hit or not in the top of the 6th, with runners on second and third, boiled down to this question and this question only:

“What is the average win expectancy (WE) of the Mets with DeGrom hitting and then pitching some number of innings and what is the average WE with a pinch hitter and someone else pitching in place of DeGrom?”

Admittedly the gain, if there is any, from making the decision to bring in a PH and reliever or relievers must be balanced against any known or potential negative consequences for the Mets not related to the game at hand. Examples of these might be: 1) limiting your relief possibilities in the rest of the series or the World Series. 2) Pissing off DeGrom or his teammates for taking him out and thus affecting the morale of the team.

I’m fine with the fans or the manager and coaches including these and other considerations in their decision. I am not fine with them making their decision not knowing how it affects the win expectancy of the game at hand, since that is clearly the most important of the considerations.

My guess is that if we asked Collins about his decision-making process, and he was honest with us, he would not say, “Yeah, I knew that letting him hit would substantially lower our chances of winning the game, but I also wanted to save the pen a little and give DeGrom a chance to….” I’m pretty sure he thought that with DeGrom pitching well (which he usually does, by the way – it’s not like he was pitching well-above his norm), his chances of winning were better with him hitting and then pitching another inning or two.

At this point, and before I get into estimating the WE of the two alternatives facing Collins, letting DeGrom hit and pitch or pinch hitting and bringing in a reliever, I want to discuss an important concept in decision analysis in sports. In American civil law, there is a thing called a summary judgment. When a party in a civil action moves for one, the judge makes his decision based on the known facts and assuming controversial facts and legal theories in a light most favorable to the non-moving party. In other words, if everything that the other party says is true is true (and is not already known to be false) and the moving party would still win the case according to the law, then the judge must accept the motion and the moving party wins the case without a trial.

When deciding whether a particular decision was “correct” or not in a baseball game or other contest, we can often do the same thing in order to make up for an imperfect model (which all models are by the way). You know the old saw in science – all models are wrong, but some are useful. In this particular instance, we don’t know for sure how DeGrom will pitch in the 6th and 7th innings to the Cubs order for the 3rd time, we don’t know for how much longer he will pitch, we don’t know how well DeGrom will bat, and we don’t know who Collins can and will bring in.

I’m not talking about the fact that we don’t know whether DeGrom or a reliever is going to give up a run or two, or whether he or they are going to shut the Cubs down. That is in the realm of “results-based analysis” and I‘ve already explained how and why that is irrelevant. I’m talking about what is DeGrom’s true talent, say in runs allowed per 9 facing the Cubs for the third time, what is a reliever’s or relievers’ true talent in the 6th and 7th, how many innings do we estimate DeGrom will pitch on the average if he stays in the game, and what is his true batting talent.

Our estimates of all of those things will affect our model’s results – our estimate of the Mets’ WE with and without DeGrom hitting. But what if we assumed everything in favor of keeping DeGrom in the game – we looked at all controversial items in a light most favorable to the non-moving party – and it was still a clear decision to pinch hit for him? Well, we get a summary judgment! Pinch hitting for him would clearly be the correct move.

There is one more caveat. If it is true that there are indirect negative consequences to taking him out – and I’m not sure that there are – then we also have to look at the magnitude of the gain from taking him out and then decide whether it is worth it. In order to do that, we have to have some idea as to what is a small and what is a large advantage. That is actually not that hard to do. Managers routinely bring in closers in the 9th inning with a 2-run lead, right? No one questions that. In fact, if they didn’t – if they regularly brought in their second or third best reliever instead, they would be crucified by the media and fans. How much does bringing in a closer with a 2-run lead typically add to a team’s WE, compared to a lesser reliever? According to The Book, an elite reliever compared to an average reliever in the 9th inning with a 2-run lead adds around 4% to the team’s WE. So we know that 4% is a big advantage, which it is.

That brings up another way to account for the imperfection of our models. The first way was to use the “summary judgment” method, or assume things most favorable to making the decision that we are questioning. The second way is to simply estimate everything to the best of our ability and then look at the magnitude of the results. If the difference between decision A and B is 4%, it is extremely unlikely that any reasonable tweak to the model will change that 4% to 0% or -1%.

In this situation, whether we assume DeGrom is going to pitch 1.5 more innings or 1.6 or 1.4, it won’t change the results much. If we assume that DeGrom is an average hitting pitcher or a poor one, it won’t change the result all that much. If we assume that the “times through the order penalty” is .25 runs or .3 runs per 9 innings, it won’t change the results much. If we assume that the relievers used in place of DeGrom have a true talent of 3.5, 3.3, 3.7, or even 3.9, it won’t change the results all that much. Nothing can change the results from 4% in favor of decision A to something in favor of decision B. 4% is just too much to overcome even if our model is not completely accurate. Now, if our results assuming “best of our ability estimates” for all of these things yield a 1% advantage for choosing A, then it is entirely possible that B is the real correct choice and we might defer to the manager in case he knows some things that we don’t or we simply are mistaken in our estimates or we failed to account for some important variable.

Let’s see what the numbers say, assuming “average” values for all of these relevant variables and then again making reasonable assumptions in favor of allowing DeGrom to hit (assuming that pinch hitting for him appears to be correct).

What is the win expectancy with DeGrom batting. We’ll assume he is an average-hitting pitcher or so (I have heard that he is a poor-hitting pitcher). An average pitcher’s batting line is around 10% single, 2% double or triple, .3% HR, 4% BB, and 83.7% out. The average WE for an average team leading by 1 run in the top of the 6th, with runners on second and third, 2 outs, and a batter with this line, is…..

63.2%.

If DeGrom were an automatic out, the WE would be 59.5%. That is the average WE leading off the bottom of the 6th with the visiting team winning by a run. So an average pitcher batting in that spot adds a little more than 3.5% in WE. That’s not wood. What if DeGrom were a poor hitting pitcher?

Whirrrrr……

62.1%.

So whether DeGrom is an average or poor-hitting pitcher doesn’t change the Mets’ WE in that spot all that much. Let’s call it 63%. That is reasonable. He adds 3.5% to the Mets’ WE compared to an out.

What about a pinch hitter? Obviously the quality of the hitter matters. The Mets have some decent hitters on the bench – notably Cuddyer from the right side and Johnson from the left. Let’s assume a league-average hitter. Given that, the Mets’ WE with runners on second and third, 2 outs, and a 1-run lead, is 68.8%. A league-average hitter adds over 9% to the Mets’ WE compared to an out. The difference between DeGrom as a slightly below-average hitting pitcher and a league-average hitter is 5.8%. That means, unequivocally, assuming that our numbers are reasonably accurate, that letting DeGrom hit cost the Mets almost 6% in their chances of winning the game.

That is enormous of course. Remember we said that bringing in an elite reliever in the 9th of a 2-run game, as compared to a league-average reliever, is worth 4% in WE. You can’t really make a worse decision as a manager than reducing your chances of winning by 5.8%, unless you purposely throw the game. But, that’s not nearly the end of the story. Collins presumably made this decision thinking that DeGrom pitching the 6th and perhaps the 7th would more than make up for that. Actually he’s not quite thinking, “Make up for that.” He is not thinking in those terms. He does not know that letting him hit “cost 5.8% in win expectancy” compared to a pinch hitter. I doubt that the average manager knows what “win expectancy” means let alone how to use it in making in-game decisions. He merely thinks, “I really want him to pitch another inning or two, and letting him hit is a small price to pay,” or something like that.

So how much does he gain by letting him pitch the 6th and 7th rather than a reliever. To be honest it is debatable whether he gains anything at all. Not only that, but if we look back in history to see how many innings starters end up pitching, on the average, in situations like that, we will find that it is not 2 innings. It is probably not even 1.5 innings. He was at 82 pitches through 5. He may throw 20 or 25 pitches in the 6th (like he did in the first), in which case he may be done. He may give up a base runner or two, or even a run or two, and come out in the 6th, perhaps before recording an out. At best, he pitches 2 more innings, and once in a blue moon he pitches all or part of the 8th I guess (as it turned out, he pitched 2 more effective innings and was taken out after seven). Let’s assume 1.5 innings, which I think is generous.

What is DeGrom’s expected RA9 for those 2 innings? He has pitched well thus far but not spectacularly well. In any case, there is no evidence that pitching well through 5 innings tells us anything about how a pitcher is going to pitch in the 6th and beyond. What is DeGrom’s normal expected RA9? Steamer, ZIPS and my projection systems say about 83% of league-average run prevention. That is equivalent to a #1 or #2 starter. It is equivalent to an elite starter, but not quite the level of the Kershaw’s, Arrieta’s, or even the Price’s or Sale’s. Obviously he could turn out to be better than that – or worse – but all we can do in these calculations and all managers can do in making these decisions is use the best information and the best models available to estimate player talent.

Then there is the “times through the order penalty.” There is no reason to think that this wouldn’t apply to DeGrom in this situation. He is going to face the Cubs for the third time in the 6th and 7th innings. Research has found that the third time through the order a starter’s RA9 is .3 runs worse than his overall RA9. So a pitcher who allows 83% of league average runs allows 90% when facing the order for the 3rd time. That is around 3.7 runs per 9 innings against an average NL team.

Now we have to compare that to a reliever. The Mets have Niese, Robles, Reed, Colon, and Gilmartin available for short or long relief. Colon might be the obvious choice for the 6th and 7th inning, although they surely could use a combination of righties and lefties, especially in very high leverage situations. What do we expect these relievers’ RA9 to be? The average reliever is around 4.0 to start with, compared to DeGrom’s 3.7. If Collins uses Colon, Reed, Niese or some combination of relievers, we might expect them to be better than the average NL reliever. Let’s be conservative and assume an average, generic reliever for those 1.5 innings.

How much does that cost the Mets in WE? To figure that, we take the difference in run prevention between DeGrom and the reliever(s), multiply by the game leverage and convert it into WE. The difference between a 3.7 RA9 and a 4.0 RA9 in 1.5 innings is .05 runs. The average expected leverage index in the 6th and 7th innings where the road team is up by a run is around 1.7. So we multiply .05 by 1.7 and convert that into WE. The final number is .0085, or less than 1% in win expectancy gained by allowing DeGrom to pitch rather than an average reliever.

That might shock some people. It certainly should shock Collins, since that is presumably his reason for allowing DeGrom to hit – he really, really wanted him to pitch another inning or two. He presumably thought that that would give his team a much better chance to win the game as opposed to one or more of his relievers. I have done this kind of calculation dozens of times and I know that keeping good or even great starters in the game for an inning or two is not worth much. For some reason, the human mind, in all its imperfect and biased glory, overestimates the value of 1 or 2 innings of a pitcher who is “pitching well” as compared to an “unknown entity” (of course we know the expected performance of our relievers almost as well as we know the expected performance of the starter). It is like a manager who brings in his closer in a 3-run game in the 9th. He thinks that his team has a much better chance of winning than if he brings in an inferior pitcher. The facts say that he is wrong, but tell that to a manager and see if he agrees with you – he won’t. Of course, it’s not a matter of opinion – it’s a matter of fact.

Do I need to go any further? Do I need to tweak the inputs? Assuming average values for the relevant variables yields a loss of over 5% in win expectancy by allowing DeGrom to hit. What if we knew that DeGrom were going to pitch two more innings rather than an average of 1.5? He saves .07 runs rather than .05 which translates to 1.2% WE rather than .85%, which means that pinch hitting for him increases the Mets’ chances of winning by 4.7% rather than 5.05%. 4.7% is still an enormous advantage. Reducing your team‘s chances of winning by 4.7% by letting DeGrom hit is criminal. It’s like pinch hitting Jeff Mathis for Mike Trout in a high leverage situation – twice!

What about if our estimate of DeGrom’s true talent is too conservative? What if he is as good as Kershaw and Arrieta? That’s 63% of league average run prevention or 2.6 RA9. Third time through the order and it’s 2.9. The difference between that and an average reliever is 1.1 runs per 9, which translates to a 3.1% WE difference in 1.5 innings. So allowing Kershaw to hit in that spot reduces the Mets chances of winning by 2.7%. That’s not wood either.

What if the reliever you replaced DeGrom with was a replacement level pitcher – the worst pitcher in the major leagues? He allows around 113% league average runs, or 4.6 RA9. Difference between DeGrom and him for 1.5 innings? 2.7% for a net loss of 3.1% by letting him hit rather than pinch hitting for him and letting the worst pitcher in baseball pitch the next 1.5 innings? If you told Collins, “Hey genius, if you pinch hit for Degrom and let the worst pitcher in baseball pitch for another inning and a half instead of DeGrom, you will increase your chances of winning by 3.1%,” what do you think he would say?

What if DeGrom were a good hitting pitcher? What if….?

You should be getting the picture. Allowing him to hit is so costly, assuming reasonable and average values for all the pertinent variables, that even if we are missing something in our model, or some of our numbers are a little off – even if assume everything in the best possible light of allowing him to hit – the decision is a no-brainer in favor of a pinch hitter.

If Collins truly wanted to give his team the best chance of winning the game, or in the vernacular of ballplayers, putting his team in the best position to succeed, the clear and unequivocal choice was to lift DeGrom for a pinch hitter. It’s too bad that no one cares because the Mets ultimately won the game, which they were going to do at least 60% of the time anyway, regardless of whether Collins made the right or wrong decision.

The biggest loser, other than the Cubs, is Collins (I don’t mean he is a loser, as in the childish insult), because every time you use results to evaluate a decision and the results are positive, you deprive yourself of the opportunity to learn a valuable lesson. In this case, the analysis could have and should have been done before the game even started. All managers should know the importance of bringing in pinch hitters for pitchers in high leverage situations in important games, no matter how good the pitchers are or how well they are pitching in the game so far. Maybe someday they will.

Last night in the Cubs/Cardinals game, the Cardinals skipper took his starter, Lackey, out in the 8th inning of a 1-run game with one out, no one on base and lefty Chris Coghlan coming to the plate. Coghlan is mostly a platoon player. He has faced almost four times as many righties in his career than lefties. His career wOBA against righties is a respectable .342. Against lefties it is an anemic .288. I have him with a projected platoon split of 27 points, less than his actual splits, which is to be expected as platoon splits in general get heavily regressed toward the mean, because they tend to be laden with noise for two reasons: One, the samples are rarely large because you are comparing performance against righties to performance against lefties and the smaller of the two tends to dominate the effective sample size – in Coghlan’s case, he has faced only 540 lefties in his entire 7-year career, less than the number of PA a typical  full-time batter gets in one season. Two, there is not much of a spread in platoon talent among both batters and pitchers. The less spread in talent for any statistic, the more the differences you see among players, especially in small samples, are noise. Sort of like DIPS for pitchers.

Anyway, even with a heavy regression, we think that Coghlan has a larger than average platoon split for a lefty and the average lefty split tends to be large. You typically would not want him facing a lefty in that situation. That is especially true when you have a very good and fairly powerful right-handed bat on the bench – Jorge Soler. Soler has a reverse career platoon split, but with only 114 PA versus lefties, that number is almost meaningless. I estimate his actual platoon split to be 23 points, a little less than the average righty. For RHB, there is always a heavy regression of actual platoon splits, regardless of the sample size (while the greater the sample of actual PA versus lefties, the less you regress, it might be a 95% regression for small samples and an 80% regression for large samples – either way, large) simply because there is not a very large spread of talent among RHB. If we look at the actual splits for all RHB over many, many PA, we see a narrow range of results. In fact, there is virtually no such thing as a RHB with true reverse platoon splits.

Soler seems to be the obvious choice,  so of course that’s what Maddon did – he pinch hit for Coghlan with Soler, right? This is also a perfect opportunity since Matheny cannot counter with a RHP – Siegrest has to pitch to at least one batter after entering the game. Maddon let Coghlan hit and he was easily dispatched by Siegrest 4 pitches later. Not that the result has anything to do with the decision by Matheny or Maddon. It doesn’t. Matheny’s decision to bring in Siegrest at that point in time was rather curious too, if you think about it. Surely he must have assumed that Maddon would bring in a RH pinch hitter. So he had to decide whether to pitch Lackey against Coghlan or Siegrest against a right handed hitter, probably Soler. Plus, the next batter, Russell, is another righty. It looks like he got extraordinarily lucky when Maddon did what he did – or didn’t do – in letting Coghlan bat. But that’s not the whole story…

Siegrest may or may not be your ordinary left-handed pitcher. What if Siegrest actually has reverse splits? What if we expect him to pitch better against right handed batters and worse against left-handed batters?  In that case, Coghlan might actually be the better choice than Soler even though he doesn’t often face lefty pitchers. When a pitcher has reverse splits – true reverse splits – we treat him exactly like a pitcher of the opposite hand.  It would be exactly like Coghlan or Soler were facing a RHP. Or maybe Siegrest has no splits – i.e. RH and LH batters of equal overall talent perform about the same. Or very small platoon splits compared to the average left-hander? So maybe hitting Coghlan or Soler is a coin flip.

It might also have been correct for Matheny to bring in Siegrest no matter who he was going to face, simply because Lackey, who is arguably a good but not great pitcher, was about to face a good lefty hitter for the third time – not a great matchup. And if Siegrest does indeed have very small splits either positive or negative, or no splits at all, that is a perfect opportunity to bring him in, and not care whether Maddon leaves Coghlan in or pinch hits Soler. At the same time, if Maddon things that Siegrest has significant reverse splits, he can leave in Coghlan, and if he thinks that the lefty pitcher has somewhere around a neutral platoon split, he can still leave Coghlan in and save Soler for another pinch hit opportunity. Of course, if he thinks that Siegrest is like your typical lefty pitcher, with a 30 point platoon split, then using Coghlan is a big mistake.

So how do managers determine what a pitcher’s true or expected (the same thing) platoon split is? The typical troglodyte will use batting average against during the season in question. After all, that’s what you hear ad-nauseam from the talking heads on TV, most of them ex-players or even ex-managers. Even the slightly informed fan knows that batting average against for a pitcher is worthless stat in and of itself (what, walks don’t count, and a HR is the same as a single?), especially in light of DIPS. The slightly more informed fan also knows that one season splits for a batter or pitcher are not very useful for the reasons I explained above.

If you look at Siegrest’s BA against splits for 2015, you will see .163 versus RHB and .269 versus LHB. Cue the TV commentators: “Siegrest is much better against right-handed batters than left-handed ones.” Of course, is and was are very different things in this context and with respect to making decisions like Matheny and Maddon did. The other day David Price was a pretty mediocre to poor pitcher. He is a great pitcher and you would certainly be taking your life into your hands if you treated him like a mediocre to poor pitcher in the present. Kershaw was a poor pitcher in the playoffs…well, you get the idea. Of course, sometimes, was is very similar to is. It depends on what we are talking about and how long the was was, and what the was actually is.

Given that Matheny is not considered to be such an astute manager when it comes to data-driven decisions, it may be is surprising that he would bring in Siegrest to pitch to Coghlan knowing that Siegrest has an enormous reverse BA against split in 2015. Maybe he was just trying to bring in a fresh arm – Siegrest is a very good pitcher overall. He also knows that the lefty is going to have to pitch to the next batter, Russell, a RHB.

What about Maddon? Surely he knows better than to look at such a garbage stat for one season to inform a decision like that. Let’s use a much better stat like wOBA and look at Siegrest’s career rather than just one season. Granted, a pitcher’s true platoon splits may change from season to season as he changes his pitch repertoire, perhaps even arm angle, position on the rubber, etc. Given that, we can certainly give more weight to the current season if we like. For his career, Siegrest has a .304 wOBA against versus LHB and .257 versus RHB. Wait, let me double check that. That can’t be right. Yup, it’s right. He has a career reverse wOBA split of 47 points! All hail Joe Maddon for leaving Coghlan in to face essentially a RHP with large platoon splits! Maybe.

Remember how in the first few paragraphs I talked about how we have to regress actual platoon splits a lot for pitchers and batters, because we normally don’t have a huge sample and because there is not a great deal of spread among pitchers with respect to true platoon split talent? Also remember that what we, and Maddon and Matheny, are desperately trying to do is estimate Siegrest’s true, real-life honest-to-goodness platoon split in order to make the best decision we can regarding the batter/pitcher matchup. That estimate may or may not be the same as or even remotely similar to his actual platoon splits, even over his entire career. Those actual splits will surely help us in this estimate, but the was is often quite different than the is.

Let me digress a little and invoke the ole’ coin flipping analogy in order to explain how sample size and spread of talent come into play when it comes to estimating a true anything for a player – in this case platoon splits.

Note: If you want you can skip the “coins” section and go right to the “platoon” section. 

Coins

Let’s say that we have a bunch of fair coins that we stole from our kid’s piggy bank. We know of course that each of them has a 50/50 chance of coming up head or tails in one flip – sort of like a pitcher with exactly even true platoon splits. If we flip a bunch of them 100 times, we know we’re going to get all kinds of results – 42% heads, 61% tails, etc. For the math inclined, if we flip enough coins the distribution of results will be a normal curve, with the mean and median at 50% and the standard deviation equal to the binomial standard deviation of 100 flips, which is 5%.

Based on the actual results of 100 flips of any of the coins, what would you estimate the true heads/tails percentage of that coin? If one coin came up 65/35 in favor of heads, what is your estimate for future flips? 50% of course. 90/10? 50%. What if we flipped a coin 1000 or even 5000 times and it came up 55% heads and 45% tails? Still 50%. If you don’t believe or understand that, stop reading and go back to whatever you were doing. You won’t understand the rest of this article. Sorry to be so blunt.

That’s like looking at a bunch of pitchers platoon stats and no matter what they are and over how many TBF, you conclude that the pitcher really has an even split and what you observed is just noise. Why is that? With the coins it is because we know beforehand that all the coins are fair (other than that one trick coin that your kid keeps for special occasions). We can say that there is no “spread in talent” among the coins and therefore regardless of the result of a number of flips and regardless of how many flips, we regress the result 100% of the way toward the mean of all the coins, 50%, in order to estimate the true percentage of any one coin.

But, there is a spread of talent among pitcher and batter platoon splits. At least we think there is. There is no reason why it has to be so. Even if it is true, we certainly can’t know off the top of our head how much of a spread there is. As it turns out, that is really important in terms of estimating true pitcher and batter splits. Let’s get back to the coins to see why that is. Let’s say that we don’t have 100% fair coins. Our sly kid put in his piggy bank a bunch of trick coins, but not really, really tricky. Most are still 50/50, but some are 48/52, 52/48, a few less are 45/55, and 1 or 2 are 40/60 and 60/40. We can say that there is now a spread of “true coin talent” but the spread is small. Most of the coins are still right around 50/50 and a few are more biased than that.  If your kid were smart enough to put in a normal distribution of “coin talent,” even one with a small spread, the further away from 50/50, the fewer coins there are.  Maybe half the coins are still fair coins, 20% are 48/52 or 52/48, and a very, very small percentage are 60/40 or 40/60.  Now what happens if we flip a bunch of these coins?

If we flip them 100 times, we are still going to be all over the place, whether we happen to flip a true 50/50 coin or a true 48/52 coin. It will be hard to guess what kind of a true coin we flipped from the result of 100 flips. A 50/50 coin is almost as likely to come up 55 heads and 45 tails as a coin that is truly a 52/48 coin in favor of heads. That is intuitive, right?

This next part is really important. It’s called Bayesian inference, but you don’t need to worry about what it’s called or even how it technically works. It is true that if you flipped a coin and got 60/40 heads that that coin was much more likely to be a true 60/40 coin than it is to be a 50/50 coin. That should be obvious too.  But here’s the catch. There are many, many more 50/50 coins in your kid’s piggy bank than there are 60/40. Your kid was smart enough to put in a normal distribution of trick coins.

So even though it seems like if you flipped a coin 100 times and got 60/40 heads, it is more likely you have a true 60/40 coin than a true 50/50 coin, it isn’t. It is much more likely that you have a 50/50 coin that got “heads lucky” than a true 60/40 coin that landed on the most likely result after 100 flips (60/40) because there are many more 50/50 coins in the bank than 60/40 coins – assuming a somewhat normal distribution with a small spread.

Here is the math: The chances of a 50/50 coin coming up exactly 60/40 is around .01. Chances of a true 60/40 coin coming up 60/40 is 8 times that amount, or .08. But, if there are 8 times as many 50/50 coins in your piggy bank as 60/40 coins, then the chances of your 60/40 coin being a fair coin or a 60/40 biased coin is only 50/50. If there 800 times more 50/50 coins than 60/40 coins in your bank, as there is likely to be if the spread of coin talent is small, then it is 100 times more likely that you have a true 50/50 coin than a true 60/40 coin even though the coin came up 60 heads in 100 flips.

It’s like the AIDS test contradiction. If you are a healthy, heterosexual, non-drug user, and you take an AIDS test which has a 1% false positive rate and you test positive, you are extremely unlikely to have AIDS. There are very few people with AIDS in your population so it is much more likely that you do not have AIDS and got a false positive (1 in 100) than you did have AIDS in the first place (maybe 1 in 100,000) and tested positive. Out of a million people in your demographic, if they all got tested, 10 will have AIDS and test positive (assuming a 0% false negative rate) and 999,990 will not have AIDS, but 10,000 of them (1 in 100) will have a false positive. So the odds you have AIDS is 10,000 to 10 or 1000 to 1 against.

In the coin example where the spread of coin talent is small and most coins are still at or near 50/50, pretty much no matter what we get when flipping a coin 100 times, we are going to conclude that there is a good chance that our coin is still around 50/50 because most of the coins are around 50/50 in true coin talent. However, there is some chance that the coin is biased, if we get an unusual result.

Now, it is awkward and not particularly useful to conclude something like, “There is a 60% chance that our coin is a true 50/50 coin, 20% it is a 55/45 coin, etc.” So what we usually do is combine all those probabilities and come up with a single number called a weighted mean.

If one coin comes up 60/40, our weighted mean estimate of its “true talent” may be 52%. If we come up with 55/45, it might be 51%. 30/70 might be 46%. Etc. That weighed mean is what we refer to as “an estimate of true talent” and is the crucial factor in making decisions based on what we think the talent of the coins/players are likely to be in the present and in the future.

Now what if the spread of coin talent were still small, as in the above example, but we flipped the coins 500 times each? Say we came up with 60/40 again in 500 flips. The chances of that happening with a 60/40 coin is 24,000 times more likely than if the coin were 50/50! So now we are much more certain that we have a true 60/40 coin even if we don’t have that many of them in our bank. In fact, if the standard deviation of our spread in coin talent were 3%, we would be about ½ certain that our coin was a true 50/50 coin and half certain it was a true 60/40 coin, and our weighted mean would be 55%.

There is a much easier way to do it. We have to do some math gyrations which I won’t go into that will enable us to figure out how much to regress our observed flip percentage to the mean flip percentage of all the coins, 50%. For 100 flips it was a large regression such that with a 60/40 result we might estimate a true flip talent of 52%, assuming a spread of coin talent of 3%. For 500 flips, we would regress less towards 50% to give us around 55% as our estimate of coin talent. Regressing toward a mean rather than doing the long-hand Bayesian inferences using all the possible true talent states assumes a normal distribution or close to one.

The point is that the sample size of the observed measurement is determines how much we regress the observed amount towards the mean. The larger the sample, the less we regress. One season observed splits and we regress a lot. Career observed splits that are 5 times that amount, like our 500 versus 100 flips, we regress less.

But sample size of the observed results is not the only thing that determines how much to regress. Remember if all our coins were fair and there were no spread in talent, we would regress 100% no matter how many flips we did with each coin.

So what if there were a large spread in talent in the piggy bank? Maybe a SD of 10 percent so that almost all of our coins were anywhere from 20/80 to 80/20 (in a normal distribution the rule of thumb is that almost of the values fall within 3 SD of the mean in either direction)? Now what if we flipped a coin 100 times and came up with 60 heads. Now there are lots more coins at true 60/40 and even some coins at 70/30 and 80/20. The chances that we have a truly biased coin when we get an unusual result is much greater than if the spread in coin talent were smaller, even in 100 flips.

So now we have the second rule. The first rule was that the number of trials is important in determining how much credence to give to an unusual result, i.e., how much to regress that result towards the mean, assuming that there is some spread in true talent. If there is no spread, then no matter how many trials our result is based on, and no matter how unusual our result, we still regress 100% toward the mean.

All trials whether they be coins or human behavior have random results around a mean that we can usually model as long as the mean is not 0 or 1. That is an important concept, BTW. Put it in your “things I should know” book. No one can control or influence that random distribution. A human being might change his mean from time to time but he cannot change or influence the randomness around that mean. There will always be randomness, and I mean true randomness, around that mean regardless of what we are measuring, as long as the mean is between 0 and 1, and there is more than 1 trial (in one trial you either succeed or fail of course). There is nothing that anyone can do to influence that fluctuation around the mean. Nothing.

The second rule is that the spread of talent also matters in terms of how much to regress the actual results toward the mean. The more the spread, the less we regress the results for a given sample size. What is more important? That’s not really a specific enough question, but a good answer is that if the spread is small, no matter how many trials the results are based on, within reason, we regress a lot. If the spread is large, it doesn’t take a whole lot of trials, again, within reason, in order to trust the results more and not regress them a lot towards the mean.

Let’s get back to platoon splits, now that you know almost everything about sample size, spread of talent, regression to mean, and watermelons. We know that how much to trust and regress results depends on their sample size and on the spread of true talent in the population with respect to that metric, be it coin flipping or platoon splits. Keep in mind that when we say trust the results, that it is not a binary thing, as in, “With this sample and this spread of talent, I believe the results – the 60/40 coin flips or the 50 point reverse splits, and with this sample and spread, I don’t believe them.” That’s not the way it works. You never believe the results. Ever. Unless you have enough time on your hands to wait for an infinite number of results and the underlying talent never changes.

What we mean by trust is literally how much to regress the results toward a mean. If we don’t trust the stats much, we regress a lot. If we trust them a lot, we regress a little. But. We. Always. Regress. It is possible to come up with a scenario where you might regress almost 100% or 0%, but in practice most regressions are in the 20% to 80% range, depending on sample size and spread of talent. That is just a very rough rule of thumb.

We generally know the sample size of the results we are looking at. With Siegrest (I almost forgot what started this whole thing) his career TBF is 604 TBF, but that’s not his sample size for platoon splits because platoon splits are based on the difference between facing lefties and righties. The real sample size for platoon splits is the harmonic mean of TBF versus lefties and righties. If you don’t know what that means don’t worry about it. A shortcut is to use the lesser of the two which is almost always TBF versus lefties, or in Siegrest’s case, 231. That’s not a lot, obviously, but we have two possible things going for Maddon, who played his cards like Siegrest was a true reverse split lefty pitcher. One, maybe the spread of platoon skill among lefty pitchers is large (it’s not), and two, he has a really odd observed split of 47 points in reverse. That’s like flipping a coin 100 times and getting 70 heads and 30 tails or 65/35. It is an unusual result. The question is, again, not binary – whether we believe that -47 point split or not. It is how much to regress it toward the mean of +29 – the average left-handed platoon split for MLB pitchers.

While the unusual nature of the observed result is not a factor in how much regressing to do, it does obviously come into play, in terms of our final estimate of true talent. Remember that the sample size and spread of talent in the underlying population, in this case, all lefty pitchers, maybe all lefty relievers if we want to get even more specific, is the only thing that determines how much we trust the observed results, i.e., how much we regress them toward the mean. If we regress -47 points 50% toward the mean of +29 points, we get quite a different answer than if we regress, say, an observed -10 split 50% towards the mean. In the former case, we get a true talent estimate of -9 points and in the latter we get +10. That’s a big difference. Are we “trusting” the -47 more than the -10 because it is so big? You can call it whatever you want, but the regression is the same assuming the sample size and spread of talent is the same.

The “regression”, by the way, if you haven’t figured it out yet, is simply the amount, in percent, we move the observed toward the mean. -47 points is 76 points “away” from the mean of +29 (the average platoon split for a LHP). 50% regression means to move it half way, or 38 points. If you move -47 points 38 points toward +29 points, you get -9 points, our estimate of Siegrest’s true platoon split if  the correct regression is 50% given his 231 sample size and the spread of platoon talent among LH MLB pitchers. I’ll spoil the punch line. It is not even close to 50%. It’s a lot more.

How do we determine the spread of talent in a population, like platoon talent? That is actually easy but it requires some mathematical knowledge and understanding. Most of you will just have to trust me on this. There are two basic methods which are really the same thing and yield the same answer. One, we can take a sample of players, say 100 players who all had around the same number of opportunities (sample size), say, 300. That might be all full-time starting pitchers in one season and the 300 is the number of LHB faced. Or it might be all pitchers over several seasons who faced around 300 LHB. It doesn’t matter. Nor do the number of opportunities.  They don’t even have to be the same for all pitchers. It is just easier to explain that way. Now we compute the variance in that group – stats 101. Then we compare that variance with the variance expected by chance – still stats 101.

Let’s take BA, for example. If we have a bunch of players with 400 AB each, what is the variance in BA among the players expected by chance? Easy. Binomial theorem. .000625 in BA. What if we observe a variance of twice that, or .00125? Where is the extra variance coming from? A tiny bit is coming from the different contexts that the player plays in, home/road, park, weather, opposing pitchers, etc. A tiny bit comes from his own day-to-day changes in true talent. We’ll ignore that. They really are small. We can of course estimate that too and throw it into the equation. Anyway, that extra variance, the .000625, is coming from the spread of talent. The square root of that is .025 or 25 points of BA, which would be one SD of talent in this example. I just made up the numbers, but that is probably close to accurate.

Now that we know the spread in talent for BA, which we get from this formula – observed variance = random variance + talent variance – we can now calculate the exact regression amount for any sample of observed batting average or whatever metric we are looking at. It’s the ratio of random variance to total variance. Remember we need only 2 things and 2 things only to be able to estimate true talent with respect to any metric, like platoon splits: spread of talent and sample size of the observed results. That gives us the regression amount. From that we merely move the observed result toward the mean by that amount, like I did above with Siegrest’s -47 points and the mean of +29 for a league-average LHP.

The second way, which is actually more handy, is to run a regression of player results from one time period to another. We normally do year-to-year but it can be odd days to even, odd PA to even PA, etc. Or an intra-class correlation (ICC) which is essentially the same thing but it correlates every PA (or whatever the opportunity is) to every other PA within a sample.  When we do that, we either use the same sample size for every player, like we did in the first method, or we can use different sample sizes and then take the harmonic mean of all of them as our average sample size.

This second method yields a more intuitive and immediately useful answer, even though they both end up with the same result. This actually gives you the exact amount to regress for that sample size (the average of the group in your regression). In our BA example, if the average sample size of all the players were 500 and we got a year-to-year (or whatever time period) correlation of .4, that would mean that for BA, the correct amount of regression for a sample size of 500 is 60% (1 minus the correlation coefficient or “r”). So if a player bats .300 in 500 AB and the league average is .250 and we know nothing else about him, we estimate his true BA to be (.300 – .250) * .4 + .250 or .270. We move his observed BA 60% towards the mean of .250. We can easily with a little more math calculate the amount of regression for any sample size.

Using method #1 tells us precisely what the spread in talent is. Method 2 tells us that implicitly by looking at the correlation coefficient and the sample size. With either method, we get the amount to regress for any given sample size.

Platoon

Let’s look at some year-to-year correlations for a 500 “opportunity” (PA, BA, etc.) sample for some common metrics. Since we are using the same sample size for each, the correlation tells us the relative spreads in talent for each of these metrics. The higher the correlation for any given sample, the higher the spread in talent (there are other factors that slightly affect the correlation other than spread of talent for any given sample size but we can safely ignore them).

BA: .450

OBA: .515

SA: .525

Pitcher ERA: .240

BABIP for pitchers (DIPS): .155

BABIP for batters: .450

Now let’s look at platoon splits:

This is for an average of 200 TBF versus a LHP, so the sample size is smaller than the ones above.

Platoon wOBA differential for pitchers (200 BF v. LHB): .135

RHP: .110

LHP: .195

Platoon wOBA differential for batters (200 BF v. LHP): .180

RHB: .0625

LHB: .118

Those numbers are telling us that, like DIPS, the spread of talent among batters and pitchers with respect to platoon splits is very small. You all know now that this, along with sample size, tells us how much to regress an observed split like Siegrest’s -47 points. Yes, a reverse split of 47 points is a lot, but that has nothing to do with how much to regress it in order to estimate Siegrist’s true platoon split. The fact that -47 points is very far from the average left-handed pitcher’s +29 points means that it will take a lot of regression to moved it into the plus zone, but the -47 points in and of itself does not mean that we “trust it more.” If the regression were 99% then whether the observed were -47 or +10, we would arrive at nearly the same answer. Don’t confuse the regression with the observed result. One has nothing to do with the other. And don’t think in terms of “trusting” the observed result or not. Regress the result and that’s your answer. If you arrive at answer X it makes no difference whether your starting point, the observed result, was B, or C. None whatsoever.  That is a very important point. I don’t know how many times I have heard, “But he had a 47 point reverse split in his entire career!” You can’t possibly be saying that you estimate his real split to be +10 or +12 or whatever it is.” Yes, that’s exactly what I’m saying. A +10 estimated split is exactly the same whether the observed split were -47 or +5. The estimate using the regression amount is the only thing that counts.

What about the certainty of the result? The certainty of the estimate depends mostly on the sample size of the observed results. If we never saw a player hit before and we estimate that he is a .250 hitter we are surely less certain than if we have a hitter who has hit .250 over 5000 AB. But does that change the estimate? No. The certainty due to the sample size was already included in the estimate. The higher the certainty the less we regressed the observed results. So once we have the estimate we don’t revise that again because of the uncertainty. We already included that in the estimate!

And what about the practical importance of the certainty in terms of using that estimate to make decisions? Does it matter whether we are 100% or 90% sure that Siegrest is a +10 true platoon split pitcher? Or whether we are only 20% sure – he might actually have a higher platoon split or a lower one? Remember the +10 is a weighted mean which means that it is in the middle of our error bars. The answer to that is, “No, no and no!” Every decision that a manager makes on the field is or should be based on weighted mean estimates of various player talents. The certainty or distribution rarely should come into play. Basically the noise in the result of a sample of 1 is so large that it doesn’t matter at all what the uncertainty level of your estimates are.

So what do we estimate Siegrest’s true platoon split, given a 47 point reverse split in 231 TBF versus LHB. Using no weighting for more recent results, we regress his observed splits 1 minus 230/1255, or .82 (82%) towards the league average for lefty pitchers, which is around 29 points for a LHP. 82% of 76 points is 62 points. So we regress his -47 points 62 points in the plus direction which gives us an estimate of +15 points in true platoon split. That is half the split of an average LHP, but it is plus nonetheless.

That means that a left-handed hitter like Coghlan will hit better than he normally does against a left-handed pitcher. However, Coghlan has a larger than average estimated split, so that cancels out Siegrest’s smaller than average split to some extent. That also means that Soler or another righty will not hit as well against Siegrest as he would against a LH pitcher with average splits. And since Soler himself has a slightly smaller platoon split than the average RHB, his edge against Siegrest is small.

We also have another method for better estimating true platoon splits for pitchers which can be used to enhance the method we use using sample results, sample size, and means. It is very valuable. We have a pretty good idea as to what causes one pitcher to have a smaller or greater platoon split than another. It’s not like pitchers deliberately throw better or harder to one side or the other or that RH or LH batters scare or distract them. Pitcher platoon splits mostly come from two things: One is arm angle. If you’ve ever played or watched baseball that should be obvious to you. The more a pitcher comes from the side, the tougher he is on same-side batters and the larger his platoon split. That is probably the number one factor in these splits. It is almost impossible for a side-armer not to have large splits.

What about Siegrest? His arm angle is estimated by Jared Cross of Steamer, using pitch f/x data, at 48 degrees. That is about a ¾ arm angle. That strongly suggests that he does not have true reverse splits and it certainly enables us to be more confident that he is plus in the platoon split department.

The other thing that informs us very well about likely splits is pitch repertoire. Each pitch has its own platoon profile. For example, pitches with the largest splits are sliders and sinkers and those with the lowest or even reverse are the curve (this surprises most people), splitter, and change.

In fact, Jared (Steamer) has come up with a very good regression formula which estimates platoon split from pitch repertoire and arm angle only. This formula can be used by itself for estimating true platoon splits. Or it can be used to establish the mean towards which the actual splits should be regressed. If you use the latter method the regression percentage is much higher than if you don’t. It’s like adding a lot more 50/50 coins to that piggy bank.

If we plug Siegrest’s 2015 numbers into that regression equation, we get an estimated platoon from arm angle and pitch repertoire of 14 points, which is less than the average lefty even with the 48 degree arm angle. That is mostly because he uses around 18% change ups this year. Prior to this season, when he didn’t use the change up that often, we would probably have estimated a much higher true split.

So now rather than regressing towards just an average lefty with a 29 point platoon split, we can regress his -47 points to a more accurate mean of 14 points. But, the more you isolate your population mean, the more you have to regress for any given sample size, because you are reducing the spread of talent in that more specific population. So rather than 82%, we have to regress something line 92%. That brings -47 to +9 points.

So now we are down to a left-handed pitcher with an even smaller platoon split. That probably makes Maddon’s decision somewhat of a toss-up.

His big mistake in that same game was not pinch-hitting for Lester and Ross in the 6th. That was indefensible in my opinion. Maybe he didn’t want to piss off Lester, his teammates, and possibly the fan base.Who knows?

In response to my two articles on whether pitcher performance over the first 6 innings is predictive of their 7th inning performance (no), a common response from saber and non-saber leaning critics and commenters goes something like this:

No argument with the results or general method, but there’s a bit of a problem in selling these findings. MGL is right to say that you can’t use the stat line to predict inning number 7, but I would imagine that a lot of managers aren’t using the stat line as much as they are using their impression of the pitcher’s stuff and the swings the batters are taking.

You hear those kinds of comments pretty often even when a pitcher’s results aren’t good, “they threw the ball pretty well,” and “they didn’t have a lot of good swings.”

There’s no real way to test this and I don’t really think managers are particularly good at this either, but it’s worth pointing out that we probably aren’t able to do a great job capturing the crucial independent variable.

That is actually a comment on The Book Blog by Neil Weinberg, one of the editors of Beyond the Box Score and a sabermetric blog writer (I hope I got that somewhat right).

My (edited) response on The Book Blog was this:

Neil I hear that refrain all the time and with all due respect I’ve never seen any evidence to back it up. There is plenty of evidence, however, that for the most part it isn’t true.

If we are to believe that managers are any good whatsoever at figuring out which pitchers should stay and which should not, one of two things must be true:

1) The ones who stay must pitch well, especially in close games. That simply isn’t true.

2) The ones who do not stay would have pitched terribly. In order for that to be the case, we must be greatly under-estimating the TTO penalty. That strains credulity.

Let me explain the logic/math in # 2:

We have 100 pitchers pitching thru 6 innings. Their true talent is 4.0 RA9. 50 of them stay and 50 of them go, or some other proportion – it doesn’t matter.

We know that those who stay pitch to the tune of around 4.3. We know that. That’s what the data say. They pitch at the true talent plus the 3rd TTOP, after adjusting for the hitters faced in the 7th inning.

If we are to believe that managers can tell, to any extent whatsoever, whether a pitcher is likely to be good or bad in the next inning or so, then it must be true that the ones who stay will pitch better on the average then the ones who do not, assuming that the latter were allowed to stay in the game of course.

So let’s assume that those who were not permitted to continue would have pitched at a 4.8 level, .5 worse than the pitchers who were deemed fit to remain.

That tells us that if everyone were allowed to continue, they would pitch collectively at a 4.55 level, which implies a .55 rather than a .33 TTOP.

Are we to believe that the real TTOP is a lot higher than we think, but is depressed because managers know when to take pitchers out such that the ones they leave in actually pitch better than all pitchers would if they were all allowed to stay?

Again, to me that seems unlikely.

Anyway, here is some new data which I think strongly suggests that managers and pitching coaches have no better clue than you or I as to whether a pitcher should remain in a game or not. In fact, I think that the data suggest that whatever criteria they are using, be it runs allowed, more granular performance like K, BB, and HR, or keen, professional observation and insight, it is simply not working at all.

After 6 innings, if a game is close, a manager should make a very calculated decision as far as whether or not he should remove his starter. That decision ought to be based primarily on whether the manager thinks that his starter will pitch well in the 7th and possibly beyond, as opposed to one of his back-end relievers. Keep in mind that we are talking about general tendencies which should apply in close games going into the 7th inning. Obviously every game may be a little different in terms of who is on the mound, who is available in the pen, etc. However, in general, when the game is close in the 7th inning and the starter has already thrown 6 full, the decision to yank him or allow him to continue pitching is more important than when the game is not close.

If the game is already a blowout, it doesn’t matter much whether you leave in your starter or not. It has little effect on the win expectancy of the game. That is the whole concept of leverage. In cases where the game is not close, the tendency of the manager should be to do whatever is best for the team in the next few games and in the long run. That may be removing the starter because he is tired and he doesn’t want to risk injury or long-term fatigue. Or it may be letting his starter continue (the so-called “take one for the team” approach) in order to rest his bullpen. Or it may be to give some needed work to a reliever or two.

Let’s see what managers actually do in close and not-so-close games when their starter has pitched 6 full innings and we are heading into the 7th, and then how those starters actually perform in the 7th if they are allowed to continue.

In close games, which I defined as a tied or one-run game, the starter was allowed to begin the 7th inning 3,280 times and he was removed 1,138 times. So the starter was allowed to pitch to at least 1 batter in the 7th inning of a close game 74% of the time. That’s a pretty high percentage, although the average pitch count for those 3,280 pitcher-games was only 86 pitches, so it is not a complete shock that managers would let their starters continue especially when close games tend to be low scoring games. If a pitcher is winning or losing 2-1 or 3-2 or 1-0 or the game is tied 0-0, 1-1, 2-2, and the starter’s pitch count is not high, managers are typically loathe to remove their starter. In fact, in those 3,280 instances, the average runs allowed for the starter through 6 innings was only 1.73 runs (a RA9 of 2.6) and the average number of innings pitched beyond 6 innings was 1.15.

So these are presumably the starters that managers should have the most confidence in. These are the guys who, regardless of their runs allowed, or even their component results, like BB, K, and HR, are expected to pitch well into the 7th, right? Let’s see how they did.

These were average pitchers, on the average. Their seasonal RA9 was 4.39 which is almost exactly league average for our sample, 2003-2013 AL. They were facing the order for the 3rd time on the average, so we expect them to pitch .33 runs worse than they normally do if we know nothing about them.

These games are in slight pitcher’s parks, average PF of .994, and the batters they faced in the 7th were worse than average, including a platoon adjustment (it is almost always the case that batters faced by a starter in the 7th are worse than league average, adjusted for handedness). That reduces their expected RA9 by around .28 runs. Combine that with the .33 run “nick” that we expect from the TTOP and we expect these pitchers to pitch at a 4.45 level, again knowing nothing about them other than their seasonal levels and attaching a generic TTOP penalty and then adjusting for batter and park.

Surely their managers, in allowing them to pitch in a very close game in the 7th know something about their fitness to continue – their body language, talking to their catcher, their mechanics, location, past experience, etc. All of this will help them to weed out the ones who are not likely to pitch well if they continue, such that the ones who are called on to remain in the game, the 74% of pitchers who face this crossroad and move on, will surely pitch better than 4.45, which is about the level of a near-replacement reliever.

In other words, if a manager thought that these starters were going to pitch at a 4.45 level in such a close game in the 7th inning, they would surely bring in one of their better relievers – the kind of pitchers who typically have a 3.20 to 4.00 true talent.

So how did these hand-picked starters do in the 7th inning? They pitched at a 4.70 level. The worst reliever in any team’s pen could best that by ½ run. Apparently managers are not making very good decisions in these important close and late game situations, to say the least.

What about in non-close game situations, which I defined as a 4 or more run differential?

73% of pitchers who pitch through 6 were allowed to continue even in games that were not close. No different from the close games. The other numbers are similar too. The ones who are allowed to continue averaged 1.29 runs over the first 6 innings with a pitch count of 84, and pitched an average of 1.27 innings more.

These guys had a true talent of 4.39, the same as the ones in the close games – league average pitchers, collectively. They were expected to pitch at a 4.50 level after adjusting for TTOP, park and batters faced. They pitched at a 4.78 level, slightly worse than our starters in a close game.

So here we have two very different situations that call for very different decisions, on the average. In close games, managers should (and presumably think they are) be making very careful decision about whom to pitch in the 7th, trying to make sure that they use the best pitcher possible. In not-so-close games, especially blowouts, it doesn’t really matter who they pitch, in terms of the WE of the game, and the decision-making goal should be oriented toward the long-term.

Yet we see nothing in the data that suggests that managers are making good decisions in those close games. If we did, we would see much better performance from our starters than in not-so-close games and good performance in general. Instead we see rather poor performance, replacement level reliever numbers in the 7th inning of both close and not-so-close games. Surely that belies the, “Managers are able to see things that we don’t and thus can make better decisions about whether to leave starters in or not,” meme.

Let’s look at a couple more things to further examine this point.

In the first installment of these articles I showed that good or bad run prevention over the first 6 innings has no predictive value whatsoever for the 7th inning. In my second installment, there was some evidence that poor component performance, as measured by in-game, 6-inning FIP had some predictive value, but not good or great component performance.

Let’s see if we can glean what kind of things managers look at when deciding to yank starters in the 7th or not.

In all games in which a starter allows 1 or 0 runs through 6, even though his FIP was high, greater than 4, suggesting that he really wasn’t pitching such a great game, his manager let him continue 78% of the time, which was more than the 74% overall that starters pitched into the 7th.

In games where the starter allowed 3 or more runs through 6 but had a low FIP, less than 3, suggesting that he pitched better than his RA suggest, managers let them continue to pitch just 55% of the time.

Those numbers suggest that managers pay more attention to runs allowed than component results when deciding whether to pull their starter in the 7th. We know that that is not a good decision-making process as the data indicate that runs allowed have no predictive value while component results do, at least when those results reflect poor performance.

In addition, there is no evidence that managers can correctly determine who should stay and who to pull in close games – when that decision matters the most. Can we put to rest, for now at least, this notion that managers have some magical ability to figure out which of their starters has gas left in their tank and which do not? They don’t. They really, really, really don’t.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

The other day I wrote that pitcher performance though 6 innings, as measured solely by runs allowed, is not a good predictor of performance in the 7th inning. Whether a pitcher is pitching a shutout or has allowed 4 runs thus far, his performance in the 7th is best projected mostly by his full-season true talent level plus a times through the order penalty of around .33 runs per 9 innings (the average batter faced in the 7th inning appears for the 3rd time). Pitch count has a small effect on those late inning projections as well.

Obviously if you have allowed no or even 1 run through 6 your component results will tend to be much better than if you have allowed 3 or 4 runs, however there is going to be some overlap. Some small proportion of 0 or 1 run starters will have allowed a HR, 6 or 7 walks and hits, and few if any strikeouts. Similarly, some small percentage of pitchers who allow 3 or 4 runs through 6 will have struck out 7 or 8 batters and only allowed a few hits and walks.

If we want to know whether pitching ”well” or not through 6 innings has some predictive value for the 7th (and later) inning, it is better to focus on things that reflect the pitcher’s raw performance than simply runs allowed. It is an established fact that pitchers have little control over whether their non-HR batted balls fall for hits or outs or whether their hits and walks get “clustered” to produce lots of runs or are spread out such that few if any runs are scored.

It is also established that the components most under control by a pitcher are HR, walks, and strikeouts, and that pitchers who excel at the K, and limit walks and HR tend to be the most talented, and vice versa. It also follows that when a pitcher strikes out a lot of batters in a game and limits his HR and walks total that he is pitching “well,” regardless of how many runs he has allowed – and vice versa.

Accordingly, I have extended my inquiry into whether pitching “well” or not has some predictive value intra-game to focus on in-game FIP rather than runs allowed.  My intra-game FIP is merely HR, walks, and strikeouts per inning, using the same weights as are used in the standard FIP formula – 13 for HR, 3 for walks and 2 for strikeouts.

So, rather than defining dealing as allowing 1 or fewer runs through 6 and not dealing as 3 or more runs, I will define the former as an FIP through 6 innings below some maximum threshold and the latter as above some minimum threshold. Although I am not nearly convinced that managers and pitching coaches, and certainly not the casual fan, look much further than runs allowed, I think we can all agree that they should be looking at these FIP components instead.

Here is the same data that I presented in my last article, this time using FIP rather than runs allowed to differentiate pitchers who have been pitching very well through 6 innings or not.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings Avg runs allowed through 6 # of Games RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 1.02 5,338 4.39
Not-dealing (FIP greater than 4) 2.72 3,058 5.03

The first thing that should jump out at you is while our pitchers who are not pitching well do indeed continue to pitch poorly, our dealing pitchers, based upon K, BB, and HR rate over the first 6 innings, are not exactly breaking the bank either in the 7th inning.

Let’s put some context into those numbers.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings True talent level based on season RA9 Expected RA9 in 7th RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 4.25 4.50 4.39
Not-dealing (FIP greater than 4) 4.57 4.62 5.03

As you can see, our new dealing pitchers are much better pitchers. They normally allow 4.25 runs per game during the season. Yet they allow 4.39 runs in the 7th despite pitching very well through 6, irrespective of runs allowed (and of course they allow few runs too). In other words, we have eliminated those pitchers who allowed few runs but may have actually pitched badly or at least not as well as their meager runs allowed would suggest. All of these dealing pitchers had some combination of high K rates, and low BB and HR rates through 6 innings. But still, we see only around .1 runs per 9 in predictive value – not significantly different from zero or none.

On the other hand, pitchers who have genuinely been pitching badly, at least in terms of some combination of a low K rate and high BB and HR rates, do continue to pitch around .4 runs per 9 innings worse than we would expect given their true talent level and the TTOP.

There is one other thing that is driving some of the difference. Remember that in our last inquiry we found that pitch count was a factor in future performance. We found that while pitchers who only had 78 pitches through 6 innings pitched about as well as expected in the 7th, pitchers with an average of 97 pitches through 6 performed more than .2 runs worse than expected.

In our above 2 groups, the dealing pitchers averaged 84 pitches through 6 and the non-dealing 88, so we expect some bump in the 7th inning performance of the latter group because of a touch of fatigue, at least as compared to the dealing group.

So when we use a more granular approach to determining whether pitchers have been dealing through 6, there is not any evidence that it has much predictive value – the same thing we concluded when we looked at runs allowed only. These pitchers only pitches .11 runs per 9 better than expected.

On the other hand, if pitchers have been pitching poorly for 6 innings, as reflected in the components in which they exert the most control, K, BB, and HR rates, they do in fact pitch worse than expected, even after accounting for a slight elevation in pitch count as compared to the dealing pitchers. That decrease in performance is about .4 runs per 9.

I also want to take this time to state that based on this data and the data from my previous article, there is little evidence that managers are able to identify when pitchers should stay in the game or should be removed. We are only looking at pitchers who were chosen to continue pitching in the 7th inning by their managers and coaches. Yet, the performance of those pitchers is worse than their seasonal numbers, even for the dealing pitchers. If managers could identify those pitchers who were likely to pitch well, whether they had pitched well in prior innings or not, clearly we would see better numbers from them in the 7th inning. At best a dealing pitcher is able to mitigate his TTOP, and a non-dealing pitcher who is allowed to pitch the 7th pitches terribly, which does not bode well for the notion that managers know whom to pull and and whom to keep in the game.

For example, in the above charts, we see that dealing pitchers threw .14 runs per 9 worse than their seasonal average – which also happens to be exactly at league average levels. The non-dealing pitchers, who were also deemed fit to continue by their managers, pitched almost ½ run worse than their seasonal performance and more than .6 runs worse than the league average pitcher. Almost any reliever in the 7th inning would have been a better alternative than either the dealing or non-dealing pitchers. Once again, I have yet to see some concrete evidence that the ubiquitous cry from some of the sabermetric naysayers, “Managers know more about their players’ performance prospects than we do,” has any merit whatsoever.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

Almost everyone, to a man, thinks that a manager’s decision as to whether to allow his starter to pitch in the 6th, 7th, or 8th (or later) innings of an important game hinges, at least in part, on whether said starter has been dealing or getting banged around thus far in the game.

Obviously there are many other variables that a manager can and does consider in making such a decision, including pitch count, times through the order (not high in a manager’s hierarchy of criteria, as analysts have been pointing out more and more lately), the quality and handedness of the upcoming hitters, and the state of the bullpen, both in term of quality and availability.

For the purposes of this article, we will put aside most of these other criteria. The two questions we are going to ask is this:

  • If a starter is dealing thus far, say, in the first 6 innings, and he is allowed to continue, how does he fare in the very next inning? Again, most people, including almost every baseball insider, (player, manager, coach, media commentator, etc.), will assume that he will continue to pitch well.
  • If a starter has not been dealing, or worse yet, he is achieving particularly poor results, these same folks will usually argue that it is time to take him out and replace him with a fresh arm from the pen. As with the starter who has been dealing, the presumption is that the pitcher’s bad performance over the first, say, 6 innings, is at least somewhat predictive of his performance in the next inning or two. Is that true as well?

Keep in mind that one thing we are not able to look at is how a poorly performing pitcher might perform if he were left in a game, even though he was removed. In other words, we can’t do the controlled experiment we would like – start a bunch of pitchers, track how they perform through 6 innings and then look at their performance through the next inning or two.

So, while we have to assume that, in some cases at least, when a pitcher is pitching poorly and his manager allows him to pitch a while longer, that said manager still had some confidence in the pitcher’s performance over the remaining innings, we also must assume that if most people’s instincts are right, the dealing pitchers through 6 innings will continue to pitch exceptionally well and the not-so dealing pitchers will continue to falter.

Let’s take a look at some basic numbers before we start to parse them and do some necessary adjustments. The data below is from the AL only, 2003-2013.

 

 Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings # of Games RA9 in the 7th inning
Dealing (0 or 1 run allowed through 6) 5,822 4.46
Not-dealing (3 or more runs allowed through 6) 2,960 4.48

First, let me explain what “RA9 in the 7th inning” means: It is the average number of runs allowed by the starter in the 7th inning extrapolated to 9 innings, i.e. runs per inning in the 7th multiplied by 9. Since the starter is often removed in the middle of the 7th inning whether has been dealing or not, I calculated his runs allowed in the entire inning by adding together his actual runs allowed while he was pitching plus the run expectancy of the average pitcher when he left the game, scaled to his talent level and adjusted for time through the order, based on the number of outs and base runners.
For example, let’s say that a starter who is normally 10% worse than a league average pitcher allowed 1 run in the 7th inning and then left with 2 outs and a runner on first base. He would be charged with allowing 1 plus (.231 * 1.1 * 1.08) runs or 1.274 runs in the 7th inning. The .231 is the average run expectancy for a runner on first base and 2 outs, the 1.1 multiplier is because he is 10% worse than a league average pitcher, and the 1.08 multiplier is because most batters in the 7th inning are appearing for the 3rd time (TTOP). When all the 7th inning runs are tallied, we can convert them into a runs per 9 innings or the RA9 you see in the chart above.

At first glance it appears that whether a starter has been dealing in prior innings or not has absolutely no bearing on how he is expected to pitch in the following inning, at least with respect to those pitchers who were allowed to remain in the game past the 6th inning. However, we have different pools of pitchers, batters, parks, etc., so the numbers will have to be parsed to make sure we are comparing apples to apples.

Let’s add some pertinent data to the above chart:

Starters through 6 RA9 in the 7th Seasonal RA9
Dealing 4.46 4.29
Not-dealing 4.48 4.46

As you can see, the starters who have been dealing are, not surprisingly, better pitchers. However, interestingly, we have a reverse hot and cold effect. The pitchers who have allowed only 1 run or less through 6 innings pitch worse than expected in the 7th inning, based on their season-long RA9. Many of you will know why – the times through the order penalty. If you have not read my two articles on the TTOP, and I suggest you do, each time through the order, a starting pitcher fares worse and worse, to the tune of about .33 runs per 9 innings each time he faces the entire lineup. In the 7th inning, the average TTO is 3.0, so we expect our good pitchers, the ones with the 4.29 RA9 during the season, to average around 4.76 RA9 in the 7th inning (the 3rd time though the order, a starter pitches about .33 runs per 9 worse than he pitches overall, and the seasonal adjustment – see the note above – adds another .14 runs). They actually pitch to the tune of 4.46 or .3 runs better than expected after considering the TTOP. What’s going on there?

Well, as it turns out, there are 3 contextual factors that depress a dealing starter’s results in the 7th inning that have nothing to do with his performance in the 6 previous innings:

  • The batters that a dealing pitcher is allowed to face are 5 points lower in wOBA than the average batter that each faces over the course of the season, after adjusting for handedness. This should not be surprising. If any starting pitcher is allowed to pitch the 7th inning, it is likely that the batters in that inning are slightly less formidable or more advantageous platoon-wise, than is normally the case. Those 5 points of wOBA translate to around .17 runs per 9 innings, reducing our expected RA9 to 4.59.
  • The parks in which we find dealing pitchers are not-surprisingly, slightly pitcher friendly, with an average PF of .995, further reducing our expectation of future performance by .02 runs per 9, further reducing our expectation to 4.57.
  • The temperature in which this performance occurs is also slightly more pitcher friendly by around a degree F, although this would have a de minimus effect on run scoring (it takes about a 10 degree difference in temperature to move run scoring by around .025 runs per game).

So our dealing starters pitch .11 runs per 9 innings better than expected, a small effect, but nothing to write home about, and well within the range of values that can be explained purely by chance.

What about the starters who were not dealing? They out-perform their seasonal RA9 plus the TTOP by around .3 runs per 9. The batters they face in the 7th inning are 6 points worse than the average league batter after adjusting for the platoon advantage, and the average park and ambient temperature tend to slightly favor the hitter. Adjusting their seasonal RA9 to account for the fact that they pitched poorly through 6 (see my note at the beginning of this article), we get an expectation of 4.51. So these starters fare almost exactly as expected (4.48 to 4.51) in the 7th inning, after adjusting for the batter pool, despite allowing 3 or more runs for the first 6 innings. Keep in mind that we are only dealing with data from around 9,000 BF. One standard deviation in “luck” is around 5 points of wOBA which translates to around .16 runs per 9.

It appears to be quite damning that starters who are allowed to continue after pitching 6 stellar or mediocre to poor innings pitch almost exactly as (poorly as) expected – their normal adjusted level plus .33 runs per 9 because of the TTOP – as if we had no idea how well or poorly they pitched in the prior 6 innings.

Score one for simply using a projection plus the TTOP to project how any pitcher is likely to pitch in the middle to late innings, regardless of how well or poorly they have pitched thus far in the game. Prior performance in the same game has almost no bearing on that performance. If anything, when a manager allows a dealing pitcher to continue pitching after 6 innings, when facing the lineup for the 3rd time on the average, he is riding that pitcher too long. And, more importantly, presumably he has failed to identify anything that the pitcher might be doing, velocity-wise, mechanics-wise, repertoire-wise, command-wise, results-wise, that would suggest that he is indeed on that day and will continue to pitch well for another inning or so.

In fact, whether pitchers have pitched very well or very poorly or anything in between for the first 6 innings of a game, managers and pitching coaches seem to have no ability to determine whether they are likely to pitch well if they remain in the game. The best predictor of 7th inning performance for any pitcher who is allowed to remain in the game, is his seasonal performance (or projection) plus a fixed times through the order penalty. The TTOP is approximately .33 runs per 9 innings for every pass through the order. Since the second time through the order is roughly equal to a pitcher’s overall performance, starting with the 3rd time through the lineup we expect that starter to pitch .33 runs worse than he does overall, again, regardless of how he has pitched thus far in the game. The 4th time TTO, we expect a .66 drop in performance. Pitchers rarely if ever get to throw to the order for the 5th time.

Fatigue and Pitch Counts

Let’s look at fatigue using pitch count as a proxy, and see if that has any effect on 7th inning performance for pitchers who allowed 3 or more runs through 6 innings. For example, if a pitcher has not pitched particularly well, should we allow him to continue if he has a low pitch count?

Pitch count and 7th inning performance for non-dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 85 (avg=78) 4.56 4.70
Greater than 90 (avg=97) 4.66 4.97

 

Expected RA9 accounts for the pitchers’ adjusted seasonal RA9 plus the pool of batters faced in the 7th inning including platoon considerations, as well as park and weather. The latter 2 affect the numbers minimally. As you can see, pitchers who had relatively high pitch counts going into the 7th inning but were allowed to pitch for whatever reasons despite allowing at least 3 runs thus far, fared .3 runs worse than expected, even after adjusting for the TTOP. Pitchers with low pitch counts did only about .14 runs worse than expected, including the TTOP. Those 20 extra pitches appear to account for around .17 runs per 9, not a surprising result. Again, please keep in mind that we dealing with limited sample sizes, so these small differences are inferential suggestions and are not to be accepted with a high degree of certainty. They do point us in a certain direction, however, and one which comports with our prior expectation – at least my prior expectation.

What about if a pitcher has been dealing and he also has a low pitch count going into the 7th inning. Very few managers, if any, would remove a starter who allowed zero or 1 run through 6 innings and has only thrown 65 or 70 pitchers. That would be baseball blasphemy. Besides the affront to the pitcher (which may be a legitimate concern, but one which is beyond the scope of this article), the assumption by nearly everyone is that the pitcher will continue to pitch exceptionally well. After all, he is not at all tired and he has been dealing! Let’s see if that is true – that these starters continue to pitch well, better than expected based on their projections or seasonal performance plus the TTOP.

Pitch count and 7th inning performance for dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 80 (avg=72) 4.75 4.50
Greater than 90 (avg=96) 4.39 4.44

Keep in mind that these pitchers normally allow 4.30 runs per 9 innings during the entire season (4.44 after doing the seasonal adjustment). The reason the expected RA9 is so much higher for pitchers with a low pitch count is primarily due to the TTOP. For pitchers with a high pitch count, the batters they face in the 7th are 10 points less in wOBA than league average, thus the 4.39 expected RA9, despite the usual .3 to .35 TTOP.

Similar to the non-dealing pitchers, fatigue appears to play a factor in a dealing pitcher’s performance in the 7th. However, in either case, low-pitch or high-pitch, their performance through the first 6 innings has little bearing on their 7th inning performance. With no fatigue they out-perform their expectation by .25 runs per 9. The fatigued pitchers under-performed their overall season-long adjusted talent plus the usual TTOP by .05 runs per 9.

Again, we see that there is little value to taking out a pitcher who has been getting a little knocked around or leaving in a pitcher who has been dealing for 6 straight innings. Both groups will continue to perform at around their expected full-season levels plus any applicable TTOP, with a slight increase in performance for a low-pitch count pitcher and a slight decrease for a high-pitch count pitcher. The biggest increase we see, .25 runs, is for pitchers who were dealing and had very low pitch counts.

What about if we increase our threshold to pitchers who allow 4 or more runs over 6 innings and those who are pitching a shutout?

Starters through 6 Seasonal RA9 Expected RA9 7th inning RA9
Dealing (shutouts only) 4.23 4.62 4.70
Not-dealing (4 or more runs) 4.62 4.81 4.87

Here, we see no predictive value in the first 6 innings of performance. In fact, for some reason starters pitching a shutout pitched slightly worse than expected in the 7th inning, after adjusting for the pool of batters faced and the TTOP.

How about the holy grail of starters who are expected to keep lighting it up in the 7th inning – starters pitching a shutout and with a low pitch count? These were true talent 4.25 pitchers facing better than average batters in the 7th, mostly for the third time in the game, so we expect a .3 bump or so for the TTOP. Our expected RA9 was 4.78 after making all the adjustments, and the actual was 4.61. Nothing much to speak of. Their dealing combined with a low pitch count had a very small predictive value in the 7th. Less than .2 runs per 9 innings.

Conclusion

As I have been preaching for what seems like forever – and the data are in accordance – however a pitcher is pitching through X innings in a game, at least as measured by runs allowed, even at the extremes, has very little relevance with regard to how he is expected to pitch in subsequent innings. The best marker for whether to pull a pitcher or not seems to be pitch count.

If you want to know the most likely result, or the mean expected result at any point in the game, you should mostly ignore prior performance in that game and use a credible projection plus a fixed times through the order penalty, which is around .33 runs per 9 the 3rd time through, and another .33 the 4th time through. Of course the batters faced, park, weather, etc. will further dictate the absolute performance of the pitcher in question.

Keep in mind that I have not looked at a more granular approach to determining whether a pitcher has been pitching extremely well or getting shelled, such as hits, walks, strikeouts, and the like. It is possible that such an approach might yield a subset of pitching performance that indeed has some predictive value within a game. For now, however, you should be pretty convinced that run prevention alone during a game has little predictive value in terms of subsequent innings. Certainly a lot less than what most fans, managers, and other baseball insiders think.

Note: These are rules of thumb which apply 90-99% of the time (or so). Some of them have a few or even many exceptions and nuances to consider. I do believe, however, that if every manager followed these religiously, even without employing any exceptions or considering any of the nuances, that he would be much better off than the status quo. There are also many other suggestions, commandments, and considerations that I would use, that are not included in this list.

1)      Though shalt never use individual batter/pitcher matchups, recent batter or pitcher stats, or even seasonal batter or pitcher stats. Ever. The only thing that this organization uses are projections based on long-term performance. You will use those constantly.

2)      Thou shalt never, ever use batting average again. wOBA is your new BA. Learn how to construct it and learn what it means.

3)      Thou shalt be given and thou shalt use the following batter/pitcher matchups every game: Each batter’s projection versus each pitcher. They include platoon considerations. Those numbers will be used for all your personnel decisions. They are your new “index cards.”

4)      Thou shalt never issue another IBB again, other than obvious late and close-game situations.

5)      Thou shalt instruct your batters whether to sacrifice bunt or not, in all sacrifice situations, based on a “commit line.” If the defense plays in front of that line, thy batters will hit away. If they play behind the line, thy batters will bunt. If they are at the commit line, they may do as they please. Each batter will have his own commit line against each pitcher. Some batters will never bunt.

6)      Thou shalt never sacrifice with runners at first and third, even with a pitcher at bat. You may squeeze if you want. With 1 out and a runner on 1st only your worst hitting pitchers will bunt.

7)      Thou shalt keep thy starter in or remove him based on two things and two things only: One, his pitch count, and two, the number of times he has faced the order. Remember that ALL pitchers lose 1/3 of a run in ERA each time through the order, regardless of how they are pitching thus far.

8)      Thou shalt remove thy starter for a pinch hitter in a high leverage situation if he is facing the order for the 3rd time or more, regardless of how he is pitching.

9)      Speaking of leverage, thou shalt be given a leverage chart with score, inning, runners, and outs. Use it!

10)   Thou shalt, if at all possible, use thy best pitchers in high leverage situations and thy worst pitchers in low leverage situations, regardless of the score or inning.  Remember that “best” and “worst” are based on your new “index cards” (batter v. pitcher projections) or your chart which contains each pitcher’s generic projection. It is never based on how they did yesterday, last week, or even the entire season. Thou sometimes may use “specialty” pitchers, such as when a GDP or a K are at a premium.

11)   Thou shalt be given a chart for every base runner and several of the most common inning, out, and score situations. There will be a number next to each player’s name for each situation. If the pitcher’s time home plus the catcher’s pop time are less than that number, thy runner will not steal. If it is greater, thy runner may steal. No runner shall steal second base with a lefty pitcher on the mound.

12)   Thou shalt not let thy heart be troubled by the outcome of your decisions. No one who works for this team will ever question your decision based on the outcome. Each decision you make is either right, wrong, or a toss-up, before we know, and regardless of, the outcome.

13)   Thou shalt be held responsible for your decisions, also regardless of the outcome. If your decisions are contrary to what we believe as an organization, we would like to hear your explanation and we will discuss it with you. However, you are expected to make the right decisions at all times, based on the beliefs and philosophies of the organization. We don’t care what the fans or the media think.  We will take care of that. We will all make sure that our players are on the same page as we are.

14)   Finally, thou shalt know that we respect and admire your leadership and motivational skills. That is one of the reasons we hired you. However, if you are not on board with our decision-making processes and willing to employ them at all times, please find yourself another team to manage.

Last night I lambasted the Cardinals’ sophomore manager, Mike Matheny, for some errors in bullpen management that I estimated cost his team over 2% in win expectancy (WE). Well, after tonight’s game, all I have to say is, as BTO so eloquently said, “You ain’t seen nothin’ yet!”

Tonight (or last night, or whatever), John Farrell, the equally clueless manager of the Red Sox (God, I hope I don’t ever have to meet these people I call idiots and morons!), basically told Matheny, “I’ll see your stupid bullpen management and raise you one moronic non-pinch hit appearance!”

I’m talking of course about the top of the 7th inning in Game 5. The Red Sox had runners on second and third, one out, and John Lester, the Sox’ starter was due to hit (some day, I’ll be telling my grandkids, “Yes, Johnny, pitchers once were also hitters.”). Lester was pitching well (assuming you define “well” as how many hits/runs he allowed so far – not that I am suggesting that he wasn’t  pitching “well”) and had only thrown 69 pitches, I think. I don”t think it ever crossed Farrell’s mind to pinch hit for him in that spot. The Sox were also winning 2-1 at the time, so, you know, they didn’t need any more runs in order to guarantee a win <sarcasm>.

Anyway, I’m not going to engage in a lot of hyperbole and rhetoric (yeah, I probably will). It doesn’t take a genius to figure out that not pinch hitting for Lester in that particular spot (runners on 2nd and 3rd, and one out) is going to cost a decent number of fraction of runs. It doesn’t even take a genius, I don’t think, to figure out that that means that it also costs the Red Sox some chance of ultimately winning the game. I’ll explain it like I would to a 6-year-old child. With a pinch hitter, especially Napoli, you are much more likely to score, and if you do, you are likely to score more runs. And if on the average you score more runs that inning with a pinch hitter, you are more likely to win the game, since you only have a 1 run lead and the other team still gets to come to bat 3 more times. Surely, Farrell can figure that part out.

How many runs and how much win expectancy does that cost, on the average? That is pretty easy to figure out. I’ll get to that in a second (spoiler alert: it’s a lot). So that’s the downside. What is the upside? It is two-fold, sort of. One, you get to continue to pitch Lester for another inning or two. I assume that Farrell does not know exactly how much longer he plans on using Lester, but he probably has some idea. Two, you get to rest your bullpen in the 7th and possibly the 8th.

Both of those upsides are questionable in my opinion, but, as you’ll see, I will actually give Farrell and any other naysayer (to my way of thinking) the benefit of the doubt. The reason I think it is questionable is this: Lester, despite pitching well so far, and only throwing 69 pitches, is facing the order for the 3rd time in the 7th inning, which means that he is likely .4 runs per 9 innings worse than he is overall, and the Red Sox, like most World Series teams, have several very good options in the pen who are actually at least as good as Lester when facing the order for the third time, not to mention the fact that Farrell can mix and match his relievers in those two innings on order to get the platoon advantage. So, in my opinion, the first upside for leaving in Lester is not an upside at all.  But, when I do my final analysis, I will sort of assume that it is, as you will see.

The second upside is the idea of saving the bullpen, or more specifically, saving the back end of the bullpen, the short relievers. In my opinion, again, that is a sketchy argument. We are talking about the Word Series, where you carry 11 or 12 pitchers in order to play 7 games in 9 days and then take 5 months off. In fact, tomorrow (today?) is an off day followed by 2 more games and then they all go home. Plus, it’s not like either bullpen has been overworked in the post-season so far. But, I will be happy to concede that “saving your pen” is indeed an upside for leaving Lester in the game. How much is it worth? No one knows, but I don’t think anyone would disagree with this: A manager would not choose to “save” his bullpen for 1-2 innings when there is an off day followed by 2 more games, followed by 100 off days, when the cost of that savings is a significant chunk of win expectancy in the game he is playing at the present time. I mean, if you don’t agree with that, just stop reading and don’t ever come back to this site.

The final question, then, is how much in run or win expectancy did that non-pinch hit cost? Remember in my last post how I talked about “categories” of mistakes that a manager can make? I said that a Category I mistake, a big one, cost a team 1-2% in win expectancy. That may not seem like a lot for one game, but it is. We all criticize managers for “costing” their team the game when we think  they made a mistake and their team loses. If you’ve never done that, then you can stop reading too. The fact of the matter is that there is almost nothing a manager can do, short of losing his mind and pinch hitting the bat boy in a high leverage situation, that is worth more than 1 or 2% in win expectancy. Other than this.

The run expectancy with runners on second and third and one out in a low run environment is around 1.40. That means that on the average with a roughly average hitter at the plate, the batting team will score, on the average, 1.40 runs during that inning, from that point on. We’ll assume that it is about the same if Napoli pinch hit. He is a very good pinch hitter, but there is a pinch hitting penalty and he is facing a right handed pitcher. To be honest, it doesn’t really matter. It could be 1.2 runs or 1.5 runs. It won’t make much of a difference.

What is the run expectancy with Lester at the plate? I don’t know much about his hitting, but I assume that since he has never been in the NL, and therefore hardly ever hits, it is not good. We can easily say that it is below that of an average pitcher, but that doesn’t really matter either. With an average pitcher batting in that same situation, and the top of the order coming up, the average RE is around 1.10 runs. So the difference is .3 runs. Again, it doesn’t matter much if it is .25 or .4 runs. And there really isn’t much wiggle room. We know that it is a run scoring situation and we know that a pinch hitter like Napoli (or almost anyone for that matter) is going to be a much better hitter than Lester. So .3 runs sounds more than reasonable. Basically we are saying that, on the average, with a pinch hitter like Napoli at the plate in that situation, runners on 2nd and 3rd with 1 out, the Red Sox will score .3 more runs than with Lester at the plate. I don’t know that anyone would quarrel with that – even someone like a Tim McCarver or Joe Morgan.

In order to figure out how much in win expectancy that is going to cost, again, on the average, first we need to multiply that number by the leverage index in that situation. The LI is 1.64.  1.64 times .3 runs divided by 10 is .049 or 4.9%. That is the difference in WE between batting Lester or a pinch hitter. It means that with the pinch hitter, the Red Sox can expect, on the average, to win the game around 5% more often than if Lester hits, everything else being equal. I don’t know whether you can appreciate the enormity of that number. I have been working with these kinds of numbers for over 20 years. If you can’t appreciate it, you will just have to take my word for it that that is a ginormous number when it comes to WE in one game. As I said, I usually consider an egregious error to be worth 1-2%. This is worth almost 5%. That is ridiculous. It’s like someone offering you a brand new Chevy or Mercedes for the same price. And you take the Chevy, if you are John Farrell.

Just to see if we are in the right ballpark with our calculations, I am going to to run this scenario through my baseball simulator, which is pretty darn accurate (even though it does not have an algorithm for heart or grit) in these kinds of relatively easy situations to analyze.

Sound of computers whirring….

With Lester hitting, the Red Sox win the game 76.6% of the time. And therein lies the problem! Farrell knows that no matter what he does, he is probably going to win the game, and if he takes out Lester, not only is he going to bruise his feelings (boo hoo), but if the relief core blows the game, he is going to be lambasted and probably feel like crap. If he takes Lester out, he knows he’s also going to probably win the game, and what’s a few percent here and there. But if he lets Lester continue, as all of Red Sox nation assumes and hopes he will, and then they blow the game, no one is going to blame Farrell. You know why? Because at the first sign of trouble, he is going to pull Lester, and no one is going to criticize a manager for leaving in a pitcher who is pitching a 3-hitter through 6 innings and only 69 pitches and yanks him as soon as he gives up a baserunner or two. So letting Lester hit for himself is the safe decision. Not a good one, but a safe one.

After that rant, you probably want to know how often the Sox win if they pinch hit for Lester. 79.5% of the time. So that’s only a 2.9% difference. Still higher than my formerly highest Category of manager mistakes, 1-2%.

Let’s be conservative and call it a 3% mistake. I wonder if you told John Farrell that by not pinch hitting for Jon Lester his team’s chances of winning go from 79.5% to 76.6%. Even if he believed that, do you think it would sway his decision? I don’t think so, because he feels with all his heart and soul that having Lester, who is “dealing,” pitch another inning or two, and saving his bullpen, is well worth the difference between 77% and 80%. After all, either way, they probably win.

So how much does Lester pitching another inning or two (we’ll call it 1.5 innings, since at the time it could have been anywhere from 0 to 2, I think  – I am pretty sure that Koji was pitching the 9th no matter what) gain over another pitcher? Well, I already said that the answer is nothing. Any of their good relievers are at least as good as Lester the 3rd time though the order. But I also said that I will concede that Lester is going to be just amazing, on the average, if Farrell leaves him in the game. How good does he have to be in order to make up the .3 runs or 3% in WE that are lost by allowing Lester to hit?

A league average reliever allows around 4 runs a game. It doesn’t matter what that exact number is – we are only using it for comparison purposes. A good short reliever actually allows more like 3 or 3.5 runs a game. Starting pitchers, in general, are a little worse than the average pitcher (because of that nasty times through the order penalty). A very good pitcher like Lester allows around 3.5 runs a game (a pitcher like Wainwright around 3 runs a game). So let’s assume that a very average reliever came into the game to pitch the 7th and half the 8th rather than Lester. They would allow 4 runs a game. That is very pedestrian for a reliever. Almost any short reliever can do that with his eyes closed. In order to make up the .3 runs we lost by letting Lester hit, Lester needs to allow fewer runs than 4 runs a game. How much less? Well, .3 runs in 1.5 innings is .2 runs per inning. .2 runs per inning times 9 innings is 1.8 runs. So Lester would have to pitch like a pitcher who allows 2.2 runs per 9 innings. No starting pitcher like that exists. Even the best starter in baseball, Clayton Kershaw, is a 2.5 run per 9 pitcher at best.

Let’s go another route. Remember that I said Lester was probably around a 3.5 run pitcher (Steamer, a very good projection system, has him projected with a 3.60 FIP, which is around a 3.5 pitcher in my projection system), but that the third time through the order he is probably a 3.80 or 3.90 pitcher. Forget about that. Let’s decree that Lester is indeed going to pitch the 7th and 8th innings, or however long he continues, like an ace reliever. Let’s call him a 3.00 pitcher, not the 3.80 or 3.90 pitcher that I think he really is, going into the 7th inning.

In case, you are wondering, there is no evidence that good or even great pitching through 6 or 7 innings predicts good pitching for future innings. Quite the contrary. Even starters who are pitching well have the times through the order penalty, and if they are allowed to continue, they end up pitching worse than they do overall in a random game. That is what real life says. That is what happens. It is not my opinion, observation, or recollection. A wise person once said that, “Truth comes from evidence and not opinion or faith.”

But, again, we are living on Planet Farrell, so we are conceding that Lester is a great pitcher going into the 7th inning and the third time through the order. (Please don’t tell me how he did that inning. If you do or even think that, you need to leave and never come back. Seriously.)  We are calling him a 3.0 pitcher, around the same as a very good closer.

How bad does a replacement for Lester for 1.5 innings have to be to make up for that .3 runs? Again, we need .2 runs per inning, times 9 innings, or a total of 1.8 runs per 9. So the reliever to replace him would have to be a 4.8 pitcher. That is a replacement pitcher folks, There is no one on either roster who is even close to that.

So there you have it. Like Keith Olbermann’s, Worst person in the world, we have the worst manager in baseball – John Farrell.

Addendum: Please keep in mind that some of the hyperbole and rhetoric is just that. Take comments like, “Farrell is an idiot,” or, “the worst manager in baseball,” with a grain of salt and chalk it up to flowery emotion. It is not relevant to the argument of course. The argument speaks for itself, and you, the reader, are free to conclude what you want about whether his moves, or any other managerial moves that I might discuss, were warranted or not.

I am not insensitive to factors that drive all managers’ decisions, like the reaction, desires, and opinions of the fans, media, upper management, and especially, the players. As several people have pointed out, if a manager were to do things that were “technically” correct, yet in doing so, alienate his players (and/or the fans) thereby affecting morale, loyalty, and perhaps a conscious or subconscious desire to win, then those “correct” decisions may become “incorrect” in the grand scheme of things.

That being said, my intention is to inform the reader and to take the hypothetical perspective of informing the manager of the relevant and correct variables and inputs such that they and you can make an informed decision. Imagine this scenario: I am sitting down with Farrell and perhaps the Red Sox front office and we are rationally and intelligently discussing ways to improve managerial strategy. Surely no manager can be so arrogant as to think that everything he does is correct. You would not want an employee like that working for your company no matter how much you respect him and trust his skills. Anyway, let’s say that we are discussing this very same situation, and Farrell says something like, “You know, I really didn’t care whether I removed Lester for a pinch hitter or not, and I don’t think he or my players would either. Plus, the preservation of my bullpen was really a secondary issue. I could have easily used Morales, Dempster, or even Breslow again. Managers have to make tough decisions like that all the time. I genuinely thought that with Lester pitching and us already being up a run, we had the best chance to win. But now that you have educated me on the numbers, I realize that that assumption on my part was wrong. In the future I will have to rethink my position if that or a similar situation should come up.”

That may not be a realistic scenario, but that is the kind of discussion and thinking I am trying to foster.

MGL