Archive for the ‘Pitching’ Category

Note: This post was edited to include some new data which leads us in the direction of a different conclusion. The addendum is at the end of the original post .

This is another one of my attempts at looking at “conventional wisdoms” that you hear and read about all the time without anyone stopping for a second to catch their breath and ask themselves, “Is this really true?” Or more appropriately, “To what extent is this true?” Bill James used those very questions to pioneer a whole new field called sabermetrics.

As usual in science, we can rarely if ever answer questions with, “Yes it is true,” or “No, it is not true.” We can only look at the evidence and try and draw some inferences with some degree of certainty between 0 and 100%. This is especially true in sports when we are dealing with empirical data and limited sample sizes.

You often read something like, “So-and-so pitcher had a poor season (say, in ERA) but he had a few really bad outings so it wasn’t really that bad.” Let’s see if we can figure out to what extent that may or may not be true.

First I looked at all starting pitcher outings over the last 40 years, 1977-2016. I created a group of starters who had at least 4 very bad outings and at least 100 IP in one season. A “bad outing” was defined as 5 IP or less and at least 6 runs allowed, so a minimum RA9 of almost 11 in at least 4 games in a season. Had those starts been typical starts, each of these pitchers’ ERA’s or RA9 would have been at least a run less or so.

Next I only looked at those pitchers who had an overall RA9 of at least 5.00 in the seasons in question. The average RA9 for these pitchers with some really bad starts was 5.51 where 4.00 is the average starting pitcher’s RA9 in every season regardless of the run environment or league. Basically I normalized all pitchers to the average of his league and year and set the average at 4.00. I also park adjusted everything.

OK, what were these pitchers projected to do the following season? I used basic Marcel-type projections for all pitchers. The projections treated all RA9 equally. In other words a 5.51 RA with a few really bad starts was equivalent to a 5.51 RA with consistently below-average starts. The projections only used full season data (RA9).

So basically these 5.51 RA9 pitchers pitched near average for most of the their starts but had 4-6 really bad (and short) starts that upped their overall RA9 for the season by more than a run. Which was more indicative of their true talent? The vast majority of the games where they pitched around average, the few games where they blew up, or their overall runs allowed per 9 innings? Or, their overall RA9 for that season (regardless of how it was created) plus their RA9 from previous seasons and then some regression thrown in for good measure – in other words, a regular, old-fashioned projection?

Our average projection for these pitchers for the next season (which is an estimate of their true talent that season) was 4.46. How did they pitch the next season – which is an unbiased sample of their true talent (I didn’t set an innings requirement for this season so there is no survivorship bias)? It was 4.48 in 10,998 TBF! So the projection which had no idea that these were pitchers who pitched OK for most of the season but had a terrible seasonal result (5.51 RA9) because of a few terrible starts, was right on the money. All the projection model knew was that these pitchers had very bad RA9 for the season – in fact, their average RA was 138% of league average.

Of course since we sampled these pitchers based on some bad outings and an overall bad ERA (over 5.00) we know that in prior seasons their RA9 would be much lower, similar to their projection (4.46) – actually better. In fact, you should know that a projection can apply just as well to previous years as it can to subsequent years. There is almost no difference. You just have to make sure you apply the proper age adjustments.

Somewhat interestingly, if we look at all pitchers with a RA9 above 5 (an average of 5.43) who did not have the requisite very bad outings, i.e. they pitched consistently bad but with few disastrous starts, their projected RA9 was 4.45 and their actual was 4.25, in 25,479 TBF.

While we have significant sample error in these limited samples, not only is there no suggestion that you should ignore or even discount bad ERA or RA that are the result of a few horrific starts, there is a (admittedly weak) suggestion that pitchers who pitch badly but more consistently may be able to outperform their projections for some reason.

The next time you read that, “So-and-so pitcher has bad numbers but it was only because of a few really bad outings,” remember that there is no evidence  that an ERA or RA which includes a “few bad outings” should be treated any differently than a similar ERA or RA without that qualification, at least as far as projections are concerned.

Addendum: I was concerned about the way I defined pitchers who had “a few disastrous starts.” I included all starters who gave up at least 6 runs in 5 innings or less at least 5 times in a season. The average number of bad starts was 5.5. So basically these were mostly pitchers who had 5 or 6 really bad starts in a season, occasionally more.

I thought that most of the time when we hear the “A few bad starts” refrain, we’re talking literally about “a few bad starts,” as in 2 or 3. So I changed the criteria to include only those pitchers with 2 or 3 awful starts. I also upped the ante on those terrible starts. Before it was > 5 runs in 5 IP or less.  Now it is >7 runs in 5 IP or less – truly a blowup of epic proportions. We still had 508 pitcher seasons that fit the bill which gives us a decent sample size.

These pitchers overall had a normalized (4.00 is average) RA9 of 4.19 in the seasons in question, so 2 or 3 awful starts didn’t produce such a bad overall RA. Remember I am using a 100 IP minimum so all of these pitchers pitched at least fairly well for the season whether they had a few awful starts or not. (This is selective sampling and survivorship bias at work. Any time you set a minimum IP or PA, you select players who had above average performance, through luck and talent.)

Their next year’s projection was 3.99 and the actual was 3.89 so there is a slight inference that indeed you can discount the bad starts a little. This is in around 12,000 IP. A difference of .1 RA9 is only around 1 SD so it’s not nearly statistically significant. I also don’t know that we have any Bayesian prior to work with.

The control group – all other starters, namely those without 2 or 3 awful outings – had a RA9 in the season in question of 3.72 (compare to 4.19 for the pitchers with 2 or 3 bad starts). Their projection for the next season was 3.85 and actual was 3.86. This was in around 130,000 IP so 1 SD is now around .025 runs so we can be pretty confident that the 3.86 actual RA9 reflects their true talent within around .05 runs (2 SD) or so.

What about starters who not only had 2 or 3 disastrous starts but also had an overall poor RA9? In the original post I looked at those pitchers in our experimental group who also had a seasonal RA9 of > 5.00. I’ll do the same thing with this new experimental group – starters with only 2 or 3 very awful starts.

Their average RA9 for the experimental season was 5.52. Their projection was 4.45 and actual was 4.17, so now we have an even stronger inference that a bad season caused by a few bad starts creates a projection that is too pessimistic; thus maybe we should  discount those few bad starts. We only have around 1600 IP (in the projected season) for these pitchers so 1 SD is around .25 runs. A difference between projected and actual of .28 runs is once again not nearly statistically significant. There is, nonetheless, a suggestion that we are on to something. (Don’t ever ignore – assume it’s random – an observed effect just because it isn’t statistically significant – that’s poor science.)

What about the control group? Last time we noticed that the control group’s actual RA was less than its projection for some reason. I’ll look at pitchers who had > 5 RA9 in one season but were not part of the group that had 2 or 3 disastrous starts.

Their average RA9 was 5.44 – similar to the 5.52 of the experimental group. Their projected was 4.45 and actual was 4.35, so we see the same “too high” projection in this group as well. (In fact, in testing my RA projections based on RA only – as opposed to say FIP or ERC – I find an overall bias such that pitchers with a one-season high RA have projections that are too high, not a surprising result actually.) This is in around 7,000 IP which gives us a SD of around .1 runs per 9.

So, the “a few bad starts” group outperformed their projections by around .1 runs. This same group, limiting it to starters with an overall RA or over 5.00, outperformed their projections by .28 runs. The control group with an overall RA also > 5.00 outperformed their projections by .1 runs. None of these differences are even close to statistically significant.

Let’s increase the sample size a little of our experimental group who also had particularly bad RA overall by expanding it to starters with an overall RA of > 4.50 rather than > 5.00. We now have 3,500 IP, 2x as many IP, reducing our error by around 50%. The average RA9 of this group was 5.13. Their projected RA was 4.33 and actual was 4.05 – exactly the same difference as before. Keep in mind that the more samples we look at the more we are “data mining,” which is a bit dangerous in this kind of research.

A control group of starters with > 4.50 RA had an overall RA9 of 4.99. Their projection was exactly the same as the experimental group, 4.33, but their actual was 4.30 – almost exactly the same as their projection.

In conclusion, while we initially found no evidence that discounting a bad ERA or RA caused by “several very poor starts” is warranted when doing a projection for starters with at least 100 IP, once we change the criteria for “a few bad starts” from “at least 5 starts with 6 runs or more allowed in 5 IP or less” to “exactly 2 or 3 starts with 8 runs or more in 5 IP or less” we do find evidence that some kind of discount may be necessary. In other words, for starters whose runs allowed are inflated due to 2 or 3 really bad starts, if we simply use overall season RA or ERA for our projections we will understate their subsequent season’s RA or ERA by maybe .2 or .3 runs per 9.

Our certainty of this conclusion, especially with regard to the size of the effect – if it exists at all – is pretty weak given the magnitude of the differences we found and the sample sizes we had to work with. However, as I said before, it would be a mistake to ignore any inference – even a weak one – that is not contradicted by some Bayesian prior (or common sense).

 

Note: I updated the pinch hitting data to include a larger sample (previously I went back to 2008. Now, 2000).

Note: It was pointed out by a commenter below and another one on Twitter that you can’t look only at innings where the #9 and #1 batters batted (eliminating innings where the #1 hitter led off), as Russell did in his study, and which he uses to support his theory (he says that it is the best evidence). That creates a huge bias, of course. It eliminates all PA in which the #9 hitter made the last out of an inning or at least an out was made while he was at the plate. In fact, the wOBA for a #9 hitter, who usually bats around .300, is .432 in innings where he and the #1 hitter bat (after eliminating so many PA in which an out was made). How that got past Russell, I have no idea.  Perhaps he can explain.

Recently, Baseball Prospectus published an article by one of their regular writers, Russell Carleton (aka Pizza Cutter), in which he examined whether the so-called “times through the order” penalty (TTOP) was in fact a function of how many times a pitcher has turned over the lineup in a game or whether it was merely an artifact of a pitcher’s pitch count. In other words, is it pitcher fatigue or batter familiarity (the more the batter sees the pitcher during the game, the better he performs) which causes this effect?

It is certainly possible that most or all of the TTOP is really due to fatigue, as “times through the order” is clearly a proxy for pitch count. In any case, after some mathematic gyrations that Mr. Carleton is want to do (he is the “Warning: Gory Mathematical Details Ahead” guy) in his articles, he concludes unequivocally that there is no such thing as a TTOP – that it is really a PCP or Pitch Count Penalty effect that makes a pitcher less and less effective as he goes through the order and it has little or nothing to do with batter/pitcher familiarity. In fact, in the first line of his article, he declares, “There is no such thing as the ‘times through the order’ penalty!”

If that is true, this is a major revelation which has slipped through the cracks in the sabermetric community and its readership. I don’t believe it is, however.

As one of the primary researchers (along with Tom Tango) of the TTOP, I was taken quite aback by Russell’s conclusion, not because I was personally affronted (the “truth” is not a matter of opinion), but because my research suggested that pitch count or fatigue was likely not a significant part of the penalty. In my BP article on the TTOP a little over 2 years ago, I wrote this: “…the TTOP is not about fatigue. It is about familiarity. The more a batter sees a pitcher’s delivery and repertoire, the more likely he is to be successful against him.” What was my evidence?

First, I looked at the number of pitches thrown going into the second, third, and fourth times through the order. I split that up into two groups—a low pitch count and a high pitch count. Here are those results. The numbers in parentheses are the average number of pitches thrown going into that “time through the order.”

Times Through the Order Low Pitch Count High Pitch Count
1 .341 .340
2 .351 (28) .349 (37)
3 .359 (59) .359 (72)
4 .361 (78) .360 (97)

 

If Russell’s thesis were true, you should see a little more of a penalty in the “high pitch count” column on the right, which you don’t. The penalty appears to be the same regardless of whether the pitcher has thrown few or many pitches. To be fair, the difference in pitch count between the two groups is not large and there is obviously sample error in the numbers.

The second way I examined the question was this: I looked only at individual batters in each group who had seen few or many pitches in their prior PA. For example, I looked at batters in their second time through the order who had seen fewer than three pitches in their first PA, and also batters who saw more than four pitches in their first PA. Those were my two groups. I did the same thing for each time through the order. Here are those results. The numbers in parentheses are the average number of pitches seen in the prior PA, for every batter in the group combined.

 

Times Through the Order Low Pitch Count each Batter High Pitch Count each Batter
1 .340 .340
2 .350 (1.9) .365 (4.3)
3 .359 (2.2) .361 (4.3)

 

As you can see, if a batter sees more pitches in his first or second PA, he performs better in his next PA than if he sees fewer pitches. The effect appears to be much greater from the first to the second PA. This lends credence to the theory of “familiarity” and not pitcher fatigue. It is unlikely that 2 or 3 extra pitches would cause enough fatigue to elevate a batter’s wOBA by 8.5 points per PA (the average of 15 and 2, the “bonuses” for seeing more pitches during the first and second PA, respectively).

So how did Russell come to his conclusion and is it right or wrong? I believe he made a fatal flaw in his methodology which led him to a faulty conclusion (that the TTOP does not exist).

Among other statistical tests, here is the primary one which led Russell to conclude that the TTOP is a mirage and merely a product of pitcher fatigue due to an ever-increasing pitch count:

This time, I tried something a little different. If we’re going to see a TTOP that is drastic, the place to look for it is as the lineup turns over. I isolated all cases in which a pitcher was facing the ninth batter in the lineup for the second time and then the first batter in the lineup for the third time. To make things fair, neither hitter was allowed to be the pitcher (this essentially limited the sample to games in AL parks), and the hitters needed to be faced in the same inning. Now, because the leadoff hitter is usually a better hitter, we need to control for that. I created a control variable for all outcomes using the log odds ratio method, which controls for the skills of the batter, as well as that of the pitcher. I also controlled for whether or not the pitcher had the platoon advantage in either case.

First of all, there was no reason to limit the data to “the same inning”. Regardless of whether the pitcher faces the 9th and 1st batters in the same inning or they are split up (the 9 hitter makes the last out), since one naturally follows the other, they will always have around the same pitch count, and the leadoff hitter will always be one time through the order ahead of the number nine hitter.

Anyway, what did Russell find? He found that TTOP was not a predictor of outcome. In other words, that the effect on the #9 hitter was the same as the #1 hitter, even though the #1 hitter had faced the pitcher one more time than the #9 hitter.

I thought about this for a long time and I finally realized why that would be the case even if there was a “times order” penalty (mostly) independent of pitch count. Remember that in order to compare the effect of TTO on that #9 and #1 hitter, he had to control for the overall quality of the hitter. The last hitter in the lineup is going to be a much worse hitter overall than the leadoff hitter, on the average, in his sample.

So the results should look something like this if there were a true TTOP: Say the #9 batters are normally .300 wOBA batters, and the leadoff guys are .330. In this situation, the #9 batters should bat around .300 (during the second time through the order we see around a normal wOBA) but the leadoff guys should bat around .340 – they should have a 10 point wOBA bonus for facing the pitcher for the third time.

Russell, without showing us the data (he should!), presumably gets something like .305 for the #9 batters (since the pitcher has gone essentially 2 ½ times through the lineup, pitch count-wise) and the leadoff hitters should hit .335, or 5 points above their norm as well (maybe .336 since they are facing a pitcher with a few more pitches under his belt than the #9 hitter).

So if he gets those numbers, .335 and .305, is that evidence that there is no TTOP? Do we need to see numbers like .340 and .300 to support the TTOP theory rather than the PCP theory? I submit that even if Russell sees numbers like the former ones, that is not evidence that there is no TTOP and it’s all about the pitch count. I believe that Russell made a fatal error.

Here is where he went wrong:

Remember that he uses the log-odds method to computer the baseline numbers, or what he would expect from a given batter-pitcher matchup, based on their overall season numbers. In this experiment, there is no need to do that, since both batters, #1 and #9, are facing the same pitcher the same number of times. All he has to do is use each batter’s seasonal numbers to establish the base line.

But where do those base lines come from? Well, it is likely that the #1 hitters are mostly #1 hitters throughout the season and that #9 hitters usually hit at the bottom of the order. #1 hitters get around 150 more PA than #9 hitters over a full season. Where do those extra PA come from? Some of them come from relievers of course. But many of them come from facing the starting pitcher more often per game than those bottom-of-the-order guys. In addition, #9 hitters sometimes are removed for pinch hitters late in a game against a starter such that they lose even more of those 3rd and 4th time through the order PA’s. Here is a chart of the mean TTO per game versus the starting pitcher for each batting slot:

 

Batting Slot Mean TTO/game
1 2.15
2 2.08
3 2.02
4 1.98
5 1.95
6 1.91
7 1.86
8 1.80
9 1.77

(By the way, if Russell’s thesis is true, bottom of the order guys have it even easier, since they are always batting when the pitcher has a higher pitch count, per time through the order. Also, this is the first time you have been introduced to the concept that the top of the order batters have it a little easier than the bottom of the order guys, and that switching spots in the order can affect overall performance because of the TTOP or PCP.)

What that does is result in the baseline for the #1 hitter being higher than for the #9 hitter, because the baseline includes more pitcher TTOP (more times facing the starter for the 3rd and 4th times). That makes it look like the #1 hitter is not getting his advantage as compared to the #9 hitter, or at least he is only getting a partial advantage in Russell’s experiment.

In other words, the #9 hitter is really a true .305 hitter and the #1 hitter is really a true .325 hitter, even though their seasonal stats suggest .300 and .330. The #9 hitters are being hurt by not facing starters late in the game compared to the average hitter and the #1 hitters are being helped by facing starters for the 3rd and 4th times more often than the average hitter.

So if #9 hitters are really .305 hitters, then the second time through the order, we expect them to hit .305, if the TTOP is true. If the #1 hitters are really .325 hitters, despite hitting .330 for the whole season, we expect them to hit .335 the third time through the order, if the TTOP is true. And that is exactly what we see (presumably).

But when Russell sees .305 and .335 he concludes, “no TTOP!” He sees what he thinks is a true .300 hitter hitting .305 after the pitcher has thrown around 65 pitches and what he thinks is a true .330 hitter hitting .335 after 68 or 69 pitches. He therefore concludes that both hitters are being affected equally even though one is batting for the second time and the other for the third time – thus, there is no TTOP!

As I have shown, those numbers are perfectly consistent with a TTOP of around 8-10 points per times through the order, which is exactly what we see.

Finally, I ran one other test which I think can give us more evidence one way or another. I looked at pinch hitting appearances against starting pitchers. If the TTOP is real and pitch count is not a significant factor in the penalty, we should see around the same performance for pinch hitters regardless of the pitcher’s pitch count, since the pinch hitter always faces the pitcher for the first time and the first time only. In fact, this is a test that Russell probably should have run. The only problem is sample size. Because there are relatively few pinch hitting PA versus starting pitchers, we have quite a bit of sample error in the numbers. I split the sample of pinch hitting appearances up into 2 groups: Low pitch count and high pitch count.

 

Here is what I got:

PH TTO Overall Low Pitch Count High Pitch Count
2 .295 (PA=4901) .295 (PA=2494) .293 (PA=2318)
3 .289 (PA=10774) .290 (PA=5370) .287 (PA=5404)

 

I won’t comment on the fact that the pinch hitters performed a little better against pitchers with a low pitch count (the differences are not nearly statistically significant) other than to say that there is no evidence that pitch count has any influence on the performance of pinch hitters who are naturally facing pitchers for the first and only time. Keep in mind that the times through the order (the left column) is a good proxy for pitch count in and of itself and we also see no evidence that that makes a difference in terms of pinch hitting performance. In other words, if pitch count significantly influenced pitching effectiveness, we should see pinch hitters overall performing better when the pitcher is in the midst of his 3rd time through the order as opposed to the 2nd time (his pitch count would be around 30-35 pitches higher). We don’t. In fact, we see a worse performance (the difference is not statistically significant – one SD is 8 points of wOBA).

 

I have to say that it is difficult to follow Russell’s chain of logic and his methodology in many of his articles because he often fails to “show his work” and he uses somewhat esoteric and opaque statistical techniques only. In this case, I believe that he made a fatal mistake in his methodology as I have described above which led him to the erroneous conclusion that, “The TTOP does not exist.” I believe that I have shown fairly strong evidence that the penalty that we see pitchers incur as the game wears on is mostly or wholly as a result of the TTO and not due to fatigue caused by an increasing pitch count.

I look forward to someone doing additional research to support one theory or the other.

There seems to be an unwritten rule in baseball – not on the field, but in the stands, at home, in the press box, etc.

“You can’t criticize a manager’s decision if it doesn’t directly affect the outcome of the game, if it appears to ‘work’, or if the team goes on to win the game despite the decision.”

That’s ridiculous of course. The outcome of a decision or the game has nothing to do with whether the decision was correct or not. Some decisions may raise or lower a team’s chances of winning from 90% and other decisions may affect a baseline of 10 or 15%.

If decision A results in a team’s theoretical chances of winning of 95% and decision A, 90%, obviously A is the correct move. Choosing B would be malpractice. Equally obvious is if manager chooses B, an awful decision, he is still going to win the game 90% of the time, and based on the “unwritten rule” we rarely get to criticize him. Similarly, if decision A results in a 15% win expectancy (WE) and B results in 10%, A is the clear choice, yet the team still loses most of the time and we get to second guess the manager whether he chooses A or B. All of that is silly and counter-productive.

If your teenager drives home drunk yet manages to not kill himself or anyone else, do you say nothing because “it turned out OK?” I hope not. In sports, most people understand the concept of “results versus process” if they are cornered into thinking about it, but in practice, they just can’t bring themselves to accept it in real time. No one is going to ask Terry Collins in the post-game presser why he didn’t pinch hit for DeGrom in the 6th inning – no one. The analyst – a competent one at least – doesn’t give a hoot what happened after that. None whatsoever. He looks at a decision and if it appears questionable at the time, he tries to determine what the average consequences are – with all known data at the time the decision is made – with the decision or with one or more alternatives. That’s it. What happens after that is irrelevant to the analyst. For some reason this is a hard concept for the average fan – the average person – to apply. As I said, I truly think they understand it, especially if you give obvious examples, like the drunk driving one. They just don’t seem to be able to break the “unwritten rule” in practice. It goes against their grain.

Well, I’m an analyst and I don’t give a flying ***k whether the Mets won, lost, tied, or Wrigley Field collapsed in the 8th inning. The “correctness” of the decision to allow DeGrom to hit or not in the top of the 6th, with runners on second and third, boiled down to this question and this question only:

“What is the average win expectancy (WE) of the Mets with DeGrom hitting and then pitching some number of innings and what is the average WE with a pinch hitter and someone else pitching in place of DeGrom?”

Admittedly the gain, if there is any, from making the decision to bring in a PH and reliever or relievers must be balanced against any known or potential negative consequences for the Mets not related to the game at hand. Examples of these might be: 1) limiting your relief possibilities in the rest of the series or the World Series. 2) Pissing off DeGrom or his teammates for taking him out and thus affecting the morale of the team.

I’m fine with the fans or the manager and coaches including these and other considerations in their decision. I am not fine with them making their decision not knowing how it affects the win expectancy of the game at hand, since that is clearly the most important of the considerations.

My guess is that if we asked Collins about his decision-making process, and he was honest with us, he would not say, “Yeah, I knew that letting him hit would substantially lower our chances of winning the game, but I also wanted to save the pen a little and give DeGrom a chance to….” I’m pretty sure he thought that with DeGrom pitching well (which he usually does, by the way – it’s not like he was pitching well-above his norm), his chances of winning were better with him hitting and then pitching another inning or two.

At this point, and before I get into estimating the WE of the two alternatives facing Collins, letting DeGrom hit and pitch or pinch hitting and bringing in a reliever, I want to discuss an important concept in decision analysis in sports. In American civil law, there is a thing called a summary judgment. When a party in a civil action moves for one, the judge makes his decision based on the known facts and assuming controversial facts and legal theories in a light most favorable to the non-moving party. In other words, if everything that the other party says is true is true (and is not already known to be false) and the moving party would still win the case according to the law, then the judge must accept the motion and the moving party wins the case without a trial.

When deciding whether a particular decision was “correct” or not in a baseball game or other contest, we can often do the same thing in order to make up for an imperfect model (which all models are by the way). You know the old saw in science – all models are wrong, but some are useful. In this particular instance, we don’t know for sure how DeGrom will pitch in the 6th and 7th innings to the Cubs order for the 3rd time, we don’t know for how much longer he will pitch, we don’t know how well DeGrom will bat, and we don’t know who Collins can and will bring in.

I’m not talking about the fact that we don’t know whether DeGrom or a reliever is going to give up a run or two, or whether he or they are going to shut the Cubs down. That is in the realm of “results-based analysis” and I‘ve already explained how and why that is irrelevant. I’m talking about what is DeGrom’s true talent, say in runs allowed per 9 facing the Cubs for the third time, what is a reliever’s or relievers’ true talent in the 6th and 7th, how many innings do we estimate DeGrom will pitch on the average if he stays in the game, and what is his true batting talent.

Our estimates of all of those things will affect our model’s results – our estimate of the Mets’ WE with and without DeGrom hitting. But what if we assumed everything in favor of keeping DeGrom in the game – we looked at all controversial items in a light most favorable to the non-moving party – and it was still a clear decision to pinch hit for him? Well, we get a summary judgment! Pinch hitting for him would clearly be the correct move.

There is one more caveat. If it is true that there are indirect negative consequences to taking him out – and I’m not sure that there are – then we also have to look at the magnitude of the gain from taking him out and then decide whether it is worth it. In order to do that, we have to have some idea as to what is a small and what is a large advantage. That is actually not that hard to do. Managers routinely bring in closers in the 9th inning with a 2-run lead, right? No one questions that. In fact, if they didn’t – if they regularly brought in their second or third best reliever instead, they would be crucified by the media and fans. How much does bringing in a closer with a 2-run lead typically add to a team’s WE, compared to a lesser reliever? According to The Book, an elite reliever compared to an average reliever in the 9th inning with a 2-run lead adds around 4% to the team’s WE. So we know that 4% is a big advantage, which it is.

That brings up another way to account for the imperfection of our models. The first way was to use the “summary judgment” method, or assume things most favorable to making the decision that we are questioning. The second way is to simply estimate everything to the best of our ability and then look at the magnitude of the results. If the difference between decision A and B is 4%, it is extremely unlikely that any reasonable tweak to the model will change that 4% to 0% or -1%.

In this situation, whether we assume DeGrom is going to pitch 1.5 more innings or 1.6 or 1.4, it won’t change the results much. If we assume that DeGrom is an average hitting pitcher or a poor one, it won’t change the result all that much. If we assume that the “times through the order penalty” is .25 runs or .3 runs per 9 innings, it won’t change the results much. If we assume that the relievers used in place of DeGrom have a true talent of 3.5, 3.3, 3.7, or even 3.9, it won’t change the results all that much. Nothing can change the results from 4% in favor of decision A to something in favor of decision B. 4% is just too much to overcome even if our model is not completely accurate. Now, if our results assuming “best of our ability estimates” for all of these things yield a 1% advantage for choosing A, then it is entirely possible that B is the real correct choice and we might defer to the manager in case he knows some things that we don’t or we simply are mistaken in our estimates or we failed to account for some important variable.

Let’s see what the numbers say, assuming “average” values for all of these relevant variables and then again making reasonable assumptions in favor of allowing DeGrom to hit (assuming that pinch hitting for him appears to be correct).

What is the win expectancy with DeGrom batting. We’ll assume he is an average-hitting pitcher or so (I have heard that he is a poor-hitting pitcher). An average pitcher’s batting line is around 10% single, 2% double or triple, .3% HR, 4% BB, and 83.7% out. The average WE for an average team leading by 1 run in the top of the 6th, with runners on second and third, 2 outs, and a batter with this line, is…..

63.2%.

If DeGrom were an automatic out, the WE would be 59.5%. That is the average WE leading off the bottom of the 6th with the visiting team winning by a run. So an average pitcher batting in that spot adds a little more than 3.5% in WE. That’s not wood. What if DeGrom were a poor hitting pitcher?

Whirrrrr……

62.1%.

So whether DeGrom is an average or poor-hitting pitcher doesn’t change the Mets’ WE in that spot all that much. Let’s call it 63%. That is reasonable. He adds 3.5% to the Mets’ WE compared to an out.

What about a pinch hitter? Obviously the quality of the hitter matters. The Mets have some decent hitters on the bench – notably Cuddyer from the right side and Johnson from the left. Let’s assume a league-average hitter. Given that, the Mets’ WE with runners on second and third, 2 outs, and a 1-run lead, is 68.8%. A league-average hitter adds over 9% to the Mets’ WE compared to an out. The difference between DeGrom as a slightly below-average hitting pitcher and a league-average hitter is 5.8%. That means, unequivocally, assuming that our numbers are reasonably accurate, that letting DeGrom hit cost the Mets almost 6% in their chances of winning the game.

That is enormous of course. Remember we said that bringing in an elite reliever in the 9th of a 2-run game, as compared to a league-average reliever, is worth 4% in WE. You can’t really make a worse decision as a manager than reducing your chances of winning by 5.8%, unless you purposely throw the game. But, that’s not nearly the end of the story. Collins presumably made this decision thinking that DeGrom pitching the 6th and perhaps the 7th would more than make up for that. Actually he’s not quite thinking, “Make up for that.” He is not thinking in those terms. He does not know that letting him hit “cost 5.8% in win expectancy” compared to a pinch hitter. I doubt that the average manager knows what “win expectancy” means let alone how to use it in making in-game decisions. He merely thinks, “I really want him to pitch another inning or two, and letting him hit is a small price to pay,” or something like that.

So how much does he gain by letting him pitch the 6th and 7th rather than a reliever. To be honest it is debatable whether he gains anything at all. Not only that, but if we look back in history to see how many innings starters end up pitching, on the average, in situations like that, we will find that it is not 2 innings. It is probably not even 1.5 innings. He was at 82 pitches through 5. He may throw 20 or 25 pitches in the 6th (like he did in the first), in which case he may be done. He may give up a base runner or two, or even a run or two, and come out in the 6th, perhaps before recording an out. At best, he pitches 2 more innings, and once in a blue moon he pitches all or part of the 8th I guess (as it turned out, he pitched 2 more effective innings and was taken out after seven). Let’s assume 1.5 innings, which I think is generous.

What is DeGrom’s expected RA9 for those 2 innings? He has pitched well thus far but not spectacularly well. In any case, there is no evidence that pitching well through 5 innings tells us anything about how a pitcher is going to pitch in the 6th and beyond. What is DeGrom’s normal expected RA9? Steamer, ZIPS and my projection systems say about 83% of league-average run prevention. That is equivalent to a #1 or #2 starter. It is equivalent to an elite starter, but not quite the level of the Kershaw’s, Arrieta’s, or even the Price’s or Sale’s. Obviously he could turn out to be better than that – or worse – but all we can do in these calculations and all managers can do in making these decisions is use the best information and the best models available to estimate player talent.

Then there is the “times through the order penalty.” There is no reason to think that this wouldn’t apply to DeGrom in this situation. He is going to face the Cubs for the third time in the 6th and 7th innings. Research has found that the third time through the order a starter’s RA9 is .3 runs worse than his overall RA9. So a pitcher who allows 83% of league average runs allows 90% when facing the order for the 3rd time. That is around 3.7 runs per 9 innings against an average NL team.

Now we have to compare that to a reliever. The Mets have Niese, Robles, Reed, Colon, and Gilmartin available for short or long relief. Colon might be the obvious choice for the 6th and 7th inning, although they surely could use a combination of righties and lefties, especially in very high leverage situations. What do we expect these relievers’ RA9 to be? The average reliever is around 4.0 to start with, compared to DeGrom’s 3.7. If Collins uses Colon, Reed, Niese or some combination of relievers, we might expect them to be better than the average NL reliever. Let’s be conservative and assume an average, generic reliever for those 1.5 innings.

How much does that cost the Mets in WE? To figure that, we take the difference in run prevention between DeGrom and the reliever(s), multiply by the game leverage and convert it into WE. The difference between a 3.7 RA9 and a 4.0 RA9 in 1.5 innings is .05 runs. The average expected leverage index in the 6th and 7th innings where the road team is up by a run is around 1.7. So we multiply .05 by 1.7 and convert that into WE. The final number is .0085, or less than 1% in win expectancy gained by allowing DeGrom to pitch rather than an average reliever.

That might shock some people. It certainly should shock Collins, since that is presumably his reason for allowing DeGrom to hit – he really, really wanted him to pitch another inning or two. He presumably thought that that would give his team a much better chance to win the game as opposed to one or more of his relievers. I have done this kind of calculation dozens of times and I know that keeping good or even great starters in the game for an inning or two is not worth much. For some reason, the human mind, in all its imperfect and biased glory, overestimates the value of 1 or 2 innings of a pitcher who is “pitching well” as compared to an “unknown entity” (of course we know the expected performance of our relievers almost as well as we know the expected performance of the starter). It is like a manager who brings in his closer in a 3-run game in the 9th. He thinks that his team has a much better chance of winning than if he brings in an inferior pitcher. The facts say that he is wrong, but tell that to a manager and see if he agrees with you – he won’t. Of course, it’s not a matter of opinion – it’s a matter of fact.

Do I need to go any further? Do I need to tweak the inputs? Assuming average values for the relevant variables yields a loss of over 5% in win expectancy by allowing DeGrom to hit. What if we knew that DeGrom were going to pitch two more innings rather than an average of 1.5? He saves .07 runs rather than .05 which translates to 1.2% WE rather than .85%, which means that pinch hitting for him increases the Mets’ chances of winning by 4.7% rather than 5.05%. 4.7% is still an enormous advantage. Reducing your team‘s chances of winning by 4.7% by letting DeGrom hit is criminal. It’s like pinch hitting Jeff Mathis for Mike Trout in a high leverage situation – twice!

What about if our estimate of DeGrom’s true talent is too conservative? What if he is as good as Kershaw and Arrieta? That’s 63% of league average run prevention or 2.6 RA9. Third time through the order and it’s 2.9. The difference between that and an average reliever is 1.1 runs per 9, which translates to a 3.1% WE difference in 1.5 innings. So allowing Kershaw to hit in that spot reduces the Mets chances of winning by 2.7%. That’s not wood either.

What if the reliever you replaced DeGrom with was a replacement level pitcher – the worst pitcher in the major leagues? He allows around 113% league average runs, or 4.6 RA9. Difference between DeGrom and him for 1.5 innings? 2.7% for a net loss of 3.1% by letting him hit rather than pinch hitting for him and letting the worst pitcher in baseball pitch the next 1.5 innings? If you told Collins, “Hey genius, if you pinch hit for Degrom and let the worst pitcher in baseball pitch for another inning and a half instead of DeGrom, you will increase your chances of winning by 3.1%,” what do you think he would say?

What if DeGrom were a good hitting pitcher? What if….?

You should be getting the picture. Allowing him to hit is so costly, assuming reasonable and average values for all the pertinent variables, that even if we are missing something in our model, or some of our numbers are a little off – even if assume everything in the best possible light of allowing him to hit – the decision is a no-brainer in favor of a pinch hitter.

If Collins truly wanted to give his team the best chance of winning the game, or in the vernacular of ballplayers, putting his team in the best position to succeed, the clear and unequivocal choice was to lift DeGrom for a pinch hitter. It’s too bad that no one cares because the Mets ultimately won the game, which they were going to do at least 60% of the time anyway, regardless of whether Collins made the right or wrong decision.

The biggest loser, other than the Cubs, is Collins (I don’t mean he is a loser, as in the childish insult), because every time you use results to evaluate a decision and the results are positive, you deprive yourself of the opportunity to learn a valuable lesson. In this case, the analysis could have and should have been done before the game even started. All managers should know the importance of bringing in pinch hitters for pitchers in high leverage situations in important games, no matter how good the pitchers are or how well they are pitching in the game so far. Maybe someday they will.

As an addendum to my article on platoon splits from a few days ago, I want to give you a simple trick for answering a question about a player, such as, “Given that a player performs X in time period T, what is the average performance we can expect in the future (or present, which is essentially the same thing, or at least a subset of it)?” and want to illustrate the folly of using unusual single-season splits for projecting the future.

The trick is to identify as many players as you can in some period of time in the past (the more, the better, but sometimes the era matters so you often want to restrict your data to more recent years) that conform to the player in question in relevant ways, and then see how they do in the future. That always answers your question as best as it can. The certainty of your answer depends upon the sample size of the historical performance of similar players. That is why it is important to use as many players and as many years as possible, without causing problems by going too far back in time.

For example, say you have a player whom you know nothing about other than that he hit .230 in one season of 300 AB. What do you expect that he will hit next year? Easy to answer. There are thousands of players who have done that in the past. You can look at all of them and see what their collective BA was in their next season. That gives you your answer. There are other more mathematically rigorous ways to arrive at the same answer, but much of the time the “historical similar player method” will yield a more accurate answer, especially when you have a large sample to work with, because it captures all the things that your mathematical model may not. It is real life! You can’t do much better than that!

You can of course refine your “similar players” comparative database if you have more information about the player in question. He is left-handed? Use only left-handers in your comparison. He is 25? Use only 25-year olds. What if you have so much information about the player in question that your “comp pool” starts to be too small to have a meaningful sample size (which only means that the certainty of your answer decreases, but not necessarily the accuracy)? Let’s say that he is 25, left-handed, 5’10” and 170 pounds, he hit .273 in 300 AB, and you want to include all of these things in your comparison. That obviously will not apply to too many players in the past. Your sample size of “comps” will be small. In that case, you can use players between the ages of 24 and 26, between 5’9” and 5’11”, weigh between 160 and 180, and hit .265-283 in 200 to 400 AB. It doesn’t have to be those exact numbers, but as long as you are not biasing your sample compared to the player in question, you should arrive at an accurate answer to your question.

What if we do that with a .230 player in 300 AB? I’ll use .220 to .240 and between 200 and 400 AB. We know intuitively that we have to regress the .230 towards the league average around 60 or 65%, which will yield around .245 as our answer. But we can do better using actual players and actual data. Of course our answer depends on the league average BA for our player in question and the league average BA for the historical data. Realistically, we would probably use something like BA+ (BA as compared to league-average batting average) to arrive at our answer. Let’s try it without that. I looked at all players who batted in that range from 2010-2014 in 200-400 AB and recorded their collective BA the next year. If I wanted to be a little more accurate (for this question it is probably not necessary), I might weight the results in year 2 by the AB in year 1, or use the delta method, or something like that.

If I do that for just 5 years, 2010-2015, I get 49 players who hit a collective .230 in year 1 in an average of 302 AB. The next year, they hit a collective .245, around what we would expect. That answers our question, “What would a .230 hitter in 300 AB hit next year, assuming he were allowed to play again (we don’t know from the historical data what players who were not allowed to play would hit)?”

What about .300 in 400 AB? I looked at all players from .280 to .350 in year 1 and between 300 and 450 AB. They hit a collective .299 in year 1 and .270 in year 2. Again, that answers the question, “What do we expect Player A to hit next year if he hit .300 this year in around 400 AB?”

For Siegrest with the -47 reverse split, we can use the same method to answer the question, “What do we expect his platoon split to be in the future given 230 TBF versus lefties in the past?” That is such an unusual split that we might have to tweak the criteria a little and then extrapolate. Remember that asking the question, “What do we expect Player A to do in the future?” is almost exactly the same thing as asking, “What is his true talent with respect to this metric?”

I am going to look at only one season for pitchers with around 200 BF versus lefties even though Siegrest’s 230 TBF versus lefties was over several seasons. It should not make much difference as the key is the number of lefty batters faced. I included all left-handed pitchers with at least 150 TBF versus LHB who had a reverse wOBA platoon difference of more than 10 points and pitched again the next year. Let’s see how they do, collectively, in the next year.

There were 76 of such pitchers from 2003-1014. They had a collective platoon differential of -39 points, less than Siegrest’s -47 points, in an average of 194 TBF versus LHB, also less than Siegrest’s 231. But, we should be in the ballpark with respect to estimating Siegrest’s true splits using this “in vivo” method. How did they do in the next year, which is a good proxy (an unbiased estimate) for their true splits?

In year 2, they had an average TBF versus lefties of 161, a little less than the previous year, which is to be expected, and their collective platoon splits were plus plus 8.1 points. So they went from -39 to plus 8.1 in one season to the next because one season of reverse splits is mostly a fluke as I explained in my previous article on platoon splits. 21 points is around the average for LHB with > 150 TBF v. lefties in this time period, so these pitchers moved 47 points from year 1 to year 2, out of a total of 60 points from year 1 to league average. That is a 78% regression toward the mean, around what we estimated Siegrest’s regression should be (I think it was 82%). That suggests that our mathematical model is good since it creates around the same result as when we used our “real live players” method.

How much would it take to estimate a true reverse split for a lefty? Let’s look at some more numbers. I’ll raise the bar to lefty pitchers with at least a 20 point reverse split. There were only 57 in those 12 years of data. They had a collective split in year 1 of -47, just like Siegrest, in an average of 191 TBF v. LHB. How did they do in year 2, which is the answer to our question of their true split? Plus 6.4 points. That is a 78% regression, the same as before.

What about pitchers with at least a 25 point reverse split? They averaged -51 points in year 1. Can we get them to a true reverse split?  Nope. Not even close.

What if we raise the sample size bar? I’ll do at least 175 TBF and -15 reverse split in year 1. Only 45 lefty pitchers fit this bill and they had a -43 point split in year 1 in 209 TBF v. lefties. Next year? Plus 2.8 points! Close but no cigar. There is of course an error bar around only 45 pitchers with 170 TBF v. lefties in year 2, but we’ll take those numbers on faith since that’s what we got. That is a 72% regression with 208 TBF v. lefties, which is about what we would expect given that we have a slightly larger sample size than before.

So please, please, please, when you see or hear of a pitcher with severe reverse splits in 200 or so BF versus lefties, which is around a full year for a starting pitcher or 2 or 3 years for a reliever, remember that our best estimate of their true platoon splits, or what his manager should expect when he sends him out there, is very, very different from what those actual one or three year splits suggest when those actual splits are very far away from the norm. Most of that unusual split, in either direction – almost all of it in fact – is likely a fluke. When we say “likely” we also mean that we must assume that it is a fluke and that we must also assume that the true number is the weighted mean of all the possibilities, which are those year 2 numbers, or year 1 (or multiple years) heavily regressed toward the league average.

 

In response to my two articles on whether pitcher performance over the first 6 innings is predictive of their 7th inning performance (no), a common response from saber and non-saber leaning critics and commenters goes something like this:

No argument with the results or general method, but there’s a bit of a problem in selling these findings. MGL is right to say that you can’t use the stat line to predict inning number 7, but I would imagine that a lot of managers aren’t using the stat line as much as they are using their impression of the pitcher’s stuff and the swings the batters are taking.

You hear those kinds of comments pretty often even when a pitcher’s results aren’t good, “they threw the ball pretty well,” and “they didn’t have a lot of good swings.”

There’s no real way to test this and I don’t really think managers are particularly good at this either, but it’s worth pointing out that we probably aren’t able to do a great job capturing the crucial independent variable.

That is actually a comment on The Book Blog by Neil Weinberg, one of the editors of Beyond the Box Score and a sabermetric blog writer (I hope I got that somewhat right).

My (edited) response on The Book Blog was this:

Neil I hear that refrain all the time and with all due respect I’ve never seen any evidence to back it up. There is plenty of evidence, however, that for the most part it isn’t true.

If we are to believe that managers are any good whatsoever at figuring out which pitchers should stay and which should not, one of two things must be true:

1) The ones who stay must pitch well, especially in close games. That simply isn’t true.

2) The ones who do not stay would have pitched terribly. In order for that to be the case, we must be greatly under-estimating the TTO penalty. That strains credulity.

Let me explain the logic/math in # 2:

We have 100 pitchers pitching thru 6 innings. Their true talent is 4.0 RA9. 50 of them stay and 50 of them go, or some other proportion – it doesn’t matter.

We know that those who stay pitch to the tune of around 4.3. We know that. That’s what the data say. They pitch at the true talent plus the 3rd TTOP, after adjusting for the hitters faced in the 7th inning.

If we are to believe that managers can tell, to any extent whatsoever, whether a pitcher is likely to be good or bad in the next inning or so, then it must be true that the ones who stay will pitch better on the average then the ones who do not, assuming that the latter were allowed to stay in the game of course.

So let’s assume that those who were not permitted to continue would have pitched at a 4.8 level, .5 worse than the pitchers who were deemed fit to remain.

That tells us that if everyone were allowed to continue, they would pitch collectively at a 4.55 level, which implies a .55 rather than a .33 TTOP.

Are we to believe that the real TTOP is a lot higher than we think, but is depressed because managers know when to take pitchers out such that the ones they leave in actually pitch better than all pitchers would if they were all allowed to stay?

Again, to me that seems unlikely.

Anyway, here is some new data which I think strongly suggests that managers and pitching coaches have no better clue than you or I as to whether a pitcher should remain in a game or not. In fact, I think that the data suggest that whatever criteria they are using, be it runs allowed, more granular performance like K, BB, and HR, or keen, professional observation and insight, it is simply not working at all.

After 6 innings, if a game is close, a manager should make a very calculated decision as far as whether or not he should remove his starter. That decision ought to be based primarily on whether the manager thinks that his starter will pitch well in the 7th and possibly beyond, as opposed to one of his back-end relievers. Keep in mind that we are talking about general tendencies which should apply in close games going into the 7th inning. Obviously every game may be a little different in terms of who is on the mound, who is available in the pen, etc. However, in general, when the game is close in the 7th inning and the starter has already thrown 6 full, the decision to yank him or allow him to continue pitching is more important than when the game is not close.

If the game is already a blowout, it doesn’t matter much whether you leave in your starter or not. It has little effect on the win expectancy of the game. That is the whole concept of leverage. In cases where the game is not close, the tendency of the manager should be to do whatever is best for the team in the next few games and in the long run. That may be removing the starter because he is tired and he doesn’t want to risk injury or long-term fatigue. Or it may be letting his starter continue (the so-called “take one for the team” approach) in order to rest his bullpen. Or it may be to give some needed work to a reliever or two.

Let’s see what managers actually do in close and not-so-close games when their starter has pitched 6 full innings and we are heading into the 7th, and then how those starters actually perform in the 7th if they are allowed to continue.

In close games, which I defined as a tied or one-run game, the starter was allowed to begin the 7th inning 3,280 times and he was removed 1,138 times. So the starter was allowed to pitch to at least 1 batter in the 7th inning of a close game 74% of the time. That’s a pretty high percentage, although the average pitch count for those 3,280 pitcher-games was only 86 pitches, so it is not a complete shock that managers would let their starters continue especially when close games tend to be low scoring games. If a pitcher is winning or losing 2-1 or 3-2 or 1-0 or the game is tied 0-0, 1-1, 2-2, and the starter’s pitch count is not high, managers are typically loathe to remove their starter. In fact, in those 3,280 instances, the average runs allowed for the starter through 6 innings was only 1.73 runs (a RA9 of 2.6) and the average number of innings pitched beyond 6 innings was 1.15.

So these are presumably the starters that managers should have the most confidence in. These are the guys who, regardless of their runs allowed, or even their component results, like BB, K, and HR, are expected to pitch well into the 7th, right? Let’s see how they did.

These were average pitchers, on the average. Their seasonal RA9 was 4.39 which is almost exactly league average for our sample, 2003-2013 AL. They were facing the order for the 3rd time on the average, so we expect them to pitch .33 runs worse than they normally do if we know nothing about them.

These games are in slight pitcher’s parks, average PF of .994, and the batters they faced in the 7th were worse than average, including a platoon adjustment (it is almost always the case that batters faced by a starter in the 7th are worse than league average, adjusted for handedness). That reduces their expected RA9 by around .28 runs. Combine that with the .33 run “nick” that we expect from the TTOP and we expect these pitchers to pitch at a 4.45 level, again knowing nothing about them other than their seasonal levels and attaching a generic TTOP penalty and then adjusting for batter and park.

Surely their managers, in allowing them to pitch in a very close game in the 7th know something about their fitness to continue – their body language, talking to their catcher, their mechanics, location, past experience, etc. All of this will help them to weed out the ones who are not likely to pitch well if they continue, such that the ones who are called on to remain in the game, the 74% of pitchers who face this crossroad and move on, will surely pitch better than 4.45, which is about the level of a near-replacement reliever.

In other words, if a manager thought that these starters were going to pitch at a 4.45 level in such a close game in the 7th inning, they would surely bring in one of their better relievers – the kind of pitchers who typically have a 3.20 to 4.00 true talent.

So how did these hand-picked starters do in the 7th inning? They pitched at a 4.70 level. The worst reliever in any team’s pen could best that by ½ run. Apparently managers are not making very good decisions in these important close and late game situations, to say the least.

What about in non-close game situations, which I defined as a 4 or more run differential?

73% of pitchers who pitch through 6 were allowed to continue even in games that were not close. No different from the close games. The other numbers are similar too. The ones who are allowed to continue averaged 1.29 runs over the first 6 innings with a pitch count of 84, and pitched an average of 1.27 innings more.

These guys had a true talent of 4.39, the same as the ones in the close games – league average pitchers, collectively. They were expected to pitch at a 4.50 level after adjusting for TTOP, park and batters faced. They pitched at a 4.78 level, slightly worse than our starters in a close game.

So here we have two very different situations that call for very different decisions, on the average. In close games, managers should (and presumably think they are) be making very careful decision about whom to pitch in the 7th, trying to make sure that they use the best pitcher possible. In not-so-close games, especially blowouts, it doesn’t really matter who they pitch, in terms of the WE of the game, and the decision-making goal should be oriented toward the long-term.

Yet we see nothing in the data that suggests that managers are making good decisions in those close games. If we did, we would see much better performance from our starters than in not-so-close games and good performance in general. Instead we see rather poor performance, replacement level reliever numbers in the 7th inning of both close and not-so-close games. Surely that belies the, “Managers are able to see things that we don’t and thus can make better decisions about whether to leave starters in or not,” meme.

Let’s look at a couple more things to further examine this point.

In the first installment of these articles I showed that good or bad run prevention over the first 6 innings has no predictive value whatsoever for the 7th inning. In my second installment, there was some evidence that poor component performance, as measured by in-game, 6-inning FIP had some predictive value, but not good or great component performance.

Let’s see if we can glean what kind of things managers look at when deciding to yank starters in the 7th or not.

In all games in which a starter allows 1 or 0 runs through 6, even though his FIP was high, greater than 4, suggesting that he really wasn’t pitching such a great game, his manager let him continue 78% of the time, which was more than the 74% overall that starters pitched into the 7th.

In games where the starter allowed 3 or more runs through 6 but had a low FIP, less than 3, suggesting that he pitched better than his RA suggest, managers let them continue to pitch just 55% of the time.

Those numbers suggest that managers pay more attention to runs allowed than component results when deciding whether to pull their starter in the 7th. We know that that is not a good decision-making process as the data indicate that runs allowed have no predictive value while component results do, at least when those results reflect poor performance.

In addition, there is no evidence that managers can correctly determine who should stay and who to pull in close games – when that decision matters the most. Can we put to rest, for now at least, this notion that managers have some magical ability to figure out which of their starters has gas left in their tank and which do not? They don’t. They really, really, really don’t.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

The other day I wrote that pitcher performance though 6 innings, as measured solely by runs allowed, is not a good predictor of performance in the 7th inning. Whether a pitcher is pitching a shutout or has allowed 4 runs thus far, his performance in the 7th is best projected mostly by his full-season true talent level plus a times through the order penalty of around .33 runs per 9 innings (the average batter faced in the 7th inning appears for the 3rd time). Pitch count has a small effect on those late inning projections as well.

Obviously if you have allowed no or even 1 run through 6 your component results will tend to be much better than if you have allowed 3 or 4 runs, however there is going to be some overlap. Some small proportion of 0 or 1 run starters will have allowed a HR, 6 or 7 walks and hits, and few if any strikeouts. Similarly, some small percentage of pitchers who allow 3 or 4 runs through 6 will have struck out 7 or 8 batters and only allowed a few hits and walks.

If we want to know whether pitching ”well” or not through 6 innings has some predictive value for the 7th (and later) inning, it is better to focus on things that reflect the pitcher’s raw performance than simply runs allowed. It is an established fact that pitchers have little control over whether their non-HR batted balls fall for hits or outs or whether their hits and walks get “clustered” to produce lots of runs or are spread out such that few if any runs are scored.

It is also established that the components most under control by a pitcher are HR, walks, and strikeouts, and that pitchers who excel at the K, and limit walks and HR tend to be the most talented, and vice versa. It also follows that when a pitcher strikes out a lot of batters in a game and limits his HR and walks total that he is pitching “well,” regardless of how many runs he has allowed – and vice versa.

Accordingly, I have extended my inquiry into whether pitching “well” or not has some predictive value intra-game to focus on in-game FIP rather than runs allowed.  My intra-game FIP is merely HR, walks, and strikeouts per inning, using the same weights as are used in the standard FIP formula – 13 for HR, 3 for walks and 2 for strikeouts.

So, rather than defining dealing as allowing 1 or fewer runs through 6 and not dealing as 3 or more runs, I will define the former as an FIP through 6 innings below some maximum threshold and the latter as above some minimum threshold. Although I am not nearly convinced that managers and pitching coaches, and certainly not the casual fan, look much further than runs allowed, I think we can all agree that they should be looking at these FIP components instead.

Here is the same data that I presented in my last article, this time using FIP rather than runs allowed to differentiate pitchers who have been pitching very well through 6 innings or not.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings Avg runs allowed through 6 # of Games RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 1.02 5,338 4.39
Not-dealing (FIP greater than 4) 2.72 3,058 5.03

The first thing that should jump out at you is while our pitchers who are not pitching well do indeed continue to pitch poorly, our dealing pitchers, based upon K, BB, and HR rate over the first 6 innings, are not exactly breaking the bank either in the 7th inning.

Let’s put some context into those numbers.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings True talent level based on season RA9 Expected RA9 in 7th RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 4.25 4.50 4.39
Not-dealing (FIP greater than 4) 4.57 4.62 5.03

As you can see, our new dealing pitchers are much better pitchers. They normally allow 4.25 runs per game during the season. Yet they allow 4.39 runs in the 7th despite pitching very well through 6, irrespective of runs allowed (and of course they allow few runs too). In other words, we have eliminated those pitchers who allowed few runs but may have actually pitched badly or at least not as well as their meager runs allowed would suggest. All of these dealing pitchers had some combination of high K rates, and low BB and HR rates through 6 innings. But still, we see only around .1 runs per 9 in predictive value – not significantly different from zero or none.

On the other hand, pitchers who have genuinely been pitching badly, at least in terms of some combination of a low K rate and high BB and HR rates, do continue to pitch around .4 runs per 9 innings worse than we would expect given their true talent level and the TTOP.

There is one other thing that is driving some of the difference. Remember that in our last inquiry we found that pitch count was a factor in future performance. We found that while pitchers who only had 78 pitches through 6 innings pitched about as well as expected in the 7th, pitchers with an average of 97 pitches through 6 performed more than .2 runs worse than expected.

In our above 2 groups, the dealing pitchers averaged 84 pitches through 6 and the non-dealing 88, so we expect some bump in the 7th inning performance of the latter group because of a touch of fatigue, at least as compared to the dealing group.

So when we use a more granular approach to determining whether pitchers have been dealing through 6, there is not any evidence that it has much predictive value – the same thing we concluded when we looked at runs allowed only. These pitchers only pitches .11 runs per 9 better than expected.

On the other hand, if pitchers have been pitching poorly for 6 innings, as reflected in the components in which they exert the most control, K, BB, and HR rates, they do in fact pitch worse than expected, even after accounting for a slight elevation in pitch count as compared to the dealing pitchers. That decrease in performance is about .4 runs per 9.

I also want to take this time to state that based on this data and the data from my previous article, there is little evidence that managers are able to identify when pitchers should stay in the game or should be removed. We are only looking at pitchers who were chosen to continue pitching in the 7th inning by their managers and coaches. Yet, the performance of those pitchers is worse than their seasonal numbers, even for the dealing pitchers. If managers could identify those pitchers who were likely to pitch well, whether they had pitched well in prior innings or not, clearly we would see better numbers from them in the 7th inning. At best a dealing pitcher is able to mitigate his TTOP, and a non-dealing pitcher who is allowed to pitch the 7th pitches terribly, which does not bode well for the notion that managers know whom to pull and and whom to keep in the game.

For example, in the above charts, we see that dealing pitchers threw .14 runs per 9 worse than their seasonal average – which also happens to be exactly at league average levels. The non-dealing pitchers, who were also deemed fit to continue by their managers, pitched almost ½ run worse than their seasonal performance and more than .6 runs worse than the league average pitcher. Almost any reliever in the 7th inning would have been a better alternative than either the dealing or non-dealing pitchers. Once again, I have yet to see some concrete evidence that the ubiquitous cry from some of the sabermetric naysayers, “Managers know more about their players’ performance prospects than we do,” has any merit whatsoever.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

Almost everyone, to a man, thinks that a manager’s decision as to whether to allow his starter to pitch in the 6th, 7th, or 8th (or later) innings of an important game hinges, at least in part, on whether said starter has been dealing or getting banged around thus far in the game.

Obviously there are many other variables that a manager can and does consider in making such a decision, including pitch count, times through the order (not high in a manager’s hierarchy of criteria, as analysts have been pointing out more and more lately), the quality and handedness of the upcoming hitters, and the state of the bullpen, both in term of quality and availability.

For the purposes of this article, we will put aside most of these other criteria. The two questions we are going to ask is this:

  • If a starter is dealing thus far, say, in the first 6 innings, and he is allowed to continue, how does he fare in the very next inning? Again, most people, including almost every baseball insider, (player, manager, coach, media commentator, etc.), will assume that he will continue to pitch well.
  • If a starter has not been dealing, or worse yet, he is achieving particularly poor results, these same folks will usually argue that it is time to take him out and replace him with a fresh arm from the pen. As with the starter who has been dealing, the presumption is that the pitcher’s bad performance over the first, say, 6 innings, is at least somewhat predictive of his performance in the next inning or two. Is that true as well?

Keep in mind that one thing we are not able to look at is how a poorly performing pitcher might perform if he were left in a game, even though he was removed. In other words, we can’t do the controlled experiment we would like – start a bunch of pitchers, track how they perform through 6 innings and then look at their performance through the next inning or two.

So, while we have to assume that, in some cases at least, when a pitcher is pitching poorly and his manager allows him to pitch a while longer, that said manager still had some confidence in the pitcher’s performance over the remaining innings, we also must assume that if most people’s instincts are right, the dealing pitchers through 6 innings will continue to pitch exceptionally well and the not-so dealing pitchers will continue to falter.

Let’s take a look at some basic numbers before we start to parse them and do some necessary adjustments. The data below is from the AL only, 2003-2013.

 

 Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings # of Games RA9 in the 7th inning
Dealing (0 or 1 run allowed through 6) 5,822 4.46
Not-dealing (3 or more runs allowed through 6) 2,960 4.48

First, let me explain what “RA9 in the 7th inning” means: It is the average number of runs allowed by the starter in the 7th inning extrapolated to 9 innings, i.e. runs per inning in the 7th multiplied by 9. Since the starter is often removed in the middle of the 7th inning whether has been dealing or not, I calculated his runs allowed in the entire inning by adding together his actual runs allowed while he was pitching plus the run expectancy of the average pitcher when he left the game, scaled to his talent level and adjusted for time through the order, based on the number of outs and base runners.
For example, let’s say that a starter who is normally 10% worse than a league average pitcher allowed 1 run in the 7th inning and then left with 2 outs and a runner on first base. He would be charged with allowing 1 plus (.231 * 1.1 * 1.08) runs or 1.274 runs in the 7th inning. The .231 is the average run expectancy for a runner on first base and 2 outs, the 1.1 multiplier is because he is 10% worse than a league average pitcher, and the 1.08 multiplier is because most batters in the 7th inning are appearing for the 3rd time (TTOP). When all the 7th inning runs are tallied, we can convert them into a runs per 9 innings or the RA9 you see in the chart above.

At first glance it appears that whether a starter has been dealing in prior innings or not has absolutely no bearing on how he is expected to pitch in the following inning, at least with respect to those pitchers who were allowed to remain in the game past the 6th inning. However, we have different pools of pitchers, batters, parks, etc., so the numbers will have to be parsed to make sure we are comparing apples to apples.

Let’s add some pertinent data to the above chart:

Starters through 6 RA9 in the 7th Seasonal RA9
Dealing 4.46 4.29
Not-dealing 4.48 4.46

As you can see, the starters who have been dealing are, not surprisingly, better pitchers. However, interestingly, we have a reverse hot and cold effect. The pitchers who have allowed only 1 run or less through 6 innings pitch worse than expected in the 7th inning, based on their season-long RA9. Many of you will know why – the times through the order penalty. If you have not read my two articles on the TTOP, and I suggest you do, each time through the order, a starting pitcher fares worse and worse, to the tune of about .33 runs per 9 innings each time he faces the entire lineup. In the 7th inning, the average TTO is 3.0, so we expect our good pitchers, the ones with the 4.29 RA9 during the season, to average around 4.76 RA9 in the 7th inning (the 3rd time though the order, a starter pitches about .33 runs per 9 worse than he pitches overall, and the seasonal adjustment – see the note above – adds another .14 runs). They actually pitch to the tune of 4.46 or .3 runs better than expected after considering the TTOP. What’s going on there?

Well, as it turns out, there are 3 contextual factors that depress a dealing starter’s results in the 7th inning that have nothing to do with his performance in the 6 previous innings:

  • The batters that a dealing pitcher is allowed to face are 5 points lower in wOBA than the average batter that each faces over the course of the season, after adjusting for handedness. This should not be surprising. If any starting pitcher is allowed to pitch the 7th inning, it is likely that the batters in that inning are slightly less formidable or more advantageous platoon-wise, than is normally the case. Those 5 points of wOBA translate to around .17 runs per 9 innings, reducing our expected RA9 to 4.59.
  • The parks in which we find dealing pitchers are not-surprisingly, slightly pitcher friendly, with an average PF of .995, further reducing our expectation of future performance by .02 runs per 9, further reducing our expectation to 4.57.
  • The temperature in which this performance occurs is also slightly more pitcher friendly by around a degree F, although this would have a de minimus effect on run scoring (it takes about a 10 degree difference in temperature to move run scoring by around .025 runs per game).

So our dealing starters pitch .11 runs per 9 innings better than expected, a small effect, but nothing to write home about, and well within the range of values that can be explained purely by chance.

What about the starters who were not dealing? They out-perform their seasonal RA9 plus the TTOP by around .3 runs per 9. The batters they face in the 7th inning are 6 points worse than the average league batter after adjusting for the platoon advantage, and the average park and ambient temperature tend to slightly favor the hitter. Adjusting their seasonal RA9 to account for the fact that they pitched poorly through 6 (see my note at the beginning of this article), we get an expectation of 4.51. So these starters fare almost exactly as expected (4.48 to 4.51) in the 7th inning, after adjusting for the batter pool, despite allowing 3 or more runs for the first 6 innings. Keep in mind that we are only dealing with data from around 9,000 BF. One standard deviation in “luck” is around 5 points of wOBA which translates to around .16 runs per 9.

It appears to be quite damning that starters who are allowed to continue after pitching 6 stellar or mediocre to poor innings pitch almost exactly as (poorly as) expected – their normal adjusted level plus .33 runs per 9 because of the TTOP – as if we had no idea how well or poorly they pitched in the prior 6 innings.

Score one for simply using a projection plus the TTOP to project how any pitcher is likely to pitch in the middle to late innings, regardless of how well or poorly they have pitched thus far in the game. Prior performance in the same game has almost no bearing on that performance. If anything, when a manager allows a dealing pitcher to continue pitching after 6 innings, when facing the lineup for the 3rd time on the average, he is riding that pitcher too long. And, more importantly, presumably he has failed to identify anything that the pitcher might be doing, velocity-wise, mechanics-wise, repertoire-wise, command-wise, results-wise, that would suggest that he is indeed on that day and will continue to pitch well for another inning or so.

In fact, whether pitchers have pitched very well or very poorly or anything in between for the first 6 innings of a game, managers and pitching coaches seem to have no ability to determine whether they are likely to pitch well if they remain in the game. The best predictor of 7th inning performance for any pitcher who is allowed to remain in the game, is his seasonal performance (or projection) plus a fixed times through the order penalty. The TTOP is approximately .33 runs per 9 innings for every pass through the order. Since the second time through the order is roughly equal to a pitcher’s overall performance, starting with the 3rd time through the lineup we expect that starter to pitch .33 runs worse than he does overall, again, regardless of how he has pitched thus far in the game. The 4th time TTO, we expect a .66 drop in performance. Pitchers rarely if ever get to throw to the order for the 5th time.

Fatigue and Pitch Counts

Let’s look at fatigue using pitch count as a proxy, and see if that has any effect on 7th inning performance for pitchers who allowed 3 or more runs through 6 innings. For example, if a pitcher has not pitched particularly well, should we allow him to continue if he has a low pitch count?

Pitch count and 7th inning performance for non-dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 85 (avg=78) 4.56 4.70
Greater than 90 (avg=97) 4.66 4.97

 

Expected RA9 accounts for the pitchers’ adjusted seasonal RA9 plus the pool of batters faced in the 7th inning including platoon considerations, as well as park and weather. The latter 2 affect the numbers minimally. As you can see, pitchers who had relatively high pitch counts going into the 7th inning but were allowed to pitch for whatever reasons despite allowing at least 3 runs thus far, fared .3 runs worse than expected, even after adjusting for the TTOP. Pitchers with low pitch counts did only about .14 runs worse than expected, including the TTOP. Those 20 extra pitches appear to account for around .17 runs per 9, not a surprising result. Again, please keep in mind that we dealing with limited sample sizes, so these small differences are inferential suggestions and are not to be accepted with a high degree of certainty. They do point us in a certain direction, however, and one which comports with our prior expectation – at least my prior expectation.

What about if a pitcher has been dealing and he also has a low pitch count going into the 7th inning. Very few managers, if any, would remove a starter who allowed zero or 1 run through 6 innings and has only thrown 65 or 70 pitchers. That would be baseball blasphemy. Besides the affront to the pitcher (which may be a legitimate concern, but one which is beyond the scope of this article), the assumption by nearly everyone is that the pitcher will continue to pitch exceptionally well. After all, he is not at all tired and he has been dealing! Let’s see if that is true – that these starters continue to pitch well, better than expected based on their projections or seasonal performance plus the TTOP.

Pitch count and 7th inning performance for dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 80 (avg=72) 4.75 4.50
Greater than 90 (avg=96) 4.39 4.44

Keep in mind that these pitchers normally allow 4.30 runs per 9 innings during the entire season (4.44 after doing the seasonal adjustment). The reason the expected RA9 is so much higher for pitchers with a low pitch count is primarily due to the TTOP. For pitchers with a high pitch count, the batters they face in the 7th are 10 points less in wOBA than league average, thus the 4.39 expected RA9, despite the usual .3 to .35 TTOP.

Similar to the non-dealing pitchers, fatigue appears to play a factor in a dealing pitcher’s performance in the 7th. However, in either case, low-pitch or high-pitch, their performance through the first 6 innings has little bearing on their 7th inning performance. With no fatigue they out-perform their expectation by .25 runs per 9. The fatigued pitchers under-performed their overall season-long adjusted talent plus the usual TTOP by .05 runs per 9.

Again, we see that there is little value to taking out a pitcher who has been getting a little knocked around or leaving in a pitcher who has been dealing for 6 straight innings. Both groups will continue to perform at around their expected full-season levels plus any applicable TTOP, with a slight increase in performance for a low-pitch count pitcher and a slight decrease for a high-pitch count pitcher. The biggest increase we see, .25 runs, is for pitchers who were dealing and had very low pitch counts.

What about if we increase our threshold to pitchers who allow 4 or more runs over 6 innings and those who are pitching a shutout?

Starters through 6 Seasonal RA9 Expected RA9 7th inning RA9
Dealing (shutouts only) 4.23 4.62 4.70
Not-dealing (4 or more runs) 4.62 4.81 4.87

Here, we see no predictive value in the first 6 innings of performance. In fact, for some reason starters pitching a shutout pitched slightly worse than expected in the 7th inning, after adjusting for the pool of batters faced and the TTOP.

How about the holy grail of starters who are expected to keep lighting it up in the 7th inning – starters pitching a shutout and with a low pitch count? These were true talent 4.25 pitchers facing better than average batters in the 7th, mostly for the third time in the game, so we expect a .3 bump or so for the TTOP. Our expected RA9 was 4.78 after making all the adjustments, and the actual was 4.61. Nothing much to speak of. Their dealing combined with a low pitch count had a very small predictive value in the 7th. Less than .2 runs per 9 innings.

Conclusion

As I have been preaching for what seems like forever – and the data are in accordance – however a pitcher is pitching through X innings in a game, at least as measured by runs allowed, even at the extremes, has very little relevance with regard to how he is expected to pitch in subsequent innings. The best marker for whether to pull a pitcher or not seems to be pitch count.

If you want to know the most likely result, or the mean expected result at any point in the game, you should mostly ignore prior performance in that game and use a credible projection plus a fixed times through the order penalty, which is around .33 runs per 9 the 3rd time through, and another .33 the 4th time through. Of course the batters faced, park, weather, etc. will further dictate the absolute performance of the pitcher in question.

Keep in mind that I have not looked at a more granular approach to determining whether a pitcher has been pitching extremely well or getting shelled, such as hits, walks, strikeouts, and the like. It is possible that such an approach might yield a subset of pitching performance that indeed has some predictive value within a game. For now, however, you should be pretty convinced that run prevention alone during a game has little predictive value in terms of subsequent innings. Certainly a lot less than what most fans, managers, and other baseball insiders think.

Yesterday I looked at how and whether a hitter’s mid-season-to-date stats can help us to inform his rest-of-season performance, over and above a credible up-to-date mid-season projection. Obviously the answer to that depends on the quality of the projection – specifically how well it incorporates the season-to-date data in the projection model.

For players who were having dismal performances after the first, second, third, all the way through the fifth month of the season, the projection accurately predicted the last month’s performance and the first 5 months of data added nothing to the equation. In fact, those players who were having dismal seasons so far, even into the last month of the season, performed fairly admirably the rest of the way – nowhere near the level of their season-to-date stats. I concluded that the answer to the question, “When should we worry about a player’s especially poor performance?” was, “Never. It is irrelevant other than how it influences our projection for that player, which is not much, apparently.” For example, full-time players who had a .277 wOBA after the first month of the season, were still projected to be .342 hitters, and in fact, they hit .343 for the remainder of the season. Even halfway through the season, players who hit .283 for 3 solid months were still projected at .334 and hit .335 from then on. So, ignore bad performances and simply look at a player’s projection if you want to estimate his likely performance tomorrow, tonight, next week, or for the rest of the season.

On the other hand, players who have been hitting well-above their mid-season projections (crafted after and including the hot hitting) actually outhit their projections by anywhere from 4 to 16 points, still nowhere near the level of their “hotness,” however. This suggests that the projection algorithm is not handling recent “hot” hitting properly – at least my projection algorithm. Then again, when I looked at hitters who were projected at well-above average 2 months into the season, around .353, the hot ones and the cold ones each hit almost exactly the same over the rest of the season, equivalent to their respective projections. In that case, how they performed over those 3 months gave us no useful information beyond the mid-season projection. In one group, the “cold” group, players hit .303 for the first 2 months of the season, and they were still projected at .352. Indeed, they hit .349 for the rest of the season. The “hot” batters hit .403 for the first 2 months, they were projected to hit .352 after that and they did indeed hit exactly .352. So there would be no reason to treat these hot and cold above-average hitters any differently from one another in terms of playing time or slot in the batting order.

Today, I am going to look at pitchers. I think the perception is that because pitchers get injured more easily than position players, learn and experiment with new and different pitches, often lose velocity, their mechanics can break down, and their performance can be affected by psychological and emotional factors more easily than hitters, that early or mid-season “trends” are important in terms of future performance. Let’s see to what extent that might be true.

After one month, there were 256 pitchers or around 1/3 of all qualified pitchers (at least 50 TBF) who pitched terribly, to the tune of a normalized ERA (NERA) of 5.80 (league average is defined as 4.00). I included all pitchers whose NERA was at least 1/2 run worse than their projection. What was their projection after that poor first month? 4.08. How did they pitch over the next 5 months? 4.10. They faced 531 more batters over the last 5 months of the season.

What about the “hot” pitchers? They were projected after one month at 3.86 and they pitched at 2.56 for that first month. Their performance over the next 5 months was 3.85. So for the “hot” and “cold” pitchers after one month, their updated projection accurately told us what to expect for the remainder of the season and their performance to-date was irrelevant.

In fact, if we look at pitchers who had good projections after one month and divide those into two groups: One that pitches terribly for the first month, and one that pitches brilliantly for the first month, here is what we get:

Good pitchers who were cold for 1 month

First month: 5.38
Projection after that month: 3.79
Performance over the last 5 months: 3.75

Good pitchers who were hot for 1 month

First month: 2.49
Projection after that month: 3.78
Performance over the last 5 months: 3.78

So, and this is critical, one month into the season if you are projected to pitch above average, at, say 3.78, it makes no difference whether you have pitched great or terribly thus far. You are going to pitch at exactly your projection for the remainder of the season!

Yet the cold group faced 587 more batters and the hot group 630. Managers again are putting too much emphasis in those first month’s stats.

What if you are projected after one month as a mediocre pitcher but you have pitched brilliantly or poorly over the first month?

Bad pitchers who were cold for 1 month

First month: 6.24
Projection after that month: 4.39
Performance over the last 5 months: 4.40

Bad pitchers who were hot for 1 month

First month: 3.06
Projection after that month: 4.39
Performance over the last 5 months: 4.47

Same thing. It makes no difference whether a poor or mediocre pitcher had pitched well or poorly over the first month of the season. If you want to know how he is likely to pitch for the remainder of the season, simply look at his projection and ignore the first month. Those stats give you no more useful information. Again, the “hot” but mediocre pitchers got 44 more TBF over the final 5 months of the season, despite pitching exactly the same as the “cold” group over that 5 month period.

What about halfway into the season? Do pitchers with the same mid-season projection but one group was “hot” over the first 3 months and the other group was “cold,” pitch the same for the remaining 3 months? The projection algorithm does not handle the 3-month anomalous performances very well. Here are the numbers:

Good pitchers who were cold for 3 months

First month: 4.60
Projection after 3 months: 3.67
Performance over the last 3 months: 3.84

Good pitchers who were hot for 3 months

First month: 2.74
Projection after 3 months: 3.64
Performance over the last 3 months: 3.46

So for the hot pitchers the projection is undershooting them by around .18 runs per 9 IP and for the cold ones, it is over-shooting them by .17 runs per 9. Then again the actual performance is much closer to the projection than to the season-to-date performance. As you can see, mid-season pitcher stats halfway through the season are a terrible proxy for true talent/future performance. These “hot” and “cold” pitchers whose first half performance and second half projections were divergent by at least .5 runs per 9, performed in the second half around .75 runs per 9 better or worse than in the first half. You are much better off using the mid-season projection than the actual first-half performance.

For poorer pitchers who were “hot” and “cold” for 3 months, we get these numbers:

Poor pitchers who were cold for 3 months

First month: 5.51
Projection after 3 months: 4.41
Performance over the last 3 months: 4.64

Poor pitchers who were hot for 3 months

First month: 3.53
Projection after 3 months: 4.43
Performance over the last 3 months: 4.33

The projection model is still not giving enough weight to the recent performance, apparently. That is especially true of the “cold” pitchers. It over values them by .23 runs per 9. It is likely that these pitchers are suffering some kind of injury or velocity decline and the projection algorithm is not properly accounting for that. For the “hot” pitchers, the model only undervalues these mediocre pitchers by .1 runs per 9. Again, if you try and use the actual 3-month performance as a proxy for true talent or to project their future performance, you would be making a much bigger mistake – to the tune of around .8 runs per 9.

What about 5 months into the season? If the projection and the 5 month performance is divergent, which is better? Is using those 5 month stats a bad idea?

Yes, it still is. In fact, it is a terrible idea. For some reason, the projection does a lot better after 5 months than after 3 months. Perhaps some of those injured pitchers are selected out. Even though the projection slightly under and over values the hot and cold pitchers, using their 5 month performance as a harbinger of the last month is a terrible idea. Look at these numbers:

Poor pitchers who were cold for 5 months

First month: 5.45
Projection after 5 months: 4.41
Performance over the last month: 4.40

Poor pitchers who were hot for 5 months

First month: 3.59
Projection after 5 months: 4.39
Performance over the last month: 4.31

For the mediocre pitchers, the projection almost nails both groups, despite it being nowhere near the level of the first 5 months of the season. I cannot emphasize this enough: Even 5 months into the season, using a pitcher’s season-to-date stats as a predictor of future performance or a proxy for true talent (which is pretty much the same thing) is a terrible idea!

Look at the mistakes you would be making. You would be thinking that the hot group were comprised of 3.59 pitchers when in fact they were 4.40 pitchers who performed as such. That is a difference of .71 runs per 9. For your cold pitchers, you would undervalue them by more than a run per 9! What do managers do after 5 months of “hot” and “cold” pitching, despite the fact that both groups pitched almost the same for the last month of the season? They gave the hot group an average of 13 more TBF per pitcher. That is around a 3 inning difference in one month.

Here are the good pitchers who were hot and cold over the first 5 months of the season:

Good pitchers who were cold for 5 months

First month: 4.62
Projection after 5 months: 3.72
Performance over the last month: 3.54

Good pitchers who were hot for 5 months

First month: 2.88
Projection after 5 months: 3.71
Performance over the last month: 3.72

Here the “hot,” good pitchers pitched exactly at their projection despite pitching at .83 runs per 9 better over the first 5 months of the season. The “cold” group actually outperformed their projection by .18 runs and pitched better than the “hot” group! This is probably a sample size blip, but the message is clear: Even after 5 months, forget about how your favorite pitcher has been pitching, even for most of the season. The only thing that counts is his projection, which utilizes many years of performance plus a regression component, and not just 5 months worth of data. It would be a huge mistake to use those 5 month stats to predict these pitchers’ performances.

Managers can learn a huge lesson from this. The average number of batters faced in the last month of the season among the hot pitchers was 137, or around 32 IP. For the cold group, it was 108 TBF, or 25 IP. Again, the “hot” group pitched 7 more IP in only a month, yet they pitched worse than the “cold” group and both groups had the same projection!

The moral of the story here is that for the most part, and especially at the beginning and end of the season, ignore actual pitching performance to-date and use credible mid-season projections if you want to predict how your favorite or not-so favorite pitcher is likely to pitch tonight or over the remainder of the season. If you don’t, and that actual performance is significantly different from the updated projection, you are making a sizable mistake.

 

 

Yesterday, I posted an article describing how I modeled to some extent a way to tell whether and by how much pitchers may be able to pitch in such a way as to allow fewer or more runs than their components, including the more subtle ones, like balks, SB/CS, WP, catcher PB, GIDP, and ROE suggest.

For various reasons, I suggest taking these numbers with a grain of salt. For one thing, I need to tweak my RA9 simulator to take into consideration a few more of these subtle components. For another, there may be some things that stick with a pitcher from year to year that have nothing to do with his “RA9 skill” but which serve to increase or decrease run scoring, given the same set of components. Two of these are a pitcher’s outfielder arms and the vagueries of his home park, which both have an effect on base runner advances on hits and outs. Using a pitcher’s actual sac flies against will mitigate this, but the sim is also using league averages for base runner advances on hits, which, as I said, can vary from pitchers to pitcher, and tend to persist from year to year (if a pitcher stays on the same team) based on his outfielders and his home park. Like DIPS, it would be better to do these correlations only on pitchers who switch teams, but I fear that the sample would be too small to get any meaningful results.

Anyway, I have a database now of the last 10 years’ differences between a pitcher’s RA9 and his sim RA9 (the runs per 27 outs generated by my sim), for all pitchers who threw to at least 100 batters in a season.

First here are some interesting categorical observations:

Jared Cross, of Steamer projections, suggested to me that perhaps some pitchers, like lefties, might hold base runners on first base better than others, and therefore depress scoring a little as compared to the sim, which uses league-average base running advancement numbers. Well, lefties actually did a hair worse in my database. Their RA9 was .02 greater than their sim RA. Righties were -.01 better. That does not necessarily mean that RHP have some kind of RA skill that LHP do not have. It is more likely a bias in the sim that I am not correcting for.

How about number of pitches in a pitcher’s repertoire. I hypothesized that pitchers with more pitches would be better able to tailor their approach to the situation. For example, with a base open, you want your pitcher to be able to throw lots of good off-speed pitches in order to induce a strikeout or weak contact, whereas you don’t mind if he walks the batter.

I was wrong. Pitchers with 3 or more pitches that they throw at least 10% of the time are .01 runs worse in RA9. Pitchers with only 2 or fewer pitches, are .02 runs better. I have no idea why that is.

How about pitchers who are just flat out good in their components such that their sim RA is low, like under 4.00 runs? Their RA9 is .04 worse. Again, their might be some bias in the sim which is causing that. Or perhaps if you just go out and there “air it out” and try and get as many outs and strikeouts as possible, regardless of the situation, you are not pitching optimally.

Conversely, pitchers with a sim RA of 4.5 or greater shave .03 points off their RA9. If you are over 5 in your sim RA, your actual RA9 is .07 points better and if you are below 3.5, your RA9 is .07 runs higher. So, there probably is something about having extreme components that even the sim is not picking up. I’m not sure what that could be. Or, perhaps if you are simply not that good of a pitcher, you have to find ways to minimize run scoring above and beyond the hits and walks you allow overall.

For the NL pitchers, their RA9 is .05 runs better than their sim RA, and for the AL, they are .05 runs worse. So the sim is not doing a good job with respect to the leagues, likely because of pitchers batting. I’m not sure why, but I need to fix that. For now, I’ll adjust a pitcher’s sim RA according to his league.

You might think that younger pitchers would be “throwers” and older ones would be “pitchers” and thus their RA skill would reflect that. This time you would be right – to some extent.

Pitchers less than 26 years old were .01 runs worse in RA9. Pitchers older than 30 were .03 better. But that might just reflect the fact that pitchers older than 30 are just not very good – remember, we have a bias in terms of quality of the sim RA and the difference between that and regular RA9.

Actually, even when I control for the quality of the pitcher, the older pitchers had more RA skill than the younger ones by around .02 to .04 runs. As you can see, none of these effects, even if they are other than noise, is very large.

Finally, here are the lists of the 10 best and worst pitchers with respect to “RA skill,” with no commentary. I adjusted for the “quality of the sim RA” bias, as well as the league bias. Again, take these with a large grain of salt, considering the discussion above.

Best, 2004-2013:

Sean Chacon -.18

Steve Trachsel -.18

Francisco Rodriguez -.18

Jose Mijares -.17

Scott Linebrink -.16

Roy Oswalt -.16

Dennys Reyes -.15

Dave Riske -.15

Ian Snell -.15

5 others tied for 10th.

Worst:

Derek Lowe .27

Luke Hochevar .20

Randy Johnson .19

Jeremy Bonderman .18

Blaine Boyer .18

Rich Hill .18

Jason Johnson .18

5 others tied for 8th place.

(None of these pitchers stand out to me one way or another. The “good” ones are not any you would expect, I don’t think.)

We showed in The Book that there is a small but palpable “pitching from the stretch” talent. That of course would effect a pitcher’s RA as compared to some kind of base runner and “timing” neutral measure like FIP or component ERA, or really any of the ERA estimators.

As well, a pitcher’s ability to tailor his approach to the situation, runners, outs, score, batter, etc., would also implicate some kind of “RA talent,” again, as compared to a “timing” neutral RA estimator.

A few months ago I looked to see if RE24 results for pitchers showed any kind of talent for pitching to the situation, by comparing that to the results of a straight linear weights analysis or even a BaseRuns measure. I found no year-to-year correlations for the difference between RE24 and regular linear weights. In other words, I was trying to see if some pitchers were able to change their approach to benefit them in certain bases/outs situations more than other pitchers. I was surprised that there was no discernible correlation, i.e., that it didn’t seem to be much of a skill if at all. You would think that some pitchers would either be smarter than others or have a certain skill set that would enable them, for example, to get more K with a runner on 3rd and less than 2 outs, more walks and fewer hits with a base open, or fewer home runs with runners on base or with 2 outs and no one on base. Obviously all pitchers, on the average, vary their approach a lot with respect to these things, but I found nothing much when doing these correlations. Essentially an “r” of zero.

To some extent the pitching from the stretch talent should show up in comparing RE24 to regular lwts, but it didn’t, so again, I was a little surprised at the results.

Anyway, I decided to try one more thing.

I used my “pitching sim” to compute a component ERA for each pitcher. I tried to include everything that would create or not create runs while he was pitching, like WP/PB, SB/CS, GIDP, roe, in addition to s,d,t,hr,bb, and so. I considered an IBB as a 1/2 BB in the sim, since I didn’t program IBB into it.

So now, for each year, I recorded the difference between a pitcher’s RA9 and his simulated component RA9, and then ran year-to-year correlations. This was again to see if I could find a “RA talent” wherever it may lie – clutch pitching, stretch talent, approach talent, etc.

I got a small year-to-year correlation which, as always, varied with the underlying sample size – TBF in each of the paired years. When I limited it to pitchers with at least 500 TBF in each year, I got an “r” of .142 with an average PA of 791 in each year. That comes out to a 50% regression at around 5000 PA, or 5 years for a full-time starter, similar to BABIP for pitchers. In other words, the “stabilization” point was around 5,000 TBF.

If that .142 is accurate (at 2 sigma the confidence interval is .072 to .211), I think that is pretty interesting. For example, notable “ERA whiz” Tom Glavine from 2001 to 2006, was an average of .246 in RA9 better than his sim RA9 (simulated component RA). If we regress that difference 50%, we get .133 runs per game, which is pretty sizable I think. That is over 1/3 of a win per season. Notable “ERA hack” Ricky Nolasco from 2008 to 2010 (I only looked at 2001-2010) was an average of .357 worse in his ERA. Regress that 62.5%, and we get .134 runs worse per season, also 1/3 of a win.

So, for example, if you want to know how to reconcile fWAR (FG) and bWAR (B-R) for pitchers, take the difference and regress according to the number of TBF, using the formula 5000/(5000+TBF) for the amount of regression.

Here are a couple more interesting ones, off the top of my head. I thought that Livan Hernandez seemed like a crafty pitcher, despite having inferior stuff late in his career. Sure enough, he out-pitched his components by around .164 runs per game over 9 seasons. After regressing, that’s .105 rpg.

The other name that popped into my head was Wakefield. I always wondered if a knuckler was able to pitch to the situation as well as other pitchers could. It does not seem like they can, with only one pitch with comparatively little control. His RA9 was .143 worse than his components suggest, despite his FIP being .3 runs per 9 worse than his ERA! After regressing, he is around .095 worse than his simulated component RA.

Of course, after looking at Wake, we have to check Dickey as well. He didn’t start throwing a knuckle ball until 2005, and then only half the time. His average difference between RA9 and simulated RA9 is .03 on the good side, but our sample size for him is small with a total of only 1600 TBF, implying a regression of 76%.

If you want the numbers on any of your favorite or no-so-favorite pitchers, let me know in the comments section.