Note: This post was edited to include some new data which leads us in the direction of a different conclusion. The addendum is at the end of the original post .
This is another one of my attempts at looking at “conventional wisdoms” that you hear and read about all the time without anyone stopping for a second to catch their breath and ask themselves, “Is this really true?” Or more appropriately, “To what extent is this true?” Bill James used those very questions to pioneer a whole new field called sabermetrics.
As usual in science, we can rarely if ever answer questions with, “Yes it is true,” or “No, it is not true.” We can only look at the evidence and try and draw some inferences with some degree of certainty between 0 and 100%. This is especially true in sports when we are dealing with empirical data and limited sample sizes.
You often read something like, “So-and-so pitcher had a poor season (say, in ERA) but he had a few really bad outings so it wasn’t really that bad.” Let’s see if we can figure out to what extent that may or may not be true.
First I looked at all starting pitcher outings over the last 40 years, 1977-2016. I created a group of starters who had at least 4 very bad outings and at least 100 IP in one season. A “bad outing” was defined as 5 IP or less and at least 6 runs allowed, so a minimum RA9 of almost 11 in at least 4 games in a season. Had those starts been typical starts, each of these pitchers’ ERA’s or RA9 would have been at least a run less or so.
Next I only looked at those pitchers who had an overall RA9 of at least 5.00 in the seasons in question. The average RA9 for these pitchers with some really bad starts was 5.51 where 4.00 is the average starting pitcher’s RA9 in every season regardless of the run environment or league. Basically I normalized all pitchers to the average of his league and year and set the average at 4.00. I also park adjusted everything.
OK, what were these pitchers projected to do the following season? I used basic Marcel-type projections for all pitchers. The projections treated all RA9 equally. In other words a 5.51 RA with a few really bad starts was equivalent to a 5.51 RA with consistently below-average starts. The projections only used full season data (RA9).
So basically these 5.51 RA9 pitchers pitched near average for most of the their starts but had 4-6 really bad (and short) starts that upped their overall RA9 for the season by more than a run. Which was more indicative of their true talent? The vast majority of the games where they pitched around average, the few games where they blew up, or their overall runs allowed per 9 innings? Or, their overall RA9 for that season (regardless of how it was created) plus their RA9 from previous seasons and then some regression thrown in for good measure – in other words, a regular, old-fashioned projection?
Our average projection for these pitchers for the next season (which is an estimate of their true talent that season) was 4.46. How did they pitch the next season – which is an unbiased sample of their true talent (I didn’t set an innings requirement for this season so there is no survivorship bias)? It was 4.48 in 10,998 TBF! So the projection which had no idea that these were pitchers who pitched OK for most of the season but had a terrible seasonal result (5.51 RA9) because of a few terrible starts, was right on the money. All the projection model knew was that these pitchers had very bad RA9 for the season – in fact, their average RA was 138% of league average.
Of course since we sampled these pitchers based on some bad outings and an overall bad ERA (over 5.00) we know that in prior seasons their RA9 would be much lower, similar to their projection (4.46) – actually better. In fact, you should know that a projection can apply just as well to previous years as it can to subsequent years. There is almost no difference. You just have to make sure you apply the proper age adjustments.
Somewhat interestingly, if we look at all pitchers with a RA9 above 5 (an average of 5.43) who did not have the requisite very bad outings, i.e. they pitched consistently bad but with few disastrous starts, their projected RA9 was 4.45 and their actual was 4.25, in 25,479 TBF.
While we have significant sample error in these limited samples, not only is there no suggestion that you should ignore or even discount bad ERA or RA that are the result of a few horrific starts, there is a (admittedly weak) suggestion that pitchers who pitch badly but more consistently may be able to outperform their projections for some reason.
The next time you read that, “So-and-so pitcher has bad numbers but it was only because of a few really bad outings,” remember that there is no evidence that an ERA or RA which includes a “few bad outings” should be treated any differently than a similar ERA or RA without that qualification, at least as far as projections are concerned.
Addendum: I was concerned about the way I defined pitchers who had “a few disastrous starts.” I included all starters who gave up at least 6 runs in 5 innings or less at least 5 times in a season. The average number of bad starts was 5.5. So basically these were mostly pitchers who had 5 or 6 really bad starts in a season, occasionally more.
I thought that most of the time when we hear the “A few bad starts” refrain, we’re talking literally about “a few bad starts,” as in 2 or 3. So I changed the criteria to include only those pitchers with 2 or 3 awful starts. I also upped the ante on those terrible starts. Before it was > 5 runs in 5 IP or less. Now it is >7 runs in 5 IP or less – truly a blowup of epic proportions. We still had 508 pitcher seasons that fit the bill which gives us a decent sample size.
These pitchers overall had a normalized (4.00 is average) RA9 of 4.19 in the seasons in question, so 2 or 3 awful starts didn’t produce such a bad overall RA. Remember I am using a 100 IP minimum so all of these pitchers pitched at least fairly well for the season whether they had a few awful starts or not. (This is selective sampling and survivorship bias at work. Any time you set a minimum IP or PA, you select players who had above average performance, through luck and talent.)
Their next year’s projection was 3.99 and the actual was 3.89 so there is a slight inference that indeed you can discount the bad starts a little. This is in around 12,000 IP. A difference of .1 RA9 is only around 1 SD so it’s not nearly statistically significant. I also don’t know that we have any Bayesian prior to work with.
The control group – all other starters, namely those without 2 or 3 awful outings – had a RA9 in the season in question of 3.72 (compare to 4.19 for the pitchers with 2 or 3 bad starts). Their projection for the next season was 3.85 and actual was 3.86. This was in around 130,000 IP so 1 SD is now around .025 runs so we can be pretty confident that the 3.86 actual RA9 reflects their true talent within around .05 runs (2 SD) or so.
What about starters who not only had 2 or 3 disastrous starts but also had an overall poor RA9? In the original post I looked at those pitchers in our experimental group who also had a seasonal RA9 of > 5.00. I’ll do the same thing with this new experimental group – starters with only 2 or 3 very awful starts.
Their average RA9 for the experimental season was 5.52. Their projection was 4.45 and actual was 4.17, so now we have an even stronger inference that a bad season caused by a few bad starts creates a projection that is too pessimistic; thus maybe we should discount those few bad starts. We only have around 1600 IP (in the projected season) for these pitchers so 1 SD is around .25 runs. A difference between projected and actual of .28 runs is once again not nearly statistically significant. There is, nonetheless, a suggestion that we are on to something. (Don’t ever ignore – assume it’s random – an observed effect just because it isn’t statistically significant – that’s poor science.)
What about the control group? Last time we noticed that the control group’s actual RA was less than its projection for some reason. I’ll look at pitchers who had > 5 RA9 in one season but were not part of the group that had 2 or 3 disastrous starts.
Their average RA9 was 5.44 – similar to the 5.52 of the experimental group. Their projected was 4.45 and actual was 4.35, so we see the same “too high” projection in this group as well. (In fact, in testing my RA projections based on RA only – as opposed to say FIP or ERC – I find an overall bias such that pitchers with a one-season high RA have projections that are too high, not a surprising result actually.) This is in around 7,000 IP which gives us a SD of around .1 runs per 9.
So, the “a few bad starts” group outperformed their projections by around .1 runs. This same group, limiting it to starters with an overall RA or over 5.00, outperformed their projections by .28 runs. The control group with an overall RA also > 5.00 outperformed their projections by .1 runs. None of these differences are even close to statistically significant.
Let’s increase the sample size a little of our experimental group who also had particularly bad RA overall by expanding it to starters with an overall RA of > 4.50 rather than > 5.00. We now have 3,500 IP, 2x as many IP, reducing our error by around 50%. The average RA9 of this group was 5.13. Their projected RA was 4.33 and actual was 4.05 – exactly the same difference as before. Keep in mind that the more samples we look at the more we are “data mining,” which is a bit dangerous in this kind of research.
A control group of starters with > 4.50 RA had an overall RA9 of 4.99. Their projection was exactly the same as the experimental group, 4.33, but their actual was 4.30 – almost exactly the same as their projection.
In conclusion, while we initially found no evidence that discounting a bad ERA or RA caused by “several very poor starts” is warranted when doing a projection for starters with at least 100 IP, once we change the criteria for “a few bad starts” from “at least 5 starts with 6 runs or more allowed in 5 IP or less” to “exactly 2 or 3 starts with 8 runs or more in 5 IP or less” we do find evidence that some kind of discount may be necessary. In other words, for starters whose runs allowed are inflated due to 2 or 3 really bad starts, if we simply use overall season RA or ERA for our projections we will understate their subsequent season’s RA or ERA by maybe .2 or .3 runs per 9.
Our certainty of this conclusion, especially with regard to the size of the effect – if it exists at all – is pretty weak given the magnitude of the differences we found and the sample sizes we had to work with. However, as I said before, it would be a mistake to ignore any inference – even a weak one – that is not contradicted by some Bayesian prior (or common sense).