It’s quite simple actually.

Apropos to the myriad articles and discussions about the run scoring and HR surge starting in late 2015 and continuing through 2017 to date, I want to go over what can cause league run scoring to increase or decrease from one year to the next:

  1. Changes in equipment, such as the ball or bat.
  2. Changes to the strike zone, either the overall size or the shape.
  3. Rule changes.
  4. Changes in batter strength, conditioning, etc.
  5. Changes in batter or pitcher approaches.
  6. Random variation.
  7. Weather and park changes.
  8. Natural variation in player talent.

I’m going to focus on the last one, variation in player talent from year to year. How does the league “replenish” it’s talent from one year to the next? Poorer players get less playing time, including those who get no playing time at all (retired, injured, or switch to another league). Better players get more playing time and new players enter the league. Much of that is because of the aging curve. Younger players generally get better and thus amass more playing time and older players get worse, playing less – eventually retiring or released.  All these moves can lead to each league having a little more or less overall talent and run scoring than in the previous year. How can we measure that change in talent/scoring?

One good method is to look at how a player’s league normalized stats change from year X to year X+1. First we have to establish a base line. To do that, we track the average change in some league normalized stat like Linear Weights, RC+ or wOBA+ over many years. It is best to confine it to players in a narrow age range, like 25 to 29, so that we minimize the problem of average league age being different from one year to the next, and thus the amount of decline with age also being different.

We’ll start with batting. The stat I’m using is linear weights, which is generally zeroed out at the league level. In other words, the average player in each league, NL and AL separately, has a linear weights of exactly zero. If we look at the average change from 2000 to 2017 for all batters from 25 to 29 years old, we get -.12 runs per team per game in the NL and -.10 in the AL. That means that either these players decline with age and/or every year the quality of the league’s batting gets better. We’ll assume that most of that -.12 runs is due to aging (and that peak age is close to 25 or 26, which it probably is in the modern era), but it doesn’t matter for our purposes.

So, for example, if in year X to X+1 in the NL, all batters age 25-29 lost -.2 runs per game per team, what would that tell us? It would tell us that league batting in year X+1 was better than in year X by .1 runs per team per game. Why is that? If players should lose only -.1 runs but they lost -.2 runs, and thus they look worse than they should relative to the league as a whole, that means that the league got better.

Keep in mind that the quality of the pitching has no effect on this method. Whether the overall pitching talent changes from year 1 to year 2 has no bearing on these calculations. Nor do changes in parks, differences in weather, or any other variable that might change from year to year and affect run scoring and raw offensive stats. We’re using linear weights, which is always relative to other batters in the league. The sum of everyone’s offensive linear weights in any given year and league is always zero.

Using this method, here is the change in batting talent from year to year, in the NL and AL, from 2000 to 2017. Plus means the league got better in batting talent. Minus means it got worse. In other words, a plus value means that run scoring should increase, everything else being the same. Notice the decline in offense in both leagues from 2016 to 2017 even though we see increased run scoring. Either pitching got much worse or something else is going on. We’ll see about the pitching.

Table I

Change in batting linear weights, in runs per game per team

Years NL AL
00-01 .09 -.07
01-02 -.12 -.23
02-03 -.15 -.11
03-04 .09 -.11
04-05 -.10 -.14
05-06 .15 .05
06-07 .09 .08
07-08 -.05 .08
08-09 -.13 .08
09-10 .17 -.12
10-11 -.18 .04
11-12 .12 0
12-13 -.03 -.05
13-14 .01 .07
14-15 .06 .09
15-16 .01 .05
16-17 -.03 -.12

 

Here is the same chart for league pitching. The stat I am using is ERC, or component ERA. Component ERA takes a pitcher’s raw rate stats, singles, doubles, triples, home runs, walks, and outs, per PA, park and defense adjusted, and converts them into a theoretical runs per 9 inning, using a BaseRuns formula. Like linear weights, it is scaled to league average. A plus number means that league pitching got worse, and hence run scoring should go up.

Table II

Change in pitching, in runs per game per team

Years NL AL
00-01 .02 .21
01-02 .03 .00
02-03 -.04 -.23
03-04 .07 .11
04-05 .00 .07
05-06 -.14 -.12
06-07 .10 .06
07-08 -.15 -.10
08-09 -.13 -.17
09-10 .01 .04
10-11 .03 .16
11-12 .03 -.06
12-13 -.02 .26
13-14 -.02 -.04
14-15 .06 -.02
15-16 .03 .04
16-17 .04 -.01

 

Notice that pitching in the NL got a little worse. Overall, when you combine pitching and batting, the NL has worse talent in 2017 compared to 2016, by .07 runs per team per game. NL teams should score .01 runs per game more than in 2016, again, all other things being equal (they usually are not).

In the AL, while we’ve seen a decrease in batting of -.12 runs per team per game (which is a lot), we’ve also seen a slight increase in pitching talent, .01 runs per game per team. We would expect the AL to score .13 runs per team per game less in 2017 than in 2016, assuming nothing else has changed. The overall talent in the AL, pitching plus batting, decreased by .11 runs.

The gap in talent between the NL and AL, at least with respect to pitching and batting only (not including base running and defense, which can also vary from year to year) has presumably moved in favor of the NL by .04 runs a game per team, despite the AL’s .600 record in inter-league play so far this year compared to .550 last year (one standard deviation of the difference between this year’s and last year’s inter-league w/l record is over .05, so the difference is not even close to being statistically significant – less than one SD).

Let’s complete the analysis by doing the same thing for UZR (defense) and UBR (base running). A plus defensive change means that the defense got worse (thus more runs scored). For base running, plus means better (more runs) and minus means worse.

Table III

Change in defense (UZR), in runs per game per team

Years NL AL
00-01 .01 -.07
01-02 -.01 .05
02-03 .18 -.07
03-04 .10 .03
04-05 .12 .00
05-06 -.08 -.07
06-07 .02 .03
07-08 .04 .01
08-09 -.02 -.02
09-10 -.01 -.02
10-11 .15 -.04
11-12 -.10 -.07
12-13 -.02 .03
13-14 -.10 .03
14-15 -.02 -.02
15-16 -.07 -.05
16-17 -.06 .05

 

From last year to this year, defense in the NL got better by .06 runs per team per game, signifying a decrease in run scoring. In the AL, the defense appears to have gotten worse, by .05 runs a game. By the way, since 2012, you’ll notice that teams have gotten much better on defense in general, likely due to an increased awareness of the value of defense, and the move away from the slow, defensively-challenged power hitter.

Let’s finish by looking at base running and then we can add everything up.

Table IV

Change in base running (UBR), in runs per game per team

Years NL AL
00-01 -.02 -.01
01-02 -.02 -.01
02-03 -.01 .00
03-04 .00 -.04
04-05 .02 .02
05-06 .00 -.01
06-07 -.01 -.01
07-08 .00 .00
08-09 .02 .02
09-10 -.02 -.02
10-11 .04 -.01
11-12 .00 -.02
12-13 -.01 -.01
13-14 .01 -.01
14-15 .01 .05
15-16 .01 -.03
16-17 .01 .01

 

Remember that the batting and pitching talent in the AL presumably decreased by .11 runs per team per game and they were expected to score .13 fewer runs per game per team, in 2017, as compared to 2016. Adding in defense and base running, those numbers are a decrease in AL talent by .15 runs and a decrease in run scoring of only .07 runs per team per game.

In the NL, when we add defense and base running to batting and pitching, we get no overall change in talent, from 2016 to 2017, and a decrease in run scoring of -.04.

We also see a slight trend towards better base running since 2011, which should naturally occur with better defense.

Here is everything combined into one table.

Table V

Change in talent and run scoring, in runs per game per team. Plus means gain in talent and score more runs.

Years NL Talent AL Talent NL Runs AL Runs
00-01 .04 -.22 .09 .06
01-02 -.16 -.29 -.12 -.19
02-03 -.30 .19 -.02 -.41
03-04 -.08 -.29 .26 -.01
04-05 -.20 -.19 .04 -.05
05-06 .37 .23 -.07 -.15
06-07 -.02 -.02 .23 .16
07-08 .06 .17 -.16 -.01
08-09 .04 .29 -.26 -.09
09-10 .15 -.16 .05 -.12
10-11 -.31 -.09 .04 .15
11-12 .19 .11 .05 -.15
12-13 0 -.35 -.08 .23
13-14 .14 .07 -.10 .05
14-15 .03 .18 .11 .10
15-16 .06 .03 -.02 .03
16-17 0 -.15 -.04 -.07

If you haven’t read it, here’s the link.

For MY ball tests, the difference I found in COR was 2.6 standard deviations, as indicated in the article. The difference in seam height is around 1.5 SD. The difference in circumference is around 1.8 SD.

For those of you a little rusty on your statistics, the SD of the difference between two sample means is the square root of the sum of their respective variances.

The use of statistical significance is one of the most misunderstood and abused concepts in science. You can read about this on the internet if you want to know why. It has a bit to do with frequentist versus Bayesian statistics/inference.

For example, when you have a non-null hypothesis going into an experiment, such as, “The data suggest an altered baseball,” then ANY positive result supports that hypothesis and increases the probability of it being true, regardless of the “statistical significance of those results.”

Of course the more significant the result, the more we increase the prior probability. However, the classic case of using 2 or 2.5 SD to define “statistical significance” really only applies when you start out with the null hypothesis. In this case, for example, that would be if you had no reason to suspect a juiced ball, and you merely tested balls just to see if perhaps there were differences. In reality, you almost always have a prior P which is why the traditional concept of accepting or rejecting the null hypothesis based on the statistical significance of the results of your experiment is an obsolete concept.

In any case, from the results of MLB’s own tests, in which they tested something like 180 balls a year, the seam height reduction we found was something like 6 or 7 SD and the COR increase was something like 3 or 4 SD. We also can add to the mix, Ben’s original test whereby he found an increase in COR of .003 or around 60% of what I found.

So yes, the combined results of all three tests are almost unequivocal evidence that the ball was altered. There’s not much else you can do other than to test balls. Of course the ball testing would mean almost nothing if we didn’t have the batted ball data to back it up. We do.

I don’t think this “ball change” was intentional by MLB, although it could be.

In my extensive research for this project, I have uncovered two things:

One, there is quite a large actual year to year difference in the construction of the ball which can and does have a significant impact on HR and offensive rates in general. The concept of a “juiced” (or “de-juiced”) ball doesn’t really mean anything unless it is compared to some other ball – for example, in our case, 2014 to 2016/2017.

Two, we now know because of Statcast and lots of great work and insight by Alan Nathan and others, that very small changes in things like COR, seam height, and size can have a dramatic impact on offense. My (wild) guess is that we probably have something like a 2 or 3 feet (in batted ball distance for a typical HR trajectory) variation (one SD) from year to year based on the (random) fluctuating composition and construction of the ball.  And from 2014 to 2106 (and so far this year), we just happened to have seen a 2 or 3 standard deviation variation.

We’ve seen it before, most notably in 1987, and we’ll probably see it again. I have also altered my thinking about the “steroid era.” Now that I know that balls can fluctuate from year to year, sometimes greatly, it is entirely possible that balls were constructed differently starting in 1993 or so – perhaps in combination with burgeoning and rampant PED use.

Finally, it is true that there are many things that can influence run scoring and HR rates, some more than others. Weather and parks are very minor. Even a big change in one park or two or a very hot or cold year will have very small effects overall. And of course we can easily test or account for these things.

Change in talent can surprisingly have a large effect on overall offense. For example, this year, the AL lost a lot of offensive talent which is one reason why the NL and the AL have almost equal scoring despite the AL having the DH.

The only other thing that can fairly drastically change offense is the strike zone. Obviously it depends on the magnitude of the change. In the pitch f/x era we can measure that, as Joe Roegele and others do every year. It has not changed much the last few years until this year. It is smaller now, which is causing an uptick in offense from last year. I also believe, as others have said, that the uptick since last year is due to batters realizing that they are playing with a livelier ball and thus are hitting more air balls. They may be hitting more air balls even without thinking that the ball is juiced -they may be just jumping on the “fly-ball bandwagon.” Either way, hitting more fly balls compounds the effect of a juiced ball because it is correct to hit more fly balls.

Then there is the bat, which I know nothing about. I have not heard anything about the bats being different or what you can do to a bat to increase or decrease offense, within allowable MLB limits.

Do I think that the “juiced ball” (in combination with players taking advantage of it) is the only reason for the HR/scoring surge? I think it’s the primary driver, by far.

There’s been some discussion lately on Twitter about the sacrifice bunt. Of course it is used very little anymore in MLB other than with pitchers at the plate. I’ll spare you the numbers. If you want to verify that, you can look it up on the interweb. The reason it’s not used anymore is not because it was or is a bad strategy. It’s simply because there is no point in sac bunting in most cases. I’ve written about why before on this blog and on other sabermetric sites. It has to do with game theory. I’ll briefly explain it again along with some other things. This is mostly a copy and paste from my recent tweets on the subject.

First, the notion that you can analyze the efficacy (or anything really) about a sac bunt attempt by looking at what happens (say, the RE or WE) after an out and a runner advance is ridiculous. For some reason sabermetricians did that reflexively for a long time ever since Palmer and Thorn wrote The Hidden Game and concluded (wrongly) that the sac bunt was a terrible strategy in most cases. What they meant was that advancing the runner in exchange for an out is a terrible strategy in most cases, which it is. But again, EVERYONE knows that that isn’t the only thing that happens when a batter attempts to bunt. That’s not a shock. We all know that the batter can reach base on a single or an error, he can strike out, hit into a force or DP, pop out, or even walk. We obviously have to know  how often those things occur on a bunt attempt to have any chance to figure out whether a bunt might increase, decrease or not change the RE or WE, compared to hitting away. Why Palmer and Thorn or anyone else ever thought that looking at the RE or WE after something that occurs less than half the time on a bunt attempt (yeah, on the average an out and runner advance occurs around 47% of the time) could answer the question of whether a sac bunt might be a good play or not, is a mystery to me. Then again, there are probably plenty of stupid things we’re saying and doing now with respect to baseball analysis that we’ll be laughing or crying about in the future, so I don’t mean that literally.

What I am truly in disbelief about is that there are STILL saber-oriented writers and pundits who talk about the sac bunt attempt as if all that ever happens is an out and a runner advance. That’s indefensible. For cripes sake I wrote all about this in The Book 12 years ago. I have thoroughly debunked the idea that “bunts are bad because they considerably reduce the RE or WE.” They don’t. This is not controversial. It never was. It was kind of a, “Shit I don’t know why I didn’t realize that,” moment. If you still look at bunt attempts as an out and a runner advance instead of as an amalgam of all kinds of different results, you have no excuse. You are either profoundly ignorant, stubborn, or both. (I’ll give the casual fan a pass).

Anyway, without further ado, here is a summary of some of what I wrote in The Book 12 years ago about the sac bunt, and what I just obnoxiously tweeted in 36 or so separate tweets:

Someone asked me to post my 2017 W/L projections for each team. I basically added up the run values of my individual projections, using Fangraphs projected playing time for every player, as of around March 15.

I did use the actual schedule for a “strength of opponent” adjustment. I didn’t add anything additional for injuries, chances of each team making roster adjustments at trade deadline or otherwise, managerial skill, etc. I didn’t try and simulate lineups or anything like that. Plus, these are based on my preliminary projections without incorporating any Statcast or pitch F/X data. Also, these kinds of projections tend to regress toward a mean of .500 for all teams. That’s because bad teams tend to weed out bad players and otherwise improve, and injuries don’t hurt them much – in some cases improving them. And good teams tend to be hurt more by injuries (and I don’t think the depth charts I use account enough for chance of injury). As well, if good teams are not contending at the deadline, they tend to trade their good players.

So take these for what they are worth.

team wins div wc div+wc ds lcs ws
 

NL EAST

was 89 0.499 0.097 0.597 0.257 0.117 0.048
nyn 88 0.437 0.114 0.55 0.239 0.106 0.044
mia 78 0.046 0.02 0.066 0.024 0.01 0.004
phi 72 0.007 0.002 0.009 0.003 0.001 0
atl 72 0.011 0.004 0.014 0.006 0.002 0.001
 

NL Central

chn 100 0.934 0.044 0.978 0.56 0.303 0.146
sln 86 0.049 0.273 0.322 0.137 0.059 0.022
pit 82 0.017 0.129 0.146 0.056 0.023 0.008
cin 67 0 0.001 0.001 0 0 0
mil 61 0 0 0 0 0 0
 

NL WEST

lan 102 0.961 0.025 0.987 0.591 0.327 0.164
sfn 85 0.03 0.214 0.245 0.098 0.041 0.016
col 78 0.005 0.047 0.052 0.018 0.007 0.003
ari 77 0.003 0.03 0.033 0.011 0.004 0.002
sdn 66 0 0 0 0 0 0
 

AL EAST

tor 87 0.34 0.114 0.455 0.229 0.118 0.061
bos 87 0.359 0.129 0.487 0.238 0.117 0.064
tba 83 0.15 0.077 0.227 0.105 0.051 0.027
bal 81 0.099 0.056 0.155 0.071 0.032 0.014
nya 79 0.053 0.035 0.088 0.038 0.018 0.008
 

AL CENTRAL

cle 93 0.861 0.027 0.888 0.471 0.254 0.146
det 82 0.097 0.077 0.174 0.076 0.033 0.016
min 76 0.021 0.015 0.036 0.014 0.005 0.002
kca 75 0.02 0.014 0.033 0.014 0.005 0.003
cha 68 0.001 0.001 0.002 0 0 0
 

AL WEST

hou 91 0.541 0.13 0.671 0.362 0.188 0.11
sea 86 0.228 0.155 0.383 0.192 0.09 0.047
ala 84 0.181 0.12 0.301 0.146 0.071 0.036
tex 80 0.044 0.042 0.086 0.038 0.017 0.008
oak 73 0.006 0.007 0.014 0.006 0.002 0.001

 

 

 

The most important thing, bar none, that your government can do – must do – is to be truthful and transparent regardless of party, policy, or ideology. Your government works for you. It is your servant. As Lincoln famously said, in America we have a government, “of the people, by the people and for the people.” That is the bedrock of our Democracy.

A government that withholds, obfuscates, misrepresents or tells falsehoods should never be tolerated in a democracy. Raw, naked honesty is the first thing you must demand from your government. They. Work. For. You. Regardless of what you think of their promises and policies, if they are not honest with you, they cannot govern effectively because you can never trust that they have your best interests in mind.

Demand that your politicians are honest with you. If not, you must vote them out. It is every American’s responsibility to do so. It doesn’t matter what their party is or what you think they may accomplish. A dishonest government is like a dishonest employee. They will eventually sink your company. Anything but a transparent and forthright government is a cancer in a Democracy. It is self-serving by definition. You should demand honesty first and foremost from your public servants or our Democracy will crumble.

Maybe.

In this article, Tuffy Gosewisch, the new backup catcher for the Braves, talks catching with Fangraphs David Laurilia. He says about what you would expect from a catcher. Nothing groundbreaking or earth-shattering – nothing blatantly silly or wrong either. In fact, catchers almost always sound like baseball geniuses. They do have to be one of the smarter ones on the field. But…

Note: This is almost verbatim from my comment on that web page:

I have to wonder how much better a catcher could be if he understood what he was actually doing (of course they do, they get paid millions, they’ve been doing it all their lives, and are presumably the best in the world at what they do. Who the hell are you, you’ve never put on the gear in your life?).

All catchers talk about how they determine the “right” pitch. I’m waiting for a catcher to say, “There is no ‘right’ pitch – there can’t be! There’s a matrix of pitches and we choose one randomly. Because you see, if there were a ‘right” pitch and that was the one we called, the batter would know or at least have a pretty good idea of that same pitch and it would be a terrible pitch, especially if the batter were a catcher!”

If different catchers and pitchers have different “right” pitches and that’s why batters can’t guess them then there certainly isn’t a “right” pitch – it must be a (somewhat) random one.

When I say “random” I mean from a distribution of pitches, each with a pre-determined (optimal) frequency, based on the batter and the game situation. Rather than it be the catcher and pitcher’s job to come up with the “right” pitch – and I explained why that concept cannot be correct – it is their responsibility to come up with the “right” distribution matrix, for example, 20% FB away, 10% FB inside, 30% curve ball, 15% change up, etc. In fact, once you do that, you can tell the batter your matrix and it won’t make any difference! He can’t exploit that information and you will maximize your success as a pitcher, assuming that the batter will exploit you if you use any other strategy.

If a catcher could come up with the “right” single pitch that the batter is not likely to figure out, without randomly choosing one from a pre-determined matrix, well….that can’t be right, again, because whatever the catcher can figure, so can (and will) the batter.

We also know that catchers don’t hit well. If there were “right” pitches, catchers would be the best hitters in baseball!

Tuffy also said this:

“You also do your best to not be predictable with pitch-calling. You remember what you’ve done to guys in previous at-bats, and you try not to stay in those patterns. Certain guys — veteran guys — will look for patterns. They’ll recognize them, and will sit on pitches.”

Another piece of bad advice! Changing your patterns is being predictable! If you have to change your patterns to fool batters your patterns were not correct in the first place! As I said, the “pattern” you choose is the only optimal one. By “pattern” I mean a certain matrix of pitches thrown a certain percentage of time given the game situation and participants involved. Any other definition of “pattern” implies predictability so for a catcher to be talking about “patterns” at all is not a good thing. There should never be an identifiable pattern in pitching unless it is a random one which looks like a pattern. (As it turns out, researchers have shown that when people are shown random sequences of coin flips and ones that are chosen to look random but are not, people more often choose the non-random ones as being random.)

Say I throw lots of FB to a batter the first 2 times through order and he rakes (hits a HR and double) on them. If those two FB were part of the correct matrix I would be an idiot to throw him fewer FB in the next PA. Because if that were part of my plan, once again, he could (and would) guess that and have a huge advantage. How many times have you heard Darling, Smoltz or some other ex-pitcher announcer say something like, “After that blast last AB (on a fastball) the last thing he’ll do here is throw him another fastball in this AB?” Thankfully, for the pitcher, the announcer will invariably be wrong, and the pitcher will throw his normal percentage of fastballs to that batter – as he should.

What if I am mixing up my pitches randomly each PA but I change my mixture from time to time? Is that a good plan? No! The fact that I am choosing randomly from a matrix of pitches (each with a different fixed frequency for that exact situation) on each and every pitch means that I am “somewhat” unpredictable by definition (“somewhat” is in quotes because sometimes the correct matrix is 90% FB and 10% off-speed – is that “unpredictable?”) but the important thing is that those frequencies are optimal. If I constantly change those frequencies, even randomly, then they often will not be correct (optimal). That means that I am sometimes pitching optimally and other times not. That is not the overall optimal way to pitch of course.

The optimal way to pitch is to pitch optimally all the time (duh)! So my matrix should always be the same as long as the game situation is the same. In reality of course, the game situation changes all the time. So I should be changing my matrices all the time. But it’s not in order to “mix things up” and keep the batters guessing. That happens naturally (and in fact optimally) on each and every pitch as long as I am using the optimal frequencies in my matrix.

Once again, all of this assumes a “smart” batter. For a “dumb” batter, my strategy changes and things get complicated, but I am still using a matrix and then randomizing from it. Always. Unless I am facing the dumbest batter in the universe who is incapable of ever learning anything or perhaps if it’s the last pitch I am going to throw in my career.

There are only two correct things that a pitcher/catcher have to do – their pitch-calling jobs are actually quite easy. This is a mathematical certainty. (Again, it assumes that the batter is acting optimally – if he isn’t that requires a whole other analysis and we have to figure out how to exploit a “dumb” batter without causing him to play too much more optimally):

One, establish the game theory optimal matrix of pitches and frequencies given the game situation, personnel, and environment.

Two, choose one pitch randomly around those frequencies (for example, if the correct matrix is 90% FB and 10% off-speed, you flip a 10-side mental coin).

Finally, it may be that catchers and pitchers do nearly the right thing (i.e. they can’t be much better even if I explain to them the correct way to think about pitching – who the hell do you think you are?) even though they don’t realize what it is they’re doing right. However, that’s possible only to an extent.

Many people are successful at what they do without understanding what it is they do that makes them successful. I’ve said before that I think catchers and pitchers do randomize their pitches to a large extent. They have to. Otherwise batters would guess what they are throwing with a high degree of certainty and Ron Darling and John Smoltz wouldn’t be wrong as often as they are when they tell us what the pitcher is going to throw (or should throw).

So how is that catchers and pitchers can think their job is to figure out the “right” pitch (no one ever says they “flip a mental coin”) yet those pitches appear to be random? It is because they go through so many chaotic decision in their brain that for all intents and purposes the pitch selection often ends up being random. For example, “I threw him a fastball twice in a row so maybe I should throw him an off-speed now. But wait, he might be thinking that, so I’ll throw another fastball. But wait, he might be thinking that too, so…” Where they stop in that train of thought might be random!

Even if pitchers and catchers are essentially randomizing their pitches, two things are certain. They can’t possibly be coming up with the exact game theory optimal (GTO) matrices, and trust me there IS an optimal one (although it may be impossible for anyone to determine it, but I guarantee that someone can do a better job overall – it’s like man versus machine). Two, some pitchers and catchers will be better at pseudo-randomizing than others. In both cases there is a great deal of room for improvement on calling games and pitches.

Note: This post was edited to include some new data which leads us in the direction of a different conclusion. The addendum is at the end of the original post .

This is another one of my attempts at looking at “conventional wisdoms” that you hear and read about all the time without anyone stopping for a second to catch their breath and ask themselves, “Is this really true?” Or more appropriately, “To what extent is this true?” Bill James used those very questions to pioneer a whole new field called sabermetrics.

As usual in science, we can rarely if ever answer questions with, “Yes it is true,” or “No, it is not true.” We can only look at the evidence and try and draw some inferences with some degree of certainty between 0 and 100%. This is especially true in sports when we are dealing with empirical data and limited sample sizes.

You often read something like, “So-and-so pitcher had a poor season (say, in ERA) but he had a few really bad outings so it wasn’t really that bad.” Let’s see if we can figure out to what extent that may or may not be true.

First I looked at all starting pitcher outings over the last 40 years, 1977-2016. I created a group of starters who had at least 4 very bad outings and at least 100 IP in one season. A “bad outing” was defined as 5 IP or less and at least 6 runs allowed, so a minimum RA9 of almost 11 in at least 4 games in a season. Had those starts been typical starts, each of these pitchers’ ERA’s or RA9 would have been at least a run less or so.

Next I only looked at those pitchers who had an overall RA9 of at least 5.00 in the seasons in question. The average RA9 for these pitchers with some really bad starts was 5.51 where 4.00 is the average starting pitcher’s RA9 in every season regardless of the run environment or league. Basically I normalized all pitchers to the average of his league and year and set the average at 4.00. I also park adjusted everything.

OK, what were these pitchers projected to do the following season? I used basic Marcel-type projections for all pitchers. The projections treated all RA9 equally. In other words a 5.51 RA with a few really bad starts was equivalent to a 5.51 RA with consistently below-average starts. The projections only used full season data (RA9).

So basically these 5.51 RA9 pitchers pitched near average for most of the their starts but had 4-6 really bad (and short) starts that upped their overall RA9 for the season by more than a run. Which was more indicative of their true talent? The vast majority of the games where they pitched around average, the few games where they blew up, or their overall runs allowed per 9 innings? Or, their overall RA9 for that season (regardless of how it was created) plus their RA9 from previous seasons and then some regression thrown in for good measure – in other words, a regular, old-fashioned projection?

Our average projection for these pitchers for the next season (which is an estimate of their true talent that season) was 4.46. How did they pitch the next season – which is an unbiased sample of their true talent (I didn’t set an innings requirement for this season so there is no survivorship bias)? It was 4.48 in 10,998 TBF! So the projection which had no idea that these were pitchers who pitched OK for most of the season but had a terrible seasonal result (5.51 RA9) because of a few terrible starts, was right on the money. All the projection model knew was that these pitchers had very bad RA9 for the season – in fact, their average RA was 138% of league average.

Of course since we sampled these pitchers based on some bad outings and an overall bad ERA (over 5.00) we know that in prior seasons their RA9 would be much lower, similar to their projection (4.46) – actually better. In fact, you should know that a projection can apply just as well to previous years as it can to subsequent years. There is almost no difference. You just have to make sure you apply the proper age adjustments.

Somewhat interestingly, if we look at all pitchers with a RA9 above 5 (an average of 5.43) who did not have the requisite very bad outings, i.e. they pitched consistently bad but with few disastrous starts, their projected RA9 was 4.45 and their actual was 4.25, in 25,479 TBF.

While we have significant sample error in these limited samples, not only is there no suggestion that you should ignore or even discount bad ERA or RA that are the result of a few horrific starts, there is a (admittedly weak) suggestion that pitchers who pitch badly but more consistently may be able to outperform their projections for some reason.

The next time you read that, “So-and-so pitcher has bad numbers but it was only because of a few really bad outings,” remember that there is no evidence  that an ERA or RA which includes a “few bad outings” should be treated any differently than a similar ERA or RA without that qualification, at least as far as projections are concerned.

Addendum: I was concerned about the way I defined pitchers who had “a few disastrous starts.” I included all starters who gave up at least 6 runs in 5 innings or less at least 5 times in a season. The average number of bad starts was 5.5. So basically these were mostly pitchers who had 5 or 6 really bad starts in a season, occasionally more.

I thought that most of the time when we hear the “A few bad starts” refrain, we’re talking literally about “a few bad starts,” as in 2 or 3. So I changed the criteria to include only those pitchers with 2 or 3 awful starts. I also upped the ante on those terrible starts. Before it was > 5 runs in 5 IP or less.  Now it is >7 runs in 5 IP or less – truly a blowup of epic proportions. We still had 508 pitcher seasons that fit the bill which gives us a decent sample size.

These pitchers overall had a normalized (4.00 is average) RA9 of 4.19 in the seasons in question, so 2 or 3 awful starts didn’t produce such a bad overall RA. Remember I am using a 100 IP minimum so all of these pitchers pitched at least fairly well for the season whether they had a few awful starts or not. (This is selective sampling and survivorship bias at work. Any time you set a minimum IP or PA, you select players who had above average performance, through luck and talent.)

Their next year’s projection was 3.99 and the actual was 3.89 so there is a slight inference that indeed you can discount the bad starts a little. This is in around 12,000 IP. A difference of .1 RA9 is only around 1 SD so it’s not nearly statistically significant. I also don’t know that we have any Bayesian prior to work with.

The control group – all other starters, namely those without 2 or 3 awful outings – had a RA9 in the season in question of 3.72 (compare to 4.19 for the pitchers with 2 or 3 bad starts). Their projection for the next season was 3.85 and actual was 3.86. This was in around 130,000 IP so 1 SD is now around .025 runs so we can be pretty confident that the 3.86 actual RA9 reflects their true talent within around .05 runs (2 SD) or so.

What about starters who not only had 2 or 3 disastrous starts but also had an overall poor RA9? In the original post I looked at those pitchers in our experimental group who also had a seasonal RA9 of > 5.00. I’ll do the same thing with this new experimental group – starters with only 2 or 3 very awful starts.

Their average RA9 for the experimental season was 5.52. Their projection was 4.45 and actual was 4.17, so now we have an even stronger inference that a bad season caused by a few bad starts creates a projection that is too pessimistic; thus maybe we should  discount those few bad starts. We only have around 1600 IP (in the projected season) for these pitchers so 1 SD is around .25 runs. A difference between projected and actual of .28 runs is once again not nearly statistically significant. There is, nonetheless, a suggestion that we are on to something. (Don’t ever ignore – assume it’s random – an observed effect just because it isn’t statistically significant – that’s poor science.)

What about the control group? Last time we noticed that the control group’s actual RA was less than its projection for some reason. I’ll look at pitchers who had > 5 RA9 in one season but were not part of the group that had 2 or 3 disastrous starts.

Their average RA9 was 5.44 – similar to the 5.52 of the experimental group. Their projected was 4.45 and actual was 4.35, so we see the same “too high” projection in this group as well. (In fact, in testing my RA projections based on RA only – as opposed to say FIP or ERC – I find an overall bias such that pitchers with a one-season high RA have projections that are too high, not a surprising result actually.) This is in around 7,000 IP which gives us a SD of around .1 runs per 9.

So, the “a few bad starts” group outperformed their projections by around .1 runs. This same group, limiting it to starters with an overall RA or over 5.00, outperformed their projections by .28 runs. The control group with an overall RA also > 5.00 outperformed their projections by .1 runs. None of these differences are even close to statistically significant.

Let’s increase the sample size a little of our experimental group who also had particularly bad RA overall by expanding it to starters with an overall RA of > 4.50 rather than > 5.00. We now have 3,500 IP, 2x as many IP, reducing our error by around 50%. The average RA9 of this group was 5.13. Their projected RA was 4.33 and actual was 4.05 – exactly the same difference as before. Keep in mind that the more samples we look at the more we are “data mining,” which is a bit dangerous in this kind of research.

A control group of starters with > 4.50 RA had an overall RA9 of 4.99. Their projection was exactly the same as the experimental group, 4.33, but their actual was 4.30 – almost exactly the same as their projection.

In conclusion, while we initially found no evidence that discounting a bad ERA or RA caused by “several very poor starts” is warranted when doing a projection for starters with at least 100 IP, once we change the criteria for “a few bad starts” from “at least 5 starts with 6 runs or more allowed in 5 IP or less” to “exactly 2 or 3 starts with 8 runs or more in 5 IP or less” we do find evidence that some kind of discount may be necessary. In other words, for starters whose runs allowed are inflated due to 2 or 3 really bad starts, if we simply use overall season RA or ERA for our projections we will understate their subsequent season’s RA or ERA by maybe .2 or .3 runs per 9.

Our certainty of this conclusion, especially with regard to the size of the effect – if it exists at all – is pretty weak given the magnitude of the differences we found and the sample sizes we had to work with. However, as I said before, it would be a mistake to ignore any inference – even a weak one – that is not contradicted by some Bayesian prior (or common sense).

 

Richard Nichols (@RNicholsLV on Twitter) sent me this link. These are notes that the author, Lee Judge, a Royals blogger for the K.C. Star, took during the season. They reflect thoughts and comments from players, coaches, etc. I thought I’d briefly comment on each one. Hope you enjoy!

Random, but interesting, things about baseball – Lee Judge

▪ If a pitcher does not have a history of doubling up on pickoff throws (two in a row) take a big lead, draw a throw and then steal on the next pitch.

Of course you can do that. But how many times can you get away with it? Once? If the pitcher or one of his teammates or coaches notices it, he’ll pick you off the next time by “doubling up.” Basically by exploiting the pitcher’s non-random and thus exploitable strategy, the runner becomes exploitable himself. A pitcher, of course, should be picking a certain percentage of the time each time he goes into the set position, based on the likelihood of the runner stealing and the value of the steal attempt. That “percentage” must be randomized by the pitcher and it “resets” each time he throws a pitch or attempts a pickoff.

By “randomize” I mean the prior action, pick or no pick, cannot affect the percentage chance of a pick. If a pitcher is supposed to pick 50% prior to the next pitch he must do so whether he’s just attempted a pickoff 0, 1, 2, or 10 times in a row. The runner can’t know that a pickoff is more or less likely based on how many picks were just attempted. In fact you can tell him, “Hey every time I come set, there’s a 50% (or 20%, or whatever) chance I will attempt to pick you off,” and there’s nothing he can do to exploit that information.

For example, if he decides that he must throw over 50% of the time he comes set (in reality the optimal % changes with the count), then he flips a mental coin (or uses something – unknown to the other team – to randomize his decision, with a .5 mean). What will happen on the average is that he won’t pick half the time, 25% of the time he’ll pick once only, 12.5% of the time he’ll pick exactly twice, 25% of the time he’ll pick at least twice, etc.

Now, the tidbit from the player or coach says, “does not have a history of doubling up.” I’m not sure what that means. Surely most pitchers when they do pick, will pick once sometimes and twice sometimes, etc. Do any pitchers really never pick more than once per pitch? If they do, I would guess that it’s because the runner is not really a threat and the one-time pick is really a pick with a low percentage. If a runner is not much of a threat to run, then maybe the correct pick percentage is 10%. If that’s the case, then they will not double-up 99% of the time and correctly so. That cannot be exploited, again, assuming that a 10% rate is optimal for that runner in that situation. So while it may look like they never double up, they do in fact double up 1% of the time, which is correct and cannot be exploited (assuming the 10% is correct for that runner and in that situation).

Basically what I’m saying is that this person’s comment is way to simple and doesn’t really mean anything without putting it into context as I explain above.

▪ Foul balls with two strikes can indicate a lack of swing-and-miss stuff; the pitcher can get the batters to two strikes, but then can’t finish them off.

Not much to say here. Some pitchers have swing-and-miss stuff and others don’t, and everything in-between. You can find that out by looking at…uh…their swing-and-miss percentages (presuming a large enough sample size to give you some minimum level of certainty). Foul balls with two strikes? That’s just silly. A pitcher without swing-and-miss stuff will get more foul balls and balls in play with two strikes. That’s a tautology. He’ll also get more foul balls and balls in play with no strikes, one strike, etc.

▪ Royals third-base coach Mike Jirschele will walk around the outfield every once in a while just to remind himself how far it is to home plate and what a great throw it takes to nail a runner trying to score.

If my coach has to do that I’m not sure I want him coaching for me. That being said, whatever little quirks he has or needs to send or hold runners the correct percentage of time is fine by me. I don’t know that I would be teaching or recommending that to my coaches – again, not that there’s anything necessarily wrong with it.

Bottom line is that he better know the minimum percentages that runners need to be safe in any given situation (mostly # of outs) – i.e. the break-even points – and apply them correctly to the situation (arm strength and accuracy etc.) in order to make optimal decisions. I would surely be going over those numbers with my coaches from time to time and then evaluating his sends and holds to make sure he’s not making systematic errors or too many errors in general.

▪ For the most part, the cutter is considered a weak contact pitch; the slider is considered a swing-and-miss pitch.

If that’s confirmed by pitch f/x, fine. If it’s not, then I guess it’s not true. Swing-and-miss is really just a subset of weak contact and weak contact is a subset of contact which is a subset of a swing. The result of a swing depends on the naked quality of the pitch, where it is thrown, and the count. So while for the most part (however you want to define that – words are important!) it may be true, surely it depends on the quality of each of the pitches, on what counts they tend to be thrown, how often they are thrown at those counts, and the location they are thrown to. Pitches away from the heart of the plate tend to be balls and swing-and-miss pitches. Pitches nearer the heart tend to be contacted more often, everything else being equal.

▪ With the game on the line and behind in the count, walk the big-money guys; put your ego aside and make someone else beat you.

Stupid. Just. Plain. Stupid. Probably the dumbest thing a pitcher or manager can think/do in a game. I don’t even know what it means and neither do they. So tie game in the 9th, no one on base, 0 outs, count is 1-0. Walk the batter? That’s what he said! I can think of a hundred stupid examples like that. A pitcher’s approach changes with every batter and every score, inning, outs, runners, etc. A blanket statement like that, even as a rule of thumb, is Just. Plain. Dumb. Any interpretation of that by players and coaches can only lead to sub-optimal decisions – and does. All the time. Did I say that one is stupid?

▪ A pitcher should not let a hitter know what he’s thinking; if he hits a batter accidentally he shouldn’t pat his chest to say “my bad.” Make the hitter think you might have drilled him intentionally and that you just might do it again.

O.K. To each his own.

▪ Opposition teams are definitely trying to get into Yordano Ventura’s head by stepping out and jawing with him; anything to make him lose focus.

If he says so. I doubt much of that goes on in baseball. Not that kind of game. Some, but not much.

▪ In the big leagues, the runner decides when he’s going first-to-third; he might need a coach’s help on a ball to right field — it’s behind him — but if the play’s in front of him, the runner makes the decision.

Right, we teach that in Little League (a good manager that is). You teach your players that they are responsible for all base running decisions until they get to third. Then it’s up to the third base coach. It’s true that the third base coach can and should help the runner on a ball hit to RF, but ultimately the decision is on the runner whether to try and take third.

Speaking of taking third, while the old adage “don’t make the first or third out at third base” is a good rule of thumb, players should know that it doesn’t mean, “Never take a risk on trying to advance to third.” It means the risk has to be low (like 10-20%), but that the risk can be twice as high with 0 outs as with 2 outs. So really, the adage should be, “Never make the third out at third base, but you can sometimes make the first out at third base.”

You can also just forget about the first out part of that adage. Really, the two-out break-even point is almost exactly in between the first-out and one-out one. In other words, with no outs, you need to be safe at third around 80% of the time, with one out, around 70%, and with two outs around 90%. Players should be taught that and not just the “rule of thumb.” They should also be taught that the numbers change with trailing runners, the pitcher, and who the next batter or batters are. For example, with a trailing runner, making the third out is really bad but making the first out where the trailing runner can advance is a bonus.

▪ Even in a blowout there’s something to play for; if you come close enough to make the other team use their closer, maybe he won’t be available the next night.

I’m pretty sure the evidence suggests that players play at their best (more or less) regardless of the score. That makes sense under almost any economic or cognitive theory of behavior since players get paid big money to have big numbers. Maybe they do partially because managers and coaches encourage them to do so with tidbits like that. I don’t know.

Depending on what they mean by blowout, what they’re saying is that, say you have a 5% chance of winning a game down six runs in the late innings. Now say you have a 20% chance of making it a 3-run or less game, and that means that the opponent closer comes into the game. And say that him coming into the game gives you another 2% chance of winning tomorrow because he might not be available, and an extra 1% the day after that (if it’s the first game in a series). So rather than a 5% win expectancy, you actually have a 5% plus 20% * 3% or, 5.6% WE. Is that worth extra effort? To be honest, a manager and coach is supposed to teach his players to play hard (within reason) regardless of the score for two reasons: One, because it makes for better habits when the game is close and two, at exactly what point is the game a blowout (Google the sorites paradox)?

▪ If it’s 0-2, 1-2 and 2-2, those are curveball counts and good counts to run on. That’s why pitchers often try pickoffs in those counts.

On the other hand, 0-2 is not a good count to run on because of the threat of the pitchout. As it turns out, the majority of SB attempts (around 68%) occur at neutral counts. Only around 16% of all steal attempts occur at those pitchers’ counts. So whoever said that is completely wrong.

Of course pitchers should (and do) attempt more pickoffs the greater the chance of a steal attempt. That also tends to make it harder to steal (hence the game theory aspect).

That being said, some smart people (e.g., Professor Ted Turocy of Chadwick Baseball Bureau) believe that there is a Nash equilibrium between the offense and defense with respect to base stealing (for most players – not at the extremes) such that neither side can exploit the other by changing their strategy. I don’t know if it’s true or not. I think Professor Turocy may have a paper on this. You can check it out on the web or contact him.

▪ Don’t worry about anyone’s batting average until they have 100 at-bats.

How about “Don’t worry about batting average…period.” In so many ways this is wrong. I would have to immediately fire whoever said that if it was a coach, manager or executive.

▪ It’s hard to beat a team three times in a row; teams change starting pitchers every night and catching three different pitchers having a down night is not the norm.

Whoever said this should be fired sooner than the one above. As in, before they even finished that colossally innumerate sentence.

▪ At this level, “see-it-and-hit” will only take you so far. The best pitchers are throwing so hard you have to study the scouting reports and have some idea of what’s coming next.

If that’s your approach at any level you have a lot to learn. That goes for 20 or 50 years ago the same as it does today. If pitchers were throwing maybe 60 mph not so much I guess. But even at 85 you definitely need to know what you’re likely to get at any count and in any situation from that specific pitcher. Batters who tell you that they are “see-it-and-hit-it” batters are lying to you or to themselves. There is no such thing in professional baseball. Even the most unsophisticated batter in the world knows that at 3-0, no outs, no runners on, his team is down 6 runs, he’s likely to be getting 100% fastballs.

▪ If a pitcher throws a fastball in a 1-1 count, nine out of 10 times, guess fastball. But if it’s that 10th time and he throws a slider instead, you’re going to look silly.

WTF? If you go home expecting your house to be empty but there are two giraffes and a midget, you’re going to be surprised.

▪ Good hitters lock in on a certain pitch, look for it and won’t come off it. You can make a guy look bad until he gets the pitch he was looking for and then he probably won’t miss it.

Probably have to fire this guy too. That’s complete bullshit. Makes no sense from a game-theory perspective or from any perspective for that matter. So just never throw him that pitch right? Then he can’t be a good hitter. But now if you never throw him the pitch he’s looking for, he’ll stop looking for it, and will instead look for the alternative pitch you are throwing him. So you’ll stop throwing him that pitch and then…. Managers and hitting coaches (and players) really (really) need a primer on game theory. I am available for the right price.

▪ According to hitting coach Dale Sveum, hitters should not give pitchers too much credit; wait for a mistake and if the pitcher makes a great pitch, take it. Don’t start chasing great pitches; stick to the plan and keep waiting for that mistake.

Now why didn’t I think of that!

▪ The Royals are not a great off-speed hitting club, so opposition pitchers want to spin it up there.

Same as above. Actually, remember this: You cannot tell how good or bad a player or team is at hitting any particular pitch by looking at the results. You can only tell by how often they get each type of pitch. Game theory tells us that the results of all the different pitches (type, location, etc.) will be about the same to any hitter. What changes depending on that hitter’s strengths and weaknesses are the frequencies. And this whole, “Team is good/bad at X” is silly. It’s about the individual players of course. I’m pretty sure there was at least one hitter on the team who is good at hitting off-speed.

Also, never evaluate or define “good hitting” based on batting average which most coaches and managers do even in 2016. I don’t have to tell you, dear sophisticated reader, that. However, you should also not define good or bad hitting on a pitch level based on OPS or wOBA (presumably on contact) either. You need to include pitches not put into play and you need to incorporate count. For example, at a 3-ball count there is a huge premium on not swinging at a ball. Your result on contact is not so important. At 2-strike counts, not taking a strike is also especially important. Whenever you see pitch level numbers without including balls not swung at, or especially only on balls put into play (which is usually the case), be very wary of those numbers. For example, a good off-speed hitting player will tend to have good strike zone recognition (and not necessarily good results on contact) skills because many more off-speed pitches are thrown in pitchers’ counts and out of the strike zone.

▪ According to catcher Kurt Suzuki, opposition pitchers should not try to strike out the Royals. Kansas City hitters make contact and a pitcher that’s going for punchouts might throw 100 pitches in five innings.

Wait. If they are a good contact team, doesn’t that mean that you can try and strike them out without running up your pitch count? Another dumb statement. Someone should tell Mr. Suzuki that pitch framing is really important.

▪ If you pitch down in the zone you can use the whole plate; any pitch at the knees is a pretty good pitch (a possible exception is down-and-in to lefties). If you pitch up in the zone you have to hit corners.

To some extent that’s true though it’s (a lot) more complicated than that. What’s probably more important is that when pitching down in the zone you want to pitch more away and when pitching up in the zone more inside. By the way, is it true lefties like (hit better) the down-and-in pitch more than righties? No, it is not. Where does that pervasive myth come from? Where do all the hundreds of myths that players, fans, coaches, managers, and pundits think are true come from?

▪ If you pitch up, you have to be above the swing path.

Not really sure what that means? Above the swing “path?” Swing path tends to follow the pitch so that doesn’t make too much sense. “Path” implies angle of attack and to say “above” or “below” an angle of attach doesn’t really make sense. Maybe he means, “If you are going to pitch high, pitch really high?” Or, “If the batter tends to be a high ball hitter, pitch really high?”

▪ Numbers without context might be meaningless; or worse — misleading

I don’t know what that means. Anything might be misleading or worthless without context. Words, numbers, apple pie, dogs, cats…

▪ All walks are not equal: a walk at the beginning of an inning is worth more than a walk with two outs, a walk to Jarrod Dyson is worth more than a walk to Billy Butler.

Correct. I might give this guy one of the other guys’ (that I fired) jobs. Players, especially pitchers (but batters and fielders too), should always know the relative value of the various offensive events depending on the batter, pitcher, score, inning, count, runners, etc., and then tailor their approach to those values. This is one of the most important things in baseball.

▪ So when you look at a pitcher’s walks, ask yourself who he walked and when he walked them.

True. Walks should be weighed towards bases open, 2 outs, sluggers, close games, etc. If not, and the sample is large, then the pitcher is likely either doing something wrong or he has terrible command/control or both. For example, Greg Maddux went something like 10 years before he walked his first pitcher.

▪ When a pitcher falls behind 2-0 or 3-1, what pitch does he throw to get back in the count? Can he throw a 2-0 cutter, sinker or slider, or does he have to throw a fastball down the middle and hope for the best?

All batters, especially in this era of big data, should be acutely aware of a pitcher’s tendencies against their type of batter in any given situation and count. One of the most important ones is, “Does he have enough command of his secondary pitches (and how good is his fastball even when the batter knows it’s coming) to throw them in hitter’s counts, especially the 3-2 count?”

▪ Hitters who waggle the bat head have inconsistent swing paths.

I never heard that before. Doubt it is anything useful.

▪ The more violent the swing, the worse the pitch recognition. So if a guy really cuts it loose when he swings and allows his head to move, throw breaking stuff and change-ups. If he keeps his head still, be careful.

Honestly, if that’s all you know about a batter, someone is not doing their homework. And again, there’s game theory that must be accounted for and appreciated. Players, coaches and managers are just terrible at understanding this very important part of baseball especially the batter/pitcher matchup. If you think you can tell a pitcher to throw a certain type of pitch in a certain situation (like if the batter swings violently throw him off-speed), then surely the batter can and will know that too. If he does, which he surely will – eventually – then he basically knows what’s coming and the pitcher will get creamed!

There’s been much research and many articles over the years with respect to hitter (and other) aging curves. (I even came across in a Google search a fascinating component aging curve for PGA golfers!) I’ve publicly and privately been doing aging curves for 20 years. So has Tango Tiger. Jeff Zimmerman has also been prolific in this regard. Others have contributed as well. You can Google them if you want.

Most of the credible aging curves use some form of the delta method which is described in this excellent series on aging by the crafty n’er do well, MGL. If you’re too lazy to look it up, the delta method basically is this, from the article:

The “delta method” looks at all players who have played in back-to-back years. Many players have several back-to-back year “couplets,” obviously. For every player, it takes the difference between their rate of performance in Year I and Year II and puts that difference into a “bucket,” which is defined by the age of the player in those two years….

When we tally all the differences in each bucket and divide by the number of players, we get the average change from one age to the next for every player who ever played in at least one pair of back-to-back seasons in. So, for example, for all players who played in their age 29 and 30 seasons, we get the simple average of the rate of change in offensive performance between 29 and 30.

That’s really the only way to do an aging curve, as far as I know, unless you want to use an opaque statistical method like J.C Bradbury did back in 2009 (you can look that up too). One of the problems with aging curves, which I also discuss in the aforementioned article, and one that comes up a lot in baseball research, is survivorship bias. I’ll get to that in a paragraph or two.

Let’s say we want to use the delta method to compute the average change in wOBA performance from age 29 to 30. To do that, we look at all players who played in their age 29 and age 30 years, record each player’s difference, weight it by some number of PA (maybe the lesser of the two – either year 1 or year 2, maybe the harmonic mean of the two, or maybe weight them all equally – it’s hard to say), and then take the simple weighted average of all the differences. For example, say we have two players. Player A has a .300 wOBA in his age 29 season in 100 PA and a .290 wOBA in his age 30 season in 150 PA. Player B is .320 in year one in 200 PA and .300 in year two in 300 PA. Using the delta method we get a difference of -.010 (a decline) for player A weighted by, say, 100 PA (the lesser of 100 and 150), and a difference of -.020 for Player B in 200 PA (also the lesser of the two PA). So we have an average decline in our sample of (10 * 100 + 20 * 200) / (300), or 16.67 points of wOBA decline. We would do the same for all age intervals and all players and if we chain them together we get an aging curve for the average MLB player.

There are issues with that calculation, such as our choice to weight each player’s difference by the “lesser of the two PA,” what it means to compute “an average decline” for that age interval (since it includes all types of players, part-time, full-time, etc.) and especially what it means when we chain every age interval together to come up with an aging curve for the average major league player when it’s really a compendium of a whole bunch of players all with different career lengths at different age intervals.

Typically when we construct an aging curve, we’re not at all looking at the careers of any individual players. If we do that, we end up with severe selective sampling and survivorship problems. I’m going to ignore all of these issues and focus on survivorship bias only. It has the potential to be extremely problematic, even when using the delta method.

Let’s say that a player is becoming a marginal player for whatever reason, perhaps it is because he is at the end of his career. Let’s also say that we have a bunch of players like that and their true talent is a wOBA of .280. If we give them 300 PA, half will randomly perform better than that and half will randomly perform worse than that simply because 300 PA is just a random sample of their talent. In fact, we know that the random standard deviation of wOBA in 300 trials is around 25 points in wOBA, such that 5% of our players, whom we know have a true talent of .280, will actually hit .230 or less by chance alone. That’s a fact. There’s nothing they or anyone else can do about it. No player has an “ability” to fluctuate less than random variance tells is in any specific number of PA. There might be something about them that creates more variance on the average, but it is mathematically impossible to have less (actually the floor is a bit higher than that because of varying opponents and conditions).

Let’s assume that all players who hit less than .230 will retire or be cut – they’ll never play again, at least not in the following season. That is not unlike what happens in real life when a marginal player has a bad season. He almost always gets fewer PA the following season than he would have gotten had he not had an unlucky season. In fact, not playing at all is just a subset of playing less – both are examples of survivorship bias and create problems with aging curves. Let’s see what happens to our aging interval with these marginal players when 5% of them don’t play the next season.

We know that this entire group of players are .280 hitters because we said so. If 5% of them hit, on average, .210, then the other 95% must have hit .284 since the whole group must hit .280 – that’s their true talent. This is just a typical season for a bunch of .280 hitters. Nothing special going on here. We could have split them up any way we wanted, as long as in the aggregate they hit at their true talent level.

Now let’s say that these hitters are in their age 30 season and they are supposed to decline by 10 points in their age 31 season. If we do an aging calculation on these players in a typical pair of seasons we absolutely should see .280 in the first year and .270 in the second. In fact, if we let all our players play a random or a fixed number of PA in season two, that is exactly what we would see. It has to be. It is a mathematical certainty, given everything we stated. However survivorship bias screws up our numbers and results in an incorrect aging value from age 30 to age 31. Let’s try it.

Only 95% of our players play in season two, so 5% drop out of our sample, at least from age 30 to age 31. There’s nothing we can do about that. When we compute a traditional aging curve using the delta method, we only use numbers from pairs of years. We can never use the last year of a player’s career as the first year in a year pairing. We don’t have any information about that player’s next season. We can use a player’s last year, say, at age 30 in an age 29 to 30 pairing but not in a 30 to 31 pairing. Remember that the delta method always uses age pairings for each player in the sample.

What do those 95% hit in season one? Remember they are true .280 hitters. Well, they don’t hit .280. I already said that they hit .284. That is because they got a little lucky. The ones that got really unlucky to balance out the lucky ones, are not playing in season two, and thus dropped out of our aging curve sample. What do these true .280 players (who hit .284) hit in season two? Season two is an unbiased sample of their true talent. We know that their true talent was .280 in season one and we know that from age 30 to age 31 all players will lose 10 points in true talent because we said so. So they will naturally hit .270 in year two.

What does our delta method calculation tell us about how players age from age 30 to age 31? It tells us they lose 14 points in wOBA and not 10! It’s giving us a wrong answer because of survivorship bias. Had those other 5% of players played, they would have also hit .270 in year two and when we add everyone up, including the unlucky players, we would come up with the correct answer of a 10-point loss from age 30 to age 31 (the unlucky players would have improved in year two by 60 points).

One way to avoid this problem (survivorship bias will always make it look like players lose more or gain less as they age because the players that drop out from season to season always, on the average, got unlucky in season one) is to ignore the last season of a player’s career in our calculations. That’s fine and dandy, but survivorship bias exists in every year of a player’s career. As I wrote earlier, dropping out is just a small subset of this bias. Every player that gets unlucky in one season will see fewer PA in his next season, which creates the same kind of erroneous results. For example, if the 5% of unlucky players did play in season two, but only got 50 PA whereas the other 95% of slightly lucky players got 500 PA, we would still come up with a decline of more than 10 points of wOBA – again an incorrect answer.

To correct for this survivorship bias, which really wreaks havoc with aging curves, a number of years ago, I decided to add a phantom year for players after their last season of action. For that year, I used a projection – our best estimate of what they would have done had they been allowed to play another year. That reduced the survivorship bias but it didn’t nearly eliminate it because, as I said, every player suffers from it in reduced PA for unlucky players and increased PA for lucky ones, in their subsequent seasons.

Not only that, but we get the same effect within years. If two players have .300 wOBA true talents, but player A hits worse than .250 by luck alone in his first month (which will happen more than 16% of the time) and player B hits .350 or more, who do you think will get more playing time for the remainder of the season even though we know that they have the same talent, and that both, on the average, will hit exactly .300 for the remainder of the season?

I finally came up with a comprehensive solution based on the following thought process: If we were conducting an experiment, how would we approach the question of computing aging intervals? We would record every player’s season one (which would be an unbiased sample of his talent, so no problem so far) and then we would guarantee that every player would get X number of PA the next season, preferably something like 500 or 600 to create large samples of seasonal data. We would also give everyone a large number of PA in all season ones too, but it’s not really necessary.

How do we do that? We merely extend season two data using projections, just as I did in adding phantom seasons after a player’s career was over (or he missed a season in the middle of his career). Basically I’m doing the same thing, whether I’m adding 600 PA to a player who didn’t play (the phantom season) or I’m adding 300 PA to a player who only had 300 PA in season two. By doing this I am completely eliminating survivorship bias. Of course this correction method lives or dies with how accurate the projections are but even a simple projection system like Marcel will suffice when dealing with a large number of players of different talent levels. Now let’s get to the results.

I looked at all players from 1977 to 2016 and I park and league adjusted their wOBA for each season. Essentially I am using wOBA+. I also only looked at seasonal pairs (with a minimum of 10 PA in each season) where the player played on the same team. I didn’t have to do that, but my sample was large enough that I felt that the reduction in sample size was worth getting rid of any park biases even though I was dealing with park- adjusted numbers.

Using the delta method with no survivorship bias other than ignoring the last year of every player’s career, this is the aging curve I arrived at after chaining all of the deltas. This is the typical curve you will see in most of the prior research.

1977-2016 Aging Curve using Delta Method Without Correcting for Survivorship Bias

curve1

 

Here is the same curve after completing all season two’s with projections. For example, let’s say that a player is projected to hit .300 in his age 30 season and he hits .250 in only 150 PA (his manager benches him because he’s hit so poorly). His in-season projection would change because of the .250. It might now be .290. So I complete a 600 PA season by adding 450 PA of .290 hitting to the 150 PA of .250 hitting for a complete season of .280 in 600 PA.

If that same player hits .320 in season two in 620 PA then I add nothing to his season two data. Only players with less than 600 PA have their seasons completed with projections. How do I weight the season pairs? Without any completion correction, as in the first curve above, I weighted each season pair by the harmonic mean of the two PA. With correction, as in the second curve above, I weighted each pair by the number of PA in season one. This corrects for intra-season survivorship bias in season one as well.

1977-2016 Aging Curve using Delta Method and Correcting for Survivorship Bias

curve2

 

You can see that in the first curve, uncorrected for survivorship bias, players gain around 40 points in wOBA from age 21 to age 27, seven points per year, plateau from age 27 to 28, then decline by also around seven points a year after that. In the second curve, after we correct for survivorship bias, we have a slightly quicker ascension from age 21 to 26, more than eight points per year, a plateau from age 26 to age 27, then a much slower decline at around 3 points per year.

Keep in mind that these curves represent all players from 1977 to 2016. It is likely that aging has changed significantly from era to era due to medical advances, PED use and the like. In fact, if we limit our data to 2003 and later, after the so called steroid era, we get an uncorrected curve that plateaus between ages 24-28 and then declines by an average of 9 points a year from age 28 to 41.

In my next installment I’ll do some survivorship corrections for components like strikeout and walk percentage.

Now that Adam Eaton has been traded from the White Sox to the Nationals much has been written about his somewhat unusual “splits” in his outfield defense as measured by UZR and DRS, two of the more popular batted-ball defensive metrics. In RF, his career UZR per 150 games is around +20 runs and in CF, -8 runs. He has around 100 career games in RF and 300 in CF. These numbers do not include “arm runs” as I’m going to focus only on range and errors in this essay. If you are not familiar with UZR or DRS you can do some research on the net or just assume that they are useful metrics for quantifying defensive performance and for projecting defense.

In 2016 Eaton was around -13 in CF and +20 in RF. DRS was similar but with a narrower (but still unusual) spread. We expect that a player who plays at both CF and the corners in a season or within a career will have a spread of around 5 or 6 runs between CF and the corners (more between CF and RF than between CF and LF). For example, a CF’er who has a UZR of zero and thus is exactly average among all CF’ers, will have a UZR at around +5.5 at the corners, again a bit more in RF than LF (LF’ers are better fielders than RF’ers).

This has nothing to do with how “difficult” each position is (that is hard to define anyway – you could even make the argument that the corner positions are “harder” than CF), as UZR and DRS are calculated as runs above or below the average fielder at that position. It merely means that the average CF’er is a better fielder than the average corner OF’er by around 5 or 6 runs. Mostly they are faster. The reason teams put their better fielder in CF is not because it is an inherently more “difficult” position but because it gets around twice the number of opportunities per game than the corner positions such that you can leverage talent in the OF.

Back to Eaton. He appears to have performed much better in RF than we would expect given his performance in CF (or vice versa) or even overall. Does this mean that he is better suited to RF (and perhaps LF, where he hasn’t played much in his career) or that the big, unusual gap we see is just a random fluctuation, or somewhere in the middle as is often (usually) the case? Should the Nationals make every effort to play him in RF and not CF? After all, their current RF’er, Harper, has unusual splits too, but in the opposite direction – his career CF UZR is better than his career RF UZR! Or perhaps the value they’re getting from Eaton is diminished if they’re going to play him in CF rather than RF.

How could it be that a fielder could have such unusual defensive splits and it be solely or mostly due to chance only? The same reason a hitter can have unusual but random platoon splits or a pitcher can have unusual but random home/road or day/night splits. A metric like UZR or DRS, like almost all metrics, contains a large element of chance, or noise if you will. That noise comes from two sources – one is because the data and methodology are far from perfect and two is that actual defensive performance can fluctuate randomly (or for reasons we are just not aware of) from one time period to another – from play to play, game to game, or position to position, for various reasons or for no reason at all.

To the first point, just because our metric “says” that a player was +10 in UZR that does not necessarily mean that he performed exactly that well. In reality, he might have performed at a +15 level or he might have performed at a 0 or even a -10 level. It’s more likely of course that he performed at +5 than +20 or 0, but because of the limits of our data and methodology, the +15 is an estimate of his performance. To the second point, actual fielding performance, even if we could measure it precisely, like hitting and pitching, is subject to random fluctuations for reasons known (or at least speculated) and unknown to us. On one play a player can get a great jump and make a spectacular play and on another that same player can take a bad route, get a bad jump, the ball can pop out of his glove, etc. Some days fielders probably feel better than others. Etc.

So whenever we compare one time period to another or one position to another, even ones which require similar, perhaps even identical, skills, like in the OF, it is possible, even likely, that we are going to get different results by chance alone, or at least because of the two dynamics I explained above (don’t get hung up on the words “luck”, “chance” or “random”). Statistics tell us that those random differences will be more and more unlikely the further away we get from what is expected (e.g., we expect that play in CF will be 5 or 6 runs “worse” than play in RF or LF), however, statistics also tells us that any difference, even large ones like we see with Eaton (or more), can and do occur by chance alone.

At the same time, it is possible, maybe even likely, that a player could somehow be more suited to RF (or LF) than CF, or vice versa. So how do we determine how much of an unusual “split” in OF defense, for example, is likely chance and how much is likely “skill?” In other words, what would we expect future defense to be in RF and in CF for a player with unusual RF/CF splits? Remember that future performance always equates to an estimate of talent, more or less. For example, if we find strong evidence that almost all of these unusual splits are due to chance alone (virtually no skill), then we must assume that in the future the player with the unusual splits will revert to normal splits in any future time frame. In the case of Eaton that would mean that we would construct an OF projection based on all of his OF play, adjusted for position, and then do the normal adjustment for our CF or RF projection, such that his RF projection will be around 7 runs greater than his CF projection rather than the 20 run or more gap that we see in his past performance.

To examine this question, I looked at all players who played at least 20 games in CF and RF or LF from 2003 through 2015. I isolated those with various unusual splits. I also looked at all players to establish a baseline. At the same time, I crafted a basic one-season Marcel-like projection from that CF and corner performance combined. The way I did that was to adjust the corners to represent CF by subtracting 4 runs from LF UZR and 7 runs from RF UZR. Then I regressed that number based on the number of total games in that one season, added in an aging factor (-.5 runs for players under 27 and -1.5 runs for players 27 and older), and the resulting number was a projection for CF.

We can then take that number and add 4 runs for a LF projection and 7 runs for a RF projection. Remember these are range and errors only (no arm). So, for example, if a player was -10 in CF per 150 in 50 games and +3 in RF in 50 games, his projection would be:

Subtract 7 runs from his RF UZR to convert into “CF UZR”, so it’s now -4. Average that with his -10 UZR in CF, which gives him a total of -7 runs in 100 games. I am using 150 games as the 50% regression point so we regress this player 150/(150+100) or 60% toward a mean of -3 (because these are players who play both CF and corner, they are below average CF’ers). That comes out to -1.6. Add in an aging factor, say -.5 for a 25-year old and we get a projection of -2.1 for CF. That would mean a projection of +1.9 in LF, a +4 run adjustment and +4.9 in RF, a +7 run adjustment, assuming normal “splits.”

So let’s look at some numbers. To establish a baseline and test (and calibrate) our projections, let’s look at all players who played CF and LF or RF in season one (min 20 games in each) and then their next season in either CF or the corners:

UZR season one UZR season two Projected UZR
LF or RF +6.0 (N games=11629) 2.1 (N=42866) 2.1
CF -3.0 (N=9955) -.8 (23083) -.9

 

The spread we see in column 2, “UZR season one” is based on the “delta method”. It is expected to be a little wider than the normal talent spread we expect between CF and LF/RF which is around 6 runs. That is because of selective sampling. Players who do well at the corners will tend to also play CF and players who play poorly in CF will tend to get some play at the corners. The spread we see in column 3, “UZR season two” does not mean anything per se. In season two these are not necessarily players who played both positions again (they played either one or the other or both). All it means is that of players who played both positions in season one, they are 2.1 runs above average at the corners and .8 runs below average in CF, in season two.

Now let’s look at the same table for players like Eaton, who had larger than normal splits between a corner position and CF. I used a threshold of at least a 10-run difference (5.5 is typical). There were 254 players who played at least 20 games in CF and in RF or LF in one season and then played in LF in the next season, and 138 players who played in CF and LF or RF in one season and in RF in the next.

UZR season one UZR season two Projected UZR
LF or RF +12.7 (N games=4924) 1.4
CF -12.3 (N=4626) .3

 

For now, I’m leaving the third column, their UZR in season two, empty. These are players who appeared to be better suited at a corner position than in CF. If we assume that these unusual splits are merely noise, a random fluctuation, and that we expect them to have a normal split in season two, we can use the method I describe above to craft a projection for them. Notice the small split in the projections. The projection model I am using creates a CF projection and then it merely adds +4 runs for LF and +7 for RF. Given a 25-run split in season one rather than a normal 6-run split, we might assume that these players will play better, maybe much better, in RF or LF than in CF, in season two. In other words, there is a significant “true talent defensive split” in the OF. So rather than 1.4 in LF or RF (our projection assumes a normal split), we might see a performance of +5, and instead of .3 in CF, we might see -5, or something like that.

Remember that our projection doesn’t care how the CF and corner OF UZR’s are distributed in season one. It assumes static talent and just converts corner UZR to CF UZR by subtracting 4 or 7 runs. Then when it finalizes the CF projection, it assumes we can just add 4 runs for a LF projection and 7 runs for a RF one. It treats all OF positions the same, with a static conversion, regardless of the actual splits. The projection assumes that there is no such thing as “true talent OF splits.”

Now let’s see how well the projection does with that assumption (no such thing as “true talent OF defensive splits”). Remember that if we assume that there is “something” to those unusual splits, we expect our CF projection to be too high and our LF/RF projection to be too low.

UZR season one UZR season two Projected UZR
LF or RF +12.7 (N games=4924) .9 (N=16857) 1.4
CF -12.3 (N=4626) .8 (N=10250) .3

 

We don’t see any evidence of a “true talent OF split” when we compare projected to actual. In fact, we see the opposite effect, which is likely just noise (our projection model is pretty basic and not very precise). Instead of seeing better than expected defense at the corners as we might expect from players like Eaton who had unusually good defense at the corners compared to CF in season one, we see slightly worse than projected defense. And in CF, we see slightly better defense than projected even though we might have expected these players to be especially unsuited to CF.

Let’s look at players, unlike Eaton, who have “reverse” splits. These are players who in at least 20 games in both CF and LF or RF, had a better UZR in CF than at the corners.

UZR season one UZR season two Projected UZR
LF or RF -4.8 (N games=3299) 1.4 (N=15007) 2.4
CF 7.8 (N=3178) -4.4 (N=6832) -2.6

 

Remember, the numbers in column two, season one UZR “splits” are based on the delta method. Therefore, every player in our sample had a better UZR in CF than in LF or RF and the average difference was 12.6 runs (in favor of CF) whereas we expected an average difference of minus 6 runs or so (in favor of LF/RF). The “delta method” just means that I averaged all of the players’ individual differences weighted by the lesser of their games, either in CF or LF/RF.

Again, according to the “these unusual splits must mean something” (in terms of talent and what we expect in the next season) theory, we expect these players to significantly exceed their projection in CF and undershoot it at the corners. Again, we don’t see that. We see that our projections are high for both positions; in fact we overshoot more in CF than in RF/LF exaclty the opposite of what we would expect if there were any significance to these unusual splits. Again we see no evidence of a “true talent split in OF defense.”

For players with unusual splits in OF defense, we see that a normal projection at CF or at the corners suffices. We treat LF/RF/CF UZR exactly the same making static adjustments regardless of the direction and magnitude of the empirical splits. What about the idea that, “We don’t know what to expect with a player like Eaton?” I don’t really know what that means, but we hear it all the time when we see numbers that look unusual or “trendy” or appear to follow a “pattern.” Does that mean we expect there to be more fluctuation in season two UZR? Perhaps even though on the average they revert to normal spreads, we see a wider spread of results in these players who exhibit unusual splits in season one. Let’s look at that in our final analysis.

When we look at all players who played CF and LF/RF in season one, remember the average spread was 9 runs, +6 at the corners and -3 in CF. In season two, 28% of the players who played RF or LF had a UZR greater than +10 and 26% in CF had a UZR of -10 or worse. The standard deviation of the distribution in season two UZR was 13.9 runs for LF/RF and 15.9 in CF

What about our players like Eaton? Can we expect more players to have a poor UZR in CF and a great one at a corner? No. 26% of these players had a UZR greater than +10 and 25% had a UZR less than -10 on CF, around the same as all “dual” players in season one. In fact we get a smaller spread with these players with unusual splits as we would expect given that their means in CF and at the corners are actually closer together (look at the tables above). The standard deviation of the distribution in season two UZR for these players was 13.2 runs for LF/RF and 15.3 in CF, slightly smaller than for all “dual” players combined.

In conclusion, there is simply nothing to write about when it comes to Eaton’s or anyone else’s unusual outfield UZR or DRS splits. If you want to estimate their UZR going forward simply adjust and combine all of their OF numbers and do a normal projection. It doesn’t matter if they have -16 in LF and +20 in CF, 0 runs in CF only, or +4 runs in LF only. It’s all the same thing with exactly the same projection and exactly the same distribution of results the next season.

As far as we can tell there is simply no such thing (to any significant or identifiable degree) as an outfielder who is more suited to one OF position than another. There is outfield defense – period. It doesn’t matter where you are standing in the OF. The ability to catch line drives and fly balls in the OF is more or less the same whether you are standing in the middle or on the sides of the OF (yes it could take some time to get used to a position if you are unfamiliar with it). If you are good in one location you will be good at another, and if you are bad at one location you will be bad at another. Your UZR or DRS might change in a somewhat predictable fashion depending upon what position, CF, LF, or RF is being measured, but that’s only because the players you are measured against (those metrics are relative) differ in their average ability to catch fly balls and line drives. More importantly, when you see a player who has an unusual “split” in their outfield numbers, like Eaton, you will be tempted to think that they are intrinsically better at one position than another and that the unusual split will tend to continue in the future. When you see really large splits you will be tempted even more. Remember the words in this paragraph and remember this analysis to avoid being fooled by randomness into drawing faulty conclusions, as all human beings, even smart ones, are wont to do.