Let me explain game theory wrt sac bunting using tonight’s CLE game as an example. Bottom of the 10th, leadoff batter on first, Gimenez is up. He is a very weak batter with little power or on-base skills, and the announcers say, “You would expect him to be bunting.” He clearly is.

Now, in general, to determine whether to bunt or not, you estimate the win expectancies (WE) based on the frequencies of the various outcomes of the bunt, versus the frequencies of the various outcomes of swinging away. Since, for a position player, those two final numbers are usually close, even in late tied-game situations, the correct decision usually hinges on: On the swing side, whether the batter is a good hitter or not, and his expected GDP rate. On the bunt side, how good of a sac bunter is he and how fast is he (which affect the single and ROE frequencies, which are an important part of the bunt WE)?

Gimenez is a terrible hitter which favors the bunt attempt but he is also not a good bunter and slow which favors hitting away. So the WE’s are probably somewhat close.

One thing that affects the WE for both bunting and swinging, of course, is where the third baseman plays before the pitch is thrown. Now, in this game, it was obvious that Gimenez was bunting all the way and everyone seemed fine with that. I think the announcers and probably everyone would have been shocked if he didn’t (we’ll ignore the count completely for this discussion – the decision to bunt or not clearly can change with it).

The announcers also said, “Sano is playing pretty far back for a bunt.” He was playing just on the dirt I think, which is pretty much “in between when expecting a bunt.” So it did seem like he was not playing up enough.

So what happens if he moves up a little? Maybe now it is correct to NOT bunt because the more he plays in, the lower the WE for a bunt and the higher the WE for hitting away! So maybe he shouldn’t play up more (the assumption is that if he is bunting, then the closer he plays, the better). Maybe then the batter will hit away and correctly so, which is now better for the offense than bunting with the third baseman playing only half way. Or maybe if he plays up more, the bunt is still correct but less so than with him playing back, in which case he SHOULD play up more.

So what is supposed to happen? Where is the third baseman supposed to play and what does the batter do? There is one answer and one answer only. How many managers and coaches do you think know the answer (they should)?

The third baseman is supposed to play all the way back “for starters” in his own mind, such that it is clearly correct for the batter to bunt. Now he knows he should play in a little more. So in his mind again, he plays up just a tad bit.

Now is it still correct for the batter to bunt? IOW, is the bunt WE higher than the swing WE given where the third baseman is playing? If it is, of course he is supposed to move up just a little more (in his head).

When does he stop? He stops of course when the WE from bunting is exactly the same as the WE from swinging. Where that is completely depends on those things I talked about before, like the hitting and bunting prowess of the batter, his speed, and even the pitcher himself.

What if he keeps moving up in his mind and the WE from bunting is always higher than hitting, like with most pitchers at the plate with no outs? Then the 3B simply plays in as far as he can, assuming that the batter is bunting 100%.

So in our example, if Sano is indeed playing at the correct depth which maybe he was and maybe he wasn’t, then the WE from bunting and hitting must be exactly the same, in which case, what does the batter do? It doesn’t matter, obviously! He can do whatever he wants, as long as the 3B is playing correctly.

So in a bunt situation like this, assuming that the 3B (and other fielders if applicable) is playing reasonably correctly, it NEVER matters what the batter does. That should be the case in every single potential sac bunt situation you see in a baseball game. It NEVER matters what the batter does. Either bunting or not are equally “correct.” They result in exactly the same WE.

The only exceptions (which do occur) are when the WE from bunting is always higher than swinging when the 3B is playing all the way up (a poor hitter and/or exceptional bunter) OR the WE from swinging is always higher even when the 3B is playing completely back (a good or great hitter and/or poor bunter).

So unless you see the 3B playing all the way in or all the way back and they are playing reasonably optimally it NEVER matters what the batter does. Bunt or not bunt and the win expectancy is exactly the same! And if the 3rd baseman plays all the way in or all the way back and is playing optimally, then it is always correct for the batter to bunt or not bunt 100% of the time.

I won’t go into this too much because the post assumed that the defense was playing optimally, i.e. it was in a “Nash Equilibrium” (as I explained, it is playing in a position such that the WE for bunting and swinging are exactly equal) or it was correctly playing all the way in (the WE for bunting is still equal to or great than for swinging) or all the way back (the WE for swinging is >= that of bunting), but if the defense is NOT playing optimally, then the batter MUST bunt or swing away 100% of the time.

This is critical and amazingly there is not ONE manager or coach in MLB that understands it and thus correctly utilizes a correct bunt strategy or bunt defense.

Note: There is the beginning of a very good discussion about this topic on The Book blog. If this topic interests you, feel free to check it out and participate if you want to.

I’ve been thinking about this for many years and in fact I have been threatening to redo my UZR methodology, in order to try and reduce one of the biggest weaknesses inherent in most if not all of the batted ball advanced defensive metrics.

Here is how most of these metrics work: Let’s say a hard hit ball was hit down the third base line and the third baseman made the play and threw the runner out. He would be credited with an out minus the percentage of time that an average fielder would make the same or similar play, perhaps 40% of the time. So the third baseman would get credit for 60% of a “play” on that ball, which is roughly .9 runs (the difference between the average value of a hit down the 3rd base line and an out) times .6 or .54 runs. Similarly, if he does not make the play, he gets debited with .4 plays or minus .36 runs.

There are all kind of adjustments which can be made, such as park effects, handedness of the batter, speed of the runner, outs and base runners (these affect the positioning of the fielders and therefore the average catch rate), and even the G/F ratio of the pitcher (e.g., a ground ball pitcher’s “hard” hit balls will be a little softer than a fly ball pitcher’s “hard” hit ball).

Anyway here is the problem with this methodology which, as I said, is basic to most if not all of these defensive metrics, and it has to do with our old friend Bayes. As is usually the case, this problem is greater in smaller sample sample sizes. We don’t really, really know the probability of an average fielder making any given play; we can only roughly infer it from the characteristics of the batted ball that we have access to and perhaps from the context that I described above (like the outs, runners, batter hand, park, etc.).

In the above example, a hard hit ground ball down the third base line, I said that the league average catch rate was 40%. Where did I get than number from? (Actually, I made it up, but let’s assume that that is a correct number in MLB over the last few years, given the batted ball location database that we are working with.) We looked at all hard hit balls hit to that approximate location (right down the third base line), according to the people who provide us with the database, and found out that of those 600 some odd balls over the last 4 years, 40% of them were turned into outs by the third baseman on the field.

So what is wrong with giving a third baseman .6 credit when he makes the play and .4 debit when he doesn’t? Well, surely not every single play, if you were to “observe” and “crunch” the play like, say, Statcast would do, is caught exactly 40% of the time. For any given play in that bucket, whether the fielder caught the ball or not, we know that he didn’t really have exactly a 40% chance of catching it if he were an average fielder. You knew that already. That 40% is the aggregate for all of the balls that fit into that “bucket” (“hard hit ground ball right down the third base line”).

Sometimes it’s 30%. Other times it’s 50%. Still other times it is near 0 (like if the 3rd baseman happens to be playing way off the line, and correctly so) or near 100% (like when he is guarding the line and he gets a nice big hop right in front of him), and everything in between.

On the average it is 40%, so you say, well, what are we to do? We can’t possibly tell from the data how much it really varies from that 40% on any particular play, which is true. So the best we can do is assume 40%, which is also true. That’s just part of the uncertainty of the metric. On the average, it’s right, but with error bars. Right? Wrong!

We do have information which helps us to nail down the true catch percentage of the average fielder given that exact same batted ball, at least how it is recorded by the people who provide us with the data. I’m not talking about the above-mentioned adjustments like the speed of the batter, his handedness, or that kind of thing. Sure, that helps us and we can use it or not. Let’s assume that we are using all of these “contextual adjustments” to the best of our ability. There is still something else that can help us to tweak those “league average caught” percentages such that we don’t have to use 40% on every hard hit ground ball down the line. Unfortunately, most metrics, including my own UZR, don’t take advantage of this valuable information even though it is staring us right in the face. Can you guess what it is?

The information that is so valuable is whether the player caught the ball or not! You may be thinking that that is circular logic or perhaps illogical. We are using that information to credit or debit the fielder. How and why would we also use it to change the base line catch percentage – in our example, 40%? In comes Bayes.

Basically what is happening is this: Hard ground ball is hit down the third base line. Overall 40% of those plays are made, but we know that not every play has a 40% chance of being caught because we don’t know where the fielder was positioned and we don’t really know the exact characteristics of the ball which greatly affect its chances of being caught: it was hit hard, but how hard? What kind of a bounce did it take? Did it have spin? Was it exactly down the line or 2 feet from the line (they were all classified as being in the same “location”)? We know the runner is fast (let’s say we created a separate bucket for those batted balls with a fast runner at the plate), but exactly how fast was he? Maybe he was a blazer and he beat it out by an eyelash.

So what does that have to do with whether the fielder caught the ball or not? That should be obvious by now. If the third baseman did not catch the ball, on the average, it should be clear that the ball tended to be one of those balls that were harder to catch than the average ball in that bucket. In other words, the chances that any ball that is caught should or would have been caught by an average fielder is clearly less than 40%. Similarly if a ball was caught, by any fielder, it was more likely to be an easier play than the average ball in that bucket. What we want are conditional probabilities, based on whether the ball was caught or not.

How much easier are the caught balls than the not-caught ones in any given bucket? That’s hard to say. Really hard to say. One would have to have lots of information in order to apply Bayes theorem to better estimate the “catch rate” of a ball in a particular bucket based on whether it is caught or not caught. I can tell you that I think the differences are pretty significant. It mostly depends on the spread (and what the actual distribution looks like) of actual catch rates in any given bucket. That depends on a lot of things. For one thing, the “size” and accuracy of the locations and other characteristics which make up the buckets. For example, if the unique locations were pretty large, say, one “location bucket” is anywhere from down the third base line to 20 feet off the bag (about 1/7 of the total distance from line to line), then the spread of actual catch rates versus the average catch rate in that bucket is going to be huge. Therefore the difference between the true catch rates for caught ball and non-caught ball is going to be large as well.

Speed of the batted ball is important as well. On very hard hit balls, the distribution of actual catch rates within a certain location will tend to be polarized or “bi-modal.” Either the ball will tend to be hit near the fielder and he makes the play or a little bit away from the fielder and he doesn’t. In other words, a catch might have a 75% true catch rate and non-catch, 15%, on the average, even if the overall rate is 40%.

Again, most metrics use the same base line catch rate for catches and non-catches because that seems like the correct and intuitive thing to do. It is incorrect! The problem, of course, is what number to assign to a catch and to a non-catch in any given bucket. How do we figure that out? Well, I haven’t gotten to that point yet, and I don’t think anyone else has either (I could be wrong). I do know, however, that it is guaranteed that if I use 39% for a non-catch and 41% for a catch, in that 40% bucket, I am going to be more accurate in my results, so why not do that? Probably 42/38 is better still. I just don’t know when to stop. I don’t want to go too far so that I end up cutting my own throat.

This is similar to the problem with park factors and MLE’s (among other “adjustments”). We don’t know that using 1.30 for Coors Field is correct but we surely know that using 1.05 is better than 1.00. We don’t know that taking 85% of player’s AAA stats to convert them to a major league equivalency is correct, but we definitely know that 95% is better than nothing.

Anyway, here is what I did today (other than torture myself by watching the Ringling Brothers and…I mean the Republican debates). I took a look at all ground balls that were hit in vector “C” according to BIS and was either caught or went through the infield in less than 1.5 seconds, basically a hard hit ball down the third base line. If you watch these plays, even though I would put them in the same bucket in the UZR engine, it is clear that some are easy to field and others are nearly impossible. You would be surprised at how much variability there is. On paper they “look” almost exactly the same. In reality they can vary from day to night and everything in between. Again, we don’t really care about the variance per se, but we definitely care about the mean catch rates when they are caught and when they are not.

Keep in mind that we can never empirically figure out those mean catch rates like we do when we aggregate all of the plays in the bucket (and then simply use the average catch rate of all of those balls). You can’t figure out the “catch rate” of a group of balls that were caught. It would be 100% right? We are interested in the catch rate of an average fielder when these balls were caught by these particular fielders, for whatever reasons they caught them. Likewise we want to know the league average catch rates of a group of balls that were not caught by these particular fielders for whatever reasons.

We can make these estimates (the catch rates of caught balls and non-caught balls in this bucket) in one of two ways: the first way is probably better and much less prone to human bias. It is also way more difficult to do in practice. We can try and observe all of the balls in this bucket and then try and re-classify them into many buckets according to the exact batted ball characteristics and fielder positioning. In other words, one bucket might be hard hit ground huggers right down the line with the third baseman playing roughly 8 feet off the line. Another might be, well, you get the point. Then we can actually use the catch rates in those sub-buckets.

When we are done, we can figure out the average catch rate on balls that were caught and those that were not, in the entire bucket. If that is hard to conceptualize, try constructing an example yourself and you will see how it works.

As I said, that is a lot of work. You have to watch a lot of plays and try and create lots and lots of sub-buckets. And then, even in the sub-buckets you will have the same situation, although much less problematic. For example, in one of those sub-buckets, a caught ball might be catchable 20% of the time in reality and a non-caught one only 15% – not much to worry about. In the large, original bucket, it might be 25% and 60%, as I said before. And that is a problem, especially for small samples.

Keep in mind that this problem will be mitigated in large samples but it will never go away. It will always overrate a good performance and underrate a bad one. But, in small samples, like even in one season, it will overrate so-called good fielding performance and underrate bad ones. The better the numbers the more they overstate the actual performance. The same is true for bad numbers. This is why I have been saying for years to regress what you see from UZR or DRS, even if you want to estimate “what happened.” (You would have to regress even more if you want to estimate true fielding talent.)

This is one of the problems with simply combining offense and defense to generate WAR. The defensive component needs to be regressed while the offensive one does not (base running needs to be regressed too. It suffers from the same malady as the defensive metrics).

Anyway, I looked at 20 or so plays in one particular bucket and tried to use the second method of estimating true catch rates for catches and non-catches. I simply observed the play and tried to estimate how often an average fielder would have made the play whether it was caught or not.

This is not nearly as easy as you might think. For one thing, guessing an average “catch rate” number like 60% or 70%, even if you’ve watched thousands of games in your life like I have, is incredibly difficult. The 0-10% and 90-100% ones are not that hard. Everything else is. I would guess that my uncertainty is something like 25% on a lot of plays, and my uncertainty on that estimate of uncertainty is also high!

The other problem is bias. When a play is made, you will overrate the true average catch rate (how often an average fielder would have made the play) and vice versa for plays that are not made. Or maybe you will underrate them because you are trying to compensate for the tendency to overrate them. Either way, you will be biased by whether the play was made or not, and remember you are trying to figure out the true catch rate on every play you observe with no regard to whether the play was made or not. (In actuality maybe whether it was made or not can help you with that assessment).

Here is a condensed version of the numbers I got. In that one location, presumably from the third base line to around 6 feet off the line, for ground balls that arrive in less than 1.5 seconds (I have 4 such categories of speed/time for GB), the average catch rate overall was 36%. However, for balls that were caught (and I only looked at 6 random ones), I estimated the average catch rate to be 11% (that varied from 0 to 35%). For balls that were caught (also 6 of them), it was 53% (from 10% to 95%). That is a ridiculously large difference and look at the variation even within those two groups (caught and not-caught). Even though using 11% for non-catches and 53% for catches is better than using 40% for everything, we are still making lots of mistakes within the new caught and not caught buckets!

How does that affect a defensive metric? Let’s look at a hypothetical example: Third baseman A makes 10 plays in that bucket and misses 20. Third baseman B makes 15 and misses 15. B clearly had a better performance, but how much better? Let’s assume that the average fielder makes 26% of the plays in the bucket and the misses are 15% and the catches are 56% (actually a smaller spread than I estimated). Using 15% and 56% yields an overall catch rate of around 26%.

UZR and most of the other metrics will do the calculations this way: Player A’s UZR is 10 * .74 – 20 * .26, or plus 2.2 plays which is around plus 2 runs. Player B is 15 * .74 – 15 * .26, or plus 7.2 plays, which equals plus 6.5 runs.

What about if we use the better numbers, 15% for missed plays and 56% for made ones. Now for Player A we have: 10 * .44 – 20 * .15, or 1.4 plays, which is 1.3 runs. Player B is 3.9 runs. So Player A’s UZR for those 30 plays went from +2 to + 1.3 and Player B went from +6.5 to +3.9. Each player regressed around 35-40% toward zero. That’s a lot!

Now I have to figure out how to incorporate this “solution” to all of the UZR buckets in some kind of fairly elegant way, short of spending hundreds of hours observing plays. Any suggestions would be appreciated.

 

Note: I updated the pinch hitting data to include a larger sample (previously I went back to 2008. Now, 2000).

Note: It was pointed out by a commenter below and another one on Twitter that you can’t look only at innings where the #9 and #1 batters batted (eliminating innings where the #1 hitter led off), as Russell did in his study, and which he uses to support his theory (he says that it is the best evidence). That creates a huge bias, of course. It eliminates all PA in which the #9 hitter made the last out of an inning or at least an out was made while he was at the plate. In fact, the wOBA for a #9 hitter, who usually bats around .300, is .432 in innings where he and the #1 hitter bat (after eliminating so many PA in which an out was made). How that got past Russell, I have no idea.  Perhaps he can explain.

Recently, Baseball Prospectus published an article by one of their regular writers, Russell Carleton (aka Pizza Cutter), in which he examined whether the so-called “times through the order” penalty (TTOP) was in fact a function of how many times a pitcher has turned over the lineup in a game or whether it was merely an artifact of a pitcher’s pitch count. In other words, is it pitcher fatigue or batter familiarity (the more the batter sees the pitcher during the game, the better he performs) which causes this effect?

It is certainly possible that most or all of the TTOP is really due to fatigue, as “times through the order” is clearly a proxy for pitch count. In any case, after some mathematic gyrations that Mr. Carleton is want to do (he is the “Warning: Gory Mathematical Details Ahead” guy) in his articles, he concludes unequivocally that there is no such thing as a TTOP – that it is really a PCP or Pitch Count Penalty effect that makes a pitcher less and less effective as he goes through the order and it has little or nothing to do with batter/pitcher familiarity. In fact, in the first line of his article, he declares, “There is no such thing as the ‘times through the order’ penalty!”

If that is true, this is a major revelation which has slipped through the cracks in the sabermetric community and its readership. I don’t believe it is, however.

As one of the primary researchers (along with Tom Tango) of the TTOP, I was taken quite aback by Russell’s conclusion, not because I was personally affronted (the “truth” is not a matter of opinion), but because my research suggested that pitch count or fatigue was likely not a significant part of the penalty. In my BP article on the TTOP a little over 2 years ago, I wrote this: “…the TTOP is not about fatigue. It is about familiarity. The more a batter sees a pitcher’s delivery and repertoire, the more likely he is to be successful against him.” What was my evidence?

First, I looked at the number of pitches thrown going into the second, third, and fourth times through the order. I split that up into two groups—a low pitch count and a high pitch count. Here are those results. The numbers in parentheses are the average number of pitches thrown going into that “time through the order.”

Times Through the Order Low Pitch Count High Pitch Count
1 .341 .340
2 .351 (28) .349 (37)
3 .359 (59) .359 (72)
4 .361 (78) .360 (97)

 

If Russell’s thesis were true, you should see a little more of a penalty in the “high pitch count” column on the right, which you don’t. The penalty appears to be the same regardless of whether the pitcher has thrown few or many pitches. To be fair, the difference in pitch count between the two groups is not large and there is obviously sample error in the numbers.

The second way I examined the question was this: I looked only at individual batters in each group who had seen few or many pitches in their prior PA. For example, I looked at batters in their second time through the order who had seen fewer than three pitches in their first PA, and also batters who saw more than four pitches in their first PA. Those were my two groups. I did the same thing for each time through the order. Here are those results. The numbers in parentheses are the average number of pitches seen in the prior PA, for every batter in the group combined.

 

Times Through the Order Low Pitch Count each Batter High Pitch Count each Batter
1 .340 .340
2 .350 (1.9) .365 (4.3)
3 .359 (2.2) .361 (4.3)

 

As you can see, if a batter sees more pitches in his first or second PA, he performs better in his next PA than if he sees fewer pitches. The effect appears to be much greater from the first to the second PA. This lends credence to the theory of “familiarity” and not pitcher fatigue. It is unlikely that 2 or 3 extra pitches would cause enough fatigue to elevate a batter’s wOBA by 8.5 points per PA (the average of 15 and 2, the “bonuses” for seeing more pitches during the first and second PA, respectively).

So how did Russell come to his conclusion and is it right or wrong? I believe he made a fatal flaw in his methodology which led him to a faulty conclusion (that the TTOP does not exist).

Among other statistical tests, here is the primary one which led Russell to conclude that the TTOP is a mirage and merely a product of pitcher fatigue due to an ever-increasing pitch count:

This time, I tried something a little different. If we’re going to see a TTOP that is drastic, the place to look for it is as the lineup turns over. I isolated all cases in which a pitcher was facing the ninth batter in the lineup for the second time and then the first batter in the lineup for the third time. To make things fair, neither hitter was allowed to be the pitcher (this essentially limited the sample to games in AL parks), and the hitters needed to be faced in the same inning. Now, because the leadoff hitter is usually a better hitter, we need to control for that. I created a control variable for all outcomes using the log odds ratio method, which controls for the skills of the batter, as well as that of the pitcher. I also controlled for whether or not the pitcher had the platoon advantage in either case.

First of all, there was no reason to limit the data to “the same inning”. Regardless of whether the pitcher faces the 9th and 1st batters in the same inning or they are split up (the 9 hitter makes the last out), since one naturally follows the other, they will always have around the same pitch count, and the leadoff hitter will always be one time through the order ahead of the number nine hitter.

Anyway, what did Russell find? He found that TTOP was not a predictor of outcome. In other words, that the effect on the #9 hitter was the same as the #1 hitter, even though the #1 hitter had faced the pitcher one more time than the #9 hitter.

I thought about this for a long time and I finally realized why that would be the case even if there was a “times order” penalty (mostly) independent of pitch count. Remember that in order to compare the effect of TTO on that #9 and #1 hitter, he had to control for the overall quality of the hitter. The last hitter in the lineup is going to be a much worse hitter overall than the leadoff hitter, on the average, in his sample.

So the results should look something like this if there were a true TTOP: Say the #9 batters are normally .300 wOBA batters, and the leadoff guys are .330. In this situation, the #9 batters should bat around .300 (during the second time through the order we see around a normal wOBA) but the leadoff guys should bat around .340 – they should have a 10 point wOBA bonus for facing the pitcher for the third time.

Russell, without showing us the data (he should!), presumably gets something like .305 for the #9 batters (since the pitcher has gone essentially 2 ½ times through the lineup, pitch count-wise) and the leadoff hitters should hit .335, or 5 points above their norm as well (maybe .336 since they are facing a pitcher with a few more pitches under his belt than the #9 hitter).

So if he gets those numbers, .335 and .305, is that evidence that there is no TTOP? Do we need to see numbers like .340 and .300 to support the TTOP theory rather than the PCP theory? I submit that even if Russell sees numbers like the former ones, that is not evidence that there is no TTOP and it’s all about the pitch count. I believe that Russell made a fatal error.

Here is where he went wrong:

Remember that he uses the log-odds method to computer the baseline numbers, or what he would expect from a given batter-pitcher matchup, based on their overall season numbers. In this experiment, there is no need to do that, since both batters, #1 and #9, are facing the same pitcher the same number of times. All he has to do is use each batter’s seasonal numbers to establish the base line.

But where do those base lines come from? Well, it is likely that the #1 hitters are mostly #1 hitters throughout the season and that #9 hitters usually hit at the bottom of the order. #1 hitters get around 150 more PA than #9 hitters over a full season. Where do those extra PA come from? Some of them come from relievers of course. But many of them come from facing the starting pitcher more often per game than those bottom-of-the-order guys. In addition, #9 hitters sometimes are removed for pinch hitters late in a game against a starter such that they lose even more of those 3rd and 4th time through the order PA’s. Here is a chart of the mean TTO per game versus the starting pitcher for each batting slot:

 

Batting Slot Mean TTO/game
1 2.15
2 2.08
3 2.02
4 1.98
5 1.95
6 1.91
7 1.86
8 1.80
9 1.77

(By the way, if Russell’s thesis is true, bottom of the order guys have it even easier, since they are always batting when the pitcher has a higher pitch count, per time through the order. Also, this is the first time you have been introduced to the concept that the top of the order batters have it a little easier than the bottom of the order guys, and that switching spots in the order can affect overall performance because of the TTOP or PCP.)

What that does is result in the baseline for the #1 hitter being higher than for the #9 hitter, because the baseline includes more pitcher TTOP (more times facing the starter for the 3rd and 4th times). That makes it look like the #1 hitter is not getting his advantage as compared to the #9 hitter, or at least he is only getting a partial advantage in Russell’s experiment.

In other words, the #9 hitter is really a true .305 hitter and the #1 hitter is really a true .325 hitter, even though their seasonal stats suggest .300 and .330. The #9 hitters are being hurt by not facing starters late in the game compared to the average hitter and the #1 hitters are being helped by facing starters for the 3rd and 4th times more often than the average hitter.

So if #9 hitters are really .305 hitters, then the second time through the order, we expect them to hit .305, if the TTOP is true. If the #1 hitters are really .325 hitters, despite hitting .330 for the whole season, we expect them to hit .335 the third time through the order, if the TTOP is true. And that is exactly what we see (presumably).

But when Russell sees .305 and .335 he concludes, “no TTOP!” He sees what he thinks is a true .300 hitter hitting .305 after the pitcher has thrown around 65 pitches and what he thinks is a true .330 hitter hitting .335 after 68 or 69 pitches. He therefore concludes that both hitters are being affected equally even though one is batting for the second time and the other for the third time – thus, there is no TTOP!

As I have shown, those numbers are perfectly consistent with a TTOP of around 8-10 points per times through the order, which is exactly what we see.

Finally, I ran one other test which I think can give us more evidence one way or another. I looked at pinch hitting appearances against starting pitchers. If the TTOP is real and pitch count is not a significant factor in the penalty, we should see around the same performance for pinch hitters regardless of the pitcher’s pitch count, since the pinch hitter always faces the pitcher for the first time and the first time only. In fact, this is a test that Russell probably should have run. The only problem is sample size. Because there are relatively few pinch hitting PA versus starting pitchers, we have quite a bit of sample error in the numbers. I split the sample of pinch hitting appearances up into 2 groups: Low pitch count and high pitch count.

 

Here is what I got:

PH TTO Overall Low Pitch Count High Pitch Count
2 .295 (PA=4901) .295 (PA=2494) .293 (PA=2318)
3 .289 (PA=10774) .290 (PA=5370) .287 (PA=5404)

 

I won’t comment on the fact that the pinch hitters performed a little better against pitchers with a low pitch count (the differences are not nearly statistically significant) other than to say that there is no evidence that pitch count has any influence on the performance of pinch hitters who are naturally facing pitchers for the first and only time. Keep in mind that the times through the order (the left column) is a good proxy for pitch count in and of itself and we also see no evidence that that makes a difference in terms of pinch hitting performance. In other words, if pitch count significantly influenced pitching effectiveness, we should see pinch hitters overall performing better when the pitcher is in the midst of his 3rd time through the order as opposed to the 2nd time (his pitch count would be around 30-35 pitches higher). We don’t. In fact, we see a worse performance (the difference is not statistically significant – one SD is 8 points of wOBA).

 

I have to say that it is difficult to follow Russell’s chain of logic and his methodology in many of his articles because he often fails to “show his work” and he uses somewhat esoteric and opaque statistical techniques only. In this case, I believe that he made a fatal mistake in his methodology as I have described above which led him to the erroneous conclusion that, “The TTOP does not exist.” I believe that I have shown fairly strong evidence that the penalty that we see pitchers incur as the game wears on is mostly or wholly as a result of the TTO and not due to fatigue caused by an increasing pitch count.

I look forward to someone doing additional research to support one theory or the other.

There seems to be an unwritten rule in baseball – not on the field, but in the stands, at home, in the press box, etc.

“You can’t criticize a manager’s decision if it doesn’t directly affect the outcome of the game, if it appears to ‘work’, or if the team goes on to win the game despite the decision.”

That’s ridiculous of course. The outcome of a decision or the game has nothing to do with whether the decision was correct or not. Some decisions may raise or lower a team’s chances of winning from 90% and other decisions may affect a baseline of 10 or 15%.

If decision A results in a team’s theoretical chances of winning of 95% and decision A, 90%, obviously A is the correct move. Choosing B would be malpractice. Equally obvious is if manager chooses B, an awful decision, he is still going to win the game 90% of the time, and based on the “unwritten rule” we rarely get to criticize him. Similarly, if decision A results in a 15% win expectancy (WE) and B results in 10%, A is the clear choice, yet the team still loses most of the time and we get to second guess the manager whether he chooses A or B. All of that is silly and counter-productive.

If your teenager drives home drunk yet manages to not kill himself or anyone else, do you say nothing because “it turned out OK?” I hope not. In sports, most people understand the concept of “results versus process” if they are cornered into thinking about it, but in practice, they just can’t bring themselves to accept it in real time. No one is going to ask Terry Collins in the post-game presser why he didn’t pinch hit for DeGrom in the 6th inning – no one. The analyst – a competent one at least – doesn’t give a hoot what happened after that. None whatsoever. He looks at a decision and if it appears questionable at the time, he tries to determine what the average consequences are – with all known data at the time the decision is made – with the decision or with one or more alternatives. That’s it. What happens after that is irrelevant to the analyst. For some reason this is a hard concept for the average fan – the average person – to apply. As I said, I truly think they understand it, especially if you give obvious examples, like the drunk driving one. They just don’t seem to be able to break the “unwritten rule” in practice. It goes against their grain.

Well, I’m an analyst and I don’t give a flying ***k whether the Mets won, lost, tied, or Wrigley Field collapsed in the 8th inning. The “correctness” of the decision to allow DeGrom to hit or not in the top of the 6th, with runners on second and third, boiled down to this question and this question only:

“What is the average win expectancy (WE) of the Mets with DeGrom hitting and then pitching some number of innings and what is the average WE with a pinch hitter and someone else pitching in place of DeGrom?”

Admittedly the gain, if there is any, from making the decision to bring in a PH and reliever or relievers must be balanced against any known or potential negative consequences for the Mets not related to the game at hand. Examples of these might be: 1) limiting your relief possibilities in the rest of the series or the World Series. 2) Pissing off DeGrom or his teammates for taking him out and thus affecting the morale of the team.

I’m fine with the fans or the manager and coaches including these and other considerations in their decision. I am not fine with them making their decision not knowing how it affects the win expectancy of the game at hand, since that is clearly the most important of the considerations.

My guess is that if we asked Collins about his decision-making process, and he was honest with us, he would not say, “Yeah, I knew that letting him hit would substantially lower our chances of winning the game, but I also wanted to save the pen a little and give DeGrom a chance to….” I’m pretty sure he thought that with DeGrom pitching well (which he usually does, by the way – it’s not like he was pitching well-above his norm), his chances of winning were better with him hitting and then pitching another inning or two.

At this point, and before I get into estimating the WE of the two alternatives facing Collins, letting DeGrom hit and pitch or pinch hitting and bringing in a reliever, I want to discuss an important concept in decision analysis in sports. In American civil law, there is a thing called a summary judgment. When a party in a civil action moves for one, the judge makes his decision based on the known facts and assuming controversial facts and legal theories in a light most favorable to the non-moving party. In other words, if everything that the other party says is true is true (and is not already known to be false) and the moving party would still win the case according to the law, then the judge must accept the motion and the moving party wins the case without a trial.

When deciding whether a particular decision was “correct” or not in a baseball game or other contest, we can often do the same thing in order to make up for an imperfect model (which all models are by the way). You know the old saw in science – all models are wrong, but some are useful. In this particular instance, we don’t know for sure how DeGrom will pitch in the 6th and 7th innings to the Cubs order for the 3rd time, we don’t know for how much longer he will pitch, we don’t know how well DeGrom will bat, and we don’t know who Collins can and will bring in.

I’m not talking about the fact that we don’t know whether DeGrom or a reliever is going to give up a run or two, or whether he or they are going to shut the Cubs down. That is in the realm of “results-based analysis” and I‘ve already explained how and why that is irrelevant. I’m talking about what is DeGrom’s true talent, say in runs allowed per 9 facing the Cubs for the third time, what is a reliever’s or relievers’ true talent in the 6th and 7th, how many innings do we estimate DeGrom will pitch on the average if he stays in the game, and what is his true batting talent.

Our estimates of all of those things will affect our model’s results – our estimate of the Mets’ WE with and without DeGrom hitting. But what if we assumed everything in favor of keeping DeGrom in the game – we looked at all controversial items in a light most favorable to the non-moving party – and it was still a clear decision to pinch hit for him? Well, we get a summary judgment! Pinch hitting for him would clearly be the correct move.

There is one more caveat. If it is true that there are indirect negative consequences to taking him out – and I’m not sure that there are – then we also have to look at the magnitude of the gain from taking him out and then decide whether it is worth it. In order to do that, we have to have some idea as to what is a small and what is a large advantage. That is actually not that hard to do. Managers routinely bring in closers in the 9th inning with a 2-run lead, right? No one questions that. In fact, if they didn’t – if they regularly brought in their second or third best reliever instead, they would be crucified by the media and fans. How much does bringing in a closer with a 2-run lead typically add to a team’s WE, compared to a lesser reliever? According to The Book, an elite reliever compared to an average reliever in the 9th inning with a 2-run lead adds around 4% to the team’s WE. So we know that 4% is a big advantage, which it is.

That brings up another way to account for the imperfection of our models. The first way was to use the “summary judgment” method, or assume things most favorable to making the decision that we are questioning. The second way is to simply estimate everything to the best of our ability and then look at the magnitude of the results. If the difference between decision A and B is 4%, it is extremely unlikely that any reasonable tweak to the model will change that 4% to 0% or -1%.

In this situation, whether we assume DeGrom is going to pitch 1.5 more innings or 1.6 or 1.4, it won’t change the results much. If we assume that DeGrom is an average hitting pitcher or a poor one, it won’t change the result all that much. If we assume that the “times through the order penalty” is .25 runs or .3 runs per 9 innings, it won’t change the results much. If we assume that the relievers used in place of DeGrom have a true talent of 3.5, 3.3, 3.7, or even 3.9, it won’t change the results all that much. Nothing can change the results from 4% in favor of decision A to something in favor of decision B. 4% is just too much to overcome even if our model is not completely accurate. Now, if our results assuming “best of our ability estimates” for all of these things yield a 1% advantage for choosing A, then it is entirely possible that B is the real correct choice and we might defer to the manager in case he knows some things that we don’t or we simply are mistaken in our estimates or we failed to account for some important variable.

Let’s see what the numbers say, assuming “average” values for all of these relevant variables and then again making reasonable assumptions in favor of allowing DeGrom to hit (assuming that pinch hitting for him appears to be correct).

What is the win expectancy with DeGrom batting. We’ll assume he is an average-hitting pitcher or so (I have heard that he is a poor-hitting pitcher). An average pitcher’s batting line is around 10% single, 2% double or triple, .3% HR, 4% BB, and 83.7% out. The average WE for an average team leading by 1 run in the top of the 6th, with runners on second and third, 2 outs, and a batter with this line, is…..

63.2%.

If DeGrom were an automatic out, the WE would be 59.5%. That is the average WE leading off the bottom of the 6th with the visiting team winning by a run. So an average pitcher batting in that spot adds a little more than 3.5% in WE. That’s not wood. What if DeGrom were a poor hitting pitcher?

Whirrrrr……

62.1%.

So whether DeGrom is an average or poor-hitting pitcher doesn’t change the Mets’ WE in that spot all that much. Let’s call it 63%. That is reasonable. He adds 3.5% to the Mets’ WE compared to an out.

What about a pinch hitter? Obviously the quality of the hitter matters. The Mets have some decent hitters on the bench – notably Cuddyer from the right side and Johnson from the left. Let’s assume a league-average hitter. Given that, the Mets’ WE with runners on second and third, 2 outs, and a 1-run lead, is 68.8%. A league-average hitter adds over 9% to the Mets’ WE compared to an out. The difference between DeGrom as a slightly below-average hitting pitcher and a league-average hitter is 5.8%. That means, unequivocally, assuming that our numbers are reasonably accurate, that letting DeGrom hit cost the Mets almost 6% in their chances of winning the game.

That is enormous of course. Remember we said that bringing in an elite reliever in the 9th of a 2-run game, as compared to a league-average reliever, is worth 4% in WE. You can’t really make a worse decision as a manager than reducing your chances of winning by 5.8%, unless you purposely throw the game. But, that’s not nearly the end of the story. Collins presumably made this decision thinking that DeGrom pitching the 6th and perhaps the 7th would more than make up for that. Actually he’s not quite thinking, “Make up for that.” He is not thinking in those terms. He does not know that letting him hit “cost 5.8% in win expectancy” compared to a pinch hitter. I doubt that the average manager knows what “win expectancy” means let alone how to use it in making in-game decisions. He merely thinks, “I really want him to pitch another inning or two, and letting him hit is a small price to pay,” or something like that.

So how much does he gain by letting him pitch the 6th and 7th rather than a reliever. To be honest it is debatable whether he gains anything at all. Not only that, but if we look back in history to see how many innings starters end up pitching, on the average, in situations like that, we will find that it is not 2 innings. It is probably not even 1.5 innings. He was at 82 pitches through 5. He may throw 20 or 25 pitches in the 6th (like he did in the first), in which case he may be done. He may give up a base runner or two, or even a run or two, and come out in the 6th, perhaps before recording an out. At best, he pitches 2 more innings, and once in a blue moon he pitches all or part of the 8th I guess (as it turned out, he pitched 2 more effective innings and was taken out after seven). Let’s assume 1.5 innings, which I think is generous.

What is DeGrom’s expected RA9 for those 2 innings? He has pitched well thus far but not spectacularly well. In any case, there is no evidence that pitching well through 5 innings tells us anything about how a pitcher is going to pitch in the 6th and beyond. What is DeGrom’s normal expected RA9? Steamer, ZIPS and my projection systems say about 83% of league-average run prevention. That is equivalent to a #1 or #2 starter. It is equivalent to an elite starter, but not quite the level of the Kershaw’s, Arrieta’s, or even the Price’s or Sale’s. Obviously he could turn out to be better than that – or worse – but all we can do in these calculations and all managers can do in making these decisions is use the best information and the best models available to estimate player talent.

Then there is the “times through the order penalty.” There is no reason to think that this wouldn’t apply to DeGrom in this situation. He is going to face the Cubs for the third time in the 6th and 7th innings. Research has found that the third time through the order a starter’s RA9 is .3 runs worse than his overall RA9. So a pitcher who allows 83% of league average runs allows 90% when facing the order for the 3rd time. That is around 3.7 runs per 9 innings against an average NL team.

Now we have to compare that to a reliever. The Mets have Niese, Robles, Reed, Colon, and Gilmartin available for short or long relief. Colon might be the obvious choice for the 6th and 7th inning, although they surely could use a combination of righties and lefties, especially in very high leverage situations. What do we expect these relievers’ RA9 to be? The average reliever is around 4.0 to start with, compared to DeGrom’s 3.7. If Collins uses Colon, Reed, Niese or some combination of relievers, we might expect them to be better than the average NL reliever. Let’s be conservative and assume an average, generic reliever for those 1.5 innings.

How much does that cost the Mets in WE? To figure that, we take the difference in run prevention between DeGrom and the reliever(s), multiply by the game leverage and convert it into WE. The difference between a 3.7 RA9 and a 4.0 RA9 in 1.5 innings is .05 runs. The average expected leverage index in the 6th and 7th innings where the road team is up by a run is around 1.7. So we multiply .05 by 1.7 and convert that into WE. The final number is .0085, or less than 1% in win expectancy gained by allowing DeGrom to pitch rather than an average reliever.

That might shock some people. It certainly should shock Collins, since that is presumably his reason for allowing DeGrom to hit – he really, really wanted him to pitch another inning or two. He presumably thought that that would give his team a much better chance to win the game as opposed to one or more of his relievers. I have done this kind of calculation dozens of times and I know that keeping good or even great starters in the game for an inning or two is not worth much. For some reason, the human mind, in all its imperfect and biased glory, overestimates the value of 1 or 2 innings of a pitcher who is “pitching well” as compared to an “unknown entity” (of course we know the expected performance of our relievers almost as well as we know the expected performance of the starter). It is like a manager who brings in his closer in a 3-run game in the 9th. He thinks that his team has a much better chance of winning than if he brings in an inferior pitcher. The facts say that he is wrong, but tell that to a manager and see if he agrees with you – he won’t. Of course, it’s not a matter of opinion – it’s a matter of fact.

Do I need to go any further? Do I need to tweak the inputs? Assuming average values for the relevant variables yields a loss of over 5% in win expectancy by allowing DeGrom to hit. What if we knew that DeGrom were going to pitch two more innings rather than an average of 1.5? He saves .07 runs rather than .05 which translates to 1.2% WE rather than .85%, which means that pinch hitting for him increases the Mets’ chances of winning by 4.7% rather than 5.05%. 4.7% is still an enormous advantage. Reducing your team‘s chances of winning by 4.7% by letting DeGrom hit is criminal. It’s like pinch hitting Jeff Mathis for Mike Trout in a high leverage situation – twice!

What about if our estimate of DeGrom’s true talent is too conservative? What if he is as good as Kershaw and Arrieta? That’s 63% of league average run prevention or 2.6 RA9. Third time through the order and it’s 2.9. The difference between that and an average reliever is 1.1 runs per 9, which translates to a 3.1% WE difference in 1.5 innings. So allowing Kershaw to hit in that spot reduces the Mets chances of winning by 2.7%. That’s not wood either.

What if the reliever you replaced DeGrom with was a replacement level pitcher – the worst pitcher in the major leagues? He allows around 113% league average runs, or 4.6 RA9. Difference between DeGrom and him for 1.5 innings? 2.7% for a net loss of 3.1% by letting him hit rather than pinch hitting for him and letting the worst pitcher in baseball pitch the next 1.5 innings? If you told Collins, “Hey genius, if you pinch hit for Degrom and let the worst pitcher in baseball pitch for another inning and a half instead of DeGrom, you will increase your chances of winning by 3.1%,” what do you think he would say?

What if DeGrom were a good hitting pitcher? What if….?

You should be getting the picture. Allowing him to hit is so costly, assuming reasonable and average values for all the pertinent variables, that even if we are missing something in our model, or some of our numbers are a little off – even if assume everything in the best possible light of allowing him to hit – the decision is a no-brainer in favor of a pinch hitter.

If Collins truly wanted to give his team the best chance of winning the game, or in the vernacular of ballplayers, putting his team in the best position to succeed, the clear and unequivocal choice was to lift DeGrom for a pinch hitter. It’s too bad that no one cares because the Mets ultimately won the game, which they were going to do at least 60% of the time anyway, regardless of whether Collins made the right or wrong decision.

The biggest loser, other than the Cubs, is Collins (I don’t mean he is a loser, as in the childish insult), because every time you use results to evaluate a decision and the results are positive, you deprive yourself of the opportunity to learn a valuable lesson. In this case, the analysis could have and should have been done before the game even started. All managers should know the importance of bringing in pinch hitters for pitchers in high leverage situations in important games, no matter how good the pitchers are or how well they are pitching in the game so far. Maybe someday they will.

As an addendum to my article on platoon splits from a few days ago, I want to give you a simple trick for answering a question about a player, such as, “Given that a player performs X in time period T, what is the average performance we can expect in the future (or present, which is essentially the same thing, or at least a subset of it)?” and want to illustrate the folly of using unusual single-season splits for projecting the future.

The trick is to identify as many players as you can in some period of time in the past (the more, the better, but sometimes the era matters so you often want to restrict your data to more recent years) that conform to the player in question in relevant ways, and then see how they do in the future. That always answers your question as best as it can. The certainty of your answer depends upon the sample size of the historical performance of similar players. That is why it is important to use as many players and as many years as possible, without causing problems by going too far back in time.

For example, say you have a player whom you know nothing about other than that he hit .230 in one season of 300 AB. What do you expect that he will hit next year? Easy to answer. There are thousands of players who have done that in the past. You can look at all of them and see what their collective BA was in their next season. That gives you your answer. There are other more mathematically rigorous ways to arrive at the same answer, but much of the time the “historical similar player method” will yield a more accurate answer, especially when you have a large sample to work with, because it captures all the things that your mathematical model may not. It is real life! You can’t do much better than that!

You can of course refine your “similar players” comparative database if you have more information about the player in question. He is left-handed? Use only left-handers in your comparison. He is 25? Use only 25-year olds. What if you have so much information about the player in question that your “comp pool” starts to be too small to have a meaningful sample size (which only means that the certainty of your answer decreases, but not necessarily the accuracy)? Let’s say that he is 25, left-handed, 5’10” and 170 pounds, he hit .273 in 300 AB, and you want to include all of these things in your comparison. That obviously will not apply to too many players in the past. Your sample size of “comps” will be small. In that case, you can use players between the ages of 24 and 26, between 5’9” and 5’11”, weigh between 160 and 180, and hit .265-283 in 200 to 400 AB. It doesn’t have to be those exact numbers, but as long as you are not biasing your sample compared to the player in question, you should arrive at an accurate answer to your question.

What if we do that with a .230 player in 300 AB? I’ll use .220 to .240 and between 200 and 400 AB. We know intuitively that we have to regress the .230 towards the league average around 60 or 65%, which will yield around .245 as our answer. But we can do better using actual players and actual data. Of course our answer depends on the league average BA for our player in question and the league average BA for the historical data. Realistically, we would probably use something like BA+ (BA as compared to league-average batting average) to arrive at our answer. Let’s try it without that. I looked at all players who batted in that range from 2010-2014 in 200-400 AB and recorded their collective BA the next year. If I wanted to be a little more accurate (for this question it is probably not necessary), I might weight the results in year 2 by the AB in year 1, or use the delta method, or something like that.

If I do that for just 5 years, 2010-2015, I get 49 players who hit a collective .230 in year 1 in an average of 302 AB. The next year, they hit a collective .245, around what we would expect. That answers our question, “What would a .230 hitter in 300 AB hit next year, assuming he were allowed to play again (we don’t know from the historical data what players who were not allowed to play would hit)?”

What about .300 in 400 AB? I looked at all players from .280 to .350 in year 1 and between 300 and 450 AB. They hit a collective .299 in year 1 and .270 in year 2. Again, that answers the question, “What do we expect Player A to hit next year if he hit .300 this year in around 400 AB?”

For Siegrest with the -47 reverse split, we can use the same method to answer the question, “What do we expect his platoon split to be in the future given 230 TBF versus lefties in the past?” That is such an unusual split that we might have to tweak the criteria a little and then extrapolate. Remember that asking the question, “What do we expect Player A to do in the future?” is almost exactly the same thing as asking, “What is his true talent with respect to this metric?”

I am going to look at only one season for pitchers with around 200 BF versus lefties even though Siegrest’s 230 TBF versus lefties was over several seasons. It should not make much difference as the key is the number of lefty batters faced. I included all left-handed pitchers with at least 150 TBF versus LHB who had a reverse wOBA platoon difference of more than 10 points and pitched again the next year. Let’s see how they do, collectively, in the next year.

There were 76 of such pitchers from 2003-1014. They had a collective platoon differential of -39 points, less than Siegrest’s -47 points, in an average of 194 TBF versus LHB, also less than Siegrest’s 231. But, we should be in the ballpark with respect to estimating Siegrest’s true splits using this “in vivo” method. How did they do in the next year, which is a good proxy (an unbiased estimate) for their true splits?

In year 2, they had an average TBF versus lefties of 161, a little less than the previous year, which is to be expected, and their collective platoon splits were plus plus 8.1 points. So they went from -39 to plus 8.1 in one season to the next because one season of reverse splits is mostly a fluke as I explained in my previous article on platoon splits. 21 points is around the average for LHB with > 150 TBF v. lefties in this time period, so these pitchers moved 47 points from year 1 to year 2, out of a total of 60 points from year 1 to league average. That is a 78% regression toward the mean, around what we estimated Siegrest’s regression should be (I think it was 82%). That suggests that our mathematical model is good since it creates around the same result as when we used our “real live players” method.

How much would it take to estimate a true reverse split for a lefty? Let’s look at some more numbers. I’ll raise the bar to lefty pitchers with at least a 20 point reverse split. There were only 57 in those 12 years of data. They had a collective split in year 1 of -47, just like Siegrest, in an average of 191 TBF v. LHB. How did they do in year 2, which is the answer to our question of their true split? Plus 6.4 points. That is a 78% regression, the same as before.

What about pitchers with at least a 25 point reverse split? They averaged -51 points in year 1. Can we get them to a true reverse split?  Nope. Not even close.

What if we raise the sample size bar? I’ll do at least 175 TBF and -15 reverse split in year 1. Only 45 lefty pitchers fit this bill and they had a -43 point split in year 1 in 209 TBF v. lefties. Next year? Plus 2.8 points! Close but no cigar. There is of course an error bar around only 45 pitchers with 170 TBF v. lefties in year 2, but we’ll take those numbers on faith since that’s what we got. That is a 72% regression with 208 TBF v. lefties, which is about what we would expect given that we have a slightly larger sample size than before.

So please, please, please, when you see or hear of a pitcher with severe reverse splits in 200 or so BF versus lefties, which is around a full year for a starting pitcher or 2 or 3 years for a reliever, remember that our best estimate of their true platoon splits, or what his manager should expect when he sends him out there, is very, very different from what those actual one or three year splits suggest when those actual splits are very far away from the norm. Most of that unusual split, in either direction – almost all of it in fact – is likely a fluke. When we say “likely” we also mean that we must assume that it is a fluke and that we must also assume that the true number is the weighted mean of all the possibilities, which are those year 2 numbers, or year 1 (or multiple years) heavily regressed toward the league average.

 

With all the hullaballoo about Utley’s slide last night and the umpires’ calls or non-calls, including the one or ones in NY (whose names, addresses, telephone numbers, and social security numbers should be posted on the internet, according to Pedro Martinez), what was lost – or at least there was much confusion – was a discussion of the specific rule(s) that applies to that exact situation – the take-out slide that is, not whether Utley was safe or not on replay. For that you need to download the 2015 complete rule book, I guess. If you Google certain rule numbers, it takes you to the MLB “official rules” portion of their website in which at least some of the rule numbers appear to be completely different than in the actual current rule book.

In any case, last night after a flurry of tweets, Rob Neyer, from Fox Sports, pointed out the clearly applicable rule (although other rules come close): It is 5.09 (a) (13) in the PDF version of the current rulebook. It reads, in full:

The batter is out when… “A preceding runner shall, in the umpire’s judgment, intentionally interfere with a fielder who is attempting to catch a thrown ball or to throw a ball in an attempt to complete any play;”

That rule is unambiguous and crystal clear. 1) Umpire, in his judgment, determines that runner intentionally interferes with the pivot man. 2) The batter must be called out.

By the way, the runner himself may or may not be out. This rule does not address that. There is a somewhat common misperception that the umpire calls both players out according to this rule. Another rule might require the umpire to call the runner also out on interference even if he arrived before the ball/fielder or the fielder missed the bag – but that’s another story.

Keep in mind that if you ask the umpire, “Excuse me, Mr. umpire, but in your judgment, did you think that the runner intentionally interfered with the fielder,” and his answer is, “Yes,” then he must call the batter out. There is no more judgment. The only judgment allowed in this rule is whether the runner intentionally interfered or not. If the rule had said, “The runner may be called out,” then there would be two levels of judgment, presumably. There are other rules which explicitly say the umpire may do certain things, in which case there is presumably some judgement that goes into whether he decides to do them or not. Sometimes those rules provide guidelines for that judgment (the may part) and sometimes they do not. Anyway, this rule does not provide that may judgment. If umpire thinks is it intentional interference, the batter (not runner) is automatically out.

So clearly the umpire should have called the batter out on that play, unless he could say with a straight face, “In my judgment, I don’t think that Utley intentionally interfered with the fielder.” That is not a reasonable judgment of course. Not that there is much recourse for poor or even terrible judgment. Judgment calls are not reviewable, I don’t think. Perhaps umpires can get together and overturn a poor judgment call. I don’t know.

But that’s not the end of the story. There is a comment to this rule which reads:

“Rule 5.09(a)(13) Comment (Rule 6.05(m) Comment): The objective of this rule is to penalize the offensive team for deliberate, unwarranted, unsportsmanlike action by the runner in leaving the baseline for the obvious purpose of crashing the pivot man on a double play, rather than trying to reach the base. Obviously this is an umpire’s judgment play.”

Now that throws a monkey wrench into this situation. Apparently this is where the (I always thought it was an unwritten rule), “Runner must be so far away from the base that he cannot touch it in order for the ‘automatic double play’ to be called” rule came from. Only it’s not a rule. It is a comment which clearly adds a wrinkle to the rule.

The rule is unambiguous. If the runner interferes with the fielder trying to make the play (whether he would have completed the DP or not), then the batter is out. There is no mention of where the runner has to be or not be. The comment changes the rule. It adds another requirement (and another level of judgment). The runner must have been “outside the baseline” in the umpire’s judgment. In addition, it adds some vague requirements about the action of the runner. The original rule says only that the runner must “intentionally interfere” with the fielder. The comment adds words that require the runner’s actions to be more egregious – deliberate, unwarranted, and unsportsmanlike.

But the comment doesn’t really require that to be the case for the umpire to call the batter out. I don’t think. It says, “The objective of this rule is to penalize the offensive team….” I guess if the comment is meant to clarify the rule, MLB really doesn’t want the umpire to call the batter out unless the requirements in the comment are met (runner out of the baseline and his action was not only intentional but deliberate, unwarranted, and unsportsmanlike, a higher bar than just intentional).

Of course the rule doesn’t need clarification. It is crystal clear. If MLB wanted to make sure that the runner is outside of the baseline and acts more egregiously than just intentionally, then they should change the rule, right? Especially if comments are not binding, which I presume they are not.

Also, the comment starts off with: “The objective of this rule is to…”

Does that mean that this rule is only to be applied in double play situations? What if a fielder at second base fields a ball, starts to throw to first base to retire the batter, and the runner tackles him or steps in front of the ball? Is rule 5.09(a)(13) meant to apply? The comment says that the objective of the rule is to penalize the offensive team for trying to break up the double play. In this hypothetical, there is no double play being attempted. There has to be some rule that applies to this situation? If there isn’t, then MLB should not have written in the comment, “The objective of this rule….”

There is another rule which also appears to clearly apply to a take-out slide at second base, like Utley’s, with no added comments requiring that the runner be out of the baseline, or that his actions be unwarranted and unsportsmanlike. It is 6.01(6). Or 7.09(e) on the MLB web site. In fact, I tweeted this rule last night thinking that it addressed the Utley play 100% and that the runner and the batter should have been called out.

“If, in the judgment of the umpire, a base runner willfully and deliberately interferes with a batted ball or a fielder in the act of fielding a batted ball with the obvious intent to break up a double play, the ball is dead. The umpire shall call the runner out for interference and also call out the batter-runner because of the action of his teammate.”

The only problem there are the words, “interferes with a batted ball or a fielder in the act of fielding a batted ball.” A lawyer would say that the plain meaning of the words precludes this from applying to an attempt to interfere with a middle infielder tagging second base and throwing to first, because he is not fielding or attempting to field a batted ball and the runner is not interfering with a batted ball. The runner, in this case, is interfering with a thrown ball or a fielder attempting to tag second and then make a throw to first.

So if this rule is not meant to apply to a take-out slide at second, what is it meant to apply to? That would leave only one thing really. A ground ball is hit in the vicinity of the runner and he interferes with the ball or a fielder trying to field the ball. But there also must be, “an obvious intent to break up a double play.” That is curious wording. Would a reasonable person consider that an attempt to break up a double play? Perhaps ”obvious intent to prevent a double play.” Using the words break up sure sounds like this rule is meant to apply to a runner trying to take out the pivot man on a potential double play. But then why write “fielding a batted ball” rather than “making a play or a throw?”

A good lawyer working for the Mets would try and make the case that “fielding a batted ball” includes everything that happens after someone actually “fields the batted ball,” including catching and throwing it. In order to do so, he would probably need to find that kind of definition somewhere else in the rule book. It is a stretch, but it is not unreasonable, I don’t think.

Finally, Eric Byrnes on MLB Tonight, had one of the more intelligent and reasonable comments regarding this play that I have ever heard from an ex-player. He said, and I paraphrase:

“Of course it was a dirty slide. But all players are taught to do whatever it takes to break up the DP, especially in a post-season game. Until umpires start calling an automatic double play on slides like that, aggressive players like Utley will continue to do that. I think we’ll see a change soon.”

P.S. For the record, since there was judgment involved, and judgment is supposed to represent fairness and common sense, I think that Utley should not have been ruled safe at second on appeal.

Postscript:

Perhaps comments are binding. From the forward to the rules, on the MLB web site:

The Playing Rules Committee, at its December 1977 meeting, voted to incorporate the Notes/Case Book/Comments section directly into the Official Baseball Rules at the appropriate places. Basically, the Case Book interprets or elaborates on the basic rules and in essence have the same effect as rules when applied to particular sections for which they are intended.

Last night in the Cubs/Cardinals game, the Cardinals skipper took his starter, Lackey, out in the 8th inning of a 1-run game with one out, no one on base and lefty Chris Coghlan coming to the plate. Coghlan is mostly a platoon player. He has faced almost four times as many righties in his career than lefties. His career wOBA against righties is a respectable .342. Against lefties it is an anemic .288. I have him with a projected platoon split of 27 points, less than his actual splits, which is to be expected as platoon splits in general get heavily regressed toward the mean, because they tend to be laden with noise for two reasons: One, the samples are rarely large because you are comparing performance against righties to performance against lefties and the smaller of the two tends to dominate the effective sample size – in Coghlan’s case, he has faced only 540 lefties in his entire 7-year career, less than the number of PA a typical  full-time batter gets in one season. Two, there is not much of a spread in platoon talent among both batters and pitchers. The less spread in talent for any statistic, the more the differences you see among players, especially in small samples, are noise. Sort of like DIPS for pitchers.

Anyway, even with a heavy regression, we think that Coghlan has a larger than average platoon split for a lefty and the average lefty split tends to be large. You typically would not want him facing a lefty in that situation. That is especially true when you have a very good and fairly powerful right-handed bat on the bench – Jorge Soler. Soler has a reverse career platoon split, but with only 114 PA versus lefties, that number is almost meaningless. I estimate his actual platoon split to be 23 points, a little less than the average righty. For RHB, there is always a heavy regression of actual platoon splits, regardless of the sample size (while the greater the sample of actual PA versus lefties, the less you regress, it might be a 95% regression for small samples and an 80% regression for large samples – either way, large) simply because there is not a very large spread of talent among RHB. If we look at the actual splits for all RHB over many, many PA, we see a narrow range of results. In fact, there is virtually no such thing as a RHB with true reverse platoon splits.

Soler seems to be the obvious choice,  so of course that’s what Maddon did – he pinch hit for Coghlan with Soler, right? This is also a perfect opportunity since Matheny cannot counter with a RHP – Siegrest has to pitch to at least one batter after entering the game. Maddon let Coghlan hit and he was easily dispatched by Siegrest 4 pitches later. Not that the result has anything to do with the decision by Matheny or Maddon. It doesn’t. Matheny’s decision to bring in Siegrest at that point in time was rather curious too, if you think about it. Surely he must have assumed that Maddon would bring in a RH pinch hitter. So he had to decide whether to pitch Lackey against Coghlan or Siegrest against a right handed hitter, probably Soler. Plus, the next batter, Russell, is another righty. It looks like he got extraordinarily lucky when Maddon did what he did – or didn’t do – in letting Coghlan bat. But that’s not the whole story…

Siegrest may or may not be your ordinary left-handed pitcher. What if Siegrest actually has reverse splits? What if we expect him to pitch better against right handed batters and worse against left-handed batters?  In that case, Coghlan might actually be the better choice than Soler even though he doesn’t often face lefty pitchers. When a pitcher has reverse splits – true reverse splits – we treat him exactly like a pitcher of the opposite hand.  It would be exactly like Coghlan or Soler were facing a RHP. Or maybe Siegrest has no splits – i.e. RH and LH batters of equal overall talent perform about the same. Or very small platoon splits compared to the average left-hander? So maybe hitting Coghlan or Soler is a coin flip.

It might also have been correct for Matheny to bring in Siegrest no matter who he was going to face, simply because Lackey, who is arguably a good but not great pitcher, was about to face a good lefty hitter for the third time – not a great matchup. And if Siegrest does indeed have very small splits either positive or negative, or no splits at all, that is a perfect opportunity to bring him in, and not care whether Maddon leaves Coghlan in or pinch hits Soler. At the same time, if Maddon things that Siegrest has significant reverse splits, he can leave in Coghlan, and if he thinks that the lefty pitcher has somewhere around a neutral platoon split, he can still leave Coghlan in and save Soler for another pinch hit opportunity. Of course, if he thinks that Siegrest is like your typical lefty pitcher, with a 30 point platoon split, then using Coghlan is a big mistake.

So how do managers determine what a pitcher’s true or expected (the same thing) platoon split is? The typical troglodyte will use batting average against during the season in question. After all, that’s what you hear ad-nauseam from the talking heads on TV, most of them ex-players or even ex-managers. Even the slightly informed fan knows that batting average against for a pitcher is worthless stat in and of itself (what, walks don’t count, and a HR is the same as a single?), especially in light of DIPS. The slightly more informed fan also knows that one season splits for a batter or pitcher are not very useful for the reasons I explained above.

If you look at Siegrest’s BA against splits for 2015, you will see .163 versus RHB and .269 versus LHB. Cue the TV commentators: “Siegrest is much better against right-handed batters than left-handed ones.” Of course, is and was are very different things in this context and with respect to making decisions like Matheny and Maddon did. The other day David Price was a pretty mediocre to poor pitcher. He is a great pitcher and you would certainly be taking your life into your hands if you treated him like a mediocre to poor pitcher in the present. Kershaw was a poor pitcher in the playoffs…well, you get the idea. Of course, sometimes, was is very similar to is. It depends on what we are talking about and how long the was was, and what the was actually is.

Given that Matheny is not considered to be such an astute manager when it comes to data-driven decisions, it may be is surprising that he would bring in Siegrest to pitch to Coghlan knowing that Siegrest has an enormous reverse BA against split in 2015. Maybe he was just trying to bring in a fresh arm – Siegrest is a very good pitcher overall. He also knows that the lefty is going to have to pitch to the next batter, Russell, a RHB.

What about Maddon? Surely he knows better than to look at such a garbage stat for one season to inform a decision like that. Let’s use a much better stat like wOBA and look at Siegrest’s career rather than just one season. Granted, a pitcher’s true platoon splits may change from season to season as he changes his pitch repertoire, perhaps even arm angle, position on the rubber, etc. Given that, we can certainly give more weight to the current season if we like. For his career, Siegrest has a .304 wOBA against versus LHB and .257 versus RHB. Wait, let me double check that. That can’t be right. Yup, it’s right. He has a career reverse wOBA split of 47 points! All hail Joe Maddon for leaving Coghlan in to face essentially a RHP with large platoon splits! Maybe.

Remember how in the first few paragraphs I talked about how we have to regress actual platoon splits a lot for pitchers and batters, because we normally don’t have a huge sample and because there is not a great deal of spread among pitchers with respect to true platoon split talent? Also remember that what we, and Maddon and Matheny, are desperately trying to do is estimate Siegrest’s true, real-life honest-to-goodness platoon split in order to make the best decision we can regarding the batter/pitcher matchup. That estimate may or may not be the same as or even remotely similar to his actual platoon splits, even over his entire career. Those actual splits will surely help us in this estimate, but the was is often quite different than the is.

Let me digress a little and invoke the ole’ coin flipping analogy in order to explain how sample size and spread of talent come into play when it comes to estimating a true anything for a player – in this case platoon splits.

Note: If you want you can skip the “coins” section and go right to the “platoon” section. 

Coins

Let’s say that we have a bunch of fair coins that we stole from our kid’s piggy bank. We know of course that each of them has a 50/50 chance of coming up head or tails in one flip – sort of like a pitcher with exactly even true platoon splits. If we flip a bunch of them 100 times, we know we’re going to get all kinds of results – 42% heads, 61% tails, etc. For the math inclined, if we flip enough coins the distribution of results will be a normal curve, with the mean and median at 50% and the standard deviation equal to the binomial standard deviation of 100 flips, which is 5%.

Based on the actual results of 100 flips of any of the coins, what would you estimate the true heads/tails percentage of that coin? If one coin came up 65/35 in favor of heads, what is your estimate for future flips? 50% of course. 90/10? 50%. What if we flipped a coin 1000 or even 5000 times and it came up 55% heads and 45% tails? Still 50%. If you don’t believe or understand that, stop reading and go back to whatever you were doing. You won’t understand the rest of this article. Sorry to be so blunt.

That’s like looking at a bunch of pitchers platoon stats and no matter what they are and over how many TBF, you conclude that the pitcher really has an even split and what you observed is just noise. Why is that? With the coins it is because we know beforehand that all the coins are fair (other than that one trick coin that your kid keeps for special occasions). We can say that there is no “spread in talent” among the coins and therefore regardless of the result of a number of flips and regardless of how many flips, we regress the result 100% of the way toward the mean of all the coins, 50%, in order to estimate the true percentage of any one coin.

But, there is a spread of talent among pitcher and batter platoon splits. At least we think there is. There is no reason why it has to be so. Even if it is true, we certainly can’t know off the top of our head how much of a spread there is. As it turns out, that is really important in terms of estimating true pitcher and batter splits. Let’s get back to the coins to see why that is. Let’s say that we don’t have 100% fair coins. Our sly kid put in his piggy bank a bunch of trick coins, but not really, really tricky. Most are still 50/50, but some are 48/52, 52/48, a few less are 45/55, and 1 or 2 are 40/60 and 60/40. We can say that there is now a spread of “true coin talent” but the spread is small. Most of the coins are still right around 50/50 and a few are more biased than that.  If your kid were smart enough to put in a normal distribution of “coin talent,” even one with a small spread, the further away from 50/50, the fewer coins there are.  Maybe half the coins are still fair coins, 20% are 48/52 or 52/48, and a very, very small percentage are 60/40 or 40/60.  Now what happens if we flip a bunch of these coins?

If we flip them 100 times, we are still going to be all over the place, whether we happen to flip a true 50/50 coin or a true 48/52 coin. It will be hard to guess what kind of a true coin we flipped from the result of 100 flips. A 50/50 coin is almost as likely to come up 55 heads and 45 tails as a coin that is truly a 52/48 coin in favor of heads. That is intuitive, right?

This next part is really important. It’s called Bayesian inference, but you don’t need to worry about what it’s called or even how it technically works. It is true that if you flipped a coin and got 60/40 heads that that coin was much more likely to be a true 60/40 coin than it is to be a 50/50 coin. That should be obvious too.  But here’s the catch. There are many, many more 50/50 coins in your kid’s piggy bank than there are 60/40. Your kid was smart enough to put in a normal distribution of trick coins.

So even though it seems like if you flipped a coin 100 times and got 60/40 heads, it is more likely you have a true 60/40 coin than a true 50/50 coin, it isn’t. It is much more likely that you have a 50/50 coin that got “heads lucky” than a true 60/40 coin that landed on the most likely result after 100 flips (60/40) because there are many more 50/50 coins in the bank than 60/40 coins – assuming a somewhat normal distribution with a small spread.

Here is the math: The chances of a 50/50 coin coming up exactly 60/40 is around .01. Chances of a true 60/40 coin coming up 60/40 is 8 times that amount, or .08. But, if there are 8 times as many 50/50 coins in your piggy bank as 60/40 coins, then the chances of your 60/40 coin being a fair coin or a 60/40 biased coin is only 50/50. If there 800 times more 50/50 coins than 60/40 coins in your bank, as there is likely to be if the spread of coin talent is small, then it is 100 times more likely that you have a true 50/50 coin than a true 60/40 coin even though the coin came up 60 heads in 100 flips.

It’s like the AIDS test contradiction. If you are a healthy, heterosexual, non-drug user, and you take an AIDS test which has a 1% false positive rate and you test positive, you are extremely unlikely to have AIDS. There are very few people with AIDS in your population so it is much more likely that you do not have AIDS and got a false positive (1 in 100) than you did have AIDS in the first place (maybe 1 in 100,000) and tested positive. Out of a million people in your demographic, if they all got tested, 10 will have AIDS and test positive (assuming a 0% false negative rate) and 999,990 will not have AIDS, but 10,000 of them (1 in 100) will have a false positive. So the odds you have AIDS is 10,000 to 10 or 1000 to 1 against.

In the coin example where the spread of coin talent is small and most coins are still at or near 50/50, pretty much no matter what we get when flipping a coin 100 times, we are going to conclude that there is a good chance that our coin is still around 50/50 because most of the coins are around 50/50 in true coin talent. However, there is some chance that the coin is biased, if we get an unusual result.

Now, it is awkward and not particularly useful to conclude something like, “There is a 60% chance that our coin is a true 50/50 coin, 20% it is a 55/45 coin, etc.” So what we usually do is combine all those probabilities and come up with a single number called a weighted mean.

If one coin comes up 60/40, our weighted mean estimate of its “true talent” may be 52%. If we come up with 55/45, it might be 51%. 30/70 might be 46%. Etc. That weighed mean is what we refer to as “an estimate of true talent” and is the crucial factor in making decisions based on what we think the talent of the coins/players are likely to be in the present and in the future.

Now what if the spread of coin talent were still small, as in the above example, but we flipped the coins 500 times each? Say we came up with 60/40 again in 500 flips. The chances of that happening with a 60/40 coin is 24,000 times more likely than if the coin were 50/50! So now we are much more certain that we have a true 60/40 coin even if we don’t have that many of them in our bank. In fact, if the standard deviation of our spread in coin talent were 3%, we would be about ½ certain that our coin was a true 50/50 coin and half certain it was a true 60/40 coin, and our weighted mean would be 55%.

There is a much easier way to do it. We have to do some math gyrations which I won’t go into that will enable us to figure out how much to regress our observed flip percentage to the mean flip percentage of all the coins, 50%. For 100 flips it was a large regression such that with a 60/40 result we might estimate a true flip talent of 52%, assuming a spread of coin talent of 3%. For 500 flips, we would regress less towards 50% to give us around 55% as our estimate of coin talent. Regressing toward a mean rather than doing the long-hand Bayesian inferences using all the possible true talent states assumes a normal distribution or close to one.

The point is that the sample size of the observed measurement is determines how much we regress the observed amount towards the mean. The larger the sample, the less we regress. One season observed splits and we regress a lot. Career observed splits that are 5 times that amount, like our 500 versus 100 flips, we regress less.

But sample size of the observed results is not the only thing that determines how much to regress. Remember if all our coins were fair and there were no spread in talent, we would regress 100% no matter how many flips we did with each coin.

So what if there were a large spread in talent in the piggy bank? Maybe a SD of 10 percent so that almost all of our coins were anywhere from 20/80 to 80/20 (in a normal distribution the rule of thumb is that almost of the values fall within 3 SD of the mean in either direction)? Now what if we flipped a coin 100 times and came up with 60 heads. Now there are lots more coins at true 60/40 and even some coins at 70/30 and 80/20. The chances that we have a truly biased coin when we get an unusual result is much greater than if the spread in coin talent were smaller, even in 100 flips.

So now we have the second rule. The first rule was that the number of trials is important in determining how much credence to give to an unusual result, i.e., how much to regress that result towards the mean, assuming that there is some spread in true talent. If there is no spread, then no matter how many trials our result is based on, and no matter how unusual our result, we still regress 100% toward the mean.

All trials whether they be coins or human behavior have random results around a mean that we can usually model as long as the mean is not 0 or 1. That is an important concept, BTW. Put it in your “things I should know” book. No one can control or influence that random distribution. A human being might change his mean from time to time but he cannot change or influence the randomness around that mean. There will always be randomness, and I mean true randomness, around that mean regardless of what we are measuring, as long as the mean is between 0 and 1, and there is more than 1 trial (in one trial you either succeed or fail of course). There is nothing that anyone can do to influence that fluctuation around the mean. Nothing.

The second rule is that the spread of talent also matters in terms of how much to regress the actual results toward the mean. The more the spread, the less we regress the results for a given sample size. What is more important? That’s not really a specific enough question, but a good answer is that if the spread is small, no matter how many trials the results are based on, within reason, we regress a lot. If the spread is large, it doesn’t take a whole lot of trials, again, within reason, in order to trust the results more and not regress them a lot towards the mean.

Let’s get back to platoon splits, now that you know almost everything about sample size, spread of talent, regression to mean, and watermelons. We know that how much to trust and regress results depends on their sample size and on the spread of true talent in the population with respect to that metric, be it coin flipping or platoon splits. Keep in mind that when we say trust the results, that it is not a binary thing, as in, “With this sample and this spread of talent, I believe the results – the 60/40 coin flips or the 50 point reverse splits, and with this sample and spread, I don’t believe them.” That’s not the way it works. You never believe the results. Ever. Unless you have enough time on your hands to wait for an infinite number of results and the underlying talent never changes.

What we mean by trust is literally how much to regress the results toward a mean. If we don’t trust the stats much, we regress a lot. If we trust them a lot, we regress a little. But. We. Always. Regress. It is possible to come up with a scenario where you might regress almost 100% or 0%, but in practice most regressions are in the 20% to 80% range, depending on sample size and spread of talent. That is just a very rough rule of thumb.

We generally know the sample size of the results we are looking at. With Siegrest (I almost forgot what started this whole thing) his career TBF is 604 TBF, but that’s not his sample size for platoon splits because platoon splits are based on the difference between facing lefties and righties. The real sample size for platoon splits is the harmonic mean of TBF versus lefties and righties. If you don’t know what that means don’t worry about it. A shortcut is to use the lesser of the two which is almost always TBF versus lefties, or in Siegrest’s case, 231. That’s not a lot, obviously, but we have two possible things going for Maddon, who played his cards like Siegrest was a true reverse split lefty pitcher. One, maybe the spread of platoon skill among lefty pitchers is large (it’s not), and two, he has a really odd observed split of 47 points in reverse. That’s like flipping a coin 100 times and getting 70 heads and 30 tails or 65/35. It is an unusual result. The question is, again, not binary – whether we believe that -47 point split or not. It is how much to regress it toward the mean of +29 – the average left-handed platoon split for MLB pitchers.

While the unusual nature of the observed result is not a factor in how much regressing to do, it does obviously come into play, in terms of our final estimate of true talent. Remember that the sample size and spread of talent in the underlying population, in this case, all lefty pitchers, maybe all lefty relievers if we want to get even more specific, is the only thing that determines how much we trust the observed results, i.e., how much we regress them toward the mean. If we regress -47 points 50% toward the mean of +29 points, we get quite a different answer than if we regress, say, an observed -10 split 50% towards the mean. In the former case, we get a true talent estimate of -9 points and in the latter we get +10. That’s a big difference. Are we “trusting” the -47 more than the -10 because it is so big? You can call it whatever you want, but the regression is the same assuming the sample size and spread of talent is the same.

The “regression”, by the way, if you haven’t figured it out yet, is simply the amount, in percent, we move the observed toward the mean. -47 points is 76 points “away” from the mean of +29 (the average platoon split for a LHP). 50% regression means to move it half way, or 38 points. If you move -47 points 38 points toward +29 points, you get -9 points, our estimate of Siegrest’s true platoon split if  the correct regression is 50% given his 231 sample size and the spread of platoon talent among LH MLB pitchers. I’ll spoil the punch line. It is not even close to 50%. It’s a lot more.

How do we determine the spread of talent in a population, like platoon talent? That is actually easy but it requires some mathematical knowledge and understanding. Most of you will just have to trust me on this. There are two basic methods which are really the same thing and yield the same answer. One, we can take a sample of players, say 100 players who all had around the same number of opportunities (sample size), say, 300. That might be all full-time starting pitchers in one season and the 300 is the number of LHB faced. Or it might be all pitchers over several seasons who faced around 300 LHB. It doesn’t matter. Nor do the number of opportunities.  They don’t even have to be the same for all pitchers. It is just easier to explain that way. Now we compute the variance in that group – stats 101. Then we compare that variance with the variance expected by chance – still stats 101.

Let’s take BA, for example. If we have a bunch of players with 400 AB each, what is the variance in BA among the players expected by chance? Easy. Binomial theorem. .000625 in BA. What if we observe a variance of twice that, or .00125? Where is the extra variance coming from? A tiny bit is coming from the different contexts that the player plays in, home/road, park, weather, opposing pitchers, etc. A tiny bit comes from his own day-to-day changes in true talent. We’ll ignore that. They really are small. We can of course estimate that too and throw it into the equation. Anyway, that extra variance, the .000625, is coming from the spread of talent. The square root of that is .025 or 25 points of BA, which would be one SD of talent in this example. I just made up the numbers, but that is probably close to accurate.

Now that we know the spread in talent for BA, which we get from this formula – observed variance = random variance + talent variance – we can now calculate the exact regression amount for any sample of observed batting average or whatever metric we are looking at. It’s the ratio of random variance to total variance. Remember we need only 2 things and 2 things only to be able to estimate true talent with respect to any metric, like platoon splits: spread of talent and sample size of the observed results. That gives us the regression amount. From that we merely move the observed result toward the mean by that amount, like I did above with Siegrest’s -47 points and the mean of +29 for a league-average LHP.

The second way, which is actually more handy, is to run a regression of player results from one time period to another. We normally do year-to-year but it can be odd days to even, odd PA to even PA, etc. Or an intra-class correlation (ICC) which is essentially the same thing but it correlates every PA (or whatever the opportunity is) to every other PA within a sample.  When we do that, we either use the same sample size for every player, like we did in the first method, or we can use different sample sizes and then take the harmonic mean of all of them as our average sample size.

This second method yields a more intuitive and immediately useful answer, even though they both end up with the same result. This actually gives you the exact amount to regress for that sample size (the average of the group in your regression). In our BA example, if the average sample size of all the players were 500 and we got a year-to-year (or whatever time period) correlation of .4, that would mean that for BA, the correct amount of regression for a sample size of 500 is 60% (1 minus the correlation coefficient or “r”). So if a player bats .300 in 500 AB and the league average is .250 and we know nothing else about him, we estimate his true BA to be (.300 – .250) * .4 + .250 or .270. We move his observed BA 60% towards the mean of .250. We can easily with a little more math calculate the amount of regression for any sample size.

Using method #1 tells us precisely what the spread in talent is. Method 2 tells us that implicitly by looking at the correlation coefficient and the sample size. With either method, we get the amount to regress for any given sample size.

Platoon

Let’s look at some year-to-year correlations for a 500 “opportunity” (PA, BA, etc.) sample for some common metrics. Since we are using the same sample size for each, the correlation tells us the relative spreads in talent for each of these metrics. The higher the correlation for any given sample, the higher the spread in talent (there are other factors that slightly affect the correlation other than spread of talent for any given sample size but we can safely ignore them).

BA: .450

OBA: .515

SA: .525

Pitcher ERA: .240

BABIP for pitchers (DIPS): .155

BABIP for batters: .450

Now let’s look at platoon splits:

This is for an average of 200 TBF versus a LHP, so the sample size is smaller than the ones above.

Platoon wOBA differential for pitchers (200 BF v. LHB): .135

RHP: .110

LHP: .195

Platoon wOBA differential for batters (200 BF v. LHP): .180

RHB: .0625

LHB: .118

Those numbers are telling us that, like DIPS, the spread of talent among batters and pitchers with respect to platoon splits is very small. You all know now that this, along with sample size, tells us how much to regress an observed split like Siegrest’s -47 points. Yes, a reverse split of 47 points is a lot, but that has nothing to do with how much to regress it in order to estimate Siegrist’s true platoon split. The fact that -47 points is very far from the average left-handed pitcher’s +29 points means that it will take a lot of regression to moved it into the plus zone, but the -47 points in and of itself does not mean that we “trust it more.” If the regression were 99% then whether the observed were -47 or +10, we would arrive at nearly the same answer. Don’t confuse the regression with the observed result. One has nothing to do with the other. And don’t think in terms of “trusting” the observed result or not. Regress the result and that’s your answer. If you arrive at answer X it makes no difference whether your starting point, the observed result, was B, or C. None whatsoever.  That is a very important point. I don’t know how many times I have heard, “But he had a 47 point reverse split in his entire career!” You can’t possibly be saying that you estimate his real split to be +10 or +12 or whatever it is.” Yes, that’s exactly what I’m saying. A +10 estimated split is exactly the same whether the observed split were -47 or +5. The estimate using the regression amount is the only thing that counts.

What about the certainty of the result? The certainty of the estimate depends mostly on the sample size of the observed results. If we never saw a player hit before and we estimate that he is a .250 hitter we are surely less certain than if we have a hitter who has hit .250 over 5000 AB. But does that change the estimate? No. The certainty due to the sample size was already included in the estimate. The higher the certainty the less we regressed the observed results. So once we have the estimate we don’t revise that again because of the uncertainty. We already included that in the estimate!

And what about the practical importance of the certainty in terms of using that estimate to make decisions? Does it matter whether we are 100% or 90% sure that Siegrest is a +10 true platoon split pitcher? Or whether we are only 20% sure – he might actually have a higher platoon split or a lower one? Remember the +10 is a weighted mean which means that it is in the middle of our error bars. The answer to that is, “No, no and no!” Every decision that a manager makes on the field is or should be based on weighted mean estimates of various player talents. The certainty or distribution rarely should come into play. Basically the noise in the result of a sample of 1 is so large that it doesn’t matter at all what the uncertainty level of your estimates are.

So what do we estimate Siegrest’s true platoon split, given a 47 point reverse split in 231 TBF versus LHB. Using no weighting for more recent results, we regress his observed splits 1 minus 230/1255, or .82 (82%) towards the league average for lefty pitchers, which is around 29 points for a LHP. 82% of 76 points is 62 points. So we regress his -47 points 62 points in the plus direction which gives us an estimate of +15 points in true platoon split. That is half the split of an average LHP, but it is plus nonetheless.

That means that a left-handed hitter like Coghlan will hit better than he normally does against a left-handed pitcher. However, Coghlan has a larger than average estimated split, so that cancels out Siegrest’s smaller than average split to some extent. That also means that Soler or another righty will not hit as well against Siegrest as he would against a LH pitcher with average splits. And since Soler himself has a slightly smaller platoon split than the average RHB, his edge against Siegrest is small.

We also have another method for better estimating true platoon splits for pitchers which can be used to enhance the method we use using sample results, sample size, and means. It is very valuable. We have a pretty good idea as to what causes one pitcher to have a smaller or greater platoon split than another. It’s not like pitchers deliberately throw better or harder to one side or the other or that RH or LH batters scare or distract them. Pitcher platoon splits mostly come from two things: One is arm angle. If you’ve ever played or watched baseball that should be obvious to you. The more a pitcher comes from the side, the tougher he is on same-side batters and the larger his platoon split. That is probably the number one factor in these splits. It is almost impossible for a side-armer not to have large splits.

What about Siegrest? His arm angle is estimated by Jared Cross of Steamer, using pitch f/x data, at 48 degrees. That is about a ¾ arm angle. That strongly suggests that he does not have true reverse splits and it certainly enables us to be more confident that he is plus in the platoon split department.

The other thing that informs us very well about likely splits is pitch repertoire. Each pitch has its own platoon profile. For example, pitches with the largest splits are sliders and sinkers and those with the lowest or even reverse are the curve (this surprises most people), splitter, and change.

In fact, Jared (Steamer) has come up with a very good regression formula which estimates platoon split from pitch repertoire and arm angle only. This formula can be used by itself for estimating true platoon splits. Or it can be used to establish the mean towards which the actual splits should be regressed. If you use the latter method the regression percentage is much higher than if you don’t. It’s like adding a lot more 50/50 coins to that piggy bank.

If we plug Siegrest’s 2015 numbers into that regression equation, we get an estimated platoon from arm angle and pitch repertoire of 14 points, which is less than the average lefty even with the 48 degree arm angle. That is mostly because he uses around 18% change ups this year. Prior to this season, when he didn’t use the change up that often, we would probably have estimated a much higher true split.

So now rather than regressing towards just an average lefty with a 29 point platoon split, we can regress his -47 points to a more accurate mean of 14 points. But, the more you isolate your population mean, the more you have to regress for any given sample size, because you are reducing the spread of talent in that more specific population. So rather than 82%, we have to regress something line 92%. That brings -47 to +9 points.

So now we are down to a left-handed pitcher with an even smaller platoon split. That probably makes Maddon’s decision somewhat of a toss-up.

His big mistake in that same game was not pinch-hitting for Lester and Ross in the 6th. That was indefensible in my opinion. Maybe he didn’t want to piss off Lester, his teammates, and possibly the fan base.Who knows?

Many people don’t realize that one of the (many) weaknesses of UZR, at least for the infield, is that it ignores any ground ball in which the infield was configured in some kind of a “shift” and it “influenced the play.” I believe that’s true of DRS as well.

What exactly constitutes “a shift” and how they determine whether or not it “influenced the play” I unfortunately don’t know. It’s up to the “stringers” (the people who watch the plays and input and categorize the data) and the powers that be at Baseball Info Solutions (BIS). When I get the data, there is merely a code, “1” or “0”, for whether there was a “relevant shift” or not.

How many GB are excluded from the UZR data? It varies by team, but in 2015 so far, about 21% of all GB are classified by BIS as “hit into a relevant shift.” The average team has had 332 shifts in which a GB was ignored by UZR (and presumably DRS) and 1268 GB that were included in the data that the UZR engine uses to calculate individual UZR’s. The number of shifts varies considerably from team to team, with the Nationals, somewhat surprisingly, employing the fewest, with only 181, and the Astros with a whopping 682 so far this season. Remember these are not the total number of PA in which the infield is in a shifted configuration. These are the number of ground balls in which the infield was shifted and the outcome was “relevant to the shift,” according to BIS. Presumably, the numbers track pretty well with the overall number of times that each team employs some kind of a shift. It appears that Washington disdains the shift, relatively speaking, and that Houston loves it.

In 2014, there were many fewer shifts than in this current season. Only 11% of ground balls involved a relevant shift, half the number than in 2015. The trailer was the Rockies, with only 92, and the leader, the Astros, with 666. The Nationals last year had the 4th fewest in baseball.

Here is the complete data set for 2014 and 2015 (as of August 30):

2014

Team GB Shifted Not shifted % Shifted
ari 2060 155 1905 8
atl 1887 115 1772 6
chn 1958 162 1796 8
cin 1938 125 1813 6
col 2239 92 2147 4
hou 2113 666 1447 32
lan 2056 129 1927 6
mil 2046 274 1772 13
nyn 2015 102 1913 5
phi 2105 177 1928 8
pit 2239 375 1864 17
sdn 1957 133 1824 7
sln 2002 193 1809 10
sfn 2007 194 1813 10
was 1985 116 1869 6
mia 2176 125 2051 6
ala 1817 170 1647 9
bal 1969 318 1651 16
bos 1998 247 1751 12
cha 2101 288 1813 14
cle 2003 265 1738 13
det 1995 122 1873 6
kca 1948 274 1674 14
min 2011 235 1776 12
nya 1902 394 1508 21
oak 1980 244 1736 12
sea 1910 201 1709 11
tba 1724 376 1348 22
tex 1811 203 1608 11
tor 1919 328 1591 17

 

2015

Team GB Shifted Not shifted % Shifted
ari 1709 355 1354 21
atl 1543 207 1336 13
chn 1553 239 1314 15
cin 1584 271 1313 17
col 1741 533 1208 31
hou 1667 682 985 41
lan 1630 220 1410 13
mil 1603 268 1335 17
nyn 1610 203 1407 13
phi 1673 237 1436 14
pit 1797 577 1220 32
sdn 1608 320 1288 20
sln 1680 266 1414 16
sfn 1610 333 1277 21
was 1530 181 1349 12
mia 1591 229 1362 14
ala 1493 244 1249 16
bal 1554 383 1171 25
bos 1616 273 1343 17
cha 1585 230 1355 15
cle 1445 335 1110 23
det 1576 349 1227 22
kca 1491 295 1196 20
min 1655 388 1267 23
nya 1619 478 1141 30
oak 1599 361 1238 23
sea 1663 229 1434 14
tba 1422 564 858 40
tex 1603 297 1306 19
tor 1539 398 1141 26

 

The individual fielding data (UZR) for infielders that you see on Fangraphs is based on non-shifted ground balls only, or on ground balls where there was a shift but it wasn’t relevant to the outcome. The reason that shifts are ignored in UZR (and DRS, I think) is because we don’t know where the individual fielders are located. It could be a full shift, a partial shift, the third baseman could be the left-most fielder as he usually is or he could be the man in short right field between the first baseman and the second baseman, etc. The way most of the PBP defensive metrics work, it would be useless to include this data.

But what we can do, with impunity, is to include all ground ball data in a team UZR. After all, if a hard ground ball is hit at the 23 degree vector, and we are only interested in team fielding, we don’t care who is the closest fielder or where he is playing. All we care about is whether the ball was turned into an out, relative to the league average out rate for a similar ground ball in a similar or adjusted for context. In other words, using the same UZR methodology, we can calculate a team UZR using all ground ball data, with no regard for the configuration of the IF on any particular play. And if it is true that the type, number and timing (for example, against which batters and/or with which pitchers) of shifts is relevant to a team’s overall defensive efficiency, team UZR in the infield should reflect not only the sum of individual fielding talent and performance, but also the quality of the shift in terms of hit prevention. In addition, if we subtract the sum of the individual infielders’ UZR on non-shift plays from the total team UZR on all plays, the difference should reflect, at least somewhat, the quality of the shifts.

I want to remind everyone that UZR accounts for several contexts. One, park factors. For infield purposes, although the dimensions of all infields are the same, the hardness and quality of the infield can differ from park to park. For example, in Coors Field in Colorado and Chase Field in Arizona, the infields are hard and quick, and thus more ground balls scoot through for hits even if they leave the bat with the same speed and trajectory.

Two, the speed of the batter. Obviously faster batters require the infielders to play a little closer to home plate and they beat out infield ground balls more often than slower batters. In some cases the third baseman and/or first baseman have to play in to protect against the bunt. This affects the average “caught percentage” for any given category of ground balls. The speed of the opposing batters tends to even out for fielders and especially for teams, but still, the UZR engine tries to account for this just in case it doesn’t, especially in small samples.

The third context is the position of the base runners and number of outs. This affects the positioning of the fielders, especially the first baseman (whether first base is occupied or not). The handedness of the batters is the next context. As with batter speed, these also tend to even out in the long run, but it is better to adjust for them just in case.

Finally, the overall GB propensity of the pitchers is used to adjust the average catch rates for all ground balls. The more GB oriented a pitcher is, the softer his ground balls are. While all ground balls are classified in the data as soft, medium, or hard, even within each category, the speed and consequently the catch rates, vary according to the GB tendencies of the pitcher. For example, for GB pitchers, their medium ground balls will be caught at a higher rate than the medium ground balls allowed by fly ball pitchers.

So keep in mind that individual and team UZR adjust as best as it can for these contexts. In most cases, there is not a whole lot of difference between the context adjusted UZR numbers and the unadjusted ones. Also keep in mind that the team UZR numbers you see in this article are adjusted for park, batter hand and speed, and runners/outs, the same as the individual UZR’s you see on Fangraphs.

For this article, I am interested in team UZR including when the IF is shifted. Even though we are typically interested in individual defensive performance and talent, it is becoming more and more difficult to evaluate individual fielding for infielders, because of the prevalence of the shift, and because there is so much disparity in how often each team employs the shift (so that we might be getting a sample of only 60% of the available ground balls for one team and 85% for another).

One could speculate that teams that employ the most shifts would have the best team defense. To test that, we could look at each team’s UZR versus their relevant shift percentage. The problem, of course, is that the talent of the individual fielders is a huge component of team UZR, regardless of how often a team shifts. There may also be selective sampling going on. Maybe teams that don’t have good infield defense feel the need to shift more often such that lots of shifts get unfairly correlated with (but are not the cause of) bad defense.

One way we can separate out talent from shifting is to compare team UZR on all ground balls with the total of the individual UZR’s for all the infielders (on non-shifted ground balls). The difference may tell us something about the efficacy of the shifts and non-shifts. In other words, total team individual infield UZR, which is just the total of each infielder’s UZR as you would see on Fangraphs (range and ROE runs only), represents what we generally consider to be a sample of team talent. This is measured on non-shifted ground balls only, as explained above.

Team total UZR, which measures team runs saved or cost, with no regard for who caught each ball or not, and is based on every batted ball, shifted or not, represents how the team actually performed on defense and is a much better measure of team defense than totaling the individual UZR’s. The difference, then, to some degree, represents how efficient teams are at shifting or not shifting, regardless of how often they shift.

There are lots of issues that would have to be considered when evaluating whether shifts work or not. For example, maybe shifting too much with runners on base results in fewer DP because infielders are often out of position. Maybe stolen bases are affected for the same reason. Maybe the number and quality of hits to the outfield change as a result of the shift. For example, if a team shifts a lot, maybe they don’t appear to record more ground ball outs, but the shifted batters are forced to try and hit the ball to the opposite field more often and thus they lose some of their power.

Maybe it appears that more ground balls are caught, but because pitchers are pitching “to the shift” they become more predictable and batters are actually more successful overall (despite their ground balls being caught more often). Maybe shifts are spectacularly successful against some stubborn and pull-happy batters and not very successful against others who can adjust or even take advantage of a shift in order to produce more, not less, offense. Those issues are beyond the scope of UZR and this article.

Let’s now look at each team in 2014 and 2015, their shift percentage, their overall team UZR, their team UZR when shifting, when not shifting, and their total individual combined UZR when not shifting. Remember this is for the infield only.

2015

Team % Shifts Shift Runs Non-Shift Runs Team Runs Total Individual Runs Team Minus Individual after prorating Ind Runs to 100% of plays
KCA 20 -2.2 10.5 10 26.3 -19.6
LAN 13 -5 -7.3 -13.3 0.8 -14.2
TOR 26 -2.5 13.9 11 22.6 -15.6
CHA 15 -7.7 -12.3 -21.8 -11.9 -8.9
CLE 23 0.6 3.3 3.3 12.8 -11.4
MIN 23 3.5 -11.6 -7.6 1.8 -9.7
MIL 17 0.3 -7.1 -6.7 2.5 -9.5
SEA 14 -2.6 -8.7 -13.8 -5.1 -8.3
SFN 21 2.3 12.6 15.8 24.4 -11.8
MIA 14 0.5 2.7 2.4 8.4 -6.7
ARI 21 3.4 -1.5 2.1 8 -7.0
HOU 41 -7.6 -3.2 -11.3 -6.1 -3.1
PHI 14 -6.4 -16.4 -23.5 -19 -3.0
COL 31 -7.3 0 -5.5 -1.5 -3.7
ATL 13 3.1 6.9 9.8 12.6 -3.7
SLN 16 -1.1 -5.8 -8.8 -7 -1.1
DET 22 1.8 -16.2 -17.8 -16 0.5
ALA 16 -2.4 -0.4 -3.6 -2.8 -0.5
BOS 17 0.3 4.8 3.5 2.7 0.5
NYN 13 -3.8 3.1 0.8 -2.7 3.7
WAS 12 1.1 -9.4 -8.4 -12.6 5.1
CIN 17 5 9.8 16.2 11.2 3.9
CHN 15 0.2 18.7 17.4 10.5 6.0
BAL 25 10.6 -0.5 14.4 5.8 7.6
SDN 20 7.5 -6.8 1.5 -7.8 10.3
TEX 19 4.1 12.8 19.6 10.1 8.3
TBA 40 0.1 4.5 7 -9.2 19.3
NYA 30 0.1 11.8 12.2 -6.6 20.2
PIT 32 0.3 0.3 0.1 -21 26.0
OAK 23 3.9 -8.8 -5 -31.4 31.1

 

The last column, as I explained above, represents the difference between how we think the infield defense performed based on individual UZR’s only (on non-shifted GB), prorated to 100% of the games (the proration is actually regressed so that we don’t have the “on pace for” problem), and how the team actually performed on all ground balls. If the difference is positive, then we might assume that the shifts and non-shifts are being done in an effective fashion regardless of how often shifts are occurring. If it is negative, then somehow the combination of shifts and non-shifts are costing some runs. Or the difference might not be meaningful at all – it could just be noise. At the very least, this is the first time that you are seeing real infield team defense being measured based on the characteristics of each and every ground ball and the context in which they were hit, regardless of where the infielders are playing.

First of all, if we look at all the teams that have a negative difference in the last column, the teams that presumably have the worst shift/no-shift efficiency, and compare them to those that are plus and presumably have the best shift/no-shift efficiency, we find that there is no difference in their average shift percentages. For example, TBA and HOU have the most shifts by far, and HOU “cost” their teams 5.2 runs and TBA benefited by 16.2 runs. LAN and WAS had the fewest shifts and one of them gained 4 runs and the other lost 14 runs.  The other teams are all over the board with respect to number of shifts and the difference between the individual UZR’s and team UZR.

Let’s look at that last column for 2014 and compare it to 2015 to see if there is any consistency from year to year within teams. Do some teams consistently do better or worse with their shifting and non-shifting, at least for 2014 and 2015? Let’s also see if adding more data gives us any relationship between the last column (delta team and individual UZR) and shift percentage.

Team 2015 % Shift 2014 % Shift 2015 Team Minus Individual 2014 Team Minus Individual Combined 2014 and 2015 Team Minus Individual
HOU 41 32 -5.2 45.6 40.4
TBA 40 22 16.2 12.7 28.9
PIT 32 17 21.1 5.5 26.6
TEX 19 11 9.5 9.9 19.4
WAS 12 6 4.2 13.0 17.2
OAK 23 12 26.4 -9.3 17.1
BAL 25 16 8.6 7.6 16.2
NYN 13 5 3.5 9.0 12.5
NYA 30 21 18.8 -8.4 10.4
CHA 15 14 -9.9 12.5 2.6
CHN 15 8 6.9 -5.8 1.1
TOR 26 17 -11.6 12.6 1.0
DET 22 6 -1.8 2.4 0.6
SFN 21 10 -8.6 6.0 -2.6
CIN 17 6 5 -8.2 -3.2
CLE 23 13 -9.5 5.2 -4.3
MIL 17 13 -9.2 3.1 -6.1
ARI 21 8 -5.9 -0.2 -6.1
SDN 20 7 9.3 -15.7 -6.4
MIA 14 6 -6 -0.9 -6.9
BOS 17 12 0.8 -10.6 -9.8
KCA 20 14 -16.3 6.3 -10.0
ATL 13 6 -2.8 -7.5 -10.3
PHI 14 8 -4.5 -6.2 -10.7
ALA 16 9 -0.8 -11.6 -12.4
SLN 16 10 -1.8 -12.2 -14.0
LAN 13 6 -14.1 -2.5 -16.6
MIN 23 12 -9.4 -9.3 -18.7
SEA 14 11 -8.7 -11.3 -20.0
COL 31 4 -4 -23.0 -27.0

 

Although there appears to be little correlation from one year to the next for each of the teams, we do find that of the teams that had the least efficient shifts/non-shifts (negative values in the last column), they averaged 14% shifts per season in 2014 and 2015. Those that had the most effective (plus values in the last column) shifted an average of 19% in 2014 and 2015. As well, the two teams with the biggest gains, HOU and TB, had the most shifts, at 37% and 31% per season, respectively. The two worst teams, Colorado and Seattle, shifted 17% and 13% per season. On the other hand, the team with the least shifts in baseball in 2014 and 2015 combined, the Nationals, gained over 17 runs in team UZR on all ground balls compared to a total of the individual UZR’s on non-shifted balls only, suggesting that the few shifts they employed were very effective, which seems reasonable.

It is also interesting to note that the team that had the worst difference in team and individual UZR in 2014, the Rockies, only shifted 4% of the time, easily the worst in baseball. In 2015, they have been one of the most shifted teams and still their team UZR is 4 runs worse than their total individual UZR’s. Still, that’s a lot better than in 2014.

It also appears that many of the smarter teams are figuring out how to optimize their defense beyond the talent of the individual players. TB, PIT, HOU, WAS, and OAK are at the top of the list in plus value deltas (the last column). These teams are generally considered to have progressive front offices. Some of the teams with the most negative numbers in the last column, those teams which appear to be sub-optimal in their defensive alignment, are LAN, MIN, SEA, PHI, COL, ATL, SLN, and ALA, all with reputations for having less than progressive front offices and philosophies, to one degree or another. In fact, other than a few outliers, like Boston, Texas, and the White Sox, the order of the teams in the chart above looks like a reasonable order of teams from most to least progressive teams. Certainly the teams in the top half appear to be the most saber-savvy teams and those in bottom half, the least.

In conclusion, it is hard to look at this data and figure out whether and which teams are using their shifts and non-shifts effectively. There doesn’t appear to be a strong correlation between shift percentage and the difference between team and individual defense although there are a few anecdotes that suggest otherwise. As well, in the aggregate for 2014 and 2015 combined, teams that have been able to outperform on team defense the total of their individual UZR’s have shifted more often, 19% to 13%.

There also appears to the naked eye to be a strong correlation between the perceived sabermetric orientation of a team’s front office and the efficiency of their shift/non-shift strategy, at least as measured by the numbers in the last column, explained above.

I think the most important thing to take away from this discussion is that there can be a fairly large difference between team infield UZR which uses every GB, and the total of the individual UZR’s which uses only those plays in which no shift was relevant to the outcome of the play. As well, the more shifts employed by a team, the less we should trust that the total of the individual performances are representative of the entire team’s defense on the infield. I am also going to see if Fangraphs can start publishing team UZR for infielders and for outfielders, although in the outfield, the numbers should be similar if not the same.

Recently there has been some discussion about the use of WAR in determining or at least discussing an MVP candidate for position players (pitchers are eligible too for MVP, obviously, and WAR includes defense and base running, but I am restricting my argument to position players and offensive WAR). Judging from the comments and questions coming my way, many people don’t understand exactly what WAR measures, how it is constructed, and what it can or should be used for.

In a nutshell, offensive WAR takes each of a player’s offensive events in a vacuum, without regard to the timing and context of the event or whether that event actually produced or contributed to any runs or wins, and assigns a run value to it, based on the theoretical run value of that event (linear weights), adds up all the run values, converts them to theoretical “wins” by dividing by some number around 10, and then subtracts the approximate runs/wins that a replacement player would have in that many PA. A replacement player produces around 20 runs less than average for every 650 PA, by definition. This can vary a little by defensive position and by era. And of course a replacement player is defined as the talent/value of a player who can be signed for the league minimum even if he is not protected (a so-called “freely available player”).

For example, let’s say that a player had 20 singles, 5 doubles, 1 triple, 4 HR, 10 non-intentional BB+HP, and 60 outs in 100 PA. The approximate run values for these events are .47, .78, 1.04, 1.40, .31, and -.25. These values are marginal run values and by definition are above or below a league average position player. So, for example, if a player steps up to the plate and gets a single, on the average he will generate .47 more runs than 1 generic PA of a league average player. These run values and the zero run value of a PA for a league average player assume the player bats in a random slot in the lineup, on a league average team, in a league average park, against a league-average opponent, etc.

If you were to add up all those run values for our hypothetical player, you would get +5 runs. That means that theoretically this player would produce 5 more runs than a league-average player on a league average team, etc. A replacement player would generate around 3 fewer runs than a league average player in 100 PA (remember I said that replacement level was around -20 runs per 650 PA), so our hypothetical player is 8 runs above replacement in those 100 PA.

The key here is that these are hypothetical runs. If that player produced those offensive events while in a league average context an infinite number of times he would produce exactly 5 runs more than an average player would produce in 100 PA and his team would win around .5 more games (per 100 PA) than an average player and .8 more games (and 8 runs) than a replacement player.

In reality, for those 100 PA, we have no idea how many runs or wins our player contributed to. On the average, or after an infinite number of 100 PA trials, his results would have produced an extra 5 runs and 1/2 win, but in one 100 PA trial, that exact result is unlikely, just like in 100 flips of a coin, exactly 50 heads and tails is an unlikely though “mean” or “average” event. Perhaps 15 or those 20 singles didn’t result in a single run being produced. Perhaps all 4 of his HR were hit after his team was down by 5 or 10 runs and they were meaningless. On the other hand, maybe 10 of those hits were game winning hits in the 9th inning. Similarly, of those 60 outs, what if 10 times there was a runner on third and 0 or 1 out, and our player struck out every single time? Alternatively, what if he drove in the runner 8 out of 10 times with an out, and half the time that run amounted to the game winning run? WAR would value those 10 outs exactly the same in either case.

You see where I’m going here? Context is ignored in WAR (for a good reason, which I’ll get to in a minute), yet context is everything in an MVP discussion. Let me repeat that: Context is everything in an MVP discussion. An MVP is about the “hero” nature of a player’s seasonal performance. How much did he contribute to his team’s wins and to a lesser extent, what did those wins mean or produce (hence, the “must be on a contending team” argument). Few rational people are going to consider a player MVP-quality if little of his performance contributed to runs and wins no matter how “good” that performance was in a vacuum. No one is going to remember a 4 walk game when a team loses in a 10-1 blowout. 25 HR with most of them occurring in losing games, likely through no fault of the player? Ho-hum. 20 HR, where 10 of them were in the latter stages of a close game and directly led to 8 wins? Now we’re talking possible MVP! .250 wOBA in clutch situations but .350 overall? Choker and bum, hardly an MVP.

I hope you are getting the picture. While there are probably several reasonable ways to define an MVP and reasonable and smart people can legitimately debate about whether it is Trout, Miggy, Kershaw or Goldy, I think that most reasonable people will agree that an MVP has to have had some – no a lot – of articulable performance contributing to actual, real-life runs and wins, otherwise that “empty WAR” is merely a tree falling in the forest with no one to hear it.

So what is WAR good for and why was it “invented?” Mostly it was invented as a way to combine all aspects of a player’s performance – offense, defense, base running, etc. – on a common scale. It was also invented to be able to estimate player talent and to project future performance. For that it is nearly perfect. The reason it ignores context is because we know that context is not part of a player’s skill set to any significant degree. Which also means that context-non-neutral performance is not predictive – if we want to project future performance, we need a metric that strips out context – hence WAR.

But, for MVP discussions? It is a terrible metric for the aforementioned reasons. Again, regardless of how you define MVP caliber performance, almost everyone is in agreement that it includes and needs context, precisely that which WAR disdains and ignores. Now, obviously WAR will correlate very highly with non-context-neutral performance. That goes without saying. It would be unlikely that a player who is a legitimate MVP candidate does not have a high WAR. It would be equally unlikely that a player with a high WAR did not specifically contribute to lots of runs and wins and to his team’s success in general. But that doesn’t mean that WAR is a good metric to use for MVP considerations. Batting average correlates well with overall offensive performance and pitcher wins correlate well with good pitching performance, but we would hardly use those two stats to determine who was the better overall batter or pitcher. And to say, for example, that Trout is the proper MVP and not Cabrera because Trout was 1 or 2 WAR better than Miggy, without looking at context, is an absurd and disingenuous argument.

So, is there a good or at least a better metric than WAR for MVP discussions? I don’t know. WPA perhaps. WPA in winning games only? WPA with more weight for winning games? RE27? RE27, again, adjusted for whether the team won or lost or scored a run or not? It is not really important what you use for these discussions by why you use them. It is not so much that WAR is a poor metric for determining an MVP. It is using WAR without understanding what it means and why it is a poor choice for an MVP discussion in and of itself, that is the mistake. As long as you understand what each metric means (including traditional mundane ones like RBI, runs, etc.), how it relates to the player in question and the team’s success, feel free to use whatever you like (hopefully a combination of metrics and statistics) – just make sure you can justify your position in a rational, logical, and accurate fashion.

 

In response to my two articles on whether pitcher performance over the first 6 innings is predictive of their 7th inning performance (no), a common response from saber and non-saber leaning critics and commenters goes something like this:

No argument with the results or general method, but there’s a bit of a problem in selling these findings. MGL is right to say that you can’t use the stat line to predict inning number 7, but I would imagine that a lot of managers aren’t using the stat line as much as they are using their impression of the pitcher’s stuff and the swings the batters are taking.

You hear those kinds of comments pretty often even when a pitcher’s results aren’t good, “they threw the ball pretty well,” and “they didn’t have a lot of good swings.”

There’s no real way to test this and I don’t really think managers are particularly good at this either, but it’s worth pointing out that we probably aren’t able to do a great job capturing the crucial independent variable.

That is actually a comment on The Book Blog by Neil Weinberg, one of the editors of Beyond the Box Score and a sabermetric blog writer (I hope I got that somewhat right).

My (edited) response on The Book Blog was this:

Neil I hear that refrain all the time and with all due respect I’ve never seen any evidence to back it up. There is plenty of evidence, however, that for the most part it isn’t true.

If we are to believe that managers are any good whatsoever at figuring out which pitchers should stay and which should not, one of two things must be true:

1) The ones who stay must pitch well, especially in close games. That simply isn’t true.

2) The ones who do not stay would have pitched terribly. In order for that to be the case, we must be greatly under-estimating the TTO penalty. That strains credulity.

Let me explain the logic/math in # 2:

We have 100 pitchers pitching thru 6 innings. Their true talent is 4.0 RA9. 50 of them stay and 50 of them go, or some other proportion – it doesn’t matter.

We know that those who stay pitch to the tune of around 4.3. We know that. That’s what the data say. They pitch at the true talent plus the 3rd TTOP, after adjusting for the hitters faced in the 7th inning.

If we are to believe that managers can tell, to any extent whatsoever, whether a pitcher is likely to be good or bad in the next inning or so, then it must be true that the ones who stay will pitch better on the average then the ones who do not, assuming that the latter were allowed to stay in the game of course.

So let’s assume that those who were not permitted to continue would have pitched at a 4.8 level, .5 worse than the pitchers who were deemed fit to remain.

That tells us that if everyone were allowed to continue, they would pitch collectively at a 4.55 level, which implies a .55 rather than a .33 TTOP.

Are we to believe that the real TTOP is a lot higher than we think, but is depressed because managers know when to take pitchers out such that the ones they leave in actually pitch better than all pitchers would if they were all allowed to stay?

Again, to me that seems unlikely.

Anyway, here is some new data which I think strongly suggests that managers and pitching coaches have no better clue than you or I as to whether a pitcher should remain in a game or not. In fact, I think that the data suggest that whatever criteria they are using, be it runs allowed, more granular performance like K, BB, and HR, or keen, professional observation and insight, it is simply not working at all.

After 6 innings, if a game is close, a manager should make a very calculated decision as far as whether or not he should remove his starter. That decision ought to be based primarily on whether the manager thinks that his starter will pitch well in the 7th and possibly beyond, as opposed to one of his back-end relievers. Keep in mind that we are talking about general tendencies which should apply in close games going into the 7th inning. Obviously every game may be a little different in terms of who is on the mound, who is available in the pen, etc. However, in general, when the game is close in the 7th inning and the starter has already thrown 6 full, the decision to yank him or allow him to continue pitching is more important than when the game is not close.

If the game is already a blowout, it doesn’t matter much whether you leave in your starter or not. It has little effect on the win expectancy of the game. That is the whole concept of leverage. In cases where the game is not close, the tendency of the manager should be to do whatever is best for the team in the next few games and in the long run. That may be removing the starter because he is tired and he doesn’t want to risk injury or long-term fatigue. Or it may be letting his starter continue (the so-called “take one for the team” approach) in order to rest his bullpen. Or it may be to give some needed work to a reliever or two.

Let’s see what managers actually do in close and not-so-close games when their starter has pitched 6 full innings and we are heading into the 7th, and then how those starters actually perform in the 7th if they are allowed to continue.

In close games, which I defined as a tied or one-run game, the starter was allowed to begin the 7th inning 3,280 times and he was removed 1,138 times. So the starter was allowed to pitch to at least 1 batter in the 7th inning of a close game 74% of the time. That’s a pretty high percentage, although the average pitch count for those 3,280 pitcher-games was only 86 pitches, so it is not a complete shock that managers would let their starters continue especially when close games tend to be low scoring games. If a pitcher is winning or losing 2-1 or 3-2 or 1-0 or the game is tied 0-0, 1-1, 2-2, and the starter’s pitch count is not high, managers are typically loathe to remove their starter. In fact, in those 3,280 instances, the average runs allowed for the starter through 6 innings was only 1.73 runs (a RA9 of 2.6) and the average number of innings pitched beyond 6 innings was 1.15.

So these are presumably the starters that managers should have the most confidence in. These are the guys who, regardless of their runs allowed, or even their component results, like BB, K, and HR, are expected to pitch well into the 7th, right? Let’s see how they did.

These were average pitchers, on the average. Their seasonal RA9 was 4.39 which is almost exactly league average for our sample, 2003-2013 AL. They were facing the order for the 3rd time on the average, so we expect them to pitch .33 runs worse than they normally do if we know nothing about them.

These games are in slight pitcher’s parks, average PF of .994, and the batters they faced in the 7th were worse than average, including a platoon adjustment (it is almost always the case that batters faced by a starter in the 7th are worse than league average, adjusted for handedness). That reduces their expected RA9 by around .28 runs. Combine that with the .33 run “nick” that we expect from the TTOP and we expect these pitchers to pitch at a 4.45 level, again knowing nothing about them other than their seasonal levels and attaching a generic TTOP penalty and then adjusting for batter and park.

Surely their managers, in allowing them to pitch in a very close game in the 7th know something about their fitness to continue – their body language, talking to their catcher, their mechanics, location, past experience, etc. All of this will help them to weed out the ones who are not likely to pitch well if they continue, such that the ones who are called on to remain in the game, the 74% of pitchers who face this crossroad and move on, will surely pitch better than 4.45, which is about the level of a near-replacement reliever.

In other words, if a manager thought that these starters were going to pitch at a 4.45 level in such a close game in the 7th inning, they would surely bring in one of their better relievers – the kind of pitchers who typically have a 3.20 to 4.00 true talent.

So how did these hand-picked starters do in the 7th inning? They pitched at a 4.70 level. The worst reliever in any team’s pen could best that by ½ run. Apparently managers are not making very good decisions in these important close and late game situations, to say the least.

What about in non-close game situations, which I defined as a 4 or more run differential?

73% of pitchers who pitch through 6 were allowed to continue even in games that were not close. No different from the close games. The other numbers are similar too. The ones who are allowed to continue averaged 1.29 runs over the first 6 innings with a pitch count of 84, and pitched an average of 1.27 innings more.

These guys had a true talent of 4.39, the same as the ones in the close games – league average pitchers, collectively. They were expected to pitch at a 4.50 level after adjusting for TTOP, park and batters faced. They pitched at a 4.78 level, slightly worse than our starters in a close game.

So here we have two very different situations that call for very different decisions, on the average. In close games, managers should (and presumably think they are) be making very careful decision about whom to pitch in the 7th, trying to make sure that they use the best pitcher possible. In not-so-close games, especially blowouts, it doesn’t really matter who they pitch, in terms of the WE of the game, and the decision-making goal should be oriented toward the long-term.

Yet we see nothing in the data that suggests that managers are making good decisions in those close games. If we did, we would see much better performance from our starters than in not-so-close games and good performance in general. Instead we see rather poor performance, replacement level reliever numbers in the 7th inning of both close and not-so-close games. Surely that belies the, “Managers are able to see things that we don’t and thus can make better decisions about whether to leave starters in or not,” meme.

Let’s look at a couple more things to further examine this point.

In the first installment of these articles I showed that good or bad run prevention over the first 6 innings has no predictive value whatsoever for the 7th inning. In my second installment, there was some evidence that poor component performance, as measured by in-game, 6-inning FIP had some predictive value, but not good or great component performance.

Let’s see if we can glean what kind of things managers look at when deciding to yank starters in the 7th or not.

In all games in which a starter allows 1 or 0 runs through 6, even though his FIP was high, greater than 4, suggesting that he really wasn’t pitching such a great game, his manager let him continue 78% of the time, which was more than the 74% overall that starters pitched into the 7th.

In games where the starter allowed 3 or more runs through 6 but had a low FIP, less than 3, suggesting that he pitched better than his RA suggest, managers let them continue to pitch just 55% of the time.

Those numbers suggest that managers pay more attention to runs allowed than component results when deciding whether to pull their starter in the 7th. We know that that is not a good decision-making process as the data indicate that runs allowed have no predictive value while component results do, at least when those results reflect poor performance.

In addition, there is no evidence that managers can correctly determine who should stay and who to pull in close games – when that decision matters the most. Can we put to rest, for now at least, this notion that managers have some magical ability to figure out which of their starters has gas left in their tank and which do not? They don’t. They really, really, really don’t.