There seems to be an unwritten rule in baseball – not on the field, but in the stands, at home, in the press box, etc.

“You can’t criticize a manager’s decision if it doesn’t directly affect the outcome of the game, if it appears to ‘work’, or if the team goes on to win the game despite the decision.”

That’s ridiculous of course. The outcome of a decision or the game has nothing to do with whether the decision was correct or not. Some decisions may raise or lower a team’s chances of winning from 90% and other decisions may affect a baseline of 10 or 15%.

If decision A results in a team’s theoretical chances of winning of 95% and decision A, 90%, obviously A is the correct move. Choosing B would be malpractice. Equally obvious is if manager chooses B, an awful decision, he is still going to win the game 90% of the time, and based on the “unwritten rule” we rarely get to criticize him. Similarly, if decision A results in a 15% win expectancy (WE) and B results in 10%, A is the clear choice, yet the team still loses most of the time and we get to second guess the manager whether he chooses A or B. All of that is silly and counter-productive.

If your teenager drives home drunk yet manages to not kill himself or anyone else, do you say nothing because “it turned out OK?” I hope not. In sports, most people understand the concept of “results versus process” if they are cornered into thinking about it, but in practice, they just can’t bring themselves to accept it in real time. No one is going to ask Terry Collins in the post-game presser why he didn’t pinch hit for DeGrom in the 6th inning – no one. The analyst – a competent one at least – doesn’t give a hoot what happened after that. None whatsoever. He looks at a decision and if it appears questionable at the time, he tries to determine what the average consequences are – with all known data at the time the decision is made – with the decision or with one or more alternatives. That’s it. What happens after that is irrelevant to the analyst. For some reason this is a hard concept for the average fan – the average person – to apply. As I said, I truly think they understand it, especially if you give obvious examples, like the drunk driving one. They just don’t seem to be able to break the “unwritten rule” in practice. It goes against their grain.

Well, I’m an analyst and I don’t give a flying ***k whether the Mets won, lost, tied, or Wrigley Field collapsed in the 8th inning. The “correctness” of the decision to allow DeGrom to hit or not in the top of the 6th, with runners on second and third, boiled down to this question and this question only:

“What is the average win expectancy (WE) of the Mets with DeGrom hitting and then pitching some number of innings and what is the average WE with a pinch hitter and someone else pitching in place of DeGrom?”

Admittedly the gain, if there is any, from making the decision to bring in a PH and reliever or relievers must be balanced against any known or potential negative consequences for the Mets not related to the game at hand. Examples of these might be: 1) limiting your relief possibilities in the rest of the series or the World Series. 2) Pissing off DeGrom or his teammates for taking him out and thus affecting the morale of the team.

I’m fine with the fans or the manager and coaches including these and other considerations in their decision. I am not fine with them making their decision not knowing how it affects the win expectancy of the game at hand, since that is clearly the most important of the considerations.

My guess is that if we asked Collins about his decision-making process, and he was honest with us, he would not say, “Yeah, I knew that letting him hit would substantially lower our chances of winning the game, but I also wanted to save the pen a little and give DeGrom a chance to….” I’m pretty sure he thought that with DeGrom pitching well (which he usually does, by the way – it’s not like he was pitching well-above his norm), his chances of winning were better with him hitting and then pitching another inning or two.

At this point, and before I get into estimating the WE of the two alternatives facing Collins, letting DeGrom hit and pitch or pinch hitting and bringing in a reliever, I want to discuss an important concept in decision analysis in sports. In American civil law, there is a thing called a summary judgment. When a party in a civil action moves for one, the judge makes his decision based on the known facts and assuming controversial facts and legal theories in a light most favorable to the non-moving party. In other words, if everything that the other party says is true is true (and is not already known to be false) and the moving party would still win the case according to the law, then the judge must accept the motion and the moving party wins the case without a trial.

When deciding whether a particular decision was “correct” or not in a baseball game or other contest, we can often do the same thing in order to make up for an imperfect model (which all models are by the way). You know the old saw in science – all models are wrong, but some are useful. In this particular instance, we don’t know for sure how DeGrom will pitch in the 6th and 7th innings to the Cubs order for the 3rd time, we don’t know for how much longer he will pitch, we don’t know how well DeGrom will bat, and we don’t know who Collins can and will bring in.

I’m not talking about the fact that we don’t know whether DeGrom or a reliever is going to give up a run or two, or whether he or they are going to shut the Cubs down. That is in the realm of “results-based analysis” and I‘ve already explained how and why that is irrelevant. I’m talking about what is DeGrom’s true talent, say in runs allowed per 9 facing the Cubs for the third time, what is a reliever’s or relievers’ true talent in the 6th and 7th, how many innings do we estimate DeGrom will pitch on the average if he stays in the game, and what is his true batting talent.

Our estimates of all of those things will affect our model’s results – our estimate of the Mets’ WE with and without DeGrom hitting. But what if we assumed everything in favor of keeping DeGrom in the game – we looked at all controversial items in a light most favorable to the non-moving party – and it was still a clear decision to pinch hit for him? Well, we get a summary judgment! Pinch hitting for him would clearly be the correct move.

There is one more caveat. If it is true that there are indirect negative consequences to taking him out – and I’m not sure that there are – then we also have to look at the magnitude of the gain from taking him out and then decide whether it is worth it. In order to do that, we have to have some idea as to what is a small and what is a large advantage. That is actually not that hard to do. Managers routinely bring in closers in the 9th inning with a 2-run lead, right? No one questions that. In fact, if they didn’t – if they regularly brought in their second or third best reliever instead, they would be crucified by the media and fans. How much does bringing in a closer with a 2-run lead typically add to a team’s WE, compared to a lesser reliever? According to The Book, an elite reliever compared to an average reliever in the 9th inning with a 2-run lead adds around 4% to the team’s WE. So we know that 4% is a big advantage, which it is.

That brings up another way to account for the imperfection of our models. The first way was to use the “summary judgment” method, or assume things most favorable to making the decision that we are questioning. The second way is to simply estimate everything to the best of our ability and then look at the magnitude of the results. If the difference between decision A and B is 4%, it is extremely unlikely that any reasonable tweak to the model will change that 4% to 0% or -1%.

In this situation, whether we assume DeGrom is going to pitch 1.5 more innings or 1.6 or 1.4, it won’t change the results much. If we assume that DeGrom is an average hitting pitcher or a poor one, it won’t change the result all that much. If we assume that the “times through the order penalty” is .25 runs or .3 runs per 9 innings, it won’t change the results much. If we assume that the relievers used in place of DeGrom have a true talent of 3.5, 3.3, 3.7, or even 3.9, it won’t change the results all that much. Nothing can change the results from 4% in favor of decision A to something in favor of decision B. 4% is just too much to overcome even if our model is not completely accurate. Now, if our results assuming “best of our ability estimates” for all of these things yield a 1% advantage for choosing A, then it is entirely possible that B is the real correct choice and we might defer to the manager in case he knows some things that we don’t or we simply are mistaken in our estimates or we failed to account for some important variable.

Let’s see what the numbers say, assuming “average” values for all of these relevant variables and then again making reasonable assumptions in favor of allowing DeGrom to hit (assuming that pinch hitting for him appears to be correct).

What is the win expectancy with DeGrom batting. We’ll assume he is an average-hitting pitcher or so (I have heard that he is a poor-hitting pitcher). An average pitcher’s batting line is around 10% single, 2% double or triple, .3% HR, 4% BB, and 83.7% out. The average WE for an average team leading by 1 run in the top of the 6th, with runners on second and third, 2 outs, and a batter with this line, is…..


If DeGrom were an automatic out, the WE would be 59.5%. That is the average WE leading off the bottom of the 6th with the visiting team winning by a run. So an average pitcher batting in that spot adds a little more than 3.5% in WE. That’s not wood. What if DeGrom were a poor hitting pitcher?



So whether DeGrom is an average or poor-hitting pitcher doesn’t change the Mets’ WE in that spot all that much. Let’s call it 63%. That is reasonable. He adds 3.5% to the Mets’ WE compared to an out.

What about a pinch hitter? Obviously the quality of the hitter matters. The Mets have some decent hitters on the bench – notably Cuddyer from the right side and Johnson from the left. Let’s assume a league-average hitter. Given that, the Mets’ WE with runners on second and third, 2 outs, and a 1-run lead, is 68.8%. A league-average hitter adds over 9% to the Mets’ WE compared to an out. The difference between DeGrom as a slightly below-average hitting pitcher and a league-average hitter is 5.8%. That means, unequivocally, assuming that our numbers are reasonably accurate, that letting DeGrom hit cost the Mets almost 6% in their chances of winning the game.

That is enormous of course. Remember we said that bringing in an elite reliever in the 9th of a 2-run game, as compared to a league-average reliever, is worth 4% in WE. You can’t really make a worse decision as a manager than reducing your chances of winning by 5.8%, unless you purposely throw the game. But, that’s not nearly the end of the story. Collins presumably made this decision thinking that DeGrom pitching the 6th and perhaps the 7th would more than make up for that. Actually he’s not quite thinking, “Make up for that.” He is not thinking in those terms. He does not know that letting him hit “cost 5.8% in win expectancy” compared to a pinch hitter. I doubt that the average manager knows what “win expectancy” means let alone how to use it in making in-game decisions. He merely thinks, “I really want him to pitch another inning or two, and letting him hit is a small price to pay,” or something like that.

So how much does he gain by letting him pitch the 6th and 7th rather than a reliever. To be honest it is debatable whether he gains anything at all. Not only that, but if we look back in history to see how many innings starters end up pitching, on the average, in situations like that, we will find that it is not 2 innings. It is probably not even 1.5 innings. He was at 82 pitches through 5. He may throw 20 or 25 pitches in the 6th (like he did in the first), in which case he may be done. He may give up a base runner or two, or even a run or two, and come out in the 6th, perhaps before recording an out. At best, he pitches 2 more innings, and once in a blue moon he pitches all or part of the 8th I guess (as it turned out, he pitched 2 more effective innings and was taken out after seven). Let’s assume 1.5 innings, which I think is generous.

What is DeGrom’s expected RA9 for those 2 innings? He has pitched well thus far but not spectacularly well. In any case, there is no evidence that pitching well through 5 innings tells us anything about how a pitcher is going to pitch in the 6th and beyond. What is DeGrom’s normal expected RA9? Steamer, ZIPS and my projection systems say about 83% of league-average run prevention. That is equivalent to a #1 or #2 starter. It is equivalent to an elite starter, but not quite the level of the Kershaw’s, Arrieta’s, or even the Price’s or Sale’s. Obviously he could turn out to be better than that – or worse – but all we can do in these calculations and all managers can do in making these decisions is use the best information and the best models available to estimate player talent.

Then there is the “times through the order penalty.” There is no reason to think that this wouldn’t apply to DeGrom in this situation. He is going to face the Cubs for the third time in the 6th and 7th innings. Research has found that the third time through the order a starter’s RA9 is .3 runs worse than his overall RA9. So a pitcher who allows 83% of league average runs allows 90% when facing the order for the 3rd time. That is around 3.7 runs per 9 innings against an average NL team.

Now we have to compare that to a reliever. The Mets have Niese, Robles, Reed, Colon, and Gilmartin available for short or long relief. Colon might be the obvious choice for the 6th and 7th inning, although they surely could use a combination of righties and lefties, especially in very high leverage situations. What do we expect these relievers’ RA9 to be? The average reliever is around 4.0 to start with, compared to DeGrom’s 3.7. If Collins uses Colon, Reed, Niese or some combination of relievers, we might expect them to be better than the average NL reliever. Let’s be conservative and assume an average, generic reliever for those 1.5 innings.

How much does that cost the Mets in WE? To figure that, we take the difference in run prevention between DeGrom and the reliever(s), multiply by the game leverage and convert it into WE. The difference between a 3.7 RA9 and a 4.0 RA9 in 1.5 innings is .05 runs. The average expected leverage index in the 6th and 7th innings where the road team is up by a run is around 1.7. So we multiply .05 by 1.7 and convert that into WE. The final number is .0085, or less than 1% in win expectancy gained by allowing DeGrom to pitch rather than an average reliever.

That might shock some people. It certainly should shock Collins, since that is presumably his reason for allowing DeGrom to hit – he really, really wanted him to pitch another inning or two. He presumably thought that that would give his team a much better chance to win the game as opposed to one or more of his relievers. I have done this kind of calculation dozens of times and I know that keeping good or even great starters in the game for an inning or two is not worth much. For some reason, the human mind, in all its imperfect and biased glory, overestimates the value of 1 or 2 innings of a pitcher who is “pitching well” as compared to an “unknown entity” (of course we know the expected performance of our relievers almost as well as we know the expected performance of the starter). It is like a manager who brings in his closer in a 3-run game in the 9th. He thinks that his team has a much better chance of winning than if he brings in an inferior pitcher. The facts say that he is wrong, but tell that to a manager and see if he agrees with you – he won’t. Of course, it’s not a matter of opinion – it’s a matter of fact.

Do I need to go any further? Do I need to tweak the inputs? Assuming average values for the relevant variables yields a loss of over 5% in win expectancy by allowing DeGrom to hit. What if we knew that DeGrom were going to pitch two more innings rather than an average of 1.5? He saves .07 runs rather than .05 which translates to 1.2% WE rather than .85%, which means that pinch hitting for him increases the Mets’ chances of winning by 4.7% rather than 5.05%. 4.7% is still an enormous advantage. Reducing your team‘s chances of winning by 4.7% by letting DeGrom hit is criminal. It’s like pinch hitting Jeff Mathis for Mike Trout in a high leverage situation – twice!

What about if our estimate of DeGrom’s true talent is too conservative? What if he is as good as Kershaw and Arrieta? That’s 63% of league average run prevention or 2.6 RA9. Third time through the order and it’s 2.9. The difference between that and an average reliever is 1.1 runs per 9, which translates to a 3.1% WE difference in 1.5 innings. So allowing Kershaw to hit in that spot reduces the Mets chances of winning by 2.7%. That’s not wood either.

What if the reliever you replaced DeGrom with was a replacement level pitcher – the worst pitcher in the major leagues? He allows around 113% league average runs, or 4.6 RA9. Difference between DeGrom and him for 1.5 innings? 2.7% for a net loss of 3.1% by letting him hit rather than pinch hitting for him and letting the worst pitcher in baseball pitch the next 1.5 innings? If you told Collins, “Hey genius, if you pinch hit for Degrom and let the worst pitcher in baseball pitch for another inning and a half instead of DeGrom, you will increase your chances of winning by 3.1%,” what do you think he would say?

What if DeGrom were a good hitting pitcher? What if….?

You should be getting the picture. Allowing him to hit is so costly, assuming reasonable and average values for all the pertinent variables, that even if we are missing something in our model, or some of our numbers are a little off – even if assume everything in the best possible light of allowing him to hit – the decision is a no-brainer in favor of a pinch hitter.

If Collins truly wanted to give his team the best chance of winning the game, or in the vernacular of ballplayers, putting his team in the best position to succeed, the clear and unequivocal choice was to lift DeGrom for a pinch hitter. It’s too bad that no one cares because the Mets ultimately won the game, which they were going to do at least 60% of the time anyway, regardless of whether Collins made the right or wrong decision.

The biggest loser, other than the Cubs, is Collins (I don’t mean he is a loser, as in the childish insult), because every time you use results to evaluate a decision and the results are positive, you deprive yourself of the opportunity to learn a valuable lesson. In this case, the analysis could have and should have been done before the game even started. All managers should know the importance of bringing in pinch hitters for pitchers in high leverage situations in important games, no matter how good the pitchers are or how well they are pitching in the game so far. Maybe someday they will.

As an addendum to my article on platoon splits from a few days ago, I want to give you a simple trick for answering a question about a player, such as, “Given that a player performs X in time period T, what is the average performance we can expect in the future (or present, which is essentially the same thing, or at least a subset of it)?” and want to illustrate the folly of using unusual single-season splits for projecting the future.

The trick is to identify as many players as you can in some period of time in the past (the more, the better, but sometimes the era matters so you often want to restrict your data to more recent years) that conform to the player in question in relevant ways, and then see how they do in the future. That always answers your question as best as it can. The certainty of your answer depends upon the sample size of the historical performance of similar players. That is why it is important to use as many players and as many years as possible, without causing problems by going too far back in time.

For example, say you have a player whom you know nothing about other than that he hit .230 in one season of 300 AB. What do you expect that he will hit next year? Easy to answer. There are thousands of players who have done that in the past. You can look at all of them and see what their collective BA was in their next season. That gives you your answer. There are other more mathematically rigorous ways to arrive at the same answer, but much of the time the “historical similar player method” will yield a more accurate answer, especially when you have a large sample to work with, because it captures all the things that your mathematical model may not. It is real life! You can’t do much better than that!

You can of course refine your “similar players” comparative database if you have more information about the player in question. He is left-handed? Use only left-handers in your comparison. He is 25? Use only 25-year olds. What if you have so much information about the player in question that your “comp pool” starts to be too small to have a meaningful sample size (which only means that the certainty of your answer decreases, but not necessarily the accuracy)? Let’s say that he is 25, left-handed, 5’10” and 170 pounds, he hit .273 in 300 AB, and you want to include all of these things in your comparison. That obviously will not apply to too many players in the past. Your sample size of “comps” will be small. In that case, you can use players between the ages of 24 and 26, between 5’9” and 5’11”, weigh between 160 and 180, and hit .265-283 in 200 to 400 AB. It doesn’t have to be those exact numbers, but as long as you are not biasing your sample compared to the player in question, you should arrive at an accurate answer to your question.

What if we do that with a .230 player in 300 AB? I’ll use .220 to .240 and between 200 and 400 AB. We know intuitively that we have to regress the .230 towards the league average around 60 or 65%, which will yield around .245 as our answer. But we can do better using actual players and actual data. Of course our answer depends on the league average BA for our player in question and the league average BA for the historical data. Realistically, we would probably use something like BA+ (BA as compared to league-average batting average) to arrive at our answer. Let’s try it without that. I looked at all players who batted in that range from 2010-2014 in 200-400 AB and recorded their collective BA the next year. If I wanted to be a little more accurate (for this question it is probably not necessary), I might weight the results in year 2 by the AB in year 1, or use the delta method, or something like that.

If I do that for just 5 years, 2010-2015, I get 49 players who hit a collective .230 in year 1 in an average of 302 AB. The next year, they hit a collective .245, around what we would expect. That answers our question, “What would a .230 hitter in 300 AB hit next year, assuming he were allowed to play again (we don’t know from the historical data what players who were not allowed to play would hit)?”

What about .300 in 400 AB? I looked at all players from .280 to .350 in year 1 and between 300 and 450 AB. They hit a collective .299 in year 1 and .270 in year 2. Again, that answers the question, “What do we expect Player A to hit next year if he hit .300 this year in around 400 AB?”

For Siegrest with the -47 reverse split, we can use the same method to answer the question, “What do we expect his platoon split to be in the future given 230 TBF versus lefties in the past?” That is such an unusual split that we might have to tweak the criteria a little and then extrapolate. Remember that asking the question, “What do we expect Player A to do in the future?” is almost exactly the same thing as asking, “What is his true talent with respect to this metric?”

I am going to look at only one season for pitchers with around 200 BF versus lefties even though Siegrest’s 230 TBF versus lefties was over several seasons. It should not make much difference as the key is the number of lefty batters faced. I included all left-handed pitchers with at least 150 TBF versus LHB who had a reverse wOBA platoon difference of more than 10 points and pitched again the next year. Let’s see how they do, collectively, in the next year.

There were 76 of such pitchers from 2003-1014. They had a collective platoon differential of -39 points, less than Siegrest’s -47 points, in an average of 194 TBF versus LHB, also less than Siegrest’s 231. But, we should be in the ballpark with respect to estimating Siegrest’s true splits using this “in vivo” method. How did they do in the next year, which is a good proxy (an unbiased estimate) for their true splits?

In year 2, they had an average TBF versus lefties of 161, a little less than the previous year, which is to be expected, and their collective platoon splits were plus plus 8.1 points. So they went from -39 to plus 8.1 in one season to the next because one season of reverse splits is mostly a fluke as I explained in my previous article on platoon splits. 21 points is around the average for LHB with > 150 TBF v. lefties in this time period, so these pitchers moved 47 points from year 1 to year 2, out of a total of 60 points from year 1 to league average. That is a 78% regression toward the mean, around what we estimated Siegrest’s regression should be (I think it was 82%). That suggests that our mathematical model is good since it creates around the same result as when we used our “real live players” method.

How much would it take to estimate a true reverse split for a lefty? Let’s look at some more numbers. I’ll raise the bar to lefty pitchers with at least a 20 point reverse split. There were only 57 in those 12 years of data. They had a collective split in year 1 of -47, just like Siegrest, in an average of 191 TBF v. LHB. How did they do in year 2, which is the answer to our question of their true split? Plus 6.4 points. That is a 78% regression, the same as before.

What about pitchers with at least a 25 point reverse split? They averaged -51 points in year 1. Can we get them to a true reverse split?  Nope. Not even close.

What if we raise the sample size bar? I’ll do at least 175 TBF and -15 reverse split in year 1. Only 45 lefty pitchers fit this bill and they had a -43 point split in year 1 in 209 TBF v. lefties. Next year? Plus 2.8 points! Close but no cigar. There is of course an error bar around only 45 pitchers with 170 TBF v. lefties in year 2, but we’ll take those numbers on faith since that’s what we got. That is a 72% regression with 208 TBF v. lefties, which is about what we would expect given that we have a slightly larger sample size than before.

So please, please, please, when you see or hear of a pitcher with severe reverse splits in 200 or so BF versus lefties, which is around a full year for a starting pitcher or 2 or 3 years for a reliever, remember that our best estimate of their true platoon splits, or what his manager should expect when he sends him out there, is very, very different from what those actual one or three year splits suggest when those actual splits are very far away from the norm. Most of that unusual split, in either direction – almost all of it in fact – is likely a fluke. When we say “likely” we also mean that we must assume that it is a fluke and that we must also assume that the true number is the weighted mean of all the possibilities, which are those year 2 numbers, or year 1 (or multiple years) heavily regressed toward the league average.


With all the hullaballoo about Utley’s slide last night and the umpires’ calls or non-calls, including the one or ones in NY (whose names, addresses, telephone numbers, and social security numbers should be posted on the internet, according to Pedro Martinez), what was lost – or at least there was much confusion – was a discussion of the specific rule(s) that applies to that exact situation – the take-out slide that is, not whether Utley was safe or not on replay. For that you need to download the 2015 complete rule book, I guess. If you Google certain rule numbers, it takes you to the MLB “official rules” portion of their website in which at least some of the rule numbers appear to be completely different than in the actual current rule book.

In any case, last night after a flurry of tweets, Rob Neyer, from Fox Sports, pointed out the clearly applicable rule (although other rules come close): It is 5.09 (a) (13) in the PDF version of the current rulebook. It reads, in full:

The batter is out when… “A preceding runner shall, in the umpire’s judgment, intentionally interfere with a fielder who is attempting to catch a thrown ball or to throw a ball in an attempt to complete any play;”

That rule is unambiguous and crystal clear. 1) Umpire, in his judgment, determines that runner intentionally interferes with the pivot man. 2) The batter must be called out.

By the way, the runner himself may or may not be out. This rule does not address that. There is a somewhat common misperception that the umpire calls both players out according to this rule. Another rule might require the umpire to call the runner also out on interference even if he arrived before the ball/fielder or the fielder missed the bag – but that’s another story.

Keep in mind that if you ask the umpire, “Excuse me, Mr. umpire, but in your judgment, did you think that the runner intentionally interfered with the fielder,” and his answer is, “Yes,” then he must call the batter out. There is no more judgment. The only judgment allowed in this rule is whether the runner intentionally interfered or not. If the rule had said, “The runner may be called out,” then there would be two levels of judgment, presumably. There are other rules which explicitly say the umpire may do certain things, in which case there is presumably some judgement that goes into whether he decides to do them or not. Sometimes those rules provide guidelines for that judgment (the may part) and sometimes they do not. Anyway, this rule does not provide that may judgment. If umpire thinks is it intentional interference, the batter (not runner) is automatically out.

So clearly the umpire should have called the batter out on that play, unless he could say with a straight face, “In my judgment, I don’t think that Utley intentionally interfered with the fielder.” That is not a reasonable judgment of course. Not that there is much recourse for poor or even terrible judgment. Judgment calls are not reviewable, I don’t think. Perhaps umpires can get together and overturn a poor judgment call. I don’t know.

But that’s not the end of the story. There is a comment to this rule which reads:

“Rule 5.09(a)(13) Comment (Rule 6.05(m) Comment): The objective of this rule is to penalize the offensive team for deliberate, unwarranted, unsportsmanlike action by the runner in leaving the baseline for the obvious purpose of crashing the pivot man on a double play, rather than trying to reach the base. Obviously this is an umpire’s judgment play.”

Now that throws a monkey wrench into this situation. Apparently this is where the (I always thought it was an unwritten rule), “Runner must be so far away from the base that he cannot touch it in order for the ‘automatic double play’ to be called” rule came from. Only it’s not a rule. It is a comment which clearly adds a wrinkle to the rule.

The rule is unambiguous. If the runner interferes with the fielder trying to make the play (whether he would have completed the DP or not), then the batter is out. There is no mention of where the runner has to be or not be. The comment changes the rule. It adds another requirement (and another level of judgment). The runner must have been “outside the baseline” in the umpire’s judgment. In addition, it adds some vague requirements about the action of the runner. The original rule says only that the runner must “intentionally interfere” with the fielder. The comment adds words that require the runner’s actions to be more egregious – deliberate, unwarranted, and unsportsmanlike.

But the comment doesn’t really require that to be the case for the umpire to call the batter out. I don’t think. It says, “The objective of this rule is to penalize the offensive team….” I guess if the comment is meant to clarify the rule, MLB really doesn’t want the umpire to call the batter out unless the requirements in the comment are met (runner out of the baseline and his action was not only intentional but deliberate, unwarranted, and unsportsmanlike, a higher bar than just intentional).

Of course the rule doesn’t need clarification. It is crystal clear. If MLB wanted to make sure that the runner is outside of the baseline and acts more egregiously than just intentionally, then they should change the rule, right? Especially if comments are not binding, which I presume they are not.

Also, the comment starts off with: “The objective of this rule is to…”

Does that mean that this rule is only to be applied in double play situations? What if a fielder at second base fields a ball, starts to throw to first base to retire the batter, and the runner tackles him or steps in front of the ball? Is rule 5.09(a)(13) meant to apply? The comment says that the objective of the rule is to penalize the offensive team for trying to break up the double play. In this hypothetical, there is no double play being attempted. There has to be some rule that applies to this situation? If there isn’t, then MLB should not have written in the comment, “The objective of this rule….”

There is another rule which also appears to clearly apply to a take-out slide at second base, like Utley’s, with no added comments requiring that the runner be out of the baseline, or that his actions be unwarranted and unsportsmanlike. It is 6.01(6). Or 7.09(e) on the MLB web site. In fact, I tweeted this rule last night thinking that it addressed the Utley play 100% and that the runner and the batter should have been called out.

“If, in the judgment of the umpire, a base runner willfully and deliberately interferes with a batted ball or a fielder in the act of fielding a batted ball with the obvious intent to break up a double play, the ball is dead. The umpire shall call the runner out for interference and also call out the batter-runner because of the action of his teammate.”

The only problem there are the words, “interferes with a batted ball or a fielder in the act of fielding a batted ball.” A lawyer would say that the plain meaning of the words precludes this from applying to an attempt to interfere with a middle infielder tagging second base and throwing to first, because he is not fielding or attempting to field a batted ball and the runner is not interfering with a batted ball. The runner, in this case, is interfering with a thrown ball or a fielder attempting to tag second and then make a throw to first.

So if this rule is not meant to apply to a take-out slide at second, what is it meant to apply to? That would leave only one thing really. A ground ball is hit in the vicinity of the runner and he interferes with the ball or a fielder trying to field the ball. But there also must be, “an obvious intent to break up a double play.” That is curious wording. Would a reasonable person consider that an attempt to break up a double play? Perhaps ”obvious intent to prevent a double play.” Using the words break up sure sounds like this rule is meant to apply to a runner trying to take out the pivot man on a potential double play. But then why write “fielding a batted ball” rather than “making a play or a throw?”

A good lawyer working for the Mets would try and make the case that “fielding a batted ball” includes everything that happens after someone actually “fields the batted ball,” including catching and throwing it. In order to do so, he would probably need to find that kind of definition somewhere else in the rule book. It is a stretch, but it is not unreasonable, I don’t think.

Finally, Eric Byrnes on MLB Tonight, had one of the more intelligent and reasonable comments regarding this play that I have ever heard from an ex-player. He said, and I paraphrase:

“Of course it was a dirty slide. But all players are taught to do whatever it takes to break up the DP, especially in a post-season game. Until umpires start calling an automatic double play on slides like that, aggressive players like Utley will continue to do that. I think we’ll see a change soon.”

P.S. For the record, since there was judgment involved, and judgment is supposed to represent fairness and common sense, I think that Utley should not have been ruled safe at second on appeal.


Perhaps comments are binding. From the forward to the rules, on the MLB web site:

The Playing Rules Committee, at its December 1977 meeting, voted to incorporate the Notes/Case Book/Comments section directly into the Official Baseball Rules at the appropriate places. Basically, the Case Book interprets or elaborates on the basic rules and in essence have the same effect as rules when applied to particular sections for which they are intended.

Last night in the Cubs/Cardinals game, the Cardinals skipper took his starter, Lackey, out in the 8th inning of a 1-run game with one out, no one on base and lefty Chris Coghlan coming to the plate. Coghlan is mostly a platoon player. He has faced almost four times as many righties in his career than lefties. His career wOBA against righties is a respectable .342. Against lefties it is an anemic .288. I have him with a projected platoon split of 27 points, less than his actual splits, which is to be expected as platoon splits in general get heavily regressed toward the mean, because they tend to be laden with noise for two reasons: One, the samples are rarely large because you are comparing performance against righties to performance against lefties and the smaller of the two tends to dominate the effective sample size – in Coghlan’s case, he has faced only 540 lefties in his entire 7-year career, less than the number of PA a typical  full-time batter gets in one season. Two, there is not much of a spread in platoon talent among both batters and pitchers. The less spread in talent for any statistic, the more the differences you see among players, especially in small samples, are noise. Sort of like DIPS for pitchers.

Anyway, even with a heavy regression, we think that Coghlan has a larger than average platoon split for a lefty and the average lefty split tends to be large. You typically would not want him facing a lefty in that situation. That is especially true when you have a very good and fairly powerful right-handed bat on the bench – Jorge Soler. Soler has a reverse career platoon split, but with only 114 PA versus lefties, that number is almost meaningless. I estimate his actual platoon split to be 23 points, a little less than the average righty. For RHB, there is always a heavy regression of actual platoon splits, regardless of the sample size (while the greater the sample of actual PA versus lefties, the less you regress, it might be a 95% regression for small samples and an 80% regression for large samples – either way, large) simply because there is not a very large spread of talent among RHB. If we look at the actual splits for all RHB over many, many PA, we see a narrow range of results. In fact, there is virtually no such thing as a RHB with true reverse platoon splits.

Soler seems to be the obvious choice,  so of course that’s what Maddon did – he pinch hit for Coghlan with Soler, right? This is also a perfect opportunity since Matheny cannot counter with a RHP – Siegrest has to pitch to at least one batter after entering the game. Maddon let Coghlan hit and he was easily dispatched by Siegrest 4 pitches later. Not that the result has anything to do with the decision by Matheny or Maddon. It doesn’t. Matheny’s decision to bring in Siegrest at that point in time was rather curious too, if you think about it. Surely he must have assumed that Maddon would bring in a RH pinch hitter. So he had to decide whether to pitch Lackey against Coghlan or Siegrest against a right handed hitter, probably Soler. Plus, the next batter, Russell, is another righty. It looks like he got extraordinarily lucky when Maddon did what he did – or didn’t do – in letting Coghlan bat. But that’s not the whole story…

Siegrest may or may not be your ordinary left-handed pitcher. What if Siegrest actually has reverse splits? What if we expect him to pitch better against right handed batters and worse against left-handed batters?  In that case, Coghlan might actually be the better choice than Soler even though he doesn’t often face lefty pitchers. When a pitcher has reverse splits – true reverse splits – we treat him exactly like a pitcher of the opposite hand.  It would be exactly like Coghlan or Soler were facing a RHP. Or maybe Siegrest has no splits – i.e. RH and LH batters of equal overall talent perform about the same. Or very small platoon splits compared to the average left-hander? So maybe hitting Coghlan or Soler is a coin flip.

It might also have been correct for Matheny to bring in Siegrest no matter who he was going to face, simply because Lackey, who is arguably a good but not great pitcher, was about to face a good lefty hitter for the third time – not a great matchup. And if Siegrest does indeed have very small splits either positive or negative, or no splits at all, that is a perfect opportunity to bring him in, and not care whether Maddon leaves Coghlan in or pinch hits Soler. At the same time, if Maddon things that Siegrest has significant reverse splits, he can leave in Coghlan, and if he thinks that the lefty pitcher has somewhere around a neutral platoon split, he can still leave Coghlan in and save Soler for another pinch hit opportunity. Of course, if he thinks that Siegrest is like your typical lefty pitcher, with a 30 point platoon split, then using Coghlan is a big mistake.

So how do managers determine what a pitcher’s true or expected (the same thing) platoon split is? The typical troglodyte will use batting average against during the season in question. After all, that’s what you hear ad-nauseam from the talking heads on TV, most of them ex-players or even ex-managers. Even the slightly informed fan knows that batting average against for a pitcher is worthless stat in and of itself (what, walks don’t count, and a HR is the same as a single?), especially in light of DIPS. The slightly more informed fan also knows that one season splits for a batter or pitcher are not very useful for the reasons I explained above.

If you look at Siegrest’s BA against splits for 2015, you will see .163 versus RHB and .269 versus LHB. Cue the TV commentators: “Siegrest is much better against right-handed batters than left-handed ones.” Of course, is and was are very different things in this context and with respect to making decisions like Matheny and Maddon did. The other day David Price was a pretty mediocre to poor pitcher. He is a great pitcher and you would certainly be taking your life into your hands if you treated him like a mediocre to poor pitcher in the present. Kershaw was a poor pitcher in the playoffs…well, you get the idea. Of course, sometimes, was is very similar to is. It depends on what we are talking about and how long the was was, and what the was actually is.

Given that Matheny is not considered to be such an astute manager when it comes to data-driven decisions, it may be is surprising that he would bring in Siegrest to pitch to Coghlan knowing that Siegrest has an enormous reverse BA against split in 2015. Maybe he was just trying to bring in a fresh arm – Siegrest is a very good pitcher overall. He also knows that the lefty is going to have to pitch to the next batter, Russell, a RHB.

What about Maddon? Surely he knows better than to look at such a garbage stat for one season to inform a decision like that. Let’s use a much better stat like wOBA and look at Siegrest’s career rather than just one season. Granted, a pitcher’s true platoon splits may change from season to season as he changes his pitch repertoire, perhaps even arm angle, position on the rubber, etc. Given that, we can certainly give more weight to the current season if we like. For his career, Siegrest has a .304 wOBA against versus LHB and .257 versus RHB. Wait, let me double check that. That can’t be right. Yup, it’s right. He has a career reverse wOBA split of 47 points! All hail Joe Maddon for leaving Coghlan in to face essentially a RHP with large platoon splits! Maybe.

Remember how in the first few paragraphs I talked about how we have to regress actual platoon splits a lot for pitchers and batters, because we normally don’t have a huge sample and because there is not a great deal of spread among pitchers with respect to true platoon split talent? Also remember that what we, and Maddon and Matheny, are desperately trying to do is estimate Siegrest’s true, real-life honest-to-goodness platoon split in order to make the best decision we can regarding the batter/pitcher matchup. That estimate may or may not be the same as or even remotely similar to his actual platoon splits, even over his entire career. Those actual splits will surely help us in this estimate, but the was is often quite different than the is.

Let me digress a little and invoke the ole’ coin flipping analogy in order to explain how sample size and spread of talent come into play when it comes to estimating a true anything for a player – in this case platoon splits.

Note: If you want you can skip the “coins” section and go right to the “platoon” section. 


Let’s say that we have a bunch of fair coins that we stole from our kid’s piggy bank. We know of course that each of them has a 50/50 chance of coming up head or tails in one flip – sort of like a pitcher with exactly even true platoon splits. If we flip a bunch of them 100 times, we know we’re going to get all kinds of results – 42% heads, 61% tails, etc. For the math inclined, if we flip enough coins the distribution of results will be a normal curve, with the mean and median at 50% and the standard deviation equal to the binomial standard deviation of 100 flips, which is 5%.

Based on the actual results of 100 flips of any of the coins, what would you estimate the true heads/tails percentage of that coin? If one coin came up 65/35 in favor of heads, what is your estimate for future flips? 50% of course. 90/10? 50%. What if we flipped a coin 1000 or even 5000 times and it came up 55% heads and 45% tails? Still 50%. If you don’t believe or understand that, stop reading and go back to whatever you were doing. You won’t understand the rest of this article. Sorry to be so blunt.

That’s like looking at a bunch of pitchers platoon stats and no matter what they are and over how many TBF, you conclude that the pitcher really has an even split and what you observed is just noise. Why is that? With the coins it is because we know beforehand that all the coins are fair (other than that one trick coin that your kid keeps for special occasions). We can say that there is no “spread in talent” among the coins and therefore regardless of the result of a number of flips and regardless of how many flips, we regress the result 100% of the way toward the mean of all the coins, 50%, in order to estimate the true percentage of any one coin.

But, there is a spread of talent among pitcher and batter platoon splits. At least we think there is. There is no reason why it has to be so. Even if it is true, we certainly can’t know off the top of our head how much of a spread there is. As it turns out, that is really important in terms of estimating true pitcher and batter splits. Let’s get back to the coins to see why that is. Let’s say that we don’t have 100% fair coins. Our sly kid put in his piggy bank a bunch of trick coins, but not really, really tricky. Most are still 50/50, but some are 48/52, 52/48, a few less are 45/55, and 1 or 2 are 40/60 and 60/40. We can say that there is now a spread of “true coin talent” but the spread is small. Most of the coins are still right around 50/50 and a few are more biased than that.  If your kid were smart enough to put in a normal distribution of “coin talent,” even one with a small spread, the further away from 50/50, the fewer coins there are.  Maybe half the coins are still fair coins, 20% are 48/52 or 52/48, and a very, very small percentage are 60/40 or 40/60.  Now what happens if we flip a bunch of these coins?

If we flip them 100 times, we are still going to be all over the place, whether we happen to flip a true 50/50 coin or a true 48/52 coin. It will be hard to guess what kind of a true coin we flipped from the result of 100 flips. A 50/50 coin is almost as likely to come up 55 heads and 45 tails as a coin that is truly a 52/48 coin in favor of heads. That is intuitive, right?

This next part is really important. It’s called Bayesian inference, but you don’t need to worry about what it’s called or even how it technically works. It is true that if you flipped a coin and got 60/40 heads that that coin was much more likely to be a true 60/40 coin than it is to be a 50/50 coin. That should be obvious too.  But here’s the catch. There are many, many more 50/50 coins in your kid’s piggy bank than there are 60/40. Your kid was smart enough to put in a normal distribution of trick coins.

So even though it seems like if you flipped a coin 100 times and got 60/40 heads, it is more likely you have a true 60/40 coin than a true 50/50 coin, it isn’t. It is much more likely that you have a 50/50 coin that got “heads lucky” than a true 60/40 coin that landed on the most likely result after 100 flips (60/40) because there are many more 50/50 coins in the bank than 60/40 coins – assuming a somewhat normal distribution with a small spread.

Here is the math: The chances of a 50/50 coin coming up exactly 60/40 is around .01. Chances of a true 60/40 coin coming up 60/40 is 8 times that amount, or .08. But, if there are 8 times as many 50/50 coins in your piggy bank as 60/40 coins, then the chances of your 60/40 coin being a fair coin or a 60/40 biased coin is only 50/50. If there 800 times more 50/50 coins than 60/40 coins in your bank, as there is likely to be if the spread of coin talent is small, then it is 100 times more likely that you have a true 50/50 coin than a true 60/40 coin even though the coin came up 60 heads in 100 flips.

It’s like the AIDS test contradiction. If you are a healthy, heterosexual, non-drug user, and you take an AIDS test which has a 1% false positive rate and you test positive, you are extremely unlikely to have AIDS. There are very few people with AIDS in your population so it is much more likely that you do not have AIDS and got a false positive (1 in 100) than you did have AIDS in the first place (maybe 1 in 100,000) and tested positive. Out of a million people in your demographic, if they all got tested, 10 will have AIDS and test positive (assuming a 0% false negative rate) and 999,990 will not have AIDS, but 10,000 of them (1 in 100) will have a false positive. So the odds you have AIDS is 10,000 to 10 or 1000 to 1 against.

In the coin example where the spread of coin talent is small and most coins are still at or near 50/50, pretty much no matter what we get when flipping a coin 100 times, we are going to conclude that there is a good chance that our coin is still around 50/50 because most of the coins are around 50/50 in true coin talent. However, there is some chance that the coin is biased, if we get an unusual result.

Now, it is awkward and not particularly useful to conclude something like, “There is a 60% chance that our coin is a true 50/50 coin, 20% it is a 55/45 coin, etc.” So what we usually do is combine all those probabilities and come up with a single number called a weighted mean.

If one coin comes up 60/40, our weighted mean estimate of its “true talent” may be 52%. If we come up with 55/45, it might be 51%. 30/70 might be 46%. Etc. That weighed mean is what we refer to as “an estimate of true talent” and is the crucial factor in making decisions based on what we think the talent of the coins/players are likely to be in the present and in the future.

Now what if the spread of coin talent were still small, as in the above example, but we flipped the coins 500 times each? Say we came up with 60/40 again in 500 flips. The chances of that happening with a 60/40 coin is 24,000 times more likely than if the coin were 50/50! So now we are much more certain that we have a true 60/40 coin even if we don’t have that many of them in our bank. In fact, if the standard deviation of our spread in coin talent were 3%, we would be about ½ certain that our coin was a true 50/50 coin and half certain it was a true 60/40 coin, and our weighted mean would be 55%.

There is a much easier way to do it. We have to do some math gyrations which I won’t go into that will enable us to figure out how much to regress our observed flip percentage to the mean flip percentage of all the coins, 50%. For 100 flips it was a large regression such that with a 60/40 result we might estimate a true flip talent of 52%, assuming a spread of coin talent of 3%. For 500 flips, we would regress less towards 50% to give us around 55% as our estimate of coin talent. Regressing toward a mean rather than doing the long-hand Bayesian inferences using all the possible true talent states assumes a normal distribution or close to one.

The point is that the sample size of the observed measurement is determines how much we regress the observed amount towards the mean. The larger the sample, the less we regress. One season observed splits and we regress a lot. Career observed splits that are 5 times that amount, like our 500 versus 100 flips, we regress less.

But sample size of the observed results is not the only thing that determines how much to regress. Remember if all our coins were fair and there were no spread in talent, we would regress 100% no matter how many flips we did with each coin.

So what if there were a large spread in talent in the piggy bank? Maybe a SD of 10 percent so that almost all of our coins were anywhere from 20/80 to 80/20 (in a normal distribution the rule of thumb is that almost of the values fall within 3 SD of the mean in either direction)? Now what if we flipped a coin 100 times and came up with 60 heads. Now there are lots more coins at true 60/40 and even some coins at 70/30 and 80/20. The chances that we have a truly biased coin when we get an unusual result is much greater than if the spread in coin talent were smaller, even in 100 flips.

So now we have the second rule. The first rule was that the number of trials is important in determining how much credence to give to an unusual result, i.e., how much to regress that result towards the mean, assuming that there is some spread in true talent. If there is no spread, then no matter how many trials our result is based on, and no matter how unusual our result, we still regress 100% toward the mean.

All trials whether they be coins or human behavior have random results around a mean that we can usually model as long as the mean is not 0 or 1. That is an important concept, BTW. Put it in your “things I should know” book. No one can control or influence that random distribution. A human being might change his mean from time to time but he cannot change or influence the randomness around that mean. There will always be randomness, and I mean true randomness, around that mean regardless of what we are measuring, as long as the mean is between 0 and 1, and there is more than 1 trial (in one trial you either succeed or fail of course). There is nothing that anyone can do to influence that fluctuation around the mean. Nothing.

The second rule is that the spread of talent also matters in terms of how much to regress the actual results toward the mean. The more the spread, the less we regress the results for a given sample size. What is more important? That’s not really a specific enough question, but a good answer is that if the spread is small, no matter how many trials the results are based on, within reason, we regress a lot. If the spread is large, it doesn’t take a whole lot of trials, again, within reason, in order to trust the results more and not regress them a lot towards the mean.

Let’s get back to platoon splits, now that you know almost everything about sample size, spread of talent, regression to mean, and watermelons. We know that how much to trust and regress results depends on their sample size and on the spread of true talent in the population with respect to that metric, be it coin flipping or platoon splits. Keep in mind that when we say trust the results, that it is not a binary thing, as in, “With this sample and this spread of talent, I believe the results – the 60/40 coin flips or the 50 point reverse splits, and with this sample and spread, I don’t believe them.” That’s not the way it works. You never believe the results. Ever. Unless you have enough time on your hands to wait for an infinite number of results and the underlying talent never changes.

What we mean by trust is literally how much to regress the results toward a mean. If we don’t trust the stats much, we regress a lot. If we trust them a lot, we regress a little. But. We. Always. Regress. It is possible to come up with a scenario where you might regress almost 100% or 0%, but in practice most regressions are in the 20% to 80% range, depending on sample size and spread of talent. That is just a very rough rule of thumb.

We generally know the sample size of the results we are looking at. With Siegrest (I almost forgot what started this whole thing) his career TBF is 604 TBF, but that’s not his sample size for platoon splits because platoon splits are based on the difference between facing lefties and righties. The real sample size for platoon splits is the harmonic mean of TBF versus lefties and righties. If you don’t know what that means don’t worry about it. A shortcut is to use the lesser of the two which is almost always TBF versus lefties, or in Siegrest’s case, 231. That’s not a lot, obviously, but we have two possible things going for Maddon, who played his cards like Siegrest was a true reverse split lefty pitcher. One, maybe the spread of platoon skill among lefty pitchers is large (it’s not), and two, he has a really odd observed split of 47 points in reverse. That’s like flipping a coin 100 times and getting 70 heads and 30 tails or 65/35. It is an unusual result. The question is, again, not binary – whether we believe that -47 point split or not. It is how much to regress it toward the mean of +29 – the average left-handed platoon split for MLB pitchers.

While the unusual nature of the observed result is not a factor in how much regressing to do, it does obviously come into play, in terms of our final estimate of true talent. Remember that the sample size and spread of talent in the underlying population, in this case, all lefty pitchers, maybe all lefty relievers if we want to get even more specific, is the only thing that determines how much we trust the observed results, i.e., how much we regress them toward the mean. If we regress -47 points 50% toward the mean of +29 points, we get quite a different answer than if we regress, say, an observed -10 split 50% towards the mean. In the former case, we get a true talent estimate of -9 points and in the latter we get +10. That’s a big difference. Are we “trusting” the -47 more than the -10 because it is so big? You can call it whatever you want, but the regression is the same assuming the sample size and spread of talent is the same.

The “regression”, by the way, if you haven’t figured it out yet, is simply the amount, in percent, we move the observed toward the mean. -47 points is 76 points “away” from the mean of +29 (the average platoon split for a LHP). 50% regression means to move it half way, or 38 points. If you move -47 points 38 points toward +29 points, you get -9 points, our estimate of Siegrest’s true platoon split if  the correct regression is 50% given his 231 sample size and the spread of platoon talent among LH MLB pitchers. I’ll spoil the punch line. It is not even close to 50%. It’s a lot more.

How do we determine the spread of talent in a population, like platoon talent? That is actually easy but it requires some mathematical knowledge and understanding. Most of you will just have to trust me on this. There are two basic methods which are really the same thing and yield the same answer. One, we can take a sample of players, say 100 players who all had around the same number of opportunities (sample size), say, 300. That might be all full-time starting pitchers in one season and the 300 is the number of LHB faced. Or it might be all pitchers over several seasons who faced around 300 LHB. It doesn’t matter. Nor do the number of opportunities.  They don’t even have to be the same for all pitchers. It is just easier to explain that way. Now we compute the variance in that group – stats 101. Then we compare that variance with the variance expected by chance – still stats 101.

Let’s take BA, for example. If we have a bunch of players with 400 AB each, what is the variance in BA among the players expected by chance? Easy. Binomial theorem. .000625 in BA. What if we observe a variance of twice that, or .00125? Where is the extra variance coming from? A tiny bit is coming from the different contexts that the player plays in, home/road, park, weather, opposing pitchers, etc. A tiny bit comes from his own day-to-day changes in true talent. We’ll ignore that. They really are small. We can of course estimate that too and throw it into the equation. Anyway, that extra variance, the .000625, is coming from the spread of talent. The square root of that is .025 or 25 points of BA, which would be one SD of talent in this example. I just made up the numbers, but that is probably close to accurate.

Now that we know the spread in talent for BA, which we get from this formula – observed variance = random variance + talent variance – we can now calculate the exact regression amount for any sample of observed batting average or whatever metric we are looking at. It’s the ratio of random variance to total variance. Remember we need only 2 things and 2 things only to be able to estimate true talent with respect to any metric, like platoon splits: spread of talent and sample size of the observed results. That gives us the regression amount. From that we merely move the observed result toward the mean by that amount, like I did above with Siegrest’s -47 points and the mean of +29 for a league-average LHP.

The second way, which is actually more handy, is to run a regression of player results from one time period to another. We normally do year-to-year but it can be odd days to even, odd PA to even PA, etc. Or an intra-class correlation (ICC) which is essentially the same thing but it correlates every PA (or whatever the opportunity is) to every other PA within a sample.  When we do that, we either use the same sample size for every player, like we did in the first method, or we can use different sample sizes and then take the harmonic mean of all of them as our average sample size.

This second method yields a more intuitive and immediately useful answer, even though they both end up with the same result. This actually gives you the exact amount to regress for that sample size (the average of the group in your regression). In our BA example, if the average sample size of all the players were 500 and we got a year-to-year (or whatever time period) correlation of .4, that would mean that for BA, the correct amount of regression for a sample size of 500 is 60% (1 minus the correlation coefficient or “r”). So if a player bats .300 in 500 AB and the league average is .250 and we know nothing else about him, we estimate his true BA to be (.300 – .250) * .4 + .250 or .270. We move his observed BA 60% towards the mean of .250. We can easily with a little more math calculate the amount of regression for any sample size.

Using method #1 tells us precisely what the spread in talent is. Method 2 tells us that implicitly by looking at the correlation coefficient and the sample size. With either method, we get the amount to regress for any given sample size.


Let’s look at some year-to-year correlations for a 500 “opportunity” (PA, BA, etc.) sample for some common metrics. Since we are using the same sample size for each, the correlation tells us the relative spreads in talent for each of these metrics. The higher the correlation for any given sample, the higher the spread in talent (there are other factors that slightly affect the correlation other than spread of talent for any given sample size but we can safely ignore them).

BA: .450

OBA: .515

SA: .525

Pitcher ERA: .240

BABIP for pitchers (DIPS): .155

BABIP for batters: .450

Now let’s look at platoon splits:

This is for an average of 200 TBF versus a LHP, so the sample size is smaller than the ones above.

Platoon wOBA differential for pitchers (200 BF v. LHB): .135

RHP: .110

LHP: .195

Platoon wOBA differential for batters (200 BF v. LHP): .180

RHB: .0625

LHB: .118

Those numbers are telling us that, like DIPS, the spread of talent among batters and pitchers with respect to platoon splits is very small. You all know now that this, along with sample size, tells us how much to regress an observed split like Siegrest’s -47 points. Yes, a reverse split of 47 points is a lot, but that has nothing to do with how much to regress it in order to estimate Siegrist’s true platoon split. The fact that -47 points is very far from the average left-handed pitcher’s +29 points means that it will take a lot of regression to moved it into the plus zone, but the -47 points in and of itself does not mean that we “trust it more.” If the regression were 99% then whether the observed were -47 or +10, we would arrive at nearly the same answer. Don’t confuse the regression with the observed result. One has nothing to do with the other. And don’t think in terms of “trusting” the observed result or not. Regress the result and that’s your answer. If you arrive at answer X it makes no difference whether your starting point, the observed result, was B, or C. None whatsoever.  That is a very important point. I don’t know how many times I have heard, “But he had a 47 point reverse split in his entire career!” You can’t possibly be saying that you estimate his real split to be +10 or +12 or whatever it is.” Yes, that’s exactly what I’m saying. A +10 estimated split is exactly the same whether the observed split were -47 or +5. The estimate using the regression amount is the only thing that counts.

What about the certainty of the result? The certainty of the estimate depends mostly on the sample size of the observed results. If we never saw a player hit before and we estimate that he is a .250 hitter we are surely less certain than if we have a hitter who has hit .250 over 5000 AB. But does that change the estimate? No. The certainty due to the sample size was already included in the estimate. The higher the certainty the less we regressed the observed results. So once we have the estimate we don’t revise that again because of the uncertainty. We already included that in the estimate!

And what about the practical importance of the certainty in terms of using that estimate to make decisions? Does it matter whether we are 100% or 90% sure that Siegrest is a +10 true platoon split pitcher? Or whether we are only 20% sure – he might actually have a higher platoon split or a lower one? Remember the +10 is a weighted mean which means that it is in the middle of our error bars. The answer to that is, “No, no and no!” Every decision that a manager makes on the field is or should be based on weighted mean estimates of various player talents. The certainty or distribution rarely should come into play. Basically the noise in the result of a sample of 1 is so large that it doesn’t matter at all what the uncertainty level of your estimates are.

So what do we estimate Siegrest’s true platoon split, given a 47 point reverse split in 231 TBF versus LHB. Using no weighting for more recent results, we regress his observed splits 1 minus 230/1255, or .82 (82%) towards the league average for lefty pitchers, which is around 29 points for a LHP. 82% of 76 points is 62 points. So we regress his -47 points 62 points in the plus direction which gives us an estimate of +15 points in true platoon split. That is half the split of an average LHP, but it is plus nonetheless.

That means that a left-handed hitter like Coghlan will hit better than he normally does against a left-handed pitcher. However, Coghlan has a larger than average estimated split, so that cancels out Siegrest’s smaller than average split to some extent. That also means that Soler or another righty will not hit as well against Siegrest as he would against a LH pitcher with average splits. And since Soler himself has a slightly smaller platoon split than the average RHB, his edge against Siegrest is small.

We also have another method for better estimating true platoon splits for pitchers which can be used to enhance the method we use using sample results, sample size, and means. It is very valuable. We have a pretty good idea as to what causes one pitcher to have a smaller or greater platoon split than another. It’s not like pitchers deliberately throw better or harder to one side or the other or that RH or LH batters scare or distract them. Pitcher platoon splits mostly come from two things: One is arm angle. If you’ve ever played or watched baseball that should be obvious to you. The more a pitcher comes from the side, the tougher he is on same-side batters and the larger his platoon split. That is probably the number one factor in these splits. It is almost impossible for a side-armer not to have large splits.

What about Siegrest? His arm angle is estimated by Jared Cross of Steamer, using pitch f/x data, at 48 degrees. That is about a ¾ arm angle. That strongly suggests that he does not have true reverse splits and it certainly enables us to be more confident that he is plus in the platoon split department.

The other thing that informs us very well about likely splits is pitch repertoire. Each pitch has its own platoon profile. For example, pitches with the largest splits are sliders and sinkers and those with the lowest or even reverse are the curve (this surprises most people), splitter, and change.

In fact, Jared (Steamer) has come up with a very good regression formula which estimates platoon split from pitch repertoire and arm angle only. This formula can be used by itself for estimating true platoon splits. Or it can be used to establish the mean towards which the actual splits should be regressed. If you use the latter method the regression percentage is much higher than if you don’t. It’s like adding a lot more 50/50 coins to that piggy bank.

If we plug Siegrest’s 2015 numbers into that regression equation, we get an estimated platoon from arm angle and pitch repertoire of 14 points, which is less than the average lefty even with the 48 degree arm angle. That is mostly because he uses around 18% change ups this year. Prior to this season, when he didn’t use the change up that often, we would probably have estimated a much higher true split.

So now rather than regressing towards just an average lefty with a 29 point platoon split, we can regress his -47 points to a more accurate mean of 14 points. But, the more you isolate your population mean, the more you have to regress for any given sample size, because you are reducing the spread of talent in that more specific population. So rather than 82%, we have to regress something line 92%. That brings -47 to +9 points.

So now we are down to a left-handed pitcher with an even smaller platoon split. That probably makes Maddon’s decision somewhat of a toss-up.

His big mistake in that same game was not pinch-hitting for Lester and Ross in the 6th. That was indefensible in my opinion. Maybe he didn’t want to piss off Lester, his teammates, and possibly the fan base.Who knows?

Many people don’t realize that one of the (many) weaknesses of UZR, at least for the infield, is that it ignores any ground ball in which the infield was configured in some kind of a “shift” and it “influenced the play.” I believe that’s true of DRS as well.

What exactly constitutes “a shift” and how they determine whether or not it “influenced the play” I unfortunately don’t know. It’s up to the “stringers” (the people who watch the plays and input and categorize the data) and the powers that be at Baseball Info Solutions (BIS). When I get the data, there is merely a code, “1” or “0”, for whether there was a “relevant shift” or not.

How many GB are excluded from the UZR data? It varies by team, but in 2015 so far, about 21% of all GB are classified by BIS as “hit into a relevant shift.” The average team has had 332 shifts in which a GB was ignored by UZR (and presumably DRS) and 1268 GB that were included in the data that the UZR engine uses to calculate individual UZR’s. The number of shifts varies considerably from team to team, with the Nationals, somewhat surprisingly, employing the fewest, with only 181, and the Astros with a whopping 682 so far this season. Remember these are not the total number of PA in which the infield is in a shifted configuration. These are the number of ground balls in which the infield was shifted and the outcome was “relevant to the shift,” according to BIS. Presumably, the numbers track pretty well with the overall number of times that each team employs some kind of a shift. It appears that Washington disdains the shift, relatively speaking, and that Houston loves it.

In 2014, there were many fewer shifts than in this current season. Only 11% of ground balls involved a relevant shift, half the number than in 2015. The trailer was the Rockies, with only 92, and the leader, the Astros, with 666. The Nationals last year had the 4th fewest in baseball.

Here is the complete data set for 2014 and 2015 (as of August 30):


Team GB Shifted Not shifted % Shifted
ari 2060 155 1905 8
atl 1887 115 1772 6
chn 1958 162 1796 8
cin 1938 125 1813 6
col 2239 92 2147 4
hou 2113 666 1447 32
lan 2056 129 1927 6
mil 2046 274 1772 13
nyn 2015 102 1913 5
phi 2105 177 1928 8
pit 2239 375 1864 17
sdn 1957 133 1824 7
sln 2002 193 1809 10
sfn 2007 194 1813 10
was 1985 116 1869 6
mia 2176 125 2051 6
ala 1817 170 1647 9
bal 1969 318 1651 16
bos 1998 247 1751 12
cha 2101 288 1813 14
cle 2003 265 1738 13
det 1995 122 1873 6
kca 1948 274 1674 14
min 2011 235 1776 12
nya 1902 394 1508 21
oak 1980 244 1736 12
sea 1910 201 1709 11
tba 1724 376 1348 22
tex 1811 203 1608 11
tor 1919 328 1591 17



Team GB Shifted Not shifted % Shifted
ari 1709 355 1354 21
atl 1543 207 1336 13
chn 1553 239 1314 15
cin 1584 271 1313 17
col 1741 533 1208 31
hou 1667 682 985 41
lan 1630 220 1410 13
mil 1603 268 1335 17
nyn 1610 203 1407 13
phi 1673 237 1436 14
pit 1797 577 1220 32
sdn 1608 320 1288 20
sln 1680 266 1414 16
sfn 1610 333 1277 21
was 1530 181 1349 12
mia 1591 229 1362 14
ala 1493 244 1249 16
bal 1554 383 1171 25
bos 1616 273 1343 17
cha 1585 230 1355 15
cle 1445 335 1110 23
det 1576 349 1227 22
kca 1491 295 1196 20
min 1655 388 1267 23
nya 1619 478 1141 30
oak 1599 361 1238 23
sea 1663 229 1434 14
tba 1422 564 858 40
tex 1603 297 1306 19
tor 1539 398 1141 26


The individual fielding data (UZR) for infielders that you see on Fangraphs is based on non-shifted ground balls only, or on ground balls where there was a shift but it wasn’t relevant to the outcome. The reason that shifts are ignored in UZR (and DRS, I think) is because we don’t know where the individual fielders are located. It could be a full shift, a partial shift, the third baseman could be the left-most fielder as he usually is or he could be the man in short right field between the first baseman and the second baseman, etc. The way most of the PBP defensive metrics work, it would be useless to include this data.

But what we can do, with impunity, is to include all ground ball data in a team UZR. After all, if a hard ground ball is hit at the 23 degree vector, and we are only interested in team fielding, we don’t care who is the closest fielder or where he is playing. All we care about is whether the ball was turned into an out, relative to the league average out rate for a similar ground ball in a similar or adjusted for context. In other words, using the same UZR methodology, we can calculate a team UZR using all ground ball data, with no regard for the configuration of the IF on any particular play. And if it is true that the type, number and timing (for example, against which batters and/or with which pitchers) of shifts is relevant to a team’s overall defensive efficiency, team UZR in the infield should reflect not only the sum of individual fielding talent and performance, but also the quality of the shift in terms of hit prevention. In addition, if we subtract the sum of the individual infielders’ UZR on non-shift plays from the total team UZR on all plays, the difference should reflect, at least somewhat, the quality of the shifts.

I want to remind everyone that UZR accounts for several contexts. One, park factors. For infield purposes, although the dimensions of all infields are the same, the hardness and quality of the infield can differ from park to park. For example, in Coors Field in Colorado and Chase Field in Arizona, the infields are hard and quick, and thus more ground balls scoot through for hits even if they leave the bat with the same speed and trajectory.

Two, the speed of the batter. Obviously faster batters require the infielders to play a little closer to home plate and they beat out infield ground balls more often than slower batters. In some cases the third baseman and/or first baseman have to play in to protect against the bunt. This affects the average “caught percentage” for any given category of ground balls. The speed of the opposing batters tends to even out for fielders and especially for teams, but still, the UZR engine tries to account for this just in case it doesn’t, especially in small samples.

The third context is the position of the base runners and number of outs. This affects the positioning of the fielders, especially the first baseman (whether first base is occupied or not). The handedness of the batters is the next context. As with batter speed, these also tend to even out in the long run, but it is better to adjust for them just in case.

Finally, the overall GB propensity of the pitchers is used to adjust the average catch rates for all ground balls. The more GB oriented a pitcher is, the softer his ground balls are. While all ground balls are classified in the data as soft, medium, or hard, even within each category, the speed and consequently the catch rates, vary according to the GB tendencies of the pitcher. For example, for GB pitchers, their medium ground balls will be caught at a higher rate than the medium ground balls allowed by fly ball pitchers.

So keep in mind that individual and team UZR adjust as best as it can for these contexts. In most cases, there is not a whole lot of difference between the context adjusted UZR numbers and the unadjusted ones. Also keep in mind that the team UZR numbers you see in this article are adjusted for park, batter hand and speed, and runners/outs, the same as the individual UZR’s you see on Fangraphs.

For this article, I am interested in team UZR including when the IF is shifted. Even though we are typically interested in individual defensive performance and talent, it is becoming more and more difficult to evaluate individual fielding for infielders, because of the prevalence of the shift, and because there is so much disparity in how often each team employs the shift (so that we might be getting a sample of only 60% of the available ground balls for one team and 85% for another).

One could speculate that teams that employ the most shifts would have the best team defense. To test that, we could look at each team’s UZR versus their relevant shift percentage. The problem, of course, is that the talent of the individual fielders is a huge component of team UZR, regardless of how often a team shifts. There may also be selective sampling going on. Maybe teams that don’t have good infield defense feel the need to shift more often such that lots of shifts get unfairly correlated with (but are not the cause of) bad defense.

One way we can separate out talent from shifting is to compare team UZR on all ground balls with the total of the individual UZR’s for all the infielders (on non-shifted ground balls). The difference may tell us something about the efficacy of the shifts and non-shifts. In other words, total team individual infield UZR, which is just the total of each infielder’s UZR as you would see on Fangraphs (range and ROE runs only), represents what we generally consider to be a sample of team talent. This is measured on non-shifted ground balls only, as explained above.

Team total UZR, which measures team runs saved or cost, with no regard for who caught each ball or not, and is based on every batted ball, shifted or not, represents how the team actually performed on defense and is a much better measure of team defense than totaling the individual UZR’s. The difference, then, to some degree, represents how efficient teams are at shifting or not shifting, regardless of how often they shift.

There are lots of issues that would have to be considered when evaluating whether shifts work or not. For example, maybe shifting too much with runners on base results in fewer DP because infielders are often out of position. Maybe stolen bases are affected for the same reason. Maybe the number and quality of hits to the outfield change as a result of the shift. For example, if a team shifts a lot, maybe they don’t appear to record more ground ball outs, but the shifted batters are forced to try and hit the ball to the opposite field more often and thus they lose some of their power.

Maybe it appears that more ground balls are caught, but because pitchers are pitching “to the shift” they become more predictable and batters are actually more successful overall (despite their ground balls being caught more often). Maybe shifts are spectacularly successful against some stubborn and pull-happy batters and not very successful against others who can adjust or even take advantage of a shift in order to produce more, not less, offense. Those issues are beyond the scope of UZR and this article.

Let’s now look at each team in 2014 and 2015, their shift percentage, their overall team UZR, their team UZR when shifting, when not shifting, and their total individual combined UZR when not shifting. Remember this is for the infield only.


Team % Shifts Shift Runs Non-Shift Runs Team Runs Total Individual Runs Team Minus Individual after prorating Ind Runs to 100% of plays
KCA 20 -2.2 10.5 10 26.3 -19.6
LAN 13 -5 -7.3 -13.3 0.8 -14.2
TOR 26 -2.5 13.9 11 22.6 -15.6
CHA 15 -7.7 -12.3 -21.8 -11.9 -8.9
CLE 23 0.6 3.3 3.3 12.8 -11.4
MIN 23 3.5 -11.6 -7.6 1.8 -9.7
MIL 17 0.3 -7.1 -6.7 2.5 -9.5
SEA 14 -2.6 -8.7 -13.8 -5.1 -8.3
SFN 21 2.3 12.6 15.8 24.4 -11.8
MIA 14 0.5 2.7 2.4 8.4 -6.7
ARI 21 3.4 -1.5 2.1 8 -7.0
HOU 41 -7.6 -3.2 -11.3 -6.1 -3.1
PHI 14 -6.4 -16.4 -23.5 -19 -3.0
COL 31 -7.3 0 -5.5 -1.5 -3.7
ATL 13 3.1 6.9 9.8 12.6 -3.7
SLN 16 -1.1 -5.8 -8.8 -7 -1.1
DET 22 1.8 -16.2 -17.8 -16 0.5
ALA 16 -2.4 -0.4 -3.6 -2.8 -0.5
BOS 17 0.3 4.8 3.5 2.7 0.5
NYN 13 -3.8 3.1 0.8 -2.7 3.7
WAS 12 1.1 -9.4 -8.4 -12.6 5.1
CIN 17 5 9.8 16.2 11.2 3.9
CHN 15 0.2 18.7 17.4 10.5 6.0
BAL 25 10.6 -0.5 14.4 5.8 7.6
SDN 20 7.5 -6.8 1.5 -7.8 10.3
TEX 19 4.1 12.8 19.6 10.1 8.3
TBA 40 0.1 4.5 7 -9.2 19.3
NYA 30 0.1 11.8 12.2 -6.6 20.2
PIT 32 0.3 0.3 0.1 -21 26.0
OAK 23 3.9 -8.8 -5 -31.4 31.1


The last column, as I explained above, represents the difference between how we think the infield defense performed based on individual UZR’s only (on non-shifted GB), prorated to 100% of the games (the proration is actually regressed so that we don’t have the “on pace for” problem), and how the team actually performed on all ground balls. If the difference is positive, then we might assume that the shifts and non-shifts are being done in an effective fashion regardless of how often shifts are occurring. If it is negative, then somehow the combination of shifts and non-shifts are costing some runs. Or the difference might not be meaningful at all – it could just be noise. At the very least, this is the first time that you are seeing real infield team defense being measured based on the characteristics of each and every ground ball and the context in which they were hit, regardless of where the infielders are playing.

First of all, if we look at all the teams that have a negative difference in the last column, the teams that presumably have the worst shift/no-shift efficiency, and compare them to those that are plus and presumably have the best shift/no-shift efficiency, we find that there is no difference in their average shift percentages. For example, TBA and HOU have the most shifts by far, and HOU “cost” their teams 5.2 runs and TBA benefited by 16.2 runs. LAN and WAS had the fewest shifts and one of them gained 4 runs and the other lost 14 runs.  The other teams are all over the board with respect to number of shifts and the difference between the individual UZR’s and team UZR.

Let’s look at that last column for 2014 and compare it to 2015 to see if there is any consistency from year to year within teams. Do some teams consistently do better or worse with their shifting and non-shifting, at least for 2014 and 2015? Let’s also see if adding more data gives us any relationship between the last column (delta team and individual UZR) and shift percentage.

Team 2015 % Shift 2014 % Shift 2015 Team Minus Individual 2014 Team Minus Individual Combined 2014 and 2015 Team Minus Individual
HOU 41 32 -5.2 45.6 40.4
TBA 40 22 16.2 12.7 28.9
PIT 32 17 21.1 5.5 26.6
TEX 19 11 9.5 9.9 19.4
WAS 12 6 4.2 13.0 17.2
OAK 23 12 26.4 -9.3 17.1
BAL 25 16 8.6 7.6 16.2
NYN 13 5 3.5 9.0 12.5
NYA 30 21 18.8 -8.4 10.4
CHA 15 14 -9.9 12.5 2.6
CHN 15 8 6.9 -5.8 1.1
TOR 26 17 -11.6 12.6 1.0
DET 22 6 -1.8 2.4 0.6
SFN 21 10 -8.6 6.0 -2.6
CIN 17 6 5 -8.2 -3.2
CLE 23 13 -9.5 5.2 -4.3
MIL 17 13 -9.2 3.1 -6.1
ARI 21 8 -5.9 -0.2 -6.1
SDN 20 7 9.3 -15.7 -6.4
MIA 14 6 -6 -0.9 -6.9
BOS 17 12 0.8 -10.6 -9.8
KCA 20 14 -16.3 6.3 -10.0
ATL 13 6 -2.8 -7.5 -10.3
PHI 14 8 -4.5 -6.2 -10.7
ALA 16 9 -0.8 -11.6 -12.4
SLN 16 10 -1.8 -12.2 -14.0
LAN 13 6 -14.1 -2.5 -16.6
MIN 23 12 -9.4 -9.3 -18.7
SEA 14 11 -8.7 -11.3 -20.0
COL 31 4 -4 -23.0 -27.0


Although there appears to be little correlation from one year to the next for each of the teams, we do find that of the teams that had the least efficient shifts/non-shifts (negative values in the last column), they averaged 14% shifts per season in 2014 and 2015. Those that had the most effective (plus values in the last column) shifted an average of 19% in 2014 and 2015. As well, the two teams with the biggest gains, HOU and TB, had the most shifts, at 37% and 31% per season, respectively. The two worst teams, Colorado and Seattle, shifted 17% and 13% per season. On the other hand, the team with the least shifts in baseball in 2014 and 2015 combined, the Nationals, gained over 17 runs in team UZR on all ground balls compared to a total of the individual UZR’s on non-shifted balls only, suggesting that the few shifts they employed were very effective, which seems reasonable.

It is also interesting to note that the team that had the worst difference in team and individual UZR in 2014, the Rockies, only shifted 4% of the time, easily the worst in baseball. In 2015, they have been one of the most shifted teams and still their team UZR is 4 runs worse than their total individual UZR’s. Still, that’s a lot better than in 2014.

It also appears that many of the smarter teams are figuring out how to optimize their defense beyond the talent of the individual players. TB, PIT, HOU, WAS, and OAK are at the top of the list in plus value deltas (the last column). These teams are generally considered to have progressive front offices. Some of the teams with the most negative numbers in the last column, those teams which appear to be sub-optimal in their defensive alignment, are LAN, MIN, SEA, PHI, COL, ATL, SLN, and ALA, all with reputations for having less than progressive front offices and philosophies, to one degree or another. In fact, other than a few outliers, like Boston, Texas, and the White Sox, the order of the teams in the chart above looks like a reasonable order of teams from most to least progressive teams. Certainly the teams in the top half appear to be the most saber-savvy teams and those in bottom half, the least.

In conclusion, it is hard to look at this data and figure out whether and which teams are using their shifts and non-shifts effectively. There doesn’t appear to be a strong correlation between shift percentage and the difference between team and individual defense although there are a few anecdotes that suggest otherwise. As well, in the aggregate for 2014 and 2015 combined, teams that have been able to outperform on team defense the total of their individual UZR’s have shifted more often, 19% to 13%.

There also appears to the naked eye to be a strong correlation between the perceived sabermetric orientation of a team’s front office and the efficiency of their shift/non-shift strategy, at least as measured by the numbers in the last column, explained above.

I think the most important thing to take away from this discussion is that there can be a fairly large difference between team infield UZR which uses every GB, and the total of the individual UZR’s which uses only those plays in which no shift was relevant to the outcome of the play. As well, the more shifts employed by a team, the less we should trust that the total of the individual performances are representative of the entire team’s defense on the infield. I am also going to see if Fangraphs can start publishing team UZR for infielders and for outfielders, although in the outfield, the numbers should be similar if not the same.

Recently there has been some discussion about the use of WAR in determining or at least discussing an MVP candidate for position players (pitchers are eligible too for MVP, obviously, and WAR includes defense and base running, but I am restricting my argument to position players and offensive WAR). Judging from the comments and questions coming my way, many people don’t understand exactly what WAR measures, how it is constructed, and what it can or should be used for.

In a nutshell, offensive WAR takes each of a player’s offensive events in a vacuum, without regard to the timing and context of the event or whether that event actually produced or contributed to any runs or wins, and assigns a run value to it, based on the theoretical run value of that event (linear weights), adds up all the run values, converts them to theoretical “wins” by dividing by some number around 10, and then subtracts the approximate runs/wins that a replacement player would have in that many PA. A replacement player produces around 20 runs less than average for every 650 PA, by definition. This can vary a little by defensive position and by era. And of course a replacement player is defined as the talent/value of a player who can be signed for the league minimum even if he is not protected (a so-called “freely available player”).

For example, let’s say that a player had 20 singles, 5 doubles, 1 triple, 4 HR, 10 non-intentional BB+HP, and 60 outs in 100 PA. The approximate run values for these events are .47, .78, 1.04, 1.40, .31, and -.25. These values are marginal run values and by definition are above or below a league average position player. So, for example, if a player steps up to the plate and gets a single, on the average he will generate .47 more runs than 1 generic PA of a league average player. These run values and the zero run value of a PA for a league average player assume the player bats in a random slot in the lineup, on a league average team, in a league average park, against a league-average opponent, etc.

If you were to add up all those run values for our hypothetical player, you would get +5 runs. That means that theoretically this player would produce 5 more runs than a league-average player on a league average team, etc. A replacement player would generate around 3 fewer runs than a league average player in 100 PA (remember I said that replacement level was around -20 runs per 650 PA), so our hypothetical player is 8 runs above replacement in those 100 PA.

The key here is that these are hypothetical runs. If that player produced those offensive events while in a league average context an infinite number of times he would produce exactly 5 runs more than an average player would produce in 100 PA and his team would win around .5 more games (per 100 PA) than an average player and .8 more games (and 8 runs) than a replacement player.

In reality, for those 100 PA, we have no idea how many runs or wins our player contributed to. On the average, or after an infinite number of 100 PA trials, his results would have produced an extra 5 runs and 1/2 win, but in one 100 PA trial, that exact result is unlikely, just like in 100 flips of a coin, exactly 50 heads and tails is an unlikely though “mean” or “average” event. Perhaps 15 or those 20 singles didn’t result in a single run being produced. Perhaps all 4 of his HR were hit after his team was down by 5 or 10 runs and they were meaningless. On the other hand, maybe 10 of those hits were game winning hits in the 9th inning. Similarly, of those 60 outs, what if 10 times there was a runner on third and 0 or 1 out, and our player struck out every single time? Alternatively, what if he drove in the runner 8 out of 10 times with an out, and half the time that run amounted to the game winning run? WAR would value those 10 outs exactly the same in either case.

You see where I’m going here? Context is ignored in WAR (for a good reason, which I’ll get to in a minute), yet context is everything in an MVP discussion. Let me repeat that: Context is everything in an MVP discussion. An MVP is about the “hero” nature of a player’s seasonal performance. How much did he contribute to his team’s wins and to a lesser extent, what did those wins mean or produce (hence, the “must be on a contending team” argument). Few rational people are going to consider a player MVP-quality if little of his performance contributed to runs and wins no matter how “good” that performance was in a vacuum. No one is going to remember a 4 walk game when a team loses in a 10-1 blowout. 25 HR with most of them occurring in losing games, likely through no fault of the player? Ho-hum. 20 HR, where 10 of them were in the latter stages of a close game and directly led to 8 wins? Now we’re talking possible MVP! .250 wOBA in clutch situations but .350 overall? Choker and bum, hardly an MVP.

I hope you are getting the picture. While there are probably several reasonable ways to define an MVP and reasonable and smart people can legitimately debate about whether it is Trout, Miggy, Kershaw or Goldy, I think that most reasonable people will agree that an MVP has to have had some – no a lot – of articulable performance contributing to actual, real-life runs and wins, otherwise that “empty WAR” is merely a tree falling in the forest with no one to hear it.

So what is WAR good for and why was it “invented?” Mostly it was invented as a way to combine all aspects of a player’s performance – offense, defense, base running, etc. – on a common scale. It was also invented to be able to estimate player talent and to project future performance. For that it is nearly perfect. The reason it ignores context is because we know that context is not part of a player’s skill set to any significant degree. Which also means that context-non-neutral performance is not predictive – if we want to project future performance, we need a metric that strips out context – hence WAR.

But, for MVP discussions? It is a terrible metric for the aforementioned reasons. Again, regardless of how you define MVP caliber performance, almost everyone is in agreement that it includes and needs context, precisely that which WAR disdains and ignores. Now, obviously WAR will correlate very highly with non-context-neutral performance. That goes without saying. It would be unlikely that a player who is a legitimate MVP candidate does not have a high WAR. It would be equally unlikely that a player with a high WAR did not specifically contribute to lots of runs and wins and to his team’s success in general. But that doesn’t mean that WAR is a good metric to use for MVP considerations. Batting average correlates well with overall offensive performance and pitcher wins correlate well with good pitching performance, but we would hardly use those two stats to determine who was the better overall batter or pitcher. And to say, for example, that Trout is the proper MVP and not Cabrera because Trout was 1 or 2 WAR better than Miggy, without looking at context, is an absurd and disingenuous argument.

So, is there a good or at least a better metric than WAR for MVP discussions? I don’t know. WPA perhaps. WPA in winning games only? WPA with more weight for winning games? RE27? RE27, again, adjusted for whether the team won or lost or scored a run or not? It is not really important what you use for these discussions by why you use them. It is not so much that WAR is a poor metric for determining an MVP. It is using WAR without understanding what it means and why it is a poor choice for an MVP discussion in and of itself, that is the mistake. As long as you understand what each metric means (including traditional mundane ones like RBI, runs, etc.), how it relates to the player in question and the team’s success, feel free to use whatever you like (hopefully a combination of metrics and statistics) – just make sure you can justify your position in a rational, logical, and accurate fashion.


In response to my two articles on whether pitcher performance over the first 6 innings is predictive of their 7th inning performance (no), a common response from saber and non-saber leaning critics and commenters goes something like this:

No argument with the results or general method, but there’s a bit of a problem in selling these findings. MGL is right to say that you can’t use the stat line to predict inning number 7, but I would imagine that a lot of managers aren’t using the stat line as much as they are using their impression of the pitcher’s stuff and the swings the batters are taking.

You hear those kinds of comments pretty often even when a pitcher’s results aren’t good, “they threw the ball pretty well,” and “they didn’t have a lot of good swings.”

There’s no real way to test this and I don’t really think managers are particularly good at this either, but it’s worth pointing out that we probably aren’t able to do a great job capturing the crucial independent variable.

That is actually a comment on The Book Blog by Neil Weinberg, one of the editors of Beyond the Box Score and a sabermetric blog writer (I hope I got that somewhat right).

My (edited) response on The Book Blog was this:

Neil I hear that refrain all the time and with all due respect I’ve never seen any evidence to back it up. There is plenty of evidence, however, that for the most part it isn’t true.

If we are to believe that managers are any good whatsoever at figuring out which pitchers should stay and which should not, one of two things must be true:

1) The ones who stay must pitch well, especially in close games. That simply isn’t true.

2) The ones who do not stay would have pitched terribly. In order for that to be the case, we must be greatly under-estimating the TTO penalty. That strains credulity.

Let me explain the logic/math in # 2:

We have 100 pitchers pitching thru 6 innings. Their true talent is 4.0 RA9. 50 of them stay and 50 of them go, or some other proportion – it doesn’t matter.

We know that those who stay pitch to the tune of around 4.3. We know that. That’s what the data say. They pitch at the true talent plus the 3rd TTOP, after adjusting for the hitters faced in the 7th inning.

If we are to believe that managers can tell, to any extent whatsoever, whether a pitcher is likely to be good or bad in the next inning or so, then it must be true that the ones who stay will pitch better on the average then the ones who do not, assuming that the latter were allowed to stay in the game of course.

So let’s assume that those who were not permitted to continue would have pitched at a 4.8 level, .5 worse than the pitchers who were deemed fit to remain.

That tells us that if everyone were allowed to continue, they would pitch collectively at a 4.55 level, which implies a .55 rather than a .33 TTOP.

Are we to believe that the real TTOP is a lot higher than we think, but is depressed because managers know when to take pitchers out such that the ones they leave in actually pitch better than all pitchers would if they were all allowed to stay?

Again, to me that seems unlikely.

Anyway, here is some new data which I think strongly suggests that managers and pitching coaches have no better clue than you or I as to whether a pitcher should remain in a game or not. In fact, I think that the data suggest that whatever criteria they are using, be it runs allowed, more granular performance like K, BB, and HR, or keen, professional observation and insight, it is simply not working at all.

After 6 innings, if a game is close, a manager should make a very calculated decision as far as whether or not he should remove his starter. That decision ought to be based primarily on whether the manager thinks that his starter will pitch well in the 7th and possibly beyond, as opposed to one of his back-end relievers. Keep in mind that we are talking about general tendencies which should apply in close games going into the 7th inning. Obviously every game may be a little different in terms of who is on the mound, who is available in the pen, etc. However, in general, when the game is close in the 7th inning and the starter has already thrown 6 full, the decision to yank him or allow him to continue pitching is more important than when the game is not close.

If the game is already a blowout, it doesn’t matter much whether you leave in your starter or not. It has little effect on the win expectancy of the game. That is the whole concept of leverage. In cases where the game is not close, the tendency of the manager should be to do whatever is best for the team in the next few games and in the long run. That may be removing the starter because he is tired and he doesn’t want to risk injury or long-term fatigue. Or it may be letting his starter continue (the so-called “take one for the team” approach) in order to rest his bullpen. Or it may be to give some needed work to a reliever or two.

Let’s see what managers actually do in close and not-so-close games when their starter has pitched 6 full innings and we are heading into the 7th, and then how those starters actually perform in the 7th if they are allowed to continue.

In close games, which I defined as a tied or one-run game, the starter was allowed to begin the 7th inning 3,280 times and he was removed 1,138 times. So the starter was allowed to pitch to at least 1 batter in the 7th inning of a close game 74% of the time. That’s a pretty high percentage, although the average pitch count for those 3,280 pitcher-games was only 86 pitches, so it is not a complete shock that managers would let their starters continue especially when close games tend to be low scoring games. If a pitcher is winning or losing 2-1 or 3-2 or 1-0 or the game is tied 0-0, 1-1, 2-2, and the starter’s pitch count is not high, managers are typically loathe to remove their starter. In fact, in those 3,280 instances, the average runs allowed for the starter through 6 innings was only 1.73 runs (a RA9 of 2.6) and the average number of innings pitched beyond 6 innings was 1.15.

So these are presumably the starters that managers should have the most confidence in. These are the guys who, regardless of their runs allowed, or even their component results, like BB, K, and HR, are expected to pitch well into the 7th, right? Let’s see how they did.

These were average pitchers, on the average. Their seasonal RA9 was 4.39 which is almost exactly league average for our sample, 2003-2013 AL. They were facing the order for the 3rd time on the average, so we expect them to pitch .33 runs worse than they normally do if we know nothing about them.

These games are in slight pitcher’s parks, average PF of .994, and the batters they faced in the 7th were worse than average, including a platoon adjustment (it is almost always the case that batters faced by a starter in the 7th are worse than league average, adjusted for handedness). That reduces their expected RA9 by around .28 runs. Combine that with the .33 run “nick” that we expect from the TTOP and we expect these pitchers to pitch at a 4.45 level, again knowing nothing about them other than their seasonal levels and attaching a generic TTOP penalty and then adjusting for batter and park.

Surely their managers, in allowing them to pitch in a very close game in the 7th know something about their fitness to continue – their body language, talking to their catcher, their mechanics, location, past experience, etc. All of this will help them to weed out the ones who are not likely to pitch well if they continue, such that the ones who are called on to remain in the game, the 74% of pitchers who face this crossroad and move on, will surely pitch better than 4.45, which is about the level of a near-replacement reliever.

In other words, if a manager thought that these starters were going to pitch at a 4.45 level in such a close game in the 7th inning, they would surely bring in one of their better relievers – the kind of pitchers who typically have a 3.20 to 4.00 true talent.

So how did these hand-picked starters do in the 7th inning? They pitched at a 4.70 level. The worst reliever in any team’s pen could best that by ½ run. Apparently managers are not making very good decisions in these important close and late game situations, to say the least.

What about in non-close game situations, which I defined as a 4 or more run differential?

73% of pitchers who pitch through 6 were allowed to continue even in games that were not close. No different from the close games. The other numbers are similar too. The ones who are allowed to continue averaged 1.29 runs over the first 6 innings with a pitch count of 84, and pitched an average of 1.27 innings more.

These guys had a true talent of 4.39, the same as the ones in the close games – league average pitchers, collectively. They were expected to pitch at a 4.50 level after adjusting for TTOP, park and batters faced. They pitched at a 4.78 level, slightly worse than our starters in a close game.

So here we have two very different situations that call for very different decisions, on the average. In close games, managers should (and presumably think they are) be making very careful decision about whom to pitch in the 7th, trying to make sure that they use the best pitcher possible. In not-so-close games, especially blowouts, it doesn’t really matter who they pitch, in terms of the WE of the game, and the decision-making goal should be oriented toward the long-term.

Yet we see nothing in the data that suggests that managers are making good decisions in those close games. If we did, we would see much better performance from our starters than in not-so-close games and good performance in general. Instead we see rather poor performance, replacement level reliever numbers in the 7th inning of both close and not-so-close games. Surely that belies the, “Managers are able to see things that we don’t and thus can make better decisions about whether to leave starters in or not,” meme.

Let’s look at a couple more things to further examine this point.

In the first installment of these articles I showed that good or bad run prevention over the first 6 innings has no predictive value whatsoever for the 7th inning. In my second installment, there was some evidence that poor component performance, as measured by in-game, 6-inning FIP had some predictive value, but not good or great component performance.

Let’s see if we can glean what kind of things managers look at when deciding to yank starters in the 7th or not.

In all games in which a starter allows 1 or 0 runs through 6, even though his FIP was high, greater than 4, suggesting that he really wasn’t pitching such a great game, his manager let him continue 78% of the time, which was more than the 74% overall that starters pitched into the 7th.

In games where the starter allowed 3 or more runs through 6 but had a low FIP, less than 3, suggesting that he pitched better than his RA suggest, managers let them continue to pitch just 55% of the time.

Those numbers suggest that managers pay more attention to runs allowed than component results when deciding whether to pull their starter in the 7th. We know that that is not a good decision-making process as the data indicate that runs allowed have no predictive value while component results do, at least when those results reflect poor performance.

In addition, there is no evidence that managers can correctly determine who should stay and who to pull in close games – when that decision matters the most. Can we put to rest, for now at least, this notion that managers have some magical ability to figure out which of their starters has gas left in their tank and which do not? They don’t. They really, really, really don’t.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

The other day I wrote that pitcher performance though 6 innings, as measured solely by runs allowed, is not a good predictor of performance in the 7th inning. Whether a pitcher is pitching a shutout or has allowed 4 runs thus far, his performance in the 7th is best projected mostly by his full-season true talent level plus a times through the order penalty of around .33 runs per 9 innings (the average batter faced in the 7th inning appears for the 3rd time). Pitch count has a small effect on those late inning projections as well.

Obviously if you have allowed no or even 1 run through 6 your component results will tend to be much better than if you have allowed 3 or 4 runs, however there is going to be some overlap. Some small proportion of 0 or 1 run starters will have allowed a HR, 6 or 7 walks and hits, and few if any strikeouts. Similarly, some small percentage of pitchers who allow 3 or 4 runs through 6 will have struck out 7 or 8 batters and only allowed a few hits and walks.

If we want to know whether pitching ”well” or not through 6 innings has some predictive value for the 7th (and later) inning, it is better to focus on things that reflect the pitcher’s raw performance than simply runs allowed. It is an established fact that pitchers have little control over whether their non-HR batted balls fall for hits or outs or whether their hits and walks get “clustered” to produce lots of runs or are spread out such that few if any runs are scored.

It is also established that the components most under control by a pitcher are HR, walks, and strikeouts, and that pitchers who excel at the K, and limit walks and HR tend to be the most talented, and vice versa. It also follows that when a pitcher strikes out a lot of batters in a game and limits his HR and walks total that he is pitching “well,” regardless of how many runs he has allowed – and vice versa.

Accordingly, I have extended my inquiry into whether pitching “well” or not has some predictive value intra-game to focus on in-game FIP rather than runs allowed.  My intra-game FIP is merely HR, walks, and strikeouts per inning, using the same weights as are used in the standard FIP formula – 13 for HR, 3 for walks and 2 for strikeouts.

So, rather than defining dealing as allowing 1 or fewer runs through 6 and not dealing as 3 or more runs, I will define the former as an FIP through 6 innings below some maximum threshold and the latter as above some minimum threshold. Although I am not nearly convinced that managers and pitching coaches, and certainly not the casual fan, look much further than runs allowed, I think we can all agree that they should be looking at these FIP components instead.

Here is the same data that I presented in my last article, this time using FIP rather than runs allowed to differentiate pitchers who have been pitching very well through 6 innings or not.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings Avg runs allowed through 6 # of Games RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 1.02 5,338 4.39
Not-dealing (FIP greater than 4) 2.72 3,058 5.03

The first thing that should jump out at you is while our pitchers who are not pitching well do indeed continue to pitch poorly, our dealing pitchers, based upon K, BB, and HR rate over the first 6 innings, are not exactly breaking the bank either in the 7th inning.

Let’s put some context into those numbers.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings True talent level based on season RA9 Expected RA9 in 7th RA9 in the 7th inning
Dealing (FIP less than 3 through 6) 4.25 4.50 4.39
Not-dealing (FIP greater than 4) 4.57 4.62 5.03

As you can see, our new dealing pitchers are much better pitchers. They normally allow 4.25 runs per game during the season. Yet they allow 4.39 runs in the 7th despite pitching very well through 6, irrespective of runs allowed (and of course they allow few runs too). In other words, we have eliminated those pitchers who allowed few runs but may have actually pitched badly or at least not as well as their meager runs allowed would suggest. All of these dealing pitchers had some combination of high K rates, and low BB and HR rates through 6 innings. But still, we see only around .1 runs per 9 in predictive value – not significantly different from zero or none.

On the other hand, pitchers who have genuinely been pitching badly, at least in terms of some combination of a low K rate and high BB and HR rates, do continue to pitch around .4 runs per 9 innings worse than we would expect given their true talent level and the TTOP.

There is one other thing that is driving some of the difference. Remember that in our last inquiry we found that pitch count was a factor in future performance. We found that while pitchers who only had 78 pitches through 6 innings pitched about as well as expected in the 7th, pitchers with an average of 97 pitches through 6 performed more than .2 runs worse than expected.

In our above 2 groups, the dealing pitchers averaged 84 pitches through 6 and the non-dealing 88, so we expect some bump in the 7th inning performance of the latter group because of a touch of fatigue, at least as compared to the dealing group.

So when we use a more granular approach to determining whether pitchers have been dealing through 6, there is not any evidence that it has much predictive value – the same thing we concluded when we looked at runs allowed only. These pitchers only pitches .11 runs per 9 better than expected.

On the other hand, if pitchers have been pitching poorly for 6 innings, as reflected in the components in which they exert the most control, K, BB, and HR rates, they do in fact pitch worse than expected, even after accounting for a slight elevation in pitch count as compared to the dealing pitchers. That decrease in performance is about .4 runs per 9.

I also want to take this time to state that based on this data and the data from my previous article, there is little evidence that managers are able to identify when pitchers should stay in the game or should be removed. We are only looking at pitchers who were chosen to continue pitching in the 7th inning by their managers and coaches. Yet, the performance of those pitchers is worse than their seasonal numbers, even for the dealing pitchers. If managers could identify those pitchers who were likely to pitch well, whether they had pitched well in prior innings or not, clearly we would see better numbers from them in the 7th inning. At best a dealing pitcher is able to mitigate his TTOP, and a non-dealing pitcher who is allowed to pitch the 7th pitches terribly, which does not bode well for the notion that managers know whom to pull and and whom to keep in the game.

For example, in the above charts, we see that dealing pitchers threw .14 runs per 9 worse than their seasonal average – which also happens to be exactly at league average levels. The non-dealing pitchers, who were also deemed fit to continue by their managers, pitched almost ½ run worse than their seasonal performance and more than .6 runs worse than the league average pitcher. Almost any reliever in the 7th inning would have been a better alternative than either the dealing or non-dealing pitchers. Once again, I have yet to see some concrete evidence that the ubiquitous cry from some of the sabermetric naysayers, “Managers know more about their players’ performance prospects than we do,” has any merit whatsoever.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

Almost everyone, to a man, thinks that a manager’s decision as to whether to allow his starter to pitch in the 6th, 7th, or 8th (or later) innings of an important game hinges, at least in part, on whether said starter has been dealing or getting banged around thus far in the game.

Obviously there are many other variables that a manager can and does consider in making such a decision, including pitch count, times through the order (not high in a manager’s hierarchy of criteria, as analysts have been pointing out more and more lately), the quality and handedness of the upcoming hitters, and the state of the bullpen, both in term of quality and availability.

For the purposes of this article, we will put aside most of these other criteria. The two questions we are going to ask is this:

  • If a starter is dealing thus far, say, in the first 6 innings, and he is allowed to continue, how does he fare in the very next inning? Again, most people, including almost every baseball insider, (player, manager, coach, media commentator, etc.), will assume that he will continue to pitch well.
  • If a starter has not been dealing, or worse yet, he is achieving particularly poor results, these same folks will usually argue that it is time to take him out and replace him with a fresh arm from the pen. As with the starter who has been dealing, the presumption is that the pitcher’s bad performance over the first, say, 6 innings, is at least somewhat predictive of his performance in the next inning or two. Is that true as well?

Keep in mind that one thing we are not able to look at is how a poorly performing pitcher might perform if he were left in a game, even though he was removed. In other words, we can’t do the controlled experiment we would like – start a bunch of pitchers, track how they perform through 6 innings and then look at their performance through the next inning or two.

So, while we have to assume that, in some cases at least, when a pitcher is pitching poorly and his manager allows him to pitch a while longer, that said manager still had some confidence in the pitcher’s performance over the remaining innings, we also must assume that if most people’s instincts are right, the dealing pitchers through 6 innings will continue to pitch exceptionally well and the not-so dealing pitchers will continue to falter.

Let’s take a look at some basic numbers before we start to parse them and do some necessary adjustments. The data below is from the AL only, 2003-2013.


 Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings # of Games RA9 in the 7th inning
Dealing (0 or 1 run allowed through 6) 5,822 4.46
Not-dealing (3 or more runs allowed through 6) 2,960 4.48

First, let me explain what “RA9 in the 7th inning” means: It is the average number of runs allowed by the starter in the 7th inning extrapolated to 9 innings, i.e. runs per inning in the 7th multiplied by 9. Since the starter is often removed in the middle of the 7th inning whether has been dealing or not, I calculated his runs allowed in the entire inning by adding together his actual runs allowed while he was pitching plus the run expectancy of the average pitcher when he left the game, scaled to his talent level and adjusted for time through the order, based on the number of outs and base runners.
For example, let’s say that a starter who is normally 10% worse than a league average pitcher allowed 1 run in the 7th inning and then left with 2 outs and a runner on first base. He would be charged with allowing 1 plus (.231 * 1.1 * 1.08) runs or 1.274 runs in the 7th inning. The .231 is the average run expectancy for a runner on first base and 2 outs, the 1.1 multiplier is because he is 10% worse than a league average pitcher, and the 1.08 multiplier is because most batters in the 7th inning are appearing for the 3rd time (TTOP). When all the 7th inning runs are tallied, we can convert them into a runs per 9 innings or the RA9 you see in the chart above.

At first glance it appears that whether a starter has been dealing in prior innings or not has absolutely no bearing on how he is expected to pitch in the following inning, at least with respect to those pitchers who were allowed to remain in the game past the 6th inning. However, we have different pools of pitchers, batters, parks, etc., so the numbers will have to be parsed to make sure we are comparing apples to apples.

Let’s add some pertinent data to the above chart:

Starters through 6 RA9 in the 7th Seasonal RA9
Dealing 4.46 4.29
Not-dealing 4.48 4.46

As you can see, the starters who have been dealing are, not surprisingly, better pitchers. However, interestingly, we have a reverse hot and cold effect. The pitchers who have allowed only 1 run or less through 6 innings pitch worse than expected in the 7th inning, based on their season-long RA9. Many of you will know why – the times through the order penalty. If you have not read my two articles on the TTOP, and I suggest you do, each time through the order, a starting pitcher fares worse and worse, to the tune of about .33 runs per 9 innings each time he faces the entire lineup. In the 7th inning, the average TTO is 3.0, so we expect our good pitchers, the ones with the 4.29 RA9 during the season, to average around 4.76 RA9 in the 7th inning (the 3rd time though the order, a starter pitches about .33 runs per 9 worse than he pitches overall, and the seasonal adjustment – see the note above – adds another .14 runs). They actually pitch to the tune of 4.46 or .3 runs better than expected after considering the TTOP. What’s going on there?

Well, as it turns out, there are 3 contextual factors that depress a dealing starter’s results in the 7th inning that have nothing to do with his performance in the 6 previous innings:

  • The batters that a dealing pitcher is allowed to face are 5 points lower in wOBA than the average batter that each faces over the course of the season, after adjusting for handedness. This should not be surprising. If any starting pitcher is allowed to pitch the 7th inning, it is likely that the batters in that inning are slightly less formidable or more advantageous platoon-wise, than is normally the case. Those 5 points of wOBA translate to around .17 runs per 9 innings, reducing our expected RA9 to 4.59.
  • The parks in which we find dealing pitchers are not-surprisingly, slightly pitcher friendly, with an average PF of .995, further reducing our expectation of future performance by .02 runs per 9, further reducing our expectation to 4.57.
  • The temperature in which this performance occurs is also slightly more pitcher friendly by around a degree F, although this would have a de minimus effect on run scoring (it takes about a 10 degree difference in temperature to move run scoring by around .025 runs per game).

So our dealing starters pitch .11 runs per 9 innings better than expected, a small effect, but nothing to write home about, and well within the range of values that can be explained purely by chance.

What about the starters who were not dealing? They out-perform their seasonal RA9 plus the TTOP by around .3 runs per 9. The batters they face in the 7th inning are 6 points worse than the average league batter after adjusting for the platoon advantage, and the average park and ambient temperature tend to slightly favor the hitter. Adjusting their seasonal RA9 to account for the fact that they pitched poorly through 6 (see my note at the beginning of this article), we get an expectation of 4.51. So these starters fare almost exactly as expected (4.48 to 4.51) in the 7th inning, after adjusting for the batter pool, despite allowing 3 or more runs for the first 6 innings. Keep in mind that we are only dealing with data from around 9,000 BF. One standard deviation in “luck” is around 5 points of wOBA which translates to around .16 runs per 9.

It appears to be quite damning that starters who are allowed to continue after pitching 6 stellar or mediocre to poor innings pitch almost exactly as (poorly as) expected – their normal adjusted level plus .33 runs per 9 because of the TTOP – as if we had no idea how well or poorly they pitched in the prior 6 innings.

Score one for simply using a projection plus the TTOP to project how any pitcher is likely to pitch in the middle to late innings, regardless of how well or poorly they have pitched thus far in the game. Prior performance in the same game has almost no bearing on that performance. If anything, when a manager allows a dealing pitcher to continue pitching after 6 innings, when facing the lineup for the 3rd time on the average, he is riding that pitcher too long. And, more importantly, presumably he has failed to identify anything that the pitcher might be doing, velocity-wise, mechanics-wise, repertoire-wise, command-wise, results-wise, that would suggest that he is indeed on that day and will continue to pitch well for another inning or so.

In fact, whether pitchers have pitched very well or very poorly or anything in between for the first 6 innings of a game, managers and pitching coaches seem to have no ability to determine whether they are likely to pitch well if they remain in the game. The best predictor of 7th inning performance for any pitcher who is allowed to remain in the game, is his seasonal performance (or projection) plus a fixed times through the order penalty. The TTOP is approximately .33 runs per 9 innings for every pass through the order. Since the second time through the order is roughly equal to a pitcher’s overall performance, starting with the 3rd time through the lineup we expect that starter to pitch .33 runs worse than he does overall, again, regardless of how he has pitched thus far in the game. The 4th time TTO, we expect a .66 drop in performance. Pitchers rarely if ever get to throw to the order for the 5th time.

Fatigue and Pitch Counts

Let’s look at fatigue using pitch count as a proxy, and see if that has any effect on 7th inning performance for pitchers who allowed 3 or more runs through 6 innings. For example, if a pitcher has not pitched particularly well, should we allow him to continue if he has a low pitch count?

Pitch count and 7th inning performance for non-dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 85 (avg=78) 4.56 4.70
Greater than 90 (avg=97) 4.66 4.97


Expected RA9 accounts for the pitchers’ adjusted seasonal RA9 plus the pool of batters faced in the 7th inning including platoon considerations, as well as park and weather. The latter 2 affect the numbers minimally. As you can see, pitchers who had relatively high pitch counts going into the 7th inning but were allowed to pitch for whatever reasons despite allowing at least 3 runs thus far, fared .3 runs worse than expected, even after adjusting for the TTOP. Pitchers with low pitch counts did only about .14 runs worse than expected, including the TTOP. Those 20 extra pitches appear to account for around .17 runs per 9, not a surprising result. Again, please keep in mind that we dealing with limited sample sizes, so these small differences are inferential suggestions and are not to be accepted with a high degree of certainty. They do point us in a certain direction, however, and one which comports with our prior expectation – at least my prior expectation.

What about if a pitcher has been dealing and he also has a low pitch count going into the 7th inning. Very few managers, if any, would remove a starter who allowed zero or 1 run through 6 innings and has only thrown 65 or 70 pitchers. That would be baseball blasphemy. Besides the affront to the pitcher (which may be a legitimate concern, but one which is beyond the scope of this article), the assumption by nearly everyone is that the pitcher will continue to pitch exceptionally well. After all, he is not at all tired and he has been dealing! Let’s see if that is true – that these starters continue to pitch well, better than expected based on their projections or seasonal performance plus the TTOP.

Pitch count and 7th inning performance for dealing pitchers:

Pitch count through 6 Expected RA9 Actual RA9
Less than 80 (avg=72) 4.75 4.50
Greater than 90 (avg=96) 4.39 4.44

Keep in mind that these pitchers normally allow 4.30 runs per 9 innings during the entire season (4.44 after doing the seasonal adjustment). The reason the expected RA9 is so much higher for pitchers with a low pitch count is primarily due to the TTOP. For pitchers with a high pitch count, the batters they face in the 7th are 10 points less in wOBA than league average, thus the 4.39 expected RA9, despite the usual .3 to .35 TTOP.

Similar to the non-dealing pitchers, fatigue appears to play a factor in a dealing pitcher’s performance in the 7th. However, in either case, low-pitch or high-pitch, their performance through the first 6 innings has little bearing on their 7th inning performance. With no fatigue they out-perform their expectation by .25 runs per 9. The fatigued pitchers under-performed their overall season-long adjusted talent plus the usual TTOP by .05 runs per 9.

Again, we see that there is little value to taking out a pitcher who has been getting a little knocked around or leaving in a pitcher who has been dealing for 6 straight innings. Both groups will continue to perform at around their expected full-season levels plus any applicable TTOP, with a slight increase in performance for a low-pitch count pitcher and a slight decrease for a high-pitch count pitcher. The biggest increase we see, .25 runs, is for pitchers who were dealing and had very low pitch counts.

What about if we increase our threshold to pitchers who allow 4 or more runs over 6 innings and those who are pitching a shutout?

Starters through 6 Seasonal RA9 Expected RA9 7th inning RA9
Dealing (shutouts only) 4.23 4.62 4.70
Not-dealing (4 or more runs) 4.62 4.81 4.87

Here, we see no predictive value in the first 6 innings of performance. In fact, for some reason starters pitching a shutout pitched slightly worse than expected in the 7th inning, after adjusting for the pool of batters faced and the TTOP.

How about the holy grail of starters who are expected to keep lighting it up in the 7th inning – starters pitching a shutout and with a low pitch count? These were true talent 4.25 pitchers facing better than average batters in the 7th, mostly for the third time in the game, so we expect a .3 bump or so for the TTOP. Our expected RA9 was 4.78 after making all the adjustments, and the actual was 4.61. Nothing much to speak of. Their dealing combined with a low pitch count had a very small predictive value in the 7th. Less than .2 runs per 9 innings.


As I have been preaching for what seems like forever – and the data are in accordance – however a pitcher is pitching through X innings in a game, at least as measured by runs allowed, even at the extremes, has very little relevance with regard to how he is expected to pitch in subsequent innings. The best marker for whether to pull a pitcher or not seems to be pitch count.

If you want to know the most likely result, or the mean expected result at any point in the game, you should mostly ignore prior performance in that game and use a credible projection plus a fixed times through the order penalty, which is around .33 runs per 9 the 3rd time through, and another .33 the 4th time through. Of course the batters faced, park, weather, etc. will further dictate the absolute performance of the pitcher in question.

Keep in mind that I have not looked at a more granular approach to determining whether a pitcher has been pitching extremely well or getting shelled, such as hits, walks, strikeouts, and the like. It is possible that such an approach might yield a subset of pitching performance that indeed has some predictive value within a game. For now, however, you should be pretty convinced that run prevention alone during a game has little predictive value in terms of subsequent innings. Certainly a lot less than what most fans, managers, and other baseball insiders think.

There is a prolific base stealer on first base in a tight game. The pitcher steps off the rubber, varies his timing, or throws over to first several times during the AB. You’ve no doubt heard some version of the following refrain from your favorite media commentator: “The runner is disrupting the defense and the pitcher, and the latter has to throw more fastballs and perhaps speed up his delivery or use a slide step, thus giving the batter an advantage.”

There may be another side of the same coin: The batter is distracted by all these ministrations, he may even be distracted if and when the batter takes off for second, and he may take a pitch that he would ordinarily swing at in order to let the runner steal a base. All of this leads to decreased production from the batter, as compared to a proverbial statue on first, to which the defense and the pitcher pay little attention.

So what is the actual net effect? Is it in favor of the batter, as the commentators would have you believe (after all, they’ve played the game and you haven’t), or does it benefit the pitcher – an unintended negative consequence of being a frequent base stealer?

Now, even if the net effect of a stolen base threat is negative for the batter, that doesn’t mean that being a prolific base stealer is necessarily a bad thing. Attempting stolen bases, given a high enough success rate, presumably provides extra value to the offense independent of the effect on the batter. If that extra value exceeds that given up by virtue of the batter being distracted, then being a good and prolific base stealer may be a good thing. If the pundits are correct and the “net value of distraction” is in favor of the batter, then perhaps the stolen base or stolen base attempt is implicitly worth more than we think.

Let’s not also forget that the stolen base attempt, independent of the success rate, is surely a net positive for the offense, not withstanding any potential distraction effects. That is due to the fact that when the batter puts the ball in play, whether it is a hit and run or a straight steal, there are fewer forces at second, fewer GDP’s, and the runner advances the extra base more often on a single, double, or out. Granted, there are a few extra line drive and fly ball DP, but there are many fewer GDP to offset those.

If you’ve already gotten the feeling that this whole steal thing is a lot more complicated than it appears on its face, you would be right. It is also not easy, to say the least, to try and ascertain whether there is a distraction effect and who gets the benefit, the offense or the defense. You might think, “Let’s just look at batter performance with a disruptive runner on first as compared to a non-disruptive runner.” We can even use a “delta,” “matched pairs,” or “WOWY” approach in order control for the batter, and perhaps even the pitcher and other pertinent variables. For example, with Cabrera at the plate, we can look at his wOBA with a base stealing threat on first and a non-base stealing threat. We can take the difference, say 10 points in wOBA in favor of with the threat (IOW, the defense is distracted and not the batter), and weight that by the number of times we find a matched pair (the lesser of the two PA). In other words, a “matched pair” is one PA with a stolen base threat on first and one PA with a non-threat.

If Cabrera had 10 PA with a stolen base threat and 8 PA with someone else on first, we would weight the wOBA difference by 8 – we have 8 matched pairs. We do that for all the batters, weighting each batter’s difference by their number of matched pairs, and voila, we have a measure of the amount that a stolen base threat on first affects the batter’s production, as compared to a non-stolen base threat. Seems pretty simple and effective, right? Eh, not so fast.

Unfortunately there are myriad problems associated with that methodology. First of all, do we use all PA where the runner started on first but may have ended up on another base, or was thrown out, by the time the batter completed his PA? If we do that, we will be comparing apples to oranges. With the base stealing threats, there will be many more PA with a runner on second or third, or with no runners at all (on a CS or PO). And we know that wOBA goes down once we remove a runner from first base, because we are eliminating the first base “hole” with the runner being held on. We also know that the value of the offensive components are different depending on the runners and outs. For example, with a runner on second, the walk is not as valuable to the batter and the K is worse than a batted ball out which has a chance to advance the runner.

What if we only look at PA where the runner was still at first when the batter completed his PA? Several researchers have done that, included myself and my co-authors in The Book. The problem with that method is that those PA are not an unbiased sample. For the non-base stealers, most PA will end with a runner on first, so that is not a problem. But with a stolen base threat on first, if we only include those PA that end with the runner still on first, we are only including PA that are likely biased in terms of count, score, game situation, and even the pitcher. In other words, we are only including PA where the runner has not attempted a steal yet (other than on a foul ball). That could mean that the pitcher is difficult to steal on (many of these PA will be with a LHP on the mound), the score is lopsided, the count is biased one way or another, etc. Again, if we only look at times where the PA ended with the runner on first, we are comparing apples to oranges when looking at the difference in wOBA between a stolen base threat on first and a statue.

It almost seems like we are at an impasse and there is nothing we can do, unless perhaps we try to control for everything, including the count, which would be quite an endeavor. Fortunately there is a way to solve this – or at least come close. We can first figure out the overall difference in value to the offense between having a base stealer and a non-base stealer on first, including the actual stolen base attempts. How can we do that? That is actually quite simple. We need only look at the change in run expectancy starting from the beginning to the end of the PA, starting with a runner on first base only. We can then use the delta or matched pairs method to come up with an average difference in change in RE. This difference represents the sum total of the value of a base stealer at first versus a non-base stealer, including any effect, positive or negative, on the batter.

From there we can try and back out the value of the stolen bases and caught stealings (including pick-offs, balks, pick-off errors, catcher errors on the throw, etc.) as well as the extra base runner advances and the avoidance of the GDP when the ball is put into play. What is left is any “distraction effect” whether it be in favor of the batter or the pitcher.

First, in order to classify the base runners, I looked at their number of steal attempts per times on first (BB+HP+S+ROE) for that year and the year before. If it was greater than 20%, they were classified as a “stolen-base threat.” If it was less than 2%, they were classified as a statue. Those were the two groups I looked at vis-à-vis the runner on first base. All other runners (the ones in the middle) were ignored. Around 10% of all runners were in the SB threat group and around 50% were in the rarely steal group.

Then I looked at all situations starting with a runner on first (in one or the other stolen base group) and ending when the batter completes his PA or the runner makes the third out of the inning. The batter may have completed his PA with the runner still on first, on second or third, or with no one on base because the runner was thrown out or scored, via stolen bases, errors, balks, wild pitches, passed balls, etc.

I only included innings 1-6 (to try and eliminate pinch runners, elite relievers, late and close-game strategies, etc.) and batters who occupied the 1-7 slots. I created matched pairs for each batter such that I could use the “delta method” described above to compute the average difference in RE change. I did it year by year, i.e., the matched pairs had to be in the same year, but I included 20 years of data, from 1994-2013. The batters in each matched pair had to be on the same team as well as the same year. For example, Cabrera’s matched pairs of 8 PA with base stealers and 10 PA with non-base stealers would be in one season only. In another season, he would have another set of matched pairs.

Here is how it works: Batter A may have had 3 PA with a base stealer on first and 5 with a statue. His average change in RE (everyone starts with a runner on first only) at the end of the PA may have been +.130 runs for those 3 PA with the stolen base threat on first at the beginning of the PA.

For the 5 PA with a non-threat on first, his average change in RE may have been .110 runs. The difference is .02 runs in favor of the stolen base on first and that gets weighed by 3 PA (the lesser of the 5 and the 3 PA). We do the same thing for the next batter. He may have had a difference of -.01 runs (in favor of the non-threat) weighted by, say, 2 PA. So now we have (.02 * 3 – .01 * 2) / 5 as our total average difference in RE change using the matched pair or delta method. Presumably (hopefully) the pitcher, score, parks, etc. are the same or very similar for both groups. If they are, then that final difference represents the advantage of having a stolen base threat on first base, including the stolen base attempts themselves.

A plus number means a total net advantage to the offense with a prolific base stealer on first, including his SB, CS, and speed on the bases when the ball is put into play, and a negative number means that the offense is better off with a slow, non-base stealer on first, which is unlikely of course. Let’s see what the initial numbers tell us. By the way, for the changes in RE, I am using Tango’s 1969-1992 RE matric from this web site:

We’ll start with less than 0 outs, so one of the advantages of a base stealer on first is staying out of the GDP (again, offset by a few extra line drive and fly ball DP). There were a total of 5,065 matched pair PA (adding the lesser of the two PA for each matched pair). Remember a matched pair is a certain batter with a base stealing threat on first and that same batter in the same year with a non-threat on first. The runners are on first base when the batter steps up to the plate but may not be when the PA is completed. That way we are capturing the run expectancy change of the entire PA, regardless of what happens to the runner during the PA.

The average advantage in RE change (again, that is the ending RE after the PA is over minus the starting RE, which is always with a runner on first only, in this case with 0 out) was .032 runs per PA. So, as we expect, a base stealing threat on first confers an overall advantage to the offensive team, at least with no outs. This includes the net run expectancy of SB (including balks, errors, etc.) and CS (including pick-offs), advancing on WP and PB, advancing on balls in play, staying out of the GDP, etc., as well as any advantage or disadvantage to the batter by virtue of the “distraction effect.”

The average wOBA of the batter, for all PA, whether the runner advanced a base or was thrown out during the PA, was .365 with a non-base stealer on first and .368 for a base stealer.

What are the differences in individual offensive components between a base stealing threat and a non-threat originally on first base? The batter with a statue who starts on first base has a few more singles, which is expected given that he hits with a runner on first more often. As well, the batter with a base stealing threat walks and strikes out a lot more, due to the fact he is hitting with a base open more often.

If we then compute the RE value of SB, CS (and balks, pickoffs, errors, etc.) for the base stealer and non-base stealer, as well as the RE value of advancing the extra base and staying out of the DP, we get an advantage to the offense with a base stealer on first of .034 runs per PA.

So, if the overall value of having a base stealer on first is .032 runs per PA, and we compute that .034 runs comes from greater and more efficient stolen bases and runner advances, we must conclude that that there is a .002 runs disadvantage to the batter when there is a stolen base threat on first base. That corresponds to around 2 points in wOBA. So we can say that with no outs, there is a 2 point penalty that the batter pays when there is a prolific base stealer on first base, as compared to a runner who rarely attempts a SB. In 5065 matched PA, one SD of the difference between a threat and non-threat is around 10 points in wOBA, so we have to conclude that there is likely no influence on the batter.

Let’s do the same exercise with 1 and then 2 outs.

With 1 out, in 3,485 matched pair, batters with non-threats hit .388 and batters with threats hit .367. The former had many more singles and of course fewer BB (a lot fewer) and K. Overall, with a non-base stealer starting on first base at the beginning of the PA, batters produced an RE that was .002 runs per PA better than with a base stealing threat. In other words, having a prolific, and presumably very fast, base stealer on first base offered no overall advantage to the offensive team, including the value of the SB, base runner advances, and avoiding the GDP.

If we compute the value that the stolen base threats provide on the base paths, we get .019 runs per PA, so the disadvantage to the batter by virtue of having a prolific base stealer on first base is .021 runs per PA, which is the equivalent of the batter losing 24 points in wOBA.

What about with 2 outs? With 2 outs, we can ignore the GDP advantage for the base stealer as well as the extra value from moving up a base on an out. So, once we get the average RE advantage for a base stealing threat, we can more easily factor out the stolen base and base running advantage to arrive at the net advantage or disadvantage to the batter himself.

With 2 outs, the average RE advantage with a base stealer on first (again, as compared to a non-base stealer) is .050 runs per PA, in a total of 2,390 matched pair PA. Here, the batter has a wOBA of .350 with a non-base stealer on first, and .345 with a base stealer. There is a still a difference in the number of singles because of the extra hole with the first baseman holding on the runner, as well as the usual greater rate of BB with a prolific stealer on base. (Interestingly, with 2 outs, the batter has a higher K rate with a non-threat on base – it is usually the opposite.) Let’s again tease out the advantage due to the actual SB/CS and base running and see what we’re left with. Here, you can see how I did the calculations.

With the non-base stealer, the runner on first is out before the PA is completed 1.3% of the time, he advances to second, 4.4% of the time, and to third, .2%. The total RE change for all that is .013 * -.216 + .044 * .109 + .002 * .157, or .0023 runs, not considering the count when these events occurred. The minus .216, plus .109, and plus .157 are the change in RE when a base runner is eliminated from first, advances from first to second, and advances from first to third prior to the end of the PA (technically prior to the beginning of the PA). The .013, .044, and .002 are the frequencies of those base running events.

For the base stealer, we have .085 (thrown out) times -.216 + .199 (advance to 2nd) * .109 + .025 (advance to 3rd) * .157, or .0117. So the net advantage to the base stealer from advancing or being thrown is .0117 minus .0023, or .014 runs per PA.

What about the advantage to the prolific and presumably fast base stealers from advancing on hits? The above .014 runs was from advances prior to the completion of the PA, from SB, CS, pick-offs, balks, errors, WP, and PB.

The base stealer advances the extra base from first on a single 13.5% more often and 21.7% more often on a double. Part of that is from being on the move and part of that is from being faster.

12.5% of the time, there is a single with a base stealing threat on first. He advances the extra base 13.5% more often, but the extra base with 2 outs is only worth .04 runs, so the gain is negligible (.0007 runs).

A runner on second and a single occurs 2.8% of the time with a stolen base threat on base. The base stealer advances the extra base and scores 14.6% more often than the non-threat for a gain of .73 runs (being able to score from second on a 2-out single is extremely valuable), for a total gain of .73 * .028 * .146, or .003 runs.

With a runner on first and a double, the base stealer gains an extra .0056 runs.

So, the total base running advantage when the runner on first is a stolen base threat is .00925 runs per PA. Add that to the SB/CS advantage of .014 runs, and we get a grand total of .023 runs.

Remember that the overall RE advantage was .050 runs, so if we subtract out the base runner advantage, we get a presumed advantage to the batter of .050 – .023, or .027 runs per PA. That is around 31 points in wOBA.

So let’s recap what we found. For each of no outs, 1 out, and 2 outs, we computed the average change in RE for every batter with a base stealer on first (at the beginning of the PA) and a non-base stealer on first. That tells us the value of the PA from the batter and the base runner combined. (That is RE24, by the way.) We expect that this number will be higher with base stealers, otherwise what is the point of being a base stealer in the first place if you are not giving your team an advantage?

Table I – Overall net value of having a prolific and disruptive base stealing threat on first base at the beginning of the PA, the value of his base stealing and base running, and the presumed value to the batter in terms of any “distraction effect.” Plus is good for the offense and minus good for the defense.

Outs Overall net value SB and base running value “Batter distraction” value
0 .032 runs (per PA) .034 runs -.002 runs (-2 points of wOBA)
1 -.002 runs .019 -.21 runs (-24 pts)
2 .050 runs .023 + .027 (31 pts)


We found that very much to be the case with no outs and with 2 outs, but not with 1 out. With no outs, the effect of a prolific base runner on first was .032 runs per PA, the equivalent of raising the batter’s wOBA by 37 points, and with 2 outs, the overall effect was .050 runs, the equivalent of an extra 57 points for the batter. With 1 out, however, the prolific base stealer is in effect lowering the wOBA of the batter by 2 points. Remember that these numbers include the base running and base stealing value of the runner as well as any “distraction effect” that a base stealer might have on the batter, positive or negative. In other words, RE24 captures the influence of the batter as well we the base runners.

In order to estimate the effect on the batter component, we can “back out” the base running value by looking at how often the various base running events occur and their value in terms of the “before and after” RE change. When we do that, we find that with 0 outs there is no effect on the batter from a prolific base stealer starting on first base. With 1 out, there is a 24 point wOBA disadvantage to the batter, and with 2 outs, there is a 31 point advantage to the batter. Overall, that leaves around a 3 or 4 point negative effect on the batter. Given the relatively small sample sizes of this study, one would not want to reject the hypothesis that having a prolific base stealer on first base has no net effect on the batter’s performance. Why the effect depends so much on the number of outs, and what if anything managers and players can do to mitigate or eliminate these effects, I will leave for the reader to ponder.