Archive for October, 2015

There seems to be an unwritten rule in baseball – not on the field, but in the stands, at home, in the press box, etc.

“You can’t criticize a manager’s decision if it doesn’t directly affect the outcome of the game, if it appears to ‘work’, or if the team goes on to win the game despite the decision.”

That’s ridiculous of course. The outcome of a decision or the game has nothing to do with whether the decision was correct or not. Some decisions may raise or lower a team’s chances of winning from 90% and other decisions may affect a baseline of 10 or 15%.

If decision A results in a team’s theoretical chances of winning of 95% and decision A, 90%, obviously A is the correct move. Choosing B would be malpractice. Equally obvious is if manager chooses B, an awful decision, he is still going to win the game 90% of the time, and based on the “unwritten rule” we rarely get to criticize him. Similarly, if decision A results in a 15% win expectancy (WE) and B results in 10%, A is the clear choice, yet the team still loses most of the time and we get to second guess the manager whether he chooses A or B. All of that is silly and counter-productive.

If your teenager drives home drunk yet manages to not kill himself or anyone else, do you say nothing because “it turned out OK?” I hope not. In sports, most people understand the concept of “results versus process” if they are cornered into thinking about it, but in practice, they just can’t bring themselves to accept it in real time. No one is going to ask Terry Collins in the post-game presser why he didn’t pinch hit for DeGrom in the 6th inning – no one. The analyst – a competent one at least – doesn’t give a hoot what happened after that. None whatsoever. He looks at a decision and if it appears questionable at the time, he tries to determine what the average consequences are – with all known data at the time the decision is made – with the decision or with one or more alternatives. That’s it. What happens after that is irrelevant to the analyst. For some reason this is a hard concept for the average fan – the average person – to apply. As I said, I truly think they understand it, especially if you give obvious examples, like the drunk driving one. They just don’t seem to be able to break the “unwritten rule” in practice. It goes against their grain.

Well, I’m an analyst and I don’t give a flying ***k whether the Mets won, lost, tied, or Wrigley Field collapsed in the 8th inning. The “correctness” of the decision to allow DeGrom to hit or not in the top of the 6th, with runners on second and third, boiled down to this question and this question only:

“What is the average win expectancy (WE) of the Mets with DeGrom hitting and then pitching some number of innings and what is the average WE with a pinch hitter and someone else pitching in place of DeGrom?”

Admittedly the gain, if there is any, from making the decision to bring in a PH and reliever or relievers must be balanced against any known or potential negative consequences for the Mets not related to the game at hand. Examples of these might be: 1) limiting your relief possibilities in the rest of the series or the World Series. 2) Pissing off DeGrom or his teammates for taking him out and thus affecting the morale of the team.

I’m fine with the fans or the manager and coaches including these and other considerations in their decision. I am not fine with them making their decision not knowing how it affects the win expectancy of the game at hand, since that is clearly the most important of the considerations.

My guess is that if we asked Collins about his decision-making process, and he was honest with us, he would not say, “Yeah, I knew that letting him hit would substantially lower our chances of winning the game, but I also wanted to save the pen a little and give DeGrom a chance to….” I’m pretty sure he thought that with DeGrom pitching well (which he usually does, by the way – it’s not like he was pitching well-above his norm), his chances of winning were better with him hitting and then pitching another inning or two.

At this point, and before I get into estimating the WE of the two alternatives facing Collins, letting DeGrom hit and pitch or pinch hitting and bringing in a reliever, I want to discuss an important concept in decision analysis in sports. In American civil law, there is a thing called a summary judgment. When a party in a civil action moves for one, the judge makes his decision based on the known facts and assuming controversial facts and legal theories in a light most favorable to the non-moving party. In other words, if everything that the other party says is true is true (and is not already known to be false) and the moving party would still win the case according to the law, then the judge must accept the motion and the moving party wins the case without a trial.

When deciding whether a particular decision was “correct” or not in a baseball game or other contest, we can often do the same thing in order to make up for an imperfect model (which all models are by the way). You know the old saw in science – all models are wrong, but some are useful. In this particular instance, we don’t know for sure how DeGrom will pitch in the 6th and 7th innings to the Cubs order for the 3rd time, we don’t know for how much longer he will pitch, we don’t know how well DeGrom will bat, and we don’t know who Collins can and will bring in.

I’m not talking about the fact that we don’t know whether DeGrom or a reliever is going to give up a run or two, or whether he or they are going to shut the Cubs down. That is in the realm of “results-based analysis” and I‘ve already explained how and why that is irrelevant. I’m talking about what is DeGrom’s true talent, say in runs allowed per 9 facing the Cubs for the third time, what is a reliever’s or relievers’ true talent in the 6th and 7th, how many innings do we estimate DeGrom will pitch on the average if he stays in the game, and what is his true batting talent.

Our estimates of all of those things will affect our model’s results – our estimate of the Mets’ WE with and without DeGrom hitting. But what if we assumed everything in favor of keeping DeGrom in the game – we looked at all controversial items in a light most favorable to the non-moving party – and it was still a clear decision to pinch hit for him? Well, we get a summary judgment! Pinch hitting for him would clearly be the correct move.

There is one more caveat. If it is true that there are indirect negative consequences to taking him out – and I’m not sure that there are – then we also have to look at the magnitude of the gain from taking him out and then decide whether it is worth it. In order to do that, we have to have some idea as to what is a small and what is a large advantage. That is actually not that hard to do. Managers routinely bring in closers in the 9th inning with a 2-run lead, right? No one questions that. In fact, if they didn’t – if they regularly brought in their second or third best reliever instead, they would be crucified by the media and fans. How much does bringing in a closer with a 2-run lead typically add to a team’s WE, compared to a lesser reliever? According to The Book, an elite reliever compared to an average reliever in the 9th inning with a 2-run lead adds around 4% to the team’s WE. So we know that 4% is a big advantage, which it is.

That brings up another way to account for the imperfection of our models. The first way was to use the “summary judgment” method, or assume things most favorable to making the decision that we are questioning. The second way is to simply estimate everything to the best of our ability and then look at the magnitude of the results. If the difference between decision A and B is 4%, it is extremely unlikely that any reasonable tweak to the model will change that 4% to 0% or -1%.

In this situation, whether we assume DeGrom is going to pitch 1.5 more innings or 1.6 or 1.4, it won’t change the results much. If we assume that DeGrom is an average hitting pitcher or a poor one, it won’t change the result all that much. If we assume that the “times through the order penalty” is .25 runs or .3 runs per 9 innings, it won’t change the results much. If we assume that the relievers used in place of DeGrom have a true talent of 3.5, 3.3, 3.7, or even 3.9, it won’t change the results all that much. Nothing can change the results from 4% in favor of decision A to something in favor of decision B. 4% is just too much to overcome even if our model is not completely accurate. Now, if our results assuming “best of our ability estimates” for all of these things yield a 1% advantage for choosing A, then it is entirely possible that B is the real correct choice and we might defer to the manager in case he knows some things that we don’t or we simply are mistaken in our estimates or we failed to account for some important variable.

Let’s see what the numbers say, assuming “average” values for all of these relevant variables and then again making reasonable assumptions in favor of allowing DeGrom to hit (assuming that pinch hitting for him appears to be correct).

What is the win expectancy with DeGrom batting. We’ll assume he is an average-hitting pitcher or so (I have heard that he is a poor-hitting pitcher). An average pitcher’s batting line is around 10% single, 2% double or triple, .3% HR, 4% BB, and 83.7% out. The average WE for an average team leading by 1 run in the top of the 6th, with runners on second and third, 2 outs, and a batter with this line, is…..

63.2%.

If DeGrom were an automatic out, the WE would be 59.5%. That is the average WE leading off the bottom of the 6th with the visiting team winning by a run. So an average pitcher batting in that spot adds a little more than 3.5% in WE. That’s not wood. What if DeGrom were a poor hitting pitcher?

Whirrrrr……

62.1%.

So whether DeGrom is an average or poor-hitting pitcher doesn’t change the Mets’ WE in that spot all that much. Let’s call it 63%. That is reasonable. He adds 3.5% to the Mets’ WE compared to an out.

What about a pinch hitter? Obviously the quality of the hitter matters. The Mets have some decent hitters on the bench – notably Cuddyer from the right side and Johnson from the left. Let’s assume a league-average hitter. Given that, the Mets’ WE with runners on second and third, 2 outs, and a 1-run lead, is 68.8%. A league-average hitter adds over 9% to the Mets’ WE compared to an out. The difference between DeGrom as a slightly below-average hitting pitcher and a league-average hitter is 5.8%. That means, unequivocally, assuming that our numbers are reasonably accurate, that letting DeGrom hit cost the Mets almost 6% in their chances of winning the game.

That is enormous of course. Remember we said that bringing in an elite reliever in the 9th of a 2-run game, as compared to a league-average reliever, is worth 4% in WE. You can’t really make a worse decision as a manager than reducing your chances of winning by 5.8%, unless you purposely throw the game. But, that’s not nearly the end of the story. Collins presumably made this decision thinking that DeGrom pitching the 6th and perhaps the 7th would more than make up for that. Actually he’s not quite thinking, “Make up for that.” He is not thinking in those terms. He does not know that letting him hit “cost 5.8% in win expectancy” compared to a pinch hitter. I doubt that the average manager knows what “win expectancy” means let alone how to use it in making in-game decisions. He merely thinks, “I really want him to pitch another inning or two, and letting him hit is a small price to pay,” or something like that.

So how much does he gain by letting him pitch the 6th and 7th rather than a reliever. To be honest it is debatable whether he gains anything at all. Not only that, but if we look back in history to see how many innings starters end up pitching, on the average, in situations like that, we will find that it is not 2 innings. It is probably not even 1.5 innings. He was at 82 pitches through 5. He may throw 20 or 25 pitches in the 6th (like he did in the first), in which case he may be done. He may give up a base runner or two, or even a run or two, and come out in the 6th, perhaps before recording an out. At best, he pitches 2 more innings, and once in a blue moon he pitches all or part of the 8th I guess (as it turned out, he pitched 2 more effective innings and was taken out after seven). Let’s assume 1.5 innings, which I think is generous.

What is DeGrom’s expected RA9 for those 2 innings? He has pitched well thus far but not spectacularly well. In any case, there is no evidence that pitching well through 5 innings tells us anything about how a pitcher is going to pitch in the 6th and beyond. What is DeGrom’s normal expected RA9? Steamer, ZIPS and my projection systems say about 83% of league-average run prevention. That is equivalent to a #1 or #2 starter. It is equivalent to an elite starter, but not quite the level of the Kershaw’s, Arrieta’s, or even the Price’s or Sale’s. Obviously he could turn out to be better than that – or worse – but all we can do in these calculations and all managers can do in making these decisions is use the best information and the best models available to estimate player talent.

Then there is the “times through the order penalty.” There is no reason to think that this wouldn’t apply to DeGrom in this situation. He is going to face the Cubs for the third time in the 6th and 7th innings. Research has found that the third time through the order a starter’s RA9 is .3 runs worse than his overall RA9. So a pitcher who allows 83% of league average runs allows 90% when facing the order for the 3rd time. That is around 3.7 runs per 9 innings against an average NL team.

Now we have to compare that to a reliever. The Mets have Niese, Robles, Reed, Colon, and Gilmartin available for short or long relief. Colon might be the obvious choice for the 6th and 7th inning, although they surely could use a combination of righties and lefties, especially in very high leverage situations. What do we expect these relievers’ RA9 to be? The average reliever is around 4.0 to start with, compared to DeGrom’s 3.7. If Collins uses Colon, Reed, Niese or some combination of relievers, we might expect them to be better than the average NL reliever. Let’s be conservative and assume an average, generic reliever for those 1.5 innings.

How much does that cost the Mets in WE? To figure that, we take the difference in run prevention between DeGrom and the reliever(s), multiply by the game leverage and convert it into WE. The difference between a 3.7 RA9 and a 4.0 RA9 in 1.5 innings is .05 runs. The average expected leverage index in the 6th and 7th innings where the road team is up by a run is around 1.7. So we multiply .05 by 1.7 and convert that into WE. The final number is .0085, or less than 1% in win expectancy gained by allowing DeGrom to pitch rather than an average reliever.

That might shock some people. It certainly should shock Collins, since that is presumably his reason for allowing DeGrom to hit – he really, really wanted him to pitch another inning or two. He presumably thought that that would give his team a much better chance to win the game as opposed to one or more of his relievers. I have done this kind of calculation dozens of times and I know that keeping good or even great starters in the game for an inning or two is not worth much. For some reason, the human mind, in all its imperfect and biased glory, overestimates the value of 1 or 2 innings of a pitcher who is “pitching well” as compared to an “unknown entity” (of course we know the expected performance of our relievers almost as well as we know the expected performance of the starter). It is like a manager who brings in his closer in a 3-run game in the 9th. He thinks that his team has a much better chance of winning than if he brings in an inferior pitcher. The facts say that he is wrong, but tell that to a manager and see if he agrees with you – he won’t. Of course, it’s not a matter of opinion – it’s a matter of fact.

Do I need to go any further? Do I need to tweak the inputs? Assuming average values for the relevant variables yields a loss of over 5% in win expectancy by allowing DeGrom to hit. What if we knew that DeGrom were going to pitch two more innings rather than an average of 1.5? He saves .07 runs rather than .05 which translates to 1.2% WE rather than .85%, which means that pinch hitting for him increases the Mets’ chances of winning by 4.7% rather than 5.05%. 4.7% is still an enormous advantage. Reducing your team‘s chances of winning by 4.7% by letting DeGrom hit is criminal. It’s like pinch hitting Jeff Mathis for Mike Trout in a high leverage situation – twice!

What about if our estimate of DeGrom’s true talent is too conservative? What if he is as good as Kershaw and Arrieta? That’s 63% of league average run prevention or 2.6 RA9. Third time through the order and it’s 2.9. The difference between that and an average reliever is 1.1 runs per 9, which translates to a 3.1% WE difference in 1.5 innings. So allowing Kershaw to hit in that spot reduces the Mets chances of winning by 2.7%. That’s not wood either.

What if the reliever you replaced DeGrom with was a replacement level pitcher – the worst pitcher in the major leagues? He allows around 113% league average runs, or 4.6 RA9. Difference between DeGrom and him for 1.5 innings? 2.7% for a net loss of 3.1% by letting him hit rather than pinch hitting for him and letting the worst pitcher in baseball pitch the next 1.5 innings? If you told Collins, “Hey genius, if you pinch hit for Degrom and let the worst pitcher in baseball pitch for another inning and a half instead of DeGrom, you will increase your chances of winning by 3.1%,” what do you think he would say?

What if DeGrom were a good hitting pitcher? What if….?

You should be getting the picture. Allowing him to hit is so costly, assuming reasonable and average values for all the pertinent variables, that even if we are missing something in our model, or some of our numbers are a little off – even if assume everything in the best possible light of allowing him to hit – the decision is a no-brainer in favor of a pinch hitter.

If Collins truly wanted to give his team the best chance of winning the game, or in the vernacular of ballplayers, putting his team in the best position to succeed, the clear and unequivocal choice was to lift DeGrom for a pinch hitter. It’s too bad that no one cares because the Mets ultimately won the game, which they were going to do at least 60% of the time anyway, regardless of whether Collins made the right or wrong decision.

The biggest loser, other than the Cubs, is Collins (I don’t mean he is a loser, as in the childish insult), because every time you use results to evaluate a decision and the results are positive, you deprive yourself of the opportunity to learn a valuable lesson. In this case, the analysis could have and should have been done before the game even started. All managers should know the importance of bringing in pinch hitters for pitchers in high leverage situations in important games, no matter how good the pitchers are or how well they are pitching in the game so far. Maybe someday they will.

Advertisements

As an addendum to my article on platoon splits from a few days ago, I want to give you a simple trick for answering a question about a player, such as, “Given that a player performs X in time period T, what is the average performance we can expect in the future (or present, which is essentially the same thing, or at least a subset of it)?” and want to illustrate the folly of using unusual single-season splits for projecting the future.

The trick is to identify as many players as you can in some period of time in the past (the more, the better, but sometimes the era matters so you often want to restrict your data to more recent years) that conform to the player in question in relevant ways, and then see how they do in the future. That always answers your question as best as it can. The certainty of your answer depends upon the sample size of the historical performance of similar players. That is why it is important to use as many players and as many years as possible, without causing problems by going too far back in time.

For example, say you have a player whom you know nothing about other than that he hit .230 in one season of 300 AB. What do you expect that he will hit next year? Easy to answer. There are thousands of players who have done that in the past. You can look at all of them and see what their collective BA was in their next season. That gives you your answer. There are other more mathematically rigorous ways to arrive at the same answer, but much of the time the “historical similar player method” will yield a more accurate answer, especially when you have a large sample to work with, because it captures all the things that your mathematical model may not. It is real life! You can’t do much better than that!

You can of course refine your “similar players” comparative database if you have more information about the player in question. He is left-handed? Use only left-handers in your comparison. He is 25? Use only 25-year olds. What if you have so much information about the player in question that your “comp pool” starts to be too small to have a meaningful sample size (which only means that the certainty of your answer decreases, but not necessarily the accuracy)? Let’s say that he is 25, left-handed, 5’10” and 170 pounds, he hit .273 in 300 AB, and you want to include all of these things in your comparison. That obviously will not apply to too many players in the past. Your sample size of “comps” will be small. In that case, you can use players between the ages of 24 and 26, between 5’9” and 5’11”, weigh between 160 and 180, and hit .265-283 in 200 to 400 AB. It doesn’t have to be those exact numbers, but as long as you are not biasing your sample compared to the player in question, you should arrive at an accurate answer to your question.

What if we do that with a .230 player in 300 AB? I’ll use .220 to .240 and between 200 and 400 AB. We know intuitively that we have to regress the .230 towards the league average around 60 or 65%, which will yield around .245 as our answer. But we can do better using actual players and actual data. Of course our answer depends on the league average BA for our player in question and the league average BA for the historical data. Realistically, we would probably use something like BA+ (BA as compared to league-average batting average) to arrive at our answer. Let’s try it without that. I looked at all players who batted in that range from 2010-2014 in 200-400 AB and recorded their collective BA the next year. If I wanted to be a little more accurate (for this question it is probably not necessary), I might weight the results in year 2 by the AB in year 1, or use the delta method, or something like that.

If I do that for just 5 years, 2010-2015, I get 49 players who hit a collective .230 in year 1 in an average of 302 AB. The next year, they hit a collective .245, around what we would expect. That answers our question, “What would a .230 hitter in 300 AB hit next year, assuming he were allowed to play again (we don’t know from the historical data what players who were not allowed to play would hit)?”

What about .300 in 400 AB? I looked at all players from .280 to .350 in year 1 and between 300 and 450 AB. They hit a collective .299 in year 1 and .270 in year 2. Again, that answers the question, “What do we expect Player A to hit next year if he hit .300 this year in around 400 AB?”

For Siegrest with the -47 reverse split, we can use the same method to answer the question, “What do we expect his platoon split to be in the future given 230 TBF versus lefties in the past?” That is such an unusual split that we might have to tweak the criteria a little and then extrapolate. Remember that asking the question, “What do we expect Player A to do in the future?” is almost exactly the same thing as asking, “What is his true talent with respect to this metric?”

I am going to look at only one season for pitchers with around 200 BF versus lefties even though Siegrest’s 230 TBF versus lefties was over several seasons. It should not make much difference as the key is the number of lefty batters faced. I included all left-handed pitchers with at least 150 TBF versus LHB who had a reverse wOBA platoon difference of more than 10 points and pitched again the next year. Let’s see how they do, collectively, in the next year.

There were 76 of such pitchers from 2003-1014. They had a collective platoon differential of -39 points, less than Siegrest’s -47 points, in an average of 194 TBF versus LHB, also less than Siegrest’s 231. But, we should be in the ballpark with respect to estimating Siegrest’s true splits using this “in vivo” method. How did they do in the next year, which is a good proxy (an unbiased estimate) for their true splits?

In year 2, they had an average TBF versus lefties of 161, a little less than the previous year, which is to be expected, and their collective platoon splits were plus plus 8.1 points. So they went from -39 to plus 8.1 in one season to the next because one season of reverse splits is mostly a fluke as I explained in my previous article on platoon splits. 21 points is around the average for LHB with > 150 TBF v. lefties in this time period, so these pitchers moved 47 points from year 1 to year 2, out of a total of 60 points from year 1 to league average. That is a 78% regression toward the mean, around what we estimated Siegrest’s regression should be (I think it was 82%). That suggests that our mathematical model is good since it creates around the same result as when we used our “real live players” method.

How much would it take to estimate a true reverse split for a lefty? Let’s look at some more numbers. I’ll raise the bar to lefty pitchers with at least a 20 point reverse split. There were only 57 in those 12 years of data. They had a collective split in year 1 of -47, just like Siegrest, in an average of 191 TBF v. LHB. How did they do in year 2, which is the answer to our question of their true split? Plus 6.4 points. That is a 78% regression, the same as before.

What about pitchers with at least a 25 point reverse split? They averaged -51 points in year 1. Can we get them to a true reverse split?  Nope. Not even close.

What if we raise the sample size bar? I’ll do at least 175 TBF and -15 reverse split in year 1. Only 45 lefty pitchers fit this bill and they had a -43 point split in year 1 in 209 TBF v. lefties. Next year? Plus 2.8 points! Close but no cigar. There is of course an error bar around only 45 pitchers with 170 TBF v. lefties in year 2, but we’ll take those numbers on faith since that’s what we got. That is a 72% regression with 208 TBF v. lefties, which is about what we would expect given that we have a slightly larger sample size than before.

So please, please, please, when you see or hear of a pitcher with severe reverse splits in 200 or so BF versus lefties, which is around a full year for a starting pitcher or 2 or 3 years for a reliever, remember that our best estimate of their true platoon splits, or what his manager should expect when he sends him out there, is very, very different from what those actual one or three year splits suggest when those actual splits are very far away from the norm. Most of that unusual split, in either direction – almost all of it in fact – is likely a fluke. When we say “likely” we also mean that we must assume that it is a fluke and that we must also assume that the true number is the weighted mean of all the possibilities, which are those year 2 numbers, or year 1 (or multiple years) heavily regressed toward the league average.

 

With all the hullaballoo about Utley’s slide last night and the umpires’ calls or non-calls, including the one or ones in NY (whose names, addresses, telephone numbers, and social security numbers should be posted on the internet, according to Pedro Martinez), what was lost – or at least there was much confusion – was a discussion of the specific rule(s) that applies to that exact situation – the take-out slide that is, not whether Utley was safe or not on replay. For that you need to download the 2015 complete rule book, I guess. If you Google certain rule numbers, it takes you to the MLB “official rules” portion of their website in which at least some of the rule numbers appear to be completely different than in the actual current rule book.

In any case, last night after a flurry of tweets, Rob Neyer, from Fox Sports, pointed out the clearly applicable rule (although other rules come close): It is 5.09 (a) (13) in the PDF version of the current rulebook. It reads, in full:

The batter is out when… “A preceding runner shall, in the umpire’s judgment, intentionally interfere with a fielder who is attempting to catch a thrown ball or to throw a ball in an attempt to complete any play;”

That rule is unambiguous and crystal clear. 1) Umpire, in his judgment, determines that runner intentionally interferes with the pivot man. 2) The batter must be called out.

By the way, the runner himself may or may not be out. This rule does not address that. There is a somewhat common misperception that the umpire calls both players out according to this rule. Another rule might require the umpire to call the runner also out on interference even if he arrived before the ball/fielder or the fielder missed the bag – but that’s another story.

Keep in mind that if you ask the umpire, “Excuse me, Mr. umpire, but in your judgment, did you think that the runner intentionally interfered with the fielder,” and his answer is, “Yes,” then he must call the batter out. There is no more judgment. The only judgment allowed in this rule is whether the runner intentionally interfered or not. If the rule had said, “The runner may be called out,” then there would be two levels of judgment, presumably. There are other rules which explicitly say the umpire may do certain things, in which case there is presumably some judgement that goes into whether he decides to do them or not. Sometimes those rules provide guidelines for that judgment (the may part) and sometimes they do not. Anyway, this rule does not provide that may judgment. If umpire thinks is it intentional interference, the batter (not runner) is automatically out.

So clearly the umpire should have called the batter out on that play, unless he could say with a straight face, “In my judgment, I don’t think that Utley intentionally interfered with the fielder.” That is not a reasonable judgment of course. Not that there is much recourse for poor or even terrible judgment. Judgment calls are not reviewable, I don’t think. Perhaps umpires can get together and overturn a poor judgment call. I don’t know.

But that’s not the end of the story. There is a comment to this rule which reads:

“Rule 5.09(a)(13) Comment (Rule 6.05(m) Comment): The objective of this rule is to penalize the offensive team for deliberate, unwarranted, unsportsmanlike action by the runner in leaving the baseline for the obvious purpose of crashing the pivot man on a double play, rather than trying to reach the base. Obviously this is an umpire’s judgment play.”

Now that throws a monkey wrench into this situation. Apparently this is where the (I always thought it was an unwritten rule), “Runner must be so far away from the base that he cannot touch it in order for the ‘automatic double play’ to be called” rule came from. Only it’s not a rule. It is a comment which clearly adds a wrinkle to the rule.

The rule is unambiguous. If the runner interferes with the fielder trying to make the play (whether he would have completed the DP or not), then the batter is out. There is no mention of where the runner has to be or not be. The comment changes the rule. It adds another requirement (and another level of judgment). The runner must have been “outside the baseline” in the umpire’s judgment. In addition, it adds some vague requirements about the action of the runner. The original rule says only that the runner must “intentionally interfere” with the fielder. The comment adds words that require the runner’s actions to be more egregious – deliberate, unwarranted, and unsportsmanlike.

But the comment doesn’t really require that to be the case for the umpire to call the batter out. I don’t think. It says, “The objective of this rule is to penalize the offensive team….” I guess if the comment is meant to clarify the rule, MLB really doesn’t want the umpire to call the batter out unless the requirements in the comment are met (runner out of the baseline and his action was not only intentional but deliberate, unwarranted, and unsportsmanlike, a higher bar than just intentional).

Of course the rule doesn’t need clarification. It is crystal clear. If MLB wanted to make sure that the runner is outside of the baseline and acts more egregiously than just intentionally, then they should change the rule, right? Especially if comments are not binding, which I presume they are not.

Also, the comment starts off with: “The objective of this rule is to…”

Does that mean that this rule is only to be applied in double play situations? What if a fielder at second base fields a ball, starts to throw to first base to retire the batter, and the runner tackles him or steps in front of the ball? Is rule 5.09(a)(13) meant to apply? The comment says that the objective of the rule is to penalize the offensive team for trying to break up the double play. In this hypothetical, there is no double play being attempted. There has to be some rule that applies to this situation? If there isn’t, then MLB should not have written in the comment, “The objective of this rule….”

There is another rule which also appears to clearly apply to a take-out slide at second base, like Utley’s, with no added comments requiring that the runner be out of the baseline, or that his actions be unwarranted and unsportsmanlike. It is 6.01(6). Or 7.09(e) on the MLB web site. In fact, I tweeted this rule last night thinking that it addressed the Utley play 100% and that the runner and the batter should have been called out.

“If, in the judgment of the umpire, a base runner willfully and deliberately interferes with a batted ball or a fielder in the act of fielding a batted ball with the obvious intent to break up a double play, the ball is dead. The umpire shall call the runner out for interference and also call out the batter-runner because of the action of his teammate.”

The only problem there are the words, “interferes with a batted ball or a fielder in the act of fielding a batted ball.” A lawyer would say that the plain meaning of the words precludes this from applying to an attempt to interfere with a middle infielder tagging second base and throwing to first, because he is not fielding or attempting to field a batted ball and the runner is not interfering with a batted ball. The runner, in this case, is interfering with a thrown ball or a fielder attempting to tag second and then make a throw to first.

So if this rule is not meant to apply to a take-out slide at second, what is it meant to apply to? That would leave only one thing really. A ground ball is hit in the vicinity of the runner and he interferes with the ball or a fielder trying to field the ball. But there also must be, “an obvious intent to break up a double play.” That is curious wording. Would a reasonable person consider that an attempt to break up a double play? Perhaps ”obvious intent to prevent a double play.” Using the words break up sure sounds like this rule is meant to apply to a runner trying to take out the pivot man on a potential double play. But then why write “fielding a batted ball” rather than “making a play or a throw?”

A good lawyer working for the Mets would try and make the case that “fielding a batted ball” includes everything that happens after someone actually “fields the batted ball,” including catching and throwing it. In order to do so, he would probably need to find that kind of definition somewhere else in the rule book. It is a stretch, but it is not unreasonable, I don’t think.

Finally, Eric Byrnes on MLB Tonight, had one of the more intelligent and reasonable comments regarding this play that I have ever heard from an ex-player. He said, and I paraphrase:

“Of course it was a dirty slide. But all players are taught to do whatever it takes to break up the DP, especially in a post-season game. Until umpires start calling an automatic double play on slides like that, aggressive players like Utley will continue to do that. I think we’ll see a change soon.”

P.S. For the record, since there was judgment involved, and judgment is supposed to represent fairness and common sense, I think that Utley should not have been ruled safe at second on appeal.

Postscript:

Perhaps comments are binding. From the forward to the rules, on the MLB web site:

The Playing Rules Committee, at its December 1977 meeting, voted to incorporate the Notes/Case Book/Comments section directly into the Official Baseball Rules at the appropriate places. Basically, the Case Book interprets or elaborates on the basic rules and in essence have the same effect as rules when applied to particular sections for which they are intended.

Last night in the Cubs/Cardinals game, the Cardinals skipper took his starter, Lackey, out in the 8th inning of a 1-run game with one out, no one on base and lefty Chris Coghlan coming to the plate. Coghlan is mostly a platoon player. He has faced almost four times as many righties in his career than lefties. His career wOBA against righties is a respectable .342. Against lefties it is an anemic .288. I have him with a projected platoon split of 27 points, less than his actual splits, which is to be expected as platoon splits in general get heavily regressed toward the mean, because they tend to be laden with noise for two reasons: One, the samples are rarely large because you are comparing performance against righties to performance against lefties and the smaller of the two tends to dominate the effective sample size – in Coghlan’s case, he has faced only 540 lefties in his entire 7-year career, less than the number of PA a typical  full-time batter gets in one season. Two, there is not much of a spread in platoon talent among both batters and pitchers. The less spread in talent for any statistic, the more the differences you see among players, especially in small samples, are noise. Sort of like DIPS for pitchers.

Anyway, even with a heavy regression, we think that Coghlan has a larger than average platoon split for a lefty and the average lefty split tends to be large. You typically would not want him facing a lefty in that situation. That is especially true when you have a very good and fairly powerful right-handed bat on the bench – Jorge Soler. Soler has a reverse career platoon split, but with only 114 PA versus lefties, that number is almost meaningless. I estimate his actual platoon split to be 23 points, a little less than the average righty. For RHB, there is always a heavy regression of actual platoon splits, regardless of the sample size (while the greater the sample of actual PA versus lefties, the less you regress, it might be a 95% regression for small samples and an 80% regression for large samples – either way, large) simply because there is not a very large spread of talent among RHB. If we look at the actual splits for all RHB over many, many PA, we see a narrow range of results. In fact, there is virtually no such thing as a RHB with true reverse platoon splits.

Soler seems to be the obvious choice,  so of course that’s what Maddon did – he pinch hit for Coghlan with Soler, right? This is also a perfect opportunity since Matheny cannot counter with a RHP – Siegrest has to pitch to at least one batter after entering the game. Maddon let Coghlan hit and he was easily dispatched by Siegrest 4 pitches later. Not that the result has anything to do with the decision by Matheny or Maddon. It doesn’t. Matheny’s decision to bring in Siegrest at that point in time was rather curious too, if you think about it. Surely he must have assumed that Maddon would bring in a RH pinch hitter. So he had to decide whether to pitch Lackey against Coghlan or Siegrest against a right handed hitter, probably Soler. Plus, the next batter, Russell, is another righty. It looks like he got extraordinarily lucky when Maddon did what he did – or didn’t do – in letting Coghlan bat. But that’s not the whole story…

Siegrest may or may not be your ordinary left-handed pitcher. What if Siegrest actually has reverse splits? What if we expect him to pitch better against right handed batters and worse against left-handed batters?  In that case, Coghlan might actually be the better choice than Soler even though he doesn’t often face lefty pitchers. When a pitcher has reverse splits – true reverse splits – we treat him exactly like a pitcher of the opposite hand.  It would be exactly like Coghlan or Soler were facing a RHP. Or maybe Siegrest has no splits – i.e. RH and LH batters of equal overall talent perform about the same. Or very small platoon splits compared to the average left-hander? So maybe hitting Coghlan or Soler is a coin flip.

It might also have been correct for Matheny to bring in Siegrest no matter who he was going to face, simply because Lackey, who is arguably a good but not great pitcher, was about to face a good lefty hitter for the third time – not a great matchup. And if Siegrest does indeed have very small splits either positive or negative, or no splits at all, that is a perfect opportunity to bring him in, and not care whether Maddon leaves Coghlan in or pinch hits Soler. At the same time, if Maddon things that Siegrest has significant reverse splits, he can leave in Coghlan, and if he thinks that the lefty pitcher has somewhere around a neutral platoon split, he can still leave Coghlan in and save Soler for another pinch hit opportunity. Of course, if he thinks that Siegrest is like your typical lefty pitcher, with a 30 point platoon split, then using Coghlan is a big mistake.

So how do managers determine what a pitcher’s true or expected (the same thing) platoon split is? The typical troglodyte will use batting average against during the season in question. After all, that’s what you hear ad-nauseam from the talking heads on TV, most of them ex-players or even ex-managers. Even the slightly informed fan knows that batting average against for a pitcher is worthless stat in and of itself (what, walks don’t count, and a HR is the same as a single?), especially in light of DIPS. The slightly more informed fan also knows that one season splits for a batter or pitcher are not very useful for the reasons I explained above.

If you look at Siegrest’s BA against splits for 2015, you will see .163 versus RHB and .269 versus LHB. Cue the TV commentators: “Siegrest is much better against right-handed batters than left-handed ones.” Of course, is and was are very different things in this context and with respect to making decisions like Matheny and Maddon did. The other day David Price was a pretty mediocre to poor pitcher. He is a great pitcher and you would certainly be taking your life into your hands if you treated him like a mediocre to poor pitcher in the present. Kershaw was a poor pitcher in the playoffs…well, you get the idea. Of course, sometimes, was is very similar to is. It depends on what we are talking about and how long the was was, and what the was actually is.

Given that Matheny is not considered to be such an astute manager when it comes to data-driven decisions, it may be is surprising that he would bring in Siegrest to pitch to Coghlan knowing that Siegrest has an enormous reverse BA against split in 2015. Maybe he was just trying to bring in a fresh arm – Siegrest is a very good pitcher overall. He also knows that the lefty is going to have to pitch to the next batter, Russell, a RHB.

What about Maddon? Surely he knows better than to look at such a garbage stat for one season to inform a decision like that. Let’s use a much better stat like wOBA and look at Siegrest’s career rather than just one season. Granted, a pitcher’s true platoon splits may change from season to season as he changes his pitch repertoire, perhaps even arm angle, position on the rubber, etc. Given that, we can certainly give more weight to the current season if we like. For his career, Siegrest has a .304 wOBA against versus LHB and .257 versus RHB. Wait, let me double check that. That can’t be right. Yup, it’s right. He has a career reverse wOBA split of 47 points! All hail Joe Maddon for leaving Coghlan in to face essentially a RHP with large platoon splits! Maybe.

Remember how in the first few paragraphs I talked about how we have to regress actual platoon splits a lot for pitchers and batters, because we normally don’t have a huge sample and because there is not a great deal of spread among pitchers with respect to true platoon split talent? Also remember that what we, and Maddon and Matheny, are desperately trying to do is estimate Siegrest’s true, real-life honest-to-goodness platoon split in order to make the best decision we can regarding the batter/pitcher matchup. That estimate may or may not be the same as or even remotely similar to his actual platoon splits, even over his entire career. Those actual splits will surely help us in this estimate, but the was is often quite different than the is.

Let me digress a little and invoke the ole’ coin flipping analogy in order to explain how sample size and spread of talent come into play when it comes to estimating a true anything for a player – in this case platoon splits.

Note: If you want you can skip the “coins” section and go right to the “platoon” section. 

Coins

Let’s say that we have a bunch of fair coins that we stole from our kid’s piggy bank. We know of course that each of them has a 50/50 chance of coming up head or tails in one flip – sort of like a pitcher with exactly even true platoon splits. If we flip a bunch of them 100 times, we know we’re going to get all kinds of results – 42% heads, 61% tails, etc. For the math inclined, if we flip enough coins the distribution of results will be a normal curve, with the mean and median at 50% and the standard deviation equal to the binomial standard deviation of 100 flips, which is 5%.

Based on the actual results of 100 flips of any of the coins, what would you estimate the true heads/tails percentage of that coin? If one coin came up 65/35 in favor of heads, what is your estimate for future flips? 50% of course. 90/10? 50%. What if we flipped a coin 1000 or even 5000 times and it came up 55% heads and 45% tails? Still 50%. If you don’t believe or understand that, stop reading and go back to whatever you were doing. You won’t understand the rest of this article. Sorry to be so blunt.

That’s like looking at a bunch of pitchers platoon stats and no matter what they are and over how many TBF, you conclude that the pitcher really has an even split and what you observed is just noise. Why is that? With the coins it is because we know beforehand that all the coins are fair (other than that one trick coin that your kid keeps for special occasions). We can say that there is no “spread in talent” among the coins and therefore regardless of the result of a number of flips and regardless of how many flips, we regress the result 100% of the way toward the mean of all the coins, 50%, in order to estimate the true percentage of any one coin.

But, there is a spread of talent among pitcher and batter platoon splits. At least we think there is. There is no reason why it has to be so. Even if it is true, we certainly can’t know off the top of our head how much of a spread there is. As it turns out, that is really important in terms of estimating true pitcher and batter splits. Let’s get back to the coins to see why that is. Let’s say that we don’t have 100% fair coins. Our sly kid put in his piggy bank a bunch of trick coins, but not really, really tricky. Most are still 50/50, but some are 48/52, 52/48, a few less are 45/55, and 1 or 2 are 40/60 and 60/40. We can say that there is now a spread of “true coin talent” but the spread is small. Most of the coins are still right around 50/50 and a few are more biased than that.  If your kid were smart enough to put in a normal distribution of “coin talent,” even one with a small spread, the further away from 50/50, the fewer coins there are.  Maybe half the coins are still fair coins, 20% are 48/52 or 52/48, and a very, very small percentage are 60/40 or 40/60.  Now what happens if we flip a bunch of these coins?

If we flip them 100 times, we are still going to be all over the place, whether we happen to flip a true 50/50 coin or a true 48/52 coin. It will be hard to guess what kind of a true coin we flipped from the result of 100 flips. A 50/50 coin is almost as likely to come up 55 heads and 45 tails as a coin that is truly a 52/48 coin in favor of heads. That is intuitive, right?

This next part is really important. It’s called Bayesian inference, but you don’t need to worry about what it’s called or even how it technically works. It is true that if you flipped a coin and got 60/40 heads that that coin was much more likely to be a true 60/40 coin than it is to be a 50/50 coin. That should be obvious too.  But here’s the catch. There are many, many more 50/50 coins in your kid’s piggy bank than there are 60/40. Your kid was smart enough to put in a normal distribution of trick coins.

So even though it seems like if you flipped a coin 100 times and got 60/40 heads, it is more likely you have a true 60/40 coin than a true 50/50 coin, it isn’t. It is much more likely that you have a 50/50 coin that got “heads lucky” than a true 60/40 coin that landed on the most likely result after 100 flips (60/40) because there are many more 50/50 coins in the bank than 60/40 coins – assuming a somewhat normal distribution with a small spread.

Here is the math: The chances of a 50/50 coin coming up exactly 60/40 is around .01. Chances of a true 60/40 coin coming up 60/40 is 8 times that amount, or .08. But, if there are 8 times as many 50/50 coins in your piggy bank as 60/40 coins, then the chances of your 60/40 coin being a fair coin or a 60/40 biased coin is only 50/50. If there 800 times more 50/50 coins than 60/40 coins in your bank, as there is likely to be if the spread of coin talent is small, then it is 100 times more likely that you have a true 50/50 coin than a true 60/40 coin even though the coin came up 60 heads in 100 flips.

It’s like the AIDS test contradiction. If you are a healthy, heterosexual, non-drug user, and you take an AIDS test which has a 1% false positive rate and you test positive, you are extremely unlikely to have AIDS. There are very few people with AIDS in your population so it is much more likely that you do not have AIDS and got a false positive (1 in 100) than you did have AIDS in the first place (maybe 1 in 100,000) and tested positive. Out of a million people in your demographic, if they all got tested, 10 will have AIDS and test positive (assuming a 0% false negative rate) and 999,990 will not have AIDS, but 10,000 of them (1 in 100) will have a false positive. So the odds you have AIDS is 10,000 to 10 or 1000 to 1 against.

In the coin example where the spread of coin talent is small and most coins are still at or near 50/50, pretty much no matter what we get when flipping a coin 100 times, we are going to conclude that there is a good chance that our coin is still around 50/50 because most of the coins are around 50/50 in true coin talent. However, there is some chance that the coin is biased, if we get an unusual result.

Now, it is awkward and not particularly useful to conclude something like, “There is a 60% chance that our coin is a true 50/50 coin, 20% it is a 55/45 coin, etc.” So what we usually do is combine all those probabilities and come up with a single number called a weighted mean.

If one coin comes up 60/40, our weighted mean estimate of its “true talent” may be 52%. If we come up with 55/45, it might be 51%. 30/70 might be 46%. Etc. That weighed mean is what we refer to as “an estimate of true talent” and is the crucial factor in making decisions based on what we think the talent of the coins/players are likely to be in the present and in the future.

Now what if the spread of coin talent were still small, as in the above example, but we flipped the coins 500 times each? Say we came up with 60/40 again in 500 flips. The chances of that happening with a 60/40 coin is 24,000 times more likely than if the coin were 50/50! So now we are much more certain that we have a true 60/40 coin even if we don’t have that many of them in our bank. In fact, if the standard deviation of our spread in coin talent were 3%, we would be about ½ certain that our coin was a true 50/50 coin and half certain it was a true 60/40 coin, and our weighted mean would be 55%.

There is a much easier way to do it. We have to do some math gyrations which I won’t go into that will enable us to figure out how much to regress our observed flip percentage to the mean flip percentage of all the coins, 50%. For 100 flips it was a large regression such that with a 60/40 result we might estimate a true flip talent of 52%, assuming a spread of coin talent of 3%. For 500 flips, we would regress less towards 50% to give us around 55% as our estimate of coin talent. Regressing toward a mean rather than doing the long-hand Bayesian inferences using all the possible true talent states assumes a normal distribution or close to one.

The point is that the sample size of the observed measurement is determines how much we regress the observed amount towards the mean. The larger the sample, the less we regress. One season observed splits and we regress a lot. Career observed splits that are 5 times that amount, like our 500 versus 100 flips, we regress less.

But sample size of the observed results is not the only thing that determines how much to regress. Remember if all our coins were fair and there were no spread in talent, we would regress 100% no matter how many flips we did with each coin.

So what if there were a large spread in talent in the piggy bank? Maybe a SD of 10 percent so that almost all of our coins were anywhere from 20/80 to 80/20 (in a normal distribution the rule of thumb is that almost of the values fall within 3 SD of the mean in either direction)? Now what if we flipped a coin 100 times and came up with 60 heads. Now there are lots more coins at true 60/40 and even some coins at 70/30 and 80/20. The chances that we have a truly biased coin when we get an unusual result is much greater than if the spread in coin talent were smaller, even in 100 flips.

So now we have the second rule. The first rule was that the number of trials is important in determining how much credence to give to an unusual result, i.e., how much to regress that result towards the mean, assuming that there is some spread in true talent. If there is no spread, then no matter how many trials our result is based on, and no matter how unusual our result, we still regress 100% toward the mean.

All trials whether they be coins or human behavior have random results around a mean that we can usually model as long as the mean is not 0 or 1. That is an important concept, BTW. Put it in your “things I should know” book. No one can control or influence that random distribution. A human being might change his mean from time to time but he cannot change or influence the randomness around that mean. There will always be randomness, and I mean true randomness, around that mean regardless of what we are measuring, as long as the mean is between 0 and 1, and there is more than 1 trial (in one trial you either succeed or fail of course). There is nothing that anyone can do to influence that fluctuation around the mean. Nothing.

The second rule is that the spread of talent also matters in terms of how much to regress the actual results toward the mean. The more the spread, the less we regress the results for a given sample size. What is more important? That’s not really a specific enough question, but a good answer is that if the spread is small, no matter how many trials the results are based on, within reason, we regress a lot. If the spread is large, it doesn’t take a whole lot of trials, again, within reason, in order to trust the results more and not regress them a lot towards the mean.

Let’s get back to platoon splits, now that you know almost everything about sample size, spread of talent, regression to mean, and watermelons. We know that how much to trust and regress results depends on their sample size and on the spread of true talent in the population with respect to that metric, be it coin flipping or platoon splits. Keep in mind that when we say trust the results, that it is not a binary thing, as in, “With this sample and this spread of talent, I believe the results – the 60/40 coin flips or the 50 point reverse splits, and with this sample and spread, I don’t believe them.” That’s not the way it works. You never believe the results. Ever. Unless you have enough time on your hands to wait for an infinite number of results and the underlying talent never changes.

What we mean by trust is literally how much to regress the results toward a mean. If we don’t trust the stats much, we regress a lot. If we trust them a lot, we regress a little. But. We. Always. Regress. It is possible to come up with a scenario where you might regress almost 100% or 0%, but in practice most regressions are in the 20% to 80% range, depending on sample size and spread of talent. That is just a very rough rule of thumb.

We generally know the sample size of the results we are looking at. With Siegrest (I almost forgot what started this whole thing) his career TBF is 604 TBF, but that’s not his sample size for platoon splits because platoon splits are based on the difference between facing lefties and righties. The real sample size for platoon splits is the harmonic mean of TBF versus lefties and righties. If you don’t know what that means don’t worry about it. A shortcut is to use the lesser of the two which is almost always TBF versus lefties, or in Siegrest’s case, 231. That’s not a lot, obviously, but we have two possible things going for Maddon, who played his cards like Siegrest was a true reverse split lefty pitcher. One, maybe the spread of platoon skill among lefty pitchers is large (it’s not), and two, he has a really odd observed split of 47 points in reverse. That’s like flipping a coin 100 times and getting 70 heads and 30 tails or 65/35. It is an unusual result. The question is, again, not binary – whether we believe that -47 point split or not. It is how much to regress it toward the mean of +29 – the average left-handed platoon split for MLB pitchers.

While the unusual nature of the observed result is not a factor in how much regressing to do, it does obviously come into play, in terms of our final estimate of true talent. Remember that the sample size and spread of talent in the underlying population, in this case, all lefty pitchers, maybe all lefty relievers if we want to get even more specific, is the only thing that determines how much we trust the observed results, i.e., how much we regress them toward the mean. If we regress -47 points 50% toward the mean of +29 points, we get quite a different answer than if we regress, say, an observed -10 split 50% towards the mean. In the former case, we get a true talent estimate of -9 points and in the latter we get +10. That’s a big difference. Are we “trusting” the -47 more than the -10 because it is so big? You can call it whatever you want, but the regression is the same assuming the sample size and spread of talent is the same.

The “regression”, by the way, if you haven’t figured it out yet, is simply the amount, in percent, we move the observed toward the mean. -47 points is 76 points “away” from the mean of +29 (the average platoon split for a LHP). 50% regression means to move it half way, or 38 points. If you move -47 points 38 points toward +29 points, you get -9 points, our estimate of Siegrest’s true platoon split if  the correct regression is 50% given his 231 sample size and the spread of platoon talent among LH MLB pitchers. I’ll spoil the punch line. It is not even close to 50%. It’s a lot more.

How do we determine the spread of talent in a population, like platoon talent? That is actually easy but it requires some mathematical knowledge and understanding. Most of you will just have to trust me on this. There are two basic methods which are really the same thing and yield the same answer. One, we can take a sample of players, say 100 players who all had around the same number of opportunities (sample size), say, 300. That might be all full-time starting pitchers in one season and the 300 is the number of LHB faced. Or it might be all pitchers over several seasons who faced around 300 LHB. It doesn’t matter. Nor do the number of opportunities.  They don’t even have to be the same for all pitchers. It is just easier to explain that way. Now we compute the variance in that group – stats 101. Then we compare that variance with the variance expected by chance – still stats 101.

Let’s take BA, for example. If we have a bunch of players with 400 AB each, what is the variance in BA among the players expected by chance? Easy. Binomial theorem. .000625 in BA. What if we observe a variance of twice that, or .00125? Where is the extra variance coming from? A tiny bit is coming from the different contexts that the player plays in, home/road, park, weather, opposing pitchers, etc. A tiny bit comes from his own day-to-day changes in true talent. We’ll ignore that. They really are small. We can of course estimate that too and throw it into the equation. Anyway, that extra variance, the .000625, is coming from the spread of talent. The square root of that is .025 or 25 points of BA, which would be one SD of talent in this example. I just made up the numbers, but that is probably close to accurate.

Now that we know the spread in talent for BA, which we get from this formula – observed variance = random variance + talent variance – we can now calculate the exact regression amount for any sample of observed batting average or whatever metric we are looking at. It’s the ratio of random variance to total variance. Remember we need only 2 things and 2 things only to be able to estimate true talent with respect to any metric, like platoon splits: spread of talent and sample size of the observed results. That gives us the regression amount. From that we merely move the observed result toward the mean by that amount, like I did above with Siegrest’s -47 points and the mean of +29 for a league-average LHP.

The second way, which is actually more handy, is to run a regression of player results from one time period to another. We normally do year-to-year but it can be odd days to even, odd PA to even PA, etc. Or an intra-class correlation (ICC) which is essentially the same thing but it correlates every PA (or whatever the opportunity is) to every other PA within a sample.  When we do that, we either use the same sample size for every player, like we did in the first method, or we can use different sample sizes and then take the harmonic mean of all of them as our average sample size.

This second method yields a more intuitive and immediately useful answer, even though they both end up with the same result. This actually gives you the exact amount to regress for that sample size (the average of the group in your regression). In our BA example, if the average sample size of all the players were 500 and we got a year-to-year (or whatever time period) correlation of .4, that would mean that for BA, the correct amount of regression for a sample size of 500 is 60% (1 minus the correlation coefficient or “r”). So if a player bats .300 in 500 AB and the league average is .250 and we know nothing else about him, we estimate his true BA to be (.300 – .250) * .4 + .250 or .270. We move his observed BA 60% towards the mean of .250. We can easily with a little more math calculate the amount of regression for any sample size.

Using method #1 tells us precisely what the spread in talent is. Method 2 tells us that implicitly by looking at the correlation coefficient and the sample size. With either method, we get the amount to regress for any given sample size.

Platoon

Let’s look at some year-to-year correlations for a 500 “opportunity” (PA, BA, etc.) sample for some common metrics. Since we are using the same sample size for each, the correlation tells us the relative spreads in talent for each of these metrics. The higher the correlation for any given sample, the higher the spread in talent (there are other factors that slightly affect the correlation other than spread of talent for any given sample size but we can safely ignore them).

BA: .450

OBA: .515

SA: .525

Pitcher ERA: .240

BABIP for pitchers (DIPS): .155

BABIP for batters: .450

Now let’s look at platoon splits:

This is for an average of 200 TBF versus a LHP, so the sample size is smaller than the ones above.

Platoon wOBA differential for pitchers (200 BF v. LHB): .135

RHP: .110

LHP: .195

Platoon wOBA differential for batters (200 BF v. LHP): .180

RHB: .0625

LHB: .118

Those numbers are telling us that, like DIPS, the spread of talent among batters and pitchers with respect to platoon splits is very small. You all know now that this, along with sample size, tells us how much to regress an observed split like Siegrest’s -47 points. Yes, a reverse split of 47 points is a lot, but that has nothing to do with how much to regress it in order to estimate Siegrist’s true platoon split. The fact that -47 points is very far from the average left-handed pitcher’s +29 points means that it will take a lot of regression to moved it into the plus zone, but the -47 points in and of itself does not mean that we “trust it more.” If the regression were 99% then whether the observed were -47 or +10, we would arrive at nearly the same answer. Don’t confuse the regression with the observed result. One has nothing to do with the other. And don’t think in terms of “trusting” the observed result or not. Regress the result and that’s your answer. If you arrive at answer X it makes no difference whether your starting point, the observed result, was B, or C. None whatsoever.  That is a very important point. I don’t know how many times I have heard, “But he had a 47 point reverse split in his entire career!” You can’t possibly be saying that you estimate his real split to be +10 or +12 or whatever it is.” Yes, that’s exactly what I’m saying. A +10 estimated split is exactly the same whether the observed split were -47 or +5. The estimate using the regression amount is the only thing that counts.

What about the certainty of the result? The certainty of the estimate depends mostly on the sample size of the observed results. If we never saw a player hit before and we estimate that he is a .250 hitter we are surely less certain than if we have a hitter who has hit .250 over 5000 AB. But does that change the estimate? No. The certainty due to the sample size was already included in the estimate. The higher the certainty the less we regressed the observed results. So once we have the estimate we don’t revise that again because of the uncertainty. We already included that in the estimate!

And what about the practical importance of the certainty in terms of using that estimate to make decisions? Does it matter whether we are 100% or 90% sure that Siegrest is a +10 true platoon split pitcher? Or whether we are only 20% sure – he might actually have a higher platoon split or a lower one? Remember the +10 is a weighted mean which means that it is in the middle of our error bars. The answer to that is, “No, no and no!” Every decision that a manager makes on the field is or should be based on weighted mean estimates of various player talents. The certainty or distribution rarely should come into play. Basically the noise in the result of a sample of 1 is so large that it doesn’t matter at all what the uncertainty level of your estimates are.

So what do we estimate Siegrest’s true platoon split, given a 47 point reverse split in 231 TBF versus LHB. Using no weighting for more recent results, we regress his observed splits 1 minus 230/1255, or .82 (82%) towards the league average for lefty pitchers, which is around 29 points for a LHP. 82% of 76 points is 62 points. So we regress his -47 points 62 points in the plus direction which gives us an estimate of +15 points in true platoon split. That is half the split of an average LHP, but it is plus nonetheless.

That means that a left-handed hitter like Coghlan will hit better than he normally does against a left-handed pitcher. However, Coghlan has a larger than average estimated split, so that cancels out Siegrest’s smaller than average split to some extent. That also means that Soler or another righty will not hit as well against Siegrest as he would against a LH pitcher with average splits. And since Soler himself has a slightly smaller platoon split than the average RHB, his edge against Siegrest is small.

We also have another method for better estimating true platoon splits for pitchers which can be used to enhance the method we use using sample results, sample size, and means. It is very valuable. We have a pretty good idea as to what causes one pitcher to have a smaller or greater platoon split than another. It’s not like pitchers deliberately throw better or harder to one side or the other or that RH or LH batters scare or distract them. Pitcher platoon splits mostly come from two things: One is arm angle. If you’ve ever played or watched baseball that should be obvious to you. The more a pitcher comes from the side, the tougher he is on same-side batters and the larger his platoon split. That is probably the number one factor in these splits. It is almost impossible for a side-armer not to have large splits.

What about Siegrest? His arm angle is estimated by Jared Cross of Steamer, using pitch f/x data, at 48 degrees. That is about a ¾ arm angle. That strongly suggests that he does not have true reverse splits and it certainly enables us to be more confident that he is plus in the platoon split department.

The other thing that informs us very well about likely splits is pitch repertoire. Each pitch has its own platoon profile. For example, pitches with the largest splits are sliders and sinkers and those with the lowest or even reverse are the curve (this surprises most people), splitter, and change.

In fact, Jared (Steamer) has come up with a very good regression formula which estimates platoon split from pitch repertoire and arm angle only. This formula can be used by itself for estimating true platoon splits. Or it can be used to establish the mean towards which the actual splits should be regressed. If you use the latter method the regression percentage is much higher than if you don’t. It’s like adding a lot more 50/50 coins to that piggy bank.

If we plug Siegrest’s 2015 numbers into that regression equation, we get an estimated platoon from arm angle and pitch repertoire of 14 points, which is less than the average lefty even with the 48 degree arm angle. That is mostly because he uses around 18% change ups this year. Prior to this season, when he didn’t use the change up that often, we would probably have estimated a much higher true split.

So now rather than regressing towards just an average lefty with a 29 point platoon split, we can regress his -47 points to a more accurate mean of 14 points. But, the more you isolate your population mean, the more you have to regress for any given sample size, because you are reducing the spread of talent in that more specific population. So rather than 82%, we have to regress something line 92%. That brings -47 to +9 points.

So now we are down to a left-handed pitcher with an even smaller platoon split. That probably makes Maddon’s decision somewhat of a toss-up.

His big mistake in that same game was not pinch-hitting for Lester and Ross in the 6th. That was indefensible in my opinion. Maybe he didn’t want to piss off Lester, his teammates, and possibly the fan base.Who knows?