Archive for October, 2017

Last night in game 4 of the 2017 World Series, the Astros manager, A.J. Hinch, sort of a sabermetric wunderkind, at least as far as managers go (the Astros are one of the more, if not the most, analytically oriented teams), brought in their closer, Ken Giles, to pitch the 9th in a tie game. This is standard operating procedure for the sabemetrically inclined team – bring in your best pitcher in a tie game in the 9th inning or later, especially if you’re the home team, where you’ll never have the opportunity to protect a lead. The reasoning is simple: You want to guarantee that you’ll use your best pitcher in the 9th or later inning, in a high leverage situation (in the 9+ inning of a tie game, the LI is always at least 1.73 to start the inning).

So what’s the problem? Hinch did exactly what he was supposed to do. It is more or less the optimal move, although it depends a bit on the quality of that closer against the batters he’s going to face, as opposed to the alternative (as well as other bullpen considerations). In this case, it was Giles versus, say, Devenski. Let’s look at their (my) normalized (4.00 is average) runs allowed per 9 inning projections:

Devenski: 3.37

That’s a very good reliever. That’s closer quality although not elite closer quality.

Giles: 2.71

That is an elite closer. In fact, I have Giles as the 6th best closer in baseball. The gap between the two pitchers is pretty substantial, .66 runs per 9 innings. For one inning with a leverage index (LI) of 2.0, that translates to a 1.5% win expectancy (WE) advantage for Giles over Devenski. As one-decision “swings” (the difference between the optimal and a sub-optimal move) go, that’s considered huge. Of course, if you are going to use Giles later in the game anyway if you stay with Devenski for another inning or two, if the game goes that long, you get some of that WE back. Not all of it (because he may not get to pitch), but some of it. Anyway, that’s not really the issue I want to discuss.

Why were many of the so-called sabermetric writers (they often know just enough about sabermetrics or mathematical/logical thinking in general to be “dangerous,” although that’s a bit unfair on my part – let’s just say they know enough to be “right” much of the time, but “wrong” some of the time) aghast, or at least critical, of this seemingly correct move?

First, it was due to the result of course, which belies the fact that these are sabermetric writers. The first thing they teach you in sabermetrics 101 is not to be results oriented. For the most part, the results of a decision have virtually no correlation with the “correctness” of the decision itself. Sure, some of them will claim that they thought or even publicly said beforehand that it was the wrong move, and some of them are not lying – but it doesn’t really matter. That’s only one reason why lots of people were complaining of this move – maybe even the secondary reason (or not the reason at all), especially for the saber-writers.

The primary reason (again, at least stated – I’m 100% certain that the result strongly influenced nearly all of the detractors) was that these naysayers had little or no confidence in Giles going into this game. He must have had a bad season, right, despite my stellar projection? After all, good projection systems use 3, 4 or more years of data along with a healthy dose of regression, especially with relievers who never have a large sample size of innings pitched or batters faced. Occasionally you can have a great projection for a player who had a mediocre or poor season, and that projection will be just as reliable as any other (because the projection model accurately includes the current season, but doesn’t give it as much weight as nearly all fans and media do). So what were Giles’ 2017 numbers?

Only a 2.30 ERA and 2.39 FIP in a league where the average ERA was 4.37! His career ERA and FIP are 2.43 and 2.25, and he throws 98 mph. He’s a great pitcher. One of the best. There’s little doubt that’s true. But….

He’s thrown terribly thus far in the post-season. That is, his results have been poor. In 7.2 IP his ERA is 11.74. Of course he’s also struck out 10 and has a BABIP of .409. But he “looked terrible” these naysayers keep saying. Well, no shit. When you give up 10 runs in 7.2 innings on the biggest stage in sports, you’re pretty much going to “look bad.” Is there any indication, other than having poor results, that there’s “something wrong with Giles?” Given that his velocity is fine (97.9 so far) and that Hinch saw fit to remove Devenski who was “pitching well” and insert Giles in a critical situation, I think we can say with some certainty that there is no indication that anything is wrong with him. In fact, the data, such as his 12 K/9 rate, normal velocity, and an “unlucky” .409 BABIP, all suggest that there is nothing “wrong with him.” But honestly, I’m not here to discuss that kind of thing. I think it’s a futile and silly discussion. I’ve written many times how the notion that you can just tell (or that a manager can tell – which is not the case here, since Hinch was the one who decided to use him!) when a player is hot or cold by observing him is one of the more silly myths in sports, at least in baseball, and I have reams of data-driven evidence to support that assertion.

What I’m interested in discussing right now, is, “What do the data say?” How do we expect a reliever to pitch after 6 or 7 innings or appearances in which he’s gotten shelled? It doesn’t have to be 7 IP of course, but for research like this, it doesn’t matter. Whatever you find in 7 IP you’re going to find in 5 IP or in 12 IP, assuming you have large enough sample sizes and you don’t get really unlucky with a Type I or II error. The same goes for what constitutes getting shelled compared to how you perceive or define “getting shelled.” With research like this, it doesn’t matter. Again, you’re going to get the same answer whether you define getting shelled (or pitching brilliantly) by wOBA against, runs allowed, hard hit balls, FIP, etc. It also doesn’t matter what thresholds you set – you’ll also likely get the same answer.

Here’s what I did to answer this question – or at least to shed some light on it. I looked at all relievers over the last 10 years and split them up into three groups, depending on how they pitched in all 6-game sequences. Group I pitched brilliantly over a 6-game span. The criteria I set was a wOBA against less than .175. Group III were pitchers who got hammered over a 6-game stretch, at least as far as wOBA was concerned (of course in large samples you will get equivalent RA for these wOBA). They allowed a wOBA of at least .450.  Group II was all the rest. Here are what the groups looked like:

Group Average wOBA against Equivalent RA9
I .130 Around 0
II .308 Around 3
III .496 Around 10

 

Then I looked at their very next appearance. Again, I could have looked at their next 2 or 3 appearances but it wouldn’t make any difference (other than increasing the sample size – at the risk of the “hot” or “cold” state wearing off).

 

Group Average wOBA against wOBA next appearance
I .130 .307
II .308 .312
III .496 .317

 

While we certainly don’t see a large carryover effect, we do appear to see some effect. The relievers who have been throwing brilliantly continue to pitch 10 points better than the ones who have been getting hammered. 10 points in wOBA is equivalent to about .3 runs per 9 innings, so that would make a pitcher like Giles closer to Devenski, but still not quite there. But wait! Are these groups of pitchers of the same quality? No. The ones who were pitching brilliantly belong to a much better pool of pitchers than the ones who were getting hammered. Much better. This should not be surprising. I already assumed that when doing the research. How much better? Let’s look at their seasonal numbers (those will be a little biased because we already established that these groups pitched brilliantly or terribly for some period of time in the same season).

Group Average wOBA against wOBA next appearance Season wOBA
I .130 .307 .295
II .308 .312 .313
III .496 .317 .330

 

As you can see our brilliant pitchers are much better than our terrible ones. Even if we were able to back out the bias (say, by looking at last year’s wOBA), we still get .305 for the brilliant relievers and .315 for the hammered ones, based on the previous season’s numbers. In fact, we’ll use those instead.

Group Average wOBA against wOBA next appearance Prior season wOBA
I .130 .307 .305
II .308 .312 .314
III .496 .317 .315

 

Now that’s brilliant. We do have some sample error. The number of PA in the “next appearance” for group’s I and III are around 40,000 each (SD of wOBA = 2 points). However, look at the “expected” wOBA against, which is essentially the pitcher talent (Giles’ and Devenski’s projections) compared to their actual. They are almost identical. Regardless of how a reliever has pitched in his last 6 appearances, he pitches exactly as his normal projection would suggest on that 7th appearance. The last 6 IP has virtually no predictive value even at the extremes. I don’t want to hear, “Well he really (really, really) been getting hammered – what about that big shot?”.  Allowing a .496 wOBA is getting really, really, really hammered, and .130 is throwing almost no-hit baseball, so we’ve already looked at the extremes!

So, as you can clearly see, and exactly what you should have expected, if you really knew about sabermetrics (unlike some of these so-called saber-oriented writers and pundits who like to cherry pick the sabermetric principles that suit their narratives and biases), is that 7 IP of pitching compared to 150 or more, is almost worthless information. The data don’t lie.

But you just know that something is wrong with Giles, right? You can just tell. You are absolutely certain that he’ll continue to pitch badly. You just knew that he was going to implode again last night (and you haven’t been wrong about that 90% of the time in your previous feelings). It’s all bullshit folks. But if it makes you feel smart or happy, it’s fine by me. I have nothing invested in all of this. I’m just trying to find the truth. It’s the nature of my personality. That makes me happy.

Advertisements

There’s an article up on Fangraphs by Eno Saris that talks about whether the pitch to Justin Turner in the bottom of the 9th inning in Game 2 of the 2017 NLCS was the “wrong” pitch to throw in that count (1-0) and situation (tie game, runners on 1 and 2, 2 outs) given Turner’s proclivities at that count. I won’t go into the details of the article – you can read it yourself – but I do want to talk about what it means or doesn’t mean to criticize a pitcher’s pitch selection – on one particular pitch, and how pitch selection even works, in general.

Let’s start with this – the basic tenet of pitching and pitch selection: Every single situation calls for a pitch frequency matrix. One pitch is chosen randomly from that matrix according to the “correct” frequencies. The “correct” frequencies are those which result in the exact same “result” (where result is measured by the win expectancy impact of all the possible outcomes combined).

Now, obviously, most pitchers “think” they’re choosing one specific pitch for some specific reason, but in reality since the batter doesn’t know the pitcher’s reasoning, it is essentially a random selection as far as he is concerned. For example, a pitcher throws an inside fastball to go 0-1 on the batter. He might think to himself, “OK, I just threw the inside fastball so I’ll throw a low and away off-speed to give him a ‘different look.’ But wait, he might be expecting that. I’ll double up with the fastball! Nah, he’s a pretty good fastball hitter. I’ll throw the off-speed! But I really don’t want to hang one on an 0-1 count. I’m not feeling that confident in my curve ball yet. OK, I’ll throw the fastball, but I’ll throw it low and away. He’ll probably think it’s an off-speed and lay off of it and I’ll get a called strike, or he’ll be late if he swings.”

As you can imagine, there are an infinite number of permutations of ‘reasoning’ that a pitcher can use to make his selection. The backdrop to his thinking is that he knows what tends to be effective at 0-1 counts in that situation (score, inning, runners, outs, etc.) given his repertoire, and he knows the batter’s strengths and weaknesses. The result is a roughly game theory optimal (GTO) approach which cannot be exploited by the batter and is maximally effective against a batter who is thinking roughly GTO too.

The optimal pitch selection frequency matrix is dependent on the pitcher, the batter, the count and the game situation. In that situation with Lackey on the mound and Turner at the plate, it might be something like 50% 4-seam, 20% sinker, 20% slider, and 10% cutter. The numbers are irrelevant. Then a random pitch is selected according to those frequencies, where, for example, the 4-seamer is chosen twice as often as the sinker and slider, the sinker and slider twice as often as the cutter, etc.

Obviously doing that even close to accurately is impossible, but that’s essentially what happens and what is supposed to happen. Miraculously, pitchers and catchers do a pretty good job (really you just have to have a pretty good idea as to what pitches to throw, adjusted a little for the batter). At least I presume they do. It is likely that some pitchers and batters are better than others at employing these GTO strategies as well as exploiting opponents who don’t.

The more a batter likes (or dislikes) a certain pitch (in that count or overall), the less that pitch will be thrown. In order to understand why, you must understand that the result of a pitch is directly proportional to the frequency at which it is thrown in a particular situation. For example, if Taylor is particularly good against a sinker in that situation or in general, it might be thrown 10% rather than 20% of the time. The same is true for locations of course, which makes everything quite complex.

Remember that you cannot tell what types and locations of pitches a batter likes or dislikes in a certain count and game situation from his results! This is a very important concept to understand. The results of every pitch type and location in each count, game situation, and versus each pitcher (you would have to do a “delta method” to figure this) are and should be exactly the same! Any differences you see are noise – random differences (or the result of wholesale exploitative play or externalities as I explain below). We can easily prove this with an example.

Imagine that in all 1-0 counts, early in a game with no runners on base and 0 outs (we’re just choosing a ‘particular situation’ – which situation doesn’t matter), we see that Turner gets a FB 80% of the time and a slider 20% of the time (again, the actual numbers are irrelevant). And we see that Turner’s results (we have to add up the run or win value of all the results – strike, ball, batted ball out, single, double, etc.) are much better against those 80% FB than the 20% SL. Can we conclude that Turner is better against the FB in that situation?

No! Why is that? Because if we did, we would HAVE TO also conclude that the pitchers were throwing him too many FB, right? They would then reduce the frequency of the fastball. Why throw a certain pitch 80% of the time (or at all, for that matter) when you know that another pitch is better?

You would obviously throw it less often than 80% of the time. How much less? Well, say you throw it 79% and the slider 21%. You must be better off with that ratio (rather than 80/20) since the slider is the better pitch, as we just said for this thought exercise. Now what if the FB still yields better results for Turner (and it’s not just noise – he’s still better versus the FB when he knows it’s coming 79% of the time)? Well, again obviously, you should throw the FB even less often and the slider more often.

Where does this end? Every time we decrease the frequency of the FB, the batter gets worse at it since it’s more of a surprise. Remember the relationship between the frequency of a pitch and its effectiveness. At the same time, he gets better and better at the slider since we throw it more and more frequently. It ends at the point in which the results of both pitches are exactly equal. It HAS to. If it “ends” anywhere else, the pitcher will continue to make adjustments until an equilibrium point is reached. This is called a Nash equilibrium in game theory parlance, at which point the batter can look for either pitch (or any pitch if the GTO mixed strategy includes more than two pitches) and it won’t make any difference in terms of the results. (If the batter doesn’t employ his own GTO strategy, then the pitcher can exploit him by throwing one particular pitch – in which case he then becomes exploitable, which is why it behooves both players to always employ a GTO strategy or risk being exploited.) As neutral observers, unless we see evidence otherwise, we must assume that all actors (batters and pitchers) are indeed using a roughly GTO strategy and that we are always in equilibrium. Whether they are or they aren’t, to whatever degree and in whichever situations, it certainly is instructive for us and for them to understand these concepts.

Assuming an equilibrium, this is what you MUST understand: Any differences you see in either a batter’s results across different pitches, or as a pitcher’s, MUST be noise – an artifact of random chance. Keep in mind that it’s only true for each subset of identical circumstances – the same opponent, count, and game situation (even umpire, weather, park, etc.). If you look at the results across all situations you will see legitimate differences across pitch types. That’s because they are thrown with different frequencies in different situations. For example, you will likely see better results for a pitcher with his secondary pitches overall simply because he throws them more frequently in pitcher’s counts (although this is somewhat offset by the fact that he throws them more often against better batters).

Is it possible that there are some externalities that throws this Nash equilibrium out of whack? Sure. Perhaps a pitcher must throw more FB than off-speed in order to prevent injury. That might cause his numbers for the FB to be slightly worse than for other pitches. Or the slider may be particularly risky, injury-wise, such that pitchers throw it less than GTO (game theory optimally) which results in a result better (from the pitcher’s standpoint) than the other pitches.

Any other deviations you see among pitch types and locations, by definition, must be random noise, or, perhaps exploitative strategies by either batters or pitchers (one is making a mistake and the other is capitalizing on it). It would be difficult to distinguish the two without some statistical analysis of large samples of pitches (and then we would still only have limited certainty with respect to our conclusions).

So, given all that is true, which it is (more or less), how can we criticize a particular pitch that a pitcher throws in one particular situation? We can’t. We can’t say that one pitch is “wrong” and one pitch is “right” in ANY particular situation. That’s impossible to do. We cannot evaluate the “correctness” of a single pitch. Maybe the pitch that we observe is the one that is only supposed to be thrown 5 or 10% of the time, and the pitcher knew that (and the batter was presumably surprised by it whether he hit it well or not)! The only way to evaluate a pitcher’s pitch selection strategy is by knowing the frequency at which he throws his various pitches against the various batters in the various counts and game situations. And that requires an enormous sample size of course.

There is an exception.

The one time we can say that a particular pitch is “wrong” is when that pitch is not part of the correct frequency matrix at all – i.e., the GTO solution says that it should never be thrown. That rarely occurs. About the only time that occurs is on 3-0 counts where a fastball might be the only pitch thrown (for example, 3-0 count with a 5 run lead, or even a 3-1 or 2-0 count with any big lead, late in the game – or a 3-0 count on an opposing pitcher who is taking 100% of the time).

Now that being said, let’s say that Lackey is supposed to throw his cutter away only 5% of the time against Turner. If we observe only that one pitch and it is a cutter, Bayes tells is that there is an inference that Lackey was intending to throw that pitch MORE than 5% of the time and we can indeed say with some small level of certainty that he “threw the wrong pitch.” We don’t really mean he “threw the wrong pitch.” We mean that we think (with some low degree of certainty) he had the wrong frequency matrix in his head to some significant degree (maybe he intended to throw that pitch 10% or 20% rather than 5%).*

So, the next time you hear anyone say what a pitcher should be throwing on any particular pitch or that the pitch he threw was “right” or “wrong,” it’s a good bet that he doesn’t really know what he’s talking about, even if they are or were a successful major league pitcher.

* Technically, we can only say something like, “We are 10% sure he was thinking 5%, 12% sure he was thinking 7%, 13% sure he was thinking 8%, etc.” – numbers for illustration purposes only.