Archive for the ‘Projections’ Category

Note: This post was edited to include some new data which leads us in the direction of a different conclusion. The addendum is at the end of the original post .

This is another one of my attempts at looking at “conventional wisdoms” that you hear and read about all the time without anyone stopping for a second to catch their breath and ask themselves, “Is this really true?” Or more appropriately, “To what extent is this true?” Bill James used those very questions to pioneer a whole new field called sabermetrics.

As usual in science, we can rarely if ever answer questions with, “Yes it is true,” or “No, it is not true.” We can only look at the evidence and try and draw some inferences with some degree of certainty between 0 and 100%. This is especially true in sports when we are dealing with empirical data and limited sample sizes.

You often read something like, “So-and-so pitcher had a poor season (say, in ERA) but he had a few really bad outings so it wasn’t really that bad.” Let’s see if we can figure out to what extent that may or may not be true.

First I looked at all starting pitcher outings over the last 40 years, 1977-2016. I created a group of starters who had at least 4 very bad outings and at least 100 IP in one season. A “bad outing” was defined as 5 IP or less and at least 6 runs allowed, so a minimum RA9 of almost 11 in at least 4 games in a season. Had those starts been typical starts, each of these pitchers’ ERA’s or RA9 would have been at least a run less or so.

Next I only looked at those pitchers who had an overall RA9 of at least 5.00 in the seasons in question. The average RA9 for these pitchers with some really bad starts was 5.51 where 4.00 is the average starting pitcher’s RA9 in every season regardless of the run environment or league. Basically I normalized all pitchers to the average of his league and year and set the average at 4.00. I also park adjusted everything.

OK, what were these pitchers projected to do the following season? I used basic Marcel-type projections for all pitchers. The projections treated all RA9 equally. In other words a 5.51 RA with a few really bad starts was equivalent to a 5.51 RA with consistently below-average starts. The projections only used full season data (RA9).

So basically these 5.51 RA9 pitchers pitched near average for most of the their starts but had 4-6 really bad (and short) starts that upped their overall RA9 for the season by more than a run. Which was more indicative of their true talent? The vast majority of the games where they pitched around average, the few games where they blew up, or their overall runs allowed per 9 innings? Or, their overall RA9 for that season (regardless of how it was created) plus their RA9 from previous seasons and then some regression thrown in for good measure – in other words, a regular, old-fashioned projection?

Our average projection for these pitchers for the next season (which is an estimate of their true talent that season) was 4.46. How did they pitch the next season – which is an unbiased sample of their true talent (I didn’t set an innings requirement for this season so there is no survivorship bias)? It was 4.48 in 10,998 TBF! So the projection which had no idea that these were pitchers who pitched OK for most of the season but had a terrible seasonal result (5.51 RA9) because of a few terrible starts, was right on the money. All the projection model knew was that these pitchers had very bad RA9 for the season – in fact, their average RA was 138% of league average.

Of course since we sampled these pitchers based on some bad outings and an overall bad ERA (over 5.00) we know that in prior seasons their RA9 would be much lower, similar to their projection (4.46) – actually better. In fact, you should know that a projection can apply just as well to previous years as it can to subsequent years. There is almost no difference. You just have to make sure you apply the proper age adjustments.

Somewhat interestingly, if we look at all pitchers with a RA9 above 5 (an average of 5.43) who did not have the requisite very bad outings, i.e. they pitched consistently bad but with few disastrous starts, their projected RA9 was 4.45 and their actual was 4.25, in 25,479 TBF.

While we have significant sample error in these limited samples, not only is there no suggestion that you should ignore or even discount bad ERA or RA that are the result of a few horrific starts, there is a (admittedly weak) suggestion that pitchers who pitch badly but more consistently may be able to outperform their projections for some reason.

The next time you read that, “So-and-so pitcher has bad numbers but it was only because of a few really bad outings,” remember that there is no evidence  that an ERA or RA which includes a “few bad outings” should be treated any differently than a similar ERA or RA without that qualification, at least as far as projections are concerned.

Addendum: I was concerned about the way I defined pitchers who had “a few disastrous starts.” I included all starters who gave up at least 6 runs in 5 innings or less at least 5 times in a season. The average number of bad starts was 5.5. So basically these were mostly pitchers who had 5 or 6 really bad starts in a season, occasionally more.

I thought that most of the time when we hear the “A few bad starts” refrain, we’re talking literally about “a few bad starts,” as in 2 or 3. So I changed the criteria to include only those pitchers with 2 or 3 awful starts. I also upped the ante on those terrible starts. Before it was > 5 runs in 5 IP or less.  Now it is >7 runs in 5 IP or less – truly a blowup of epic proportions. We still had 508 pitcher seasons that fit the bill which gives us a decent sample size.

These pitchers overall had a normalized (4.00 is average) RA9 of 4.19 in the seasons in question, so 2 or 3 awful starts didn’t produce such a bad overall RA. Remember I am using a 100 IP minimum so all of these pitchers pitched at least fairly well for the season whether they had a few awful starts or not. (This is selective sampling and survivorship bias at work. Any time you set a minimum IP or PA, you select players who had above average performance, through luck and talent.)

Their next year’s projection was 3.99 and the actual was 3.89 so there is a slight inference that indeed you can discount the bad starts a little. This is in around 12,000 IP. A difference of .1 RA9 is only around 1 SD so it’s not nearly statistically significant. I also don’t know that we have any Bayesian prior to work with.

The control group – all other starters, namely those without 2 or 3 awful outings – had a RA9 in the season in question of 3.72 (compare to 4.19 for the pitchers with 2 or 3 bad starts). Their projection for the next season was 3.85 and actual was 3.86. This was in around 130,000 IP so 1 SD is now around .025 runs so we can be pretty confident that the 3.86 actual RA9 reflects their true talent within around .05 runs (2 SD) or so.

What about starters who not only had 2 or 3 disastrous starts but also had an overall poor RA9? In the original post I looked at those pitchers in our experimental group who also had a seasonal RA9 of > 5.00. I’ll do the same thing with this new experimental group – starters with only 2 or 3 very awful starts.

Their average RA9 for the experimental season was 5.52. Their projection was 4.45 and actual was 4.17, so now we have an even stronger inference that a bad season caused by a few bad starts creates a projection that is too pessimistic; thus maybe we should  discount those few bad starts. We only have around 1600 IP (in the projected season) for these pitchers so 1 SD is around .25 runs. A difference between projected and actual of .28 runs is once again not nearly statistically significant. There is, nonetheless, a suggestion that we are on to something. (Don’t ever ignore – assume it’s random – an observed effect just because it isn’t statistically significant – that’s poor science.)

What about the control group? Last time we noticed that the control group’s actual RA was less than its projection for some reason. I’ll look at pitchers who had > 5 RA9 in one season but were not part of the group that had 2 or 3 disastrous starts.

Their average RA9 was 5.44 – similar to the 5.52 of the experimental group. Their projected was 4.45 and actual was 4.35, so we see the same “too high” projection in this group as well. (In fact, in testing my RA projections based on RA only – as opposed to say FIP or ERC – I find an overall bias such that pitchers with a one-season high RA have projections that are too high, not a surprising result actually.) This is in around 7,000 IP which gives us a SD of around .1 runs per 9.

So, the “a few bad starts” group outperformed their projections by around .1 runs. This same group, limiting it to starters with an overall RA or over 5.00, outperformed their projections by .28 runs. The control group with an overall RA also > 5.00 outperformed their projections by .1 runs. None of these differences are even close to statistically significant.

Let’s increase the sample size a little of our experimental group who also had particularly bad RA overall by expanding it to starters with an overall RA of > 4.50 rather than > 5.00. We now have 3,500 IP, 2x as many IP, reducing our error by around 50%. The average RA9 of this group was 5.13. Their projected RA was 4.33 and actual was 4.05 – exactly the same difference as before. Keep in mind that the more samples we look at the more we are “data mining,” which is a bit dangerous in this kind of research.

A control group of starters with > 4.50 RA had an overall RA9 of 4.99. Their projection was exactly the same as the experimental group, 4.33, but their actual was 4.30 – almost exactly the same as their projection.

In conclusion, while we initially found no evidence that discounting a bad ERA or RA caused by “several very poor starts” is warranted when doing a projection for starters with at least 100 IP, once we change the criteria for “a few bad starts” from “at least 5 starts with 6 runs or more allowed in 5 IP or less” to “exactly 2 or 3 starts with 8 runs or more in 5 IP or less” we do find evidence that some kind of discount may be necessary. In other words, for starters whose runs allowed are inflated due to 2 or 3 really bad starts, if we simply use overall season RA or ERA for our projections we will understate their subsequent season’s RA or ERA by maybe .2 or .3 runs per 9.

Our certainty of this conclusion, especially with regard to the size of the effect – if it exists at all – is pretty weak given the magnitude of the differences we found and the sample sizes we had to work with. However, as I said before, it would be a mistake to ignore any inference – even a weak one – that is not contradicted by some Bayesian prior (or common sense).

 

Advertisements

Now that Adam Eaton has been traded from the White Sox to the Nationals much has been written about his somewhat unusual “splits” in his outfield defense as measured by UZR and DRS, two of the more popular batted-ball defensive metrics. In RF, his career UZR per 150 games is around +20 runs and in CF, -8 runs. He has around 100 career games in RF and 300 in CF. These numbers do not include “arm runs” as I’m going to focus only on range and errors in this essay. If you are not familiar with UZR or DRS you can do some research on the net or just assume that they are useful metrics for quantifying defensive performance and for projecting defense.

In 2016 Eaton was around -13 in CF and +20 in RF. DRS was similar but with a narrower (but still unusual) spread. We expect that a player who plays at both CF and the corners in a season or within a career will have a spread of around 5 or 6 runs between CF and the corners (more between CF and RF than between CF and LF). For example, a CF’er who has a UZR of zero and thus is exactly average among all CF’ers, will have a UZR at around +5.5 at the corners, again a bit more in RF than LF (LF’ers are better fielders than RF’ers).

This has nothing to do with how “difficult” each position is (that is hard to define anyway – you could even make the argument that the corner positions are “harder” than CF), as UZR and DRS are calculated as runs above or below the average fielder at that position. It merely means that the average CF’er is a better fielder than the average corner OF’er by around 5 or 6 runs. Mostly they are faster. The reason teams put their better fielder in CF is not because it is an inherently more “difficult” position but because it gets around twice the number of opportunities per game than the corner positions such that you can leverage talent in the OF.

Back to Eaton. He appears to have performed much better in RF than we would expect given his performance in CF (or vice versa) or even overall. Does this mean that he is better suited to RF (and perhaps LF, where he hasn’t played much in his career) or that the big, unusual gap we see is just a random fluctuation, or somewhere in the middle as is often (usually) the case? Should the Nationals make every effort to play him in RF and not CF? After all, their current RF’er, Harper, has unusual splits too, but in the opposite direction – his career CF UZR is better than his career RF UZR! Or perhaps the value they’re getting from Eaton is diminished if they’re going to play him in CF rather than RF.

How could it be that a fielder could have such unusual defensive splits and it be solely or mostly due to chance only? The same reason a hitter can have unusual but random platoon splits or a pitcher can have unusual but random home/road or day/night splits. A metric like UZR or DRS, like almost all metrics, contains a large element of chance, or noise if you will. That noise comes from two sources – one is because the data and methodology are far from perfect and two is that actual defensive performance can fluctuate randomly (or for reasons we are just not aware of) from one time period to another – from play to play, game to game, or position to position, for various reasons or for no reason at all.

To the first point, just because our metric “says” that a player was +10 in UZR that does not necessarily mean that he performed exactly that well. In reality, he might have performed at a +15 level or he might have performed at a 0 or even a -10 level. It’s more likely of course that he performed at +5 than +20 or 0, but because of the limits of our data and methodology, the +15 is an estimate of his performance. To the second point, actual fielding performance, even if we could measure it precisely, like hitting and pitching, is subject to random fluctuations for reasons known (or at least speculated) and unknown to us. On one play a player can get a great jump and make a spectacular play and on another that same player can take a bad route, get a bad jump, the ball can pop out of his glove, etc. Some days fielders probably feel better than others. Etc.

So whenever we compare one time period to another or one position to another, even ones which require similar, perhaps even identical, skills, like in the OF, it is possible, even likely, that we are going to get different results by chance alone, or at least because of the two dynamics I explained above (don’t get hung up on the words “luck”, “chance” or “random”). Statistics tell us that those random differences will be more and more unlikely the further away we get from what is expected (e.g., we expect that play in CF will be 5 or 6 runs “worse” than play in RF or LF), however, statistics also tells us that any difference, even large ones like we see with Eaton (or more), can and do occur by chance alone.

At the same time, it is possible, maybe even likely, that a player could somehow be more suited to RF (or LF) than CF, or vice versa. So how do we determine how much of an unusual “split” in OF defense, for example, is likely chance and how much is likely “skill?” In other words, what would we expect future defense to be in RF and in CF for a player with unusual RF/CF splits? Remember that future performance always equates to an estimate of talent, more or less. For example, if we find strong evidence that almost all of these unusual splits are due to chance alone (virtually no skill), then we must assume that in the future the player with the unusual splits will revert to normal splits in any future time frame. In the case of Eaton that would mean that we would construct an OF projection based on all of his OF play, adjusted for position, and then do the normal adjustment for our CF or RF projection, such that his RF projection will be around 7 runs greater than his CF projection rather than the 20 run or more gap that we see in his past performance.

To examine this question, I looked at all players who played at least 20 games in CF and RF or LF from 2003 through 2015. I isolated those with various unusual splits. I also looked at all players to establish a baseline. At the same time, I crafted a basic one-season Marcel-like projection from that CF and corner performance combined. The way I did that was to adjust the corners to represent CF by subtracting 4 runs from LF UZR and 7 runs from RF UZR. Then I regressed that number based on the number of total games in that one season, added in an aging factor (-.5 runs for players under 27 and -1.5 runs for players 27 and older), and the resulting number was a projection for CF.

We can then take that number and add 4 runs for a LF projection and 7 runs for a RF projection. Remember these are range and errors only (no arm). So, for example, if a player was -10 in CF per 150 in 50 games and +3 in RF in 50 games, his projection would be:

Subtract 7 runs from his RF UZR to convert into “CF UZR”, so it’s now -4. Average that with his -10 UZR in CF, which gives him a total of -7 runs in 100 games. I am using 150 games as the 50% regression point so we regress this player 150/(150+100) or 60% toward a mean of -3 (because these are players who play both CF and corner, they are below average CF’ers). That comes out to -1.6. Add in an aging factor, say -.5 for a 25-year old and we get a projection of -2.1 for CF. That would mean a projection of +1.9 in LF, a +4 run adjustment and +4.9 in RF, a +7 run adjustment, assuming normal “splits.”

So let’s look at some numbers. To establish a baseline and test (and calibrate) our projections, let’s look at all players who played CF and LF or RF in season one (min 20 games in each) and then their next season in either CF or the corners:

UZR season one UZR season two Projected UZR
LF or RF +6.0 (N games=11629) 2.1 (N=42866) 2.1
CF -3.0 (N=9955) -.8 (23083) -.9

 

The spread we see in column 2, “UZR season one” is based on the “delta method”. It is expected to be a little wider than the normal talent spread we expect between CF and LF/RF which is around 6 runs. That is because of selective sampling. Players who do well at the corners will tend to also play CF and players who play poorly in CF will tend to get some play at the corners. The spread we see in column 3, “UZR season two” does not mean anything per se. In season two these are not necessarily players who played both positions again (they played either one or the other or both). All it means is that of players who played both positions in season one, they are 2.1 runs above average at the corners and .8 runs below average in CF, in season two.

Now let’s look at the same table for players like Eaton, who had larger than normal splits between a corner position and CF. I used a threshold of at least a 10-run difference (5.5 is typical). There were 254 players who played at least 20 games in CF and in RF or LF in one season and then played in LF in the next season, and 138 players who played in CF and LF or RF in one season and in RF in the next.

UZR season one UZR season two Projected UZR
LF or RF +12.7 (N games=4924) 1.4
CF -12.3 (N=4626) .3

 

For now, I’m leaving the third column, their UZR in season two, empty. These are players who appeared to be better suited at a corner position than in CF. If we assume that these unusual splits are merely noise, a random fluctuation, and that we expect them to have a normal split in season two, we can use the method I describe above to craft a projection for them. Notice the small split in the projections. The projection model I am using creates a CF projection and then it merely adds +4 runs for LF and +7 for RF. Given a 25-run split in season one rather than a normal 6-run split, we might assume that these players will play better, maybe much better, in RF or LF than in CF, in season two. In other words, there is a significant “true talent defensive split” in the OF. So rather than 1.4 in LF or RF (our projection assumes a normal split), we might see a performance of +5, and instead of .3 in CF, we might see -5, or something like that.

Remember that our projection doesn’t care how the CF and corner OF UZR’s are distributed in season one. It assumes static talent and just converts corner UZR to CF UZR by subtracting 4 or 7 runs. Then when it finalizes the CF projection, it assumes we can just add 4 runs for a LF projection and 7 runs for a RF one. It treats all OF positions the same, with a static conversion, regardless of the actual splits. The projection assumes that there is no such thing as “true talent OF splits.”

Now let’s see how well the projection does with that assumption (no such thing as “true talent OF defensive splits”). Remember that if we assume that there is “something” to those unusual splits, we expect our CF projection to be too high and our LF/RF projection to be too low.

UZR season one UZR season two Projected UZR
LF or RF +12.7 (N games=4924) .9 (N=16857) 1.4
CF -12.3 (N=4626) .8 (N=10250) .3

 

We don’t see any evidence of a “true talent OF split” when we compare projected to actual. In fact, we see the opposite effect, which is likely just noise (our projection model is pretty basic and not very precise). Instead of seeing better than expected defense at the corners as we might expect from players like Eaton who had unusually good defense at the corners compared to CF in season one, we see slightly worse than projected defense. And in CF, we see slightly better defense than projected even though we might have expected these players to be especially unsuited to CF.

Let’s look at players, unlike Eaton, who have “reverse” splits. These are players who in at least 20 games in both CF and LF or RF, had a better UZR in CF than at the corners.

UZR season one UZR season two Projected UZR
LF or RF -4.8 (N games=3299) 1.4 (N=15007) 2.4
CF 7.8 (N=3178) -4.4 (N=6832) -2.6

 

Remember, the numbers in column two, season one UZR “splits” are based on the delta method. Therefore, every player in our sample had a better UZR in CF than in LF or RF and the average difference was 12.6 runs (in favor of CF) whereas we expected an average difference of minus 6 runs or so (in favor of LF/RF). The “delta method” just means that I averaged all of the players’ individual differences weighted by the lesser of their games, either in CF or LF/RF.

Again, according to the “these unusual splits must mean something” (in terms of talent and what we expect in the next season) theory, we expect these players to significantly exceed their projection in CF and undershoot it at the corners. Again, we don’t see that. We see that our projections are high for both positions; in fact we overshoot more in CF than in RF/LF exaclty the opposite of what we would expect if there were any significance to these unusual splits. Again we see no evidence of a “true talent split in OF defense.”

For players with unusual splits in OF defense, we see that a normal projection at CF or at the corners suffices. We treat LF/RF/CF UZR exactly the same making static adjustments regardless of the direction and magnitude of the empirical splits. What about the idea that, “We don’t know what to expect with a player like Eaton?” I don’t really know what that means, but we hear it all the time when we see numbers that look unusual or “trendy” or appear to follow a “pattern.” Does that mean we expect there to be more fluctuation in season two UZR? Perhaps even though on the average they revert to normal spreads, we see a wider spread of results in these players who exhibit unusual splits in season one. Let’s look at that in our final analysis.

When we look at all players who played CF and LF/RF in season one, remember the average spread was 9 runs, +6 at the corners and -3 in CF. In season two, 28% of the players who played RF or LF had a UZR greater than +10 and 26% in CF had a UZR of -10 or worse. The standard deviation of the distribution in season two UZR was 13.9 runs for LF/RF and 15.9 in CF

What about our players like Eaton? Can we expect more players to have a poor UZR in CF and a great one at a corner? No. 26% of these players had a UZR greater than +10 and 25% had a UZR less than -10 on CF, around the same as all “dual” players in season one. In fact we get a smaller spread with these players with unusual splits as we would expect given that their means in CF and at the corners are actually closer together (look at the tables above). The standard deviation of the distribution in season two UZR for these players was 13.2 runs for LF/RF and 15.3 in CF, slightly smaller than for all “dual” players combined.

In conclusion, there is simply nothing to write about when it comes to Eaton’s or anyone else’s unusual outfield UZR or DRS splits. If you want to estimate their UZR going forward simply adjust and combine all of their OF numbers and do a normal projection. It doesn’t matter if they have -16 in LF and +20 in CF, 0 runs in CF only, or +4 runs in LF only. It’s all the same thing with exactly the same projection and exactly the same distribution of results the next season.

As far as we can tell there is simply no such thing (to any significant or identifiable degree) as an outfielder who is more suited to one OF position than another. There is outfield defense – period. It doesn’t matter where you are standing in the OF. The ability to catch line drives and fly balls in the OF is more or less the same whether you are standing in the middle or on the sides of the OF (yes it could take some time to get used to a position if you are unfamiliar with it). If you are good in one location you will be good at another, and if you are bad at one location you will be bad at another. Your UZR or DRS might change in a somewhat predictable fashion depending upon what position, CF, LF, or RF is being measured, but that’s only because the players you are measured against (those metrics are relative) differ in their average ability to catch fly balls and line drives. More importantly, when you see a player who has an unusual “split” in their outfield numbers, like Eaton, you will be tempted to think that they are intrinsically better at one position than another and that the unusual split will tend to continue in the future. When you see really large splits you will be tempted even more. Remember the words in this paragraph and remember this analysis to avoid being fooled by randomness into drawing faulty conclusions, as all human beings, even smart ones, are wont to do.

Last night in the Cubs/Cardinals game, the Cardinals skipper took his starter, Lackey, out in the 8th inning of a 1-run game with one out, no one on base and lefty Chris Coghlan coming to the plate. Coghlan is mostly a platoon player. He has faced almost four times as many righties in his career than lefties. His career wOBA against righties is a respectable .342. Against lefties it is an anemic .288. I have him with a projected platoon split of 27 points, less than his actual splits, which is to be expected as platoon splits in general get heavily regressed toward the mean, because they tend to be laden with noise for two reasons: One, the samples are rarely large because you are comparing performance against righties to performance against lefties and the smaller of the two tends to dominate the effective sample size – in Coghlan’s case, he has faced only 540 lefties in his entire 7-year career, less than the number of PA a typical  full-time batter gets in one season. Two, there is not much of a spread in platoon talent among both batters and pitchers. The less spread in talent for any statistic, the more the differences you see among players, especially in small samples, are noise. Sort of like DIPS for pitchers.

Anyway, even with a heavy regression, we think that Coghlan has a larger than average platoon split for a lefty and the average lefty split tends to be large. You typically would not want him facing a lefty in that situation. That is especially true when you have a very good and fairly powerful right-handed bat on the bench – Jorge Soler. Soler has a reverse career platoon split, but with only 114 PA versus lefties, that number is almost meaningless. I estimate his actual platoon split to be 23 points, a little less than the average righty. For RHB, there is always a heavy regression of actual platoon splits, regardless of the sample size (while the greater the sample of actual PA versus lefties, the less you regress, it might be a 95% regression for small samples and an 80% regression for large samples – either way, large) simply because there is not a very large spread of talent among RHB. If we look at the actual splits for all RHB over many, many PA, we see a narrow range of results. In fact, there is virtually no such thing as a RHB with true reverse platoon splits.

Soler seems to be the obvious choice,  so of course that’s what Maddon did – he pinch hit for Coghlan with Soler, right? This is also a perfect opportunity since Matheny cannot counter with a RHP – Siegrest has to pitch to at least one batter after entering the game. Maddon let Coghlan hit and he was easily dispatched by Siegrest 4 pitches later. Not that the result has anything to do with the decision by Matheny or Maddon. It doesn’t. Matheny’s decision to bring in Siegrest at that point in time was rather curious too, if you think about it. Surely he must have assumed that Maddon would bring in a RH pinch hitter. So he had to decide whether to pitch Lackey against Coghlan or Siegrest against a right handed hitter, probably Soler. Plus, the next batter, Russell, is another righty. It looks like he got extraordinarily lucky when Maddon did what he did – or didn’t do – in letting Coghlan bat. But that’s not the whole story…

Siegrest may or may not be your ordinary left-handed pitcher. What if Siegrest actually has reverse splits? What if we expect him to pitch better against right handed batters and worse against left-handed batters?  In that case, Coghlan might actually be the better choice than Soler even though he doesn’t often face lefty pitchers. When a pitcher has reverse splits – true reverse splits – we treat him exactly like a pitcher of the opposite hand.  It would be exactly like Coghlan or Soler were facing a RHP. Or maybe Siegrest has no splits – i.e. RH and LH batters of equal overall talent perform about the same. Or very small platoon splits compared to the average left-hander? So maybe hitting Coghlan or Soler is a coin flip.

It might also have been correct for Matheny to bring in Siegrest no matter who he was going to face, simply because Lackey, who is arguably a good but not great pitcher, was about to face a good lefty hitter for the third time – not a great matchup. And if Siegrest does indeed have very small splits either positive or negative, or no splits at all, that is a perfect opportunity to bring him in, and not care whether Maddon leaves Coghlan in or pinch hits Soler. At the same time, if Maddon things that Siegrest has significant reverse splits, he can leave in Coghlan, and if he thinks that the lefty pitcher has somewhere around a neutral platoon split, he can still leave Coghlan in and save Soler for another pinch hit opportunity. Of course, if he thinks that Siegrest is like your typical lefty pitcher, with a 30 point platoon split, then using Coghlan is a big mistake.

So how do managers determine what a pitcher’s true or expected (the same thing) platoon split is? The typical troglodyte will use batting average against during the season in question. After all, that’s what you hear ad-nauseam from the talking heads on TV, most of them ex-players or even ex-managers. Even the slightly informed fan knows that batting average against for a pitcher is worthless stat in and of itself (what, walks don’t count, and a HR is the same as a single?), especially in light of DIPS. The slightly more informed fan also knows that one season splits for a batter or pitcher are not very useful for the reasons I explained above.

If you look at Siegrest’s BA against splits for 2015, you will see .163 versus RHB and .269 versus LHB. Cue the TV commentators: “Siegrest is much better against right-handed batters than left-handed ones.” Of course, is and was are very different things in this context and with respect to making decisions like Matheny and Maddon did. The other day David Price was a pretty mediocre to poor pitcher. He is a great pitcher and you would certainly be taking your life into your hands if you treated him like a mediocre to poor pitcher in the present. Kershaw was a poor pitcher in the playoffs…well, you get the idea. Of course, sometimes, was is very similar to is. It depends on what we are talking about and how long the was was, and what the was actually is.

Given that Matheny is not considered to be such an astute manager when it comes to data-driven decisions, it may be is surprising that he would bring in Siegrest to pitch to Coghlan knowing that Siegrest has an enormous reverse BA against split in 2015. Maybe he was just trying to bring in a fresh arm – Siegrest is a very good pitcher overall. He also knows that the lefty is going to have to pitch to the next batter, Russell, a RHB.

What about Maddon? Surely he knows better than to look at such a garbage stat for one season to inform a decision like that. Let’s use a much better stat like wOBA and look at Siegrest’s career rather than just one season. Granted, a pitcher’s true platoon splits may change from season to season as he changes his pitch repertoire, perhaps even arm angle, position on the rubber, etc. Given that, we can certainly give more weight to the current season if we like. For his career, Siegrest has a .304 wOBA against versus LHB and .257 versus RHB. Wait, let me double check that. That can’t be right. Yup, it’s right. He has a career reverse wOBA split of 47 points! All hail Joe Maddon for leaving Coghlan in to face essentially a RHP with large platoon splits! Maybe.

Remember how in the first few paragraphs I talked about how we have to regress actual platoon splits a lot for pitchers and batters, because we normally don’t have a huge sample and because there is not a great deal of spread among pitchers with respect to true platoon split talent? Also remember that what we, and Maddon and Matheny, are desperately trying to do is estimate Siegrest’s true, real-life honest-to-goodness platoon split in order to make the best decision we can regarding the batter/pitcher matchup. That estimate may or may not be the same as or even remotely similar to his actual platoon splits, even over his entire career. Those actual splits will surely help us in this estimate, but the was is often quite different than the is.

Let me digress a little and invoke the ole’ coin flipping analogy in order to explain how sample size and spread of talent come into play when it comes to estimating a true anything for a player – in this case platoon splits.

Note: If you want you can skip the “coins” section and go right to the “platoon” section. 

Coins

Let’s say that we have a bunch of fair coins that we stole from our kid’s piggy bank. We know of course that each of them has a 50/50 chance of coming up head or tails in one flip – sort of like a pitcher with exactly even true platoon splits. If we flip a bunch of them 100 times, we know we’re going to get all kinds of results – 42% heads, 61% tails, etc. For the math inclined, if we flip enough coins the distribution of results will be a normal curve, with the mean and median at 50% and the standard deviation equal to the binomial standard deviation of 100 flips, which is 5%.

Based on the actual results of 100 flips of any of the coins, what would you estimate the true heads/tails percentage of that coin? If one coin came up 65/35 in favor of heads, what is your estimate for future flips? 50% of course. 90/10? 50%. What if we flipped a coin 1000 or even 5000 times and it came up 55% heads and 45% tails? Still 50%. If you don’t believe or understand that, stop reading and go back to whatever you were doing. You won’t understand the rest of this article. Sorry to be so blunt.

That’s like looking at a bunch of pitchers platoon stats and no matter what they are and over how many TBF, you conclude that the pitcher really has an even split and what you observed is just noise. Why is that? With the coins it is because we know beforehand that all the coins are fair (other than that one trick coin that your kid keeps for special occasions). We can say that there is no “spread in talent” among the coins and therefore regardless of the result of a number of flips and regardless of how many flips, we regress the result 100% of the way toward the mean of all the coins, 50%, in order to estimate the true percentage of any one coin.

But, there is a spread of talent among pitcher and batter platoon splits. At least we think there is. There is no reason why it has to be so. Even if it is true, we certainly can’t know off the top of our head how much of a spread there is. As it turns out, that is really important in terms of estimating true pitcher and batter splits. Let’s get back to the coins to see why that is. Let’s say that we don’t have 100% fair coins. Our sly kid put in his piggy bank a bunch of trick coins, but not really, really tricky. Most are still 50/50, but some are 48/52, 52/48, a few less are 45/55, and 1 or 2 are 40/60 and 60/40. We can say that there is now a spread of “true coin talent” but the spread is small. Most of the coins are still right around 50/50 and a few are more biased than that.  If your kid were smart enough to put in a normal distribution of “coin talent,” even one with a small spread, the further away from 50/50, the fewer coins there are.  Maybe half the coins are still fair coins, 20% are 48/52 or 52/48, and a very, very small percentage are 60/40 or 40/60.  Now what happens if we flip a bunch of these coins?

If we flip them 100 times, we are still going to be all over the place, whether we happen to flip a true 50/50 coin or a true 48/52 coin. It will be hard to guess what kind of a true coin we flipped from the result of 100 flips. A 50/50 coin is almost as likely to come up 55 heads and 45 tails as a coin that is truly a 52/48 coin in favor of heads. That is intuitive, right?

This next part is really important. It’s called Bayesian inference, but you don’t need to worry about what it’s called or even how it technically works. It is true that if you flipped a coin and got 60/40 heads that that coin was much more likely to be a true 60/40 coin than it is to be a 50/50 coin. That should be obvious too.  But here’s the catch. There are many, many more 50/50 coins in your kid’s piggy bank than there are 60/40. Your kid was smart enough to put in a normal distribution of trick coins.

So even though it seems like if you flipped a coin 100 times and got 60/40 heads, it is more likely you have a true 60/40 coin than a true 50/50 coin, it isn’t. It is much more likely that you have a 50/50 coin that got “heads lucky” than a true 60/40 coin that landed on the most likely result after 100 flips (60/40) because there are many more 50/50 coins in the bank than 60/40 coins – assuming a somewhat normal distribution with a small spread.

Here is the math: The chances of a 50/50 coin coming up exactly 60/40 is around .01. Chances of a true 60/40 coin coming up 60/40 is 8 times that amount, or .08. But, if there are 8 times as many 50/50 coins in your piggy bank as 60/40 coins, then the chances of your 60/40 coin being a fair coin or a 60/40 biased coin is only 50/50. If there 800 times more 50/50 coins than 60/40 coins in your bank, as there is likely to be if the spread of coin talent is small, then it is 100 times more likely that you have a true 50/50 coin than a true 60/40 coin even though the coin came up 60 heads in 100 flips.

It’s like the AIDS test contradiction. If you are a healthy, heterosexual, non-drug user, and you take an AIDS test which has a 1% false positive rate and you test positive, you are extremely unlikely to have AIDS. There are very few people with AIDS in your population so it is much more likely that you do not have AIDS and got a false positive (1 in 100) than you did have AIDS in the first place (maybe 1 in 100,000) and tested positive. Out of a million people in your demographic, if they all got tested, 10 will have AIDS and test positive (assuming a 0% false negative rate) and 999,990 will not have AIDS, but 10,000 of them (1 in 100) will have a false positive. So the odds you have AIDS is 10,000 to 10 or 1000 to 1 against.

In the coin example where the spread of coin talent is small and most coins are still at or near 50/50, pretty much no matter what we get when flipping a coin 100 times, we are going to conclude that there is a good chance that our coin is still around 50/50 because most of the coins are around 50/50 in true coin talent. However, there is some chance that the coin is biased, if we get an unusual result.

Now, it is awkward and not particularly useful to conclude something like, “There is a 60% chance that our coin is a true 50/50 coin, 20% it is a 55/45 coin, etc.” So what we usually do is combine all those probabilities and come up with a single number called a weighted mean.

If one coin comes up 60/40, our weighted mean estimate of its “true talent” may be 52%. If we come up with 55/45, it might be 51%. 30/70 might be 46%. Etc. That weighed mean is what we refer to as “an estimate of true talent” and is the crucial factor in making decisions based on what we think the talent of the coins/players are likely to be in the present and in the future.

Now what if the spread of coin talent were still small, as in the above example, but we flipped the coins 500 times each? Say we came up with 60/40 again in 500 flips. The chances of that happening with a 60/40 coin is 24,000 times more likely than if the coin were 50/50! So now we are much more certain that we have a true 60/40 coin even if we don’t have that many of them in our bank. In fact, if the standard deviation of our spread in coin talent were 3%, we would be about ½ certain that our coin was a true 50/50 coin and half certain it was a true 60/40 coin, and our weighted mean would be 55%.

There is a much easier way to do it. We have to do some math gyrations which I won’t go into that will enable us to figure out how much to regress our observed flip percentage to the mean flip percentage of all the coins, 50%. For 100 flips it was a large regression such that with a 60/40 result we might estimate a true flip talent of 52%, assuming a spread of coin talent of 3%. For 500 flips, we would regress less towards 50% to give us around 55% as our estimate of coin talent. Regressing toward a mean rather than doing the long-hand Bayesian inferences using all the possible true talent states assumes a normal distribution or close to one.

The point is that the sample size of the observed measurement is determines how much we regress the observed amount towards the mean. The larger the sample, the less we regress. One season observed splits and we regress a lot. Career observed splits that are 5 times that amount, like our 500 versus 100 flips, we regress less.

But sample size of the observed results is not the only thing that determines how much to regress. Remember if all our coins were fair and there were no spread in talent, we would regress 100% no matter how many flips we did with each coin.

So what if there were a large spread in talent in the piggy bank? Maybe a SD of 10 percent so that almost all of our coins were anywhere from 20/80 to 80/20 (in a normal distribution the rule of thumb is that almost of the values fall within 3 SD of the mean in either direction)? Now what if we flipped a coin 100 times and came up with 60 heads. Now there are lots more coins at true 60/40 and even some coins at 70/30 and 80/20. The chances that we have a truly biased coin when we get an unusual result is much greater than if the spread in coin talent were smaller, even in 100 flips.

So now we have the second rule. The first rule was that the number of trials is important in determining how much credence to give to an unusual result, i.e., how much to regress that result towards the mean, assuming that there is some spread in true talent. If there is no spread, then no matter how many trials our result is based on, and no matter how unusual our result, we still regress 100% toward the mean.

All trials whether they be coins or human behavior have random results around a mean that we can usually model as long as the mean is not 0 or 1. That is an important concept, BTW. Put it in your “things I should know” book. No one can control or influence that random distribution. A human being might change his mean from time to time but he cannot change or influence the randomness around that mean. There will always be randomness, and I mean true randomness, around that mean regardless of what we are measuring, as long as the mean is between 0 and 1, and there is more than 1 trial (in one trial you either succeed or fail of course). There is nothing that anyone can do to influence that fluctuation around the mean. Nothing.

The second rule is that the spread of talent also matters in terms of how much to regress the actual results toward the mean. The more the spread, the less we regress the results for a given sample size. What is more important? That’s not really a specific enough question, but a good answer is that if the spread is small, no matter how many trials the results are based on, within reason, we regress a lot. If the spread is large, it doesn’t take a whole lot of trials, again, within reason, in order to trust the results more and not regress them a lot towards the mean.

Let’s get back to platoon splits, now that you know almost everything about sample size, spread of talent, regression to mean, and watermelons. We know that how much to trust and regress results depends on their sample size and on the spread of true talent in the population with respect to that metric, be it coin flipping or platoon splits. Keep in mind that when we say trust the results, that it is not a binary thing, as in, “With this sample and this spread of talent, I believe the results – the 60/40 coin flips or the 50 point reverse splits, and with this sample and spread, I don’t believe them.” That’s not the way it works. You never believe the results. Ever. Unless you have enough time on your hands to wait for an infinite number of results and the underlying talent never changes.

What we mean by trust is literally how much to regress the results toward a mean. If we don’t trust the stats much, we regress a lot. If we trust them a lot, we regress a little. But. We. Always. Regress. It is possible to come up with a scenario where you might regress almost 100% or 0%, but in practice most regressions are in the 20% to 80% range, depending on sample size and spread of talent. That is just a very rough rule of thumb.

We generally know the sample size of the results we are looking at. With Siegrest (I almost forgot what started this whole thing) his career TBF is 604 TBF, but that’s not his sample size for platoon splits because platoon splits are based on the difference between facing lefties and righties. The real sample size for platoon splits is the harmonic mean of TBF versus lefties and righties. If you don’t know what that means don’t worry about it. A shortcut is to use the lesser of the two which is almost always TBF versus lefties, or in Siegrest’s case, 231. That’s not a lot, obviously, but we have two possible things going for Maddon, who played his cards like Siegrest was a true reverse split lefty pitcher. One, maybe the spread of platoon skill among lefty pitchers is large (it’s not), and two, he has a really odd observed split of 47 points in reverse. That’s like flipping a coin 100 times and getting 70 heads and 30 tails or 65/35. It is an unusual result. The question is, again, not binary – whether we believe that -47 point split or not. It is how much to regress it toward the mean of +29 – the average left-handed platoon split for MLB pitchers.

While the unusual nature of the observed result is not a factor in how much regressing to do, it does obviously come into play, in terms of our final estimate of true talent. Remember that the sample size and spread of talent in the underlying population, in this case, all lefty pitchers, maybe all lefty relievers if we want to get even more specific, is the only thing that determines how much we trust the observed results, i.e., how much we regress them toward the mean. If we regress -47 points 50% toward the mean of +29 points, we get quite a different answer than if we regress, say, an observed -10 split 50% towards the mean. In the former case, we get a true talent estimate of -9 points and in the latter we get +10. That’s a big difference. Are we “trusting” the -47 more than the -10 because it is so big? You can call it whatever you want, but the regression is the same assuming the sample size and spread of talent is the same.

The “regression”, by the way, if you haven’t figured it out yet, is simply the amount, in percent, we move the observed toward the mean. -47 points is 76 points “away” from the mean of +29 (the average platoon split for a LHP). 50% regression means to move it half way, or 38 points. If you move -47 points 38 points toward +29 points, you get -9 points, our estimate of Siegrest’s true platoon split if  the correct regression is 50% given his 231 sample size and the spread of platoon talent among LH MLB pitchers. I’ll spoil the punch line. It is not even close to 50%. It’s a lot more.

How do we determine the spread of talent in a population, like platoon talent? That is actually easy but it requires some mathematical knowledge and understanding. Most of you will just have to trust me on this. There are two basic methods which are really the same thing and yield the same answer. One, we can take a sample of players, say 100 players who all had around the same number of opportunities (sample size), say, 300. That might be all full-time starting pitchers in one season and the 300 is the number of LHB faced. Or it might be all pitchers over several seasons who faced around 300 LHB. It doesn’t matter. Nor do the number of opportunities.  They don’t even have to be the same for all pitchers. It is just easier to explain that way. Now we compute the variance in that group – stats 101. Then we compare that variance with the variance expected by chance – still stats 101.

Let’s take BA, for example. If we have a bunch of players with 400 AB each, what is the variance in BA among the players expected by chance? Easy. Binomial theorem. .000625 in BA. What if we observe a variance of twice that, or .00125? Where is the extra variance coming from? A tiny bit is coming from the different contexts that the player plays in, home/road, park, weather, opposing pitchers, etc. A tiny bit comes from his own day-to-day changes in true talent. We’ll ignore that. They really are small. We can of course estimate that too and throw it into the equation. Anyway, that extra variance, the .000625, is coming from the spread of talent. The square root of that is .025 or 25 points of BA, which would be one SD of talent in this example. I just made up the numbers, but that is probably close to accurate.

Now that we know the spread in talent for BA, which we get from this formula – observed variance = random variance + talent variance – we can now calculate the exact regression amount for any sample of observed batting average or whatever metric we are looking at. It’s the ratio of random variance to total variance. Remember we need only 2 things and 2 things only to be able to estimate true talent with respect to any metric, like platoon splits: spread of talent and sample size of the observed results. That gives us the regression amount. From that we merely move the observed result toward the mean by that amount, like I did above with Siegrest’s -47 points and the mean of +29 for a league-average LHP.

The second way, which is actually more handy, is to run a regression of player results from one time period to another. We normally do year-to-year but it can be odd days to even, odd PA to even PA, etc. Or an intra-class correlation (ICC) which is essentially the same thing but it correlates every PA (or whatever the opportunity is) to every other PA within a sample.  When we do that, we either use the same sample size for every player, like we did in the first method, or we can use different sample sizes and then take the harmonic mean of all of them as our average sample size.

This second method yields a more intuitive and immediately useful answer, even though they both end up with the same result. This actually gives you the exact amount to regress for that sample size (the average of the group in your regression). In our BA example, if the average sample size of all the players were 500 and we got a year-to-year (or whatever time period) correlation of .4, that would mean that for BA, the correct amount of regression for a sample size of 500 is 60% (1 minus the correlation coefficient or “r”). So if a player bats .300 in 500 AB and the league average is .250 and we know nothing else about him, we estimate his true BA to be (.300 – .250) * .4 + .250 or .270. We move his observed BA 60% towards the mean of .250. We can easily with a little more math calculate the amount of regression for any sample size.

Using method #1 tells us precisely what the spread in talent is. Method 2 tells us that implicitly by looking at the correlation coefficient and the sample size. With either method, we get the amount to regress for any given sample size.

Platoon

Let’s look at some year-to-year correlations for a 500 “opportunity” (PA, BA, etc.) sample for some common metrics. Since we are using the same sample size for each, the correlation tells us the relative spreads in talent for each of these metrics. The higher the correlation for any given sample, the higher the spread in talent (there are other factors that slightly affect the correlation other than spread of talent for any given sample size but we can safely ignore them).

BA: .450

OBA: .515

SA: .525

Pitcher ERA: .240

BABIP for pitchers (DIPS): .155

BABIP for batters: .450

Now let’s look at platoon splits:

This is for an average of 200 TBF versus a LHP, so the sample size is smaller than the ones above.

Platoon wOBA differential for pitchers (200 BF v. LHB): .135

RHP: .110

LHP: .195

Platoon wOBA differential for batters (200 BF v. LHP): .180

RHB: .0625

LHB: .118

Those numbers are telling us that, like DIPS, the spread of talent among batters and pitchers with respect to platoon splits is very small. You all know now that this, along with sample size, tells us how much to regress an observed split like Siegrest’s -47 points. Yes, a reverse split of 47 points is a lot, but that has nothing to do with how much to regress it in order to estimate Siegrist’s true platoon split. The fact that -47 points is very far from the average left-handed pitcher’s +29 points means that it will take a lot of regression to moved it into the plus zone, but the -47 points in and of itself does not mean that we “trust it more.” If the regression were 99% then whether the observed were -47 or +10, we would arrive at nearly the same answer. Don’t confuse the regression with the observed result. One has nothing to do with the other. And don’t think in terms of “trusting” the observed result or not. Regress the result and that’s your answer. If you arrive at answer X it makes no difference whether your starting point, the observed result, was B, or C. None whatsoever.  That is a very important point. I don’t know how many times I have heard, “But he had a 47 point reverse split in his entire career!” You can’t possibly be saying that you estimate his real split to be +10 or +12 or whatever it is.” Yes, that’s exactly what I’m saying. A +10 estimated split is exactly the same whether the observed split were -47 or +5. The estimate using the regression amount is the only thing that counts.

What about the certainty of the result? The certainty of the estimate depends mostly on the sample size of the observed results. If we never saw a player hit before and we estimate that he is a .250 hitter we are surely less certain than if we have a hitter who has hit .250 over 5000 AB. But does that change the estimate? No. The certainty due to the sample size was already included in the estimate. The higher the certainty the less we regressed the observed results. So once we have the estimate we don’t revise that again because of the uncertainty. We already included that in the estimate!

And what about the practical importance of the certainty in terms of using that estimate to make decisions? Does it matter whether we are 100% or 90% sure that Siegrest is a +10 true platoon split pitcher? Or whether we are only 20% sure – he might actually have a higher platoon split or a lower one? Remember the +10 is a weighted mean which means that it is in the middle of our error bars. The answer to that is, “No, no and no!” Every decision that a manager makes on the field is or should be based on weighted mean estimates of various player talents. The certainty or distribution rarely should come into play. Basically the noise in the result of a sample of 1 is so large that it doesn’t matter at all what the uncertainty level of your estimates are.

So what do we estimate Siegrest’s true platoon split, given a 47 point reverse split in 231 TBF versus LHB. Using no weighting for more recent results, we regress his observed splits 1 minus 230/1255, or .82 (82%) towards the league average for lefty pitchers, which is around 29 points for a LHP. 82% of 76 points is 62 points. So we regress his -47 points 62 points in the plus direction which gives us an estimate of +15 points in true platoon split. That is half the split of an average LHP, but it is plus nonetheless.

That means that a left-handed hitter like Coghlan will hit better than he normally does against a left-handed pitcher. However, Coghlan has a larger than average estimated split, so that cancels out Siegrest’s smaller than average split to some extent. That also means that Soler or another righty will not hit as well against Siegrest as he would against a LH pitcher with average splits. And since Soler himself has a slightly smaller platoon split than the average RHB, his edge against Siegrest is small.

We also have another method for better estimating true platoon splits for pitchers which can be used to enhance the method we use using sample results, sample size, and means. It is very valuable. We have a pretty good idea as to what causes one pitcher to have a smaller or greater platoon split than another. It’s not like pitchers deliberately throw better or harder to one side or the other or that RH or LH batters scare or distract them. Pitcher platoon splits mostly come from two things: One is arm angle. If you’ve ever played or watched baseball that should be obvious to you. The more a pitcher comes from the side, the tougher he is on same-side batters and the larger his platoon split. That is probably the number one factor in these splits. It is almost impossible for a side-armer not to have large splits.

What about Siegrest? His arm angle is estimated by Jared Cross of Steamer, using pitch f/x data, at 48 degrees. That is about a ¾ arm angle. That strongly suggests that he does not have true reverse splits and it certainly enables us to be more confident that he is plus in the platoon split department.

The other thing that informs us very well about likely splits is pitch repertoire. Each pitch has its own platoon profile. For example, pitches with the largest splits are sliders and sinkers and those with the lowest or even reverse are the curve (this surprises most people), splitter, and change.

In fact, Jared (Steamer) has come up with a very good regression formula which estimates platoon split from pitch repertoire and arm angle only. This formula can be used by itself for estimating true platoon splits. Or it can be used to establish the mean towards which the actual splits should be regressed. If you use the latter method the regression percentage is much higher than if you don’t. It’s like adding a lot more 50/50 coins to that piggy bank.

If we plug Siegrest’s 2015 numbers into that regression equation, we get an estimated platoon from arm angle and pitch repertoire of 14 points, which is less than the average lefty even with the 48 degree arm angle. That is mostly because he uses around 18% change ups this year. Prior to this season, when he didn’t use the change up that often, we would probably have estimated a much higher true split.

So now rather than regressing towards just an average lefty with a 29 point platoon split, we can regress his -47 points to a more accurate mean of 14 points. But, the more you isolate your population mean, the more you have to regress for any given sample size, because you are reducing the spread of talent in that more specific population. So rather than 82%, we have to regress something line 92%. That brings -47 to +9 points.

So now we are down to a left-handed pitcher with an even smaller platoon split. That probably makes Maddon’s decision somewhat of a toss-up.

His big mistake in that same game was not pinch-hitting for Lester and Ross in the 6th. That was indefensible in my opinion. Maybe he didn’t want to piss off Lester, his teammates, and possibly the fan base.Who knows?

Those of you who follow me on Twitter know that I am somewhat obsessed with how teams (managers) construct their lineups. With few exceptions, managers tend to do two things when it comes to setting their daily lineups: One, they follow more or less the traditional model of lineup construction, which is to put your best overall offensive player third, a slugger fourth, and scrappy, speedy players in the one and/or two holes. Two, monkey with lineups based on things like starting pitcher handedness (relevant), hot and cold streaks, and batter/pitcher matchups, the latter two generally being not so relevant. For example, in 2012, the average team used 122 different lineups.

If you have read The Book (co-authored by Yours Truly, Tom Tango and Andy Dolphin), you may remember that the optimal lineup differs from the traditional one. According to The Book, a team’s 3 best hitters should bat 1,2, and 4, and the 4th and 5th best hitters 3 and 5. The 1 and 2 batters should be more walk prone than the 4 and 5 hitters. Slots 6 through 9 should feature the remaining hitters in more or less descending order of quality. As we know, managers violate or in some cases butcher this structure by batting poor, sometimes awful hitters, in the 1 and 2 holes, and usually slotting their best overall hitter third. They also sometimes bat a slow, but good offensive player, often a catcher, down in the order.

In addition to these guidelines, The Book suggests placing good base stealers in front of low walk, and high singles and doubles hitters. That often means the 6 hole rather than the traditional 1 and 2 holes in which managers like to put their speedy, base stealing players. Also, because the 3 hole faces a disproportionate number of GDP opportunities, putting a good hitter who hits into a lot of DP, like a Miguel Cabrera, into the third slot can be quite costly. Surprisingly, a good spot for a GDP-prone hitter is leadoff, where a hitter encounters relatively few GDP opportunities.

Of course, other than L/R considerations (and perhaps G/F pitcher/batter matchups for extreme players) and when substituting one player for another, optimal lineups should rarely if ever change. The notion that a team has to use 152 different lineups (like TB did in 2012) in 162 games, is silly at best, and a waste of a manager’s time and sub-optimal behavior at worst.

Contrary to the beliefs of some sabermetric naysayers, most good baseball analysts and sabermetricians are not unaware of or insensitive to the notion that some players may be more or less happy or comfortable in one lineup slot or another. In fact, the general rule should be that player preference trumps a “computer generated” optimal lineup slot. That is not to say that it is impossible to change or influence a player’s preferences.

For those of you who are thinking, “Batting order doesn’t really matter, as long as it is somewhat reasonable,” you are right and you are wrong. It depends on what you mean by “matter.” It is likely that in most cases the difference between a prevailing, traditional order and an optimal one, not-withstanding any effect from player preferences, is on the order of less than 1 win (10 or 11 runs) per season; however, teams pay on the free agent market over 5 million dollars for a player win, so maybe those 10 runs do “matter.” We also occasionally find that the difference between an actual and optimal lineup is 2 wins or more. In any case, as the old sabermetric saying goes, “Why do something wrong, when you can do it right?” In other words, in order to give up even a few runs per season, there has to be some relevant countervailing and advantageous argument, otherwise you are simply throwing away potential runs, wins, and dollars.

Probably the worst lineup offense that managers commit is putting a scrappy, speedy, bunt-happy, bat-control, but poor overall offensive player in the two hole. Remember that The Book (the real Book) says that the second slot in the lineup should be reserved for one of your two best hitters, not one of your worst. Yet teams like the Reds, Braves, and the Indians, among others, consistently put awful hitting, scrappy players in the two-hole. The consequence, of course, is that there are fewer base runners for the third and fourth hitters to drive in, and you give an awful hitter many more PA per season and per game. This might surprise some people, but the #2 hitter will get over 100 more PA than the #8 hitter, per 150 games. For a bad hitter, that means more outs for the team with less production. It is debatable what else a poor, but scrappy hitter batting second brings to the table to offset those extra empty 100 PA.

The other mistake (among many) that managers make in constructing what they (presumably) think is an optimal order is using current season statistics, and often spurious ones like BA and RBI, rather than projections. I would venture to guess that you can count on one hand, at best, the number of managers that actually look at credible projections when making decisions about likely future performance, especially 4 or 5 months into the season. Unless a manager has a time machine, what a player has done so far during the season has nothing to do with how he is likely to do in the upcoming game, other than how those current season stats inform an estimate of future performance. While it is true that there is obviously a strong correlation between 4 or 5 months past performance and future performance, there are many instances where a hitter is projected as a good hitter but has had an awful season thus far, and vice versa. If you have read my previous article on projections, you will know that projections trump seasonal performance at any point in the season (good projections include current season performance to-date – of course). So, for example, if a manager sees that a hitter has a .280 wOBA for the first 4 months of the season, despite a .330 projection, and bats him 8th, he would be making a mistake, since we expect him to bat like a .330 hitter and not a .280 hitter, and in fact he does, according to an analysis of historical player seasons (again, see my article on projections).

Let’s recap the mistakes that managers typically make in constructing what they think are the best possible lineups. Again, we will ignore player preferences and other “psychological factors” not because they are unimportant, but because we don’t know when a manager might slot a player in a position that even he doesn’t think is optimal in deference to that player. The fact that managers constantly monkey with lineups anyway suggests that player preferences are not that much of a factor. Additionally, more often than not I think, we hear players say things like, “My job is to hit as well as I can wherever the manager puts me in the lineup.” Again, that is not to say that some players don’t have certain preferences and that managers shouldn’t give some, if not complete, deference to them, especially with veteran players. In other words, an analyst advising a team or manager should suggest an optimal lineup taking into consideration player preferences. No credible analyst is going to say (or at least they shouldn’t), “I don’t care where Jeter is comfortable hitting or where he wants to hit, he should bat 8th!”

Managers typically follow the traditonal batting order philosophy which is to bat your best hitter 3rd, your slugger 4th, and fast, scrappy, good-bat handlers 1 or 2, whether they are good overall hitters or not. This is not nearly the same as an optimal batting order, based on extensive computer and mathematical research, which suggest that your best hitter should bat 2 or 4, and that you need to put your worst hitters at the bottom of the order in order to limit the number of PA they get per game and per season. Probably the biggest and most pervasive mistake that managers make is slotting terrible hitters at the top, especially in the 2-hole. Managers also put too many base stealers in front of power hitters and hitters who are prone to the GDP in the 3 hole.

Finally, managers pay too much attention (they should pay none) to short term and seasonal performance as well as specific batter/pitcher past results when constructing their batting orders. In general, your batting order versus lefty and righty starting pitchers should rarely change, other than when substituting/resting players, or occasionally when player projections significantly change, in order to suit certain ballparks or weather conditions, or extreme ground ball or fly ball opposing pitchers (and perhaps according to the opposing team’s defense). Other than L/R platoon considerations (and avoiding batting consecutive lefties if possible), most of these other considerations (G/F, park, etc.) are marginal at best.

With that as a background and primer on batting orders, here is what I did: I looked at all 30 teams’ lineups as of a few days ago. No preference was made for whether the opposing pitcher was right or left-handed or whether full-time starters or substitutes were in the lineup on that particular day. Basically these were middle of August random lineups for all 30 teams.

The first thing I did was to compare a team’s projected runs scored based on adding up each player’s projected linear weights in runs per PA and then weighting each lineup slot by its average number of PA per game, to the number of runs scored using a game simulator and those same projections. For example, if the leadoff batter had a linear weights projection of -.01 runs per PA, we would multiply that by 4.8 since the average number of PA per game for a leadoff hitter is 4.8. I would do that for every player in the lineup in order to get a total linear weights for the team. In the NL, I assumed an average hitting pitcher for every team. I also added in every player’s base running (not base stealing) projected linear weights, using the UBR (Ultimate Base Running) stat you see on Fangraphs. The projections I used were my own. They are likely to be similar to those you see on Fangraphs, The Hardball Times, or BP, but in some cases they may be different.

In order to calculate runs per game in a simulated fashion, I ran a simple game simulator which uses each player’s projected singles, doubles, triples, HR, UIBB+HP, ROE, G/F ratio, GDP propensity, and base running ability. No bunts, steals or any in-game strategies (such as IBB) were used in the simulation. The way the base running works is this: Every player is assigned a base running rating from 1-5, based on their base running projections in runs above/below average (typically from -5 to +5 per season). In the simulator, every time a base running opportunity is encountered, like how many bases to advance on a single or double, or whether to score from third on a fly ball, it checks the rating of the appropriate base runner and makes an adjustment. For example, on an outfield single with a runner on first, if the runner is rated as a “1” (slow and/or poor runner), he advances to third just 18% of the time, whereas if he is a “5”, he advances 2 bases 41% of the time. The same thing is done with a ground ball and a runner on first (whether he is safe at second and the play goes to first), a ground ball, runner on second, advances on hits, tagging up on fly balls, and advancing on potential wild pitches, passed balls, and errors in the middle of a play (not ROE).

Keep in mind that a lineup does 2 things. One, it gives players at the top more PA than players at the bottom, which is a pretty straightforward thing. Because of that, it should be obvious that you want your best hitters batting near the top and your worst near the bottom. But, if that were the only thing that lineups “do,” then you would simply arrange the lineup in a descending order of quality. The second way that a lineup creates runs is by each player interacting with other players, especially those near them in the order. This is very tricky and complex. Although a computer analysis can give us rules of thumb for optimal lineup construction, as we do in The Book, it is also very player dependent, in terms of each player’s exact offensive profile (again, ignoring things like player preferences or abilities of players to optimize their approach to each lineup slot). As well, if you move one player from one slot to another, you have to move at least one other player. When moving players around in order to create an optimal lineup, things can get very messy. As we discuss in The Book, in general, you want on base guys in front of power hitters and vice versa, good base stealers in front of singles hitters with low walk totals, high GDP guys in the one hole or at the bottom of the order, etc. Basically, constructing an optimal batting order is impossible for a human being to do. If any manager thinks he can, he is either lying or fooling himself. Again, that is not to say that a computer can necessarily do a better job. As with most things in MLB, the proper combination of “scouting and stats” is usually what the doctor ordered.

In any case, adding up each player’s batting and base running projected linear weights, after controlling for the number of PA per game in each batting slot, is one way to project how many runs a lineup will score per game. Running a simulation using the same projections is another way which also captures to some extent the complex interactions among the players’ offensive profiles. Presumably, if you just stack hitters from best to worst, the “adding up the linear weights” method will result in the maximum runs per game, while the simulation should result in a runs per game quite a bit less, and certainly less than with an optimal lineup construction.

I was curious as to the extent that the actual lineups I looked at optimized these interactions. In order to do that, I compared one method to the other. For example, for a given lineup, the total linear weights prorated by number of PA per game might be -30 per 150 games. That is a below average offensive lineup by 30/150 or .2 runs per game. If the lineup simulator resulted in actual runs scored of -20 per 150 games, presumably there were advantageous interactions among the players that added another 10 runs. Perhaps the lineup avoided a high GDP player in the 3-hole or perhaps they had high on base guys in front of power hitters. Again, this has nothing to do with order per se. If a lineup has poor hitters batting first and/or second, against the advice given in The Book, both the linear weights and the simulation methods would bear the brunt of that poor construction. In fact, if those poor hitters were excellent base runners and it is advisable to have good base runners at the top of the order (and I don’t know that it is), then presumably the simulation should reflect that and perhaps create added value (more runs per game) as compared to the linear weights method of projecting runs per game.

The second thing I did was to try and use a basic model for optimizing each lineup, using the prescriptions in The Book. I then re-ran the simulation and re-calculated the total linear weights to see which teams could benefit the most from a re-working of their lineup, at least based on the lineups I chose for this analysis. This is probably the more interesting query. For the simulations, I ran 100,000 games per team, which is actually not a whole lot of games in terms of minimizing the random noise in the resultant average runs per game. One standard error in runs per 150 games is around 1.31. So take these results with a grain or two of salt.

In the NL, here are the top 3 and bottom 3 teams in terms of additional or fewer runs that a lineup simulation produced, as compared to simply adding up each player’s projected batting and base running runs, adjusting for the league average number of PA per game for each lineup slot.

Top 3

Team Linear Weights Lineup Simulation Gain per 150 games
ARI -97 -86 11
COL -23 -13 10
PIT 10 17 6

Here are those lineups:

ARI

Inciarte

Pennington

Peralta

Trumbo

Hill

Pacheco

Marte

Gosewisch

 

COL

Blackmon

Stubbs

Morneau

Arenado

Dickerson

Rosario

Culberson

Lemahieu

 

PIT

Harrison

Polanco

Martin

Walker

Marte

Snider

Davis

Alvarez

 

Bottom 3

Team Linear Weights Lineup Simulation Gain per 150 games
LAD 43 28 -15
SFN 35 27 -7
WAS 42 35 -7

 

 

LAD

Gordon

Puig

Gonzalez

Kemp

Crawford

Uribe

Ellis

Rojas

 

SFN

Pagan

Pence

Posey

Sandoval

Morse

Duvall

Panik

Crawford

 

WAS

Span

Rendon

Werth

Laroche

Ramos

Harper

Cabrera

Espinosa

 

In “optimizing” each of the 30 lineups, I used some simple criteria. I put the top two overall hitters in the 2 and 4 holes. Whichever of the two had the greatest SLG batted 4th. The next two best hitters batted 1 and 3, with the highest SLG in the 3 hole. From 5 through 8 or 8, I simply slotted them in descending order of quality.

Here is a comparison of the simple “optimal” lineup to the lineups that the teams actually used. Remember, I am using the same personnel and changing only the batting orders.

Before giving you the numbers, the first thing that jumped out at me was how little most of the numbers changed. Conventional, and even most sabermetric, thought is that any one reasonable lineup is usually just about as good as any other, give or take a few runs. As well, a good lineup must strike a balance between putting better hitters at the top of the lineup, and those who are good base runners but poor overall hitters.

The average absolute difference between the runs per game generated by the simulator from the actual and the “optimal” lineup was 3.1 runs per 150 games per team. Again, keep in mind that much of that is noise since I am running only 100,000 games per team, which generates a standard error of something like 1.3 runs per 150 games.

The kicker, however, is that the “optimal” lineups, on the average, only slightly outperformed the actual ones, by only 2/3 of a run per team. Essentially there was no difference between the lineups chosen by the managers and ones that were “optimized” according to the simple rules explained above. Keep in mind that a real optimization – one that tried every possible batting order configuration and chose the best one – would likely generate better results.

That being said, here are the teams whose actual lineups out-performed and were out-performed by the “optimal” ones:

Most sub-optimal lineups

Team Actual Lineup Simulation Results (Runs per 150) “Optimal” Lineup Simulation Results Gain per 150 games
STL 62 74 12
ATL 31 37 6
CLE -33 -27 6
MIA 7 12 5

Here are those lineups. The numbers after each player’s name represents their projected batting runs per 630 PA (around 150 games). Keep in mind that these lineups faced either RH or LH starting pitchers. When I run my simulations, I am using overall projections for each player which do not take into consideration the handedness of the batter or any opposing pitcher.

Cardinals

Name Projected Batting runs
Carpenter 30
Wong -11
Holliday 26
Adams 14
Peralta 7
Pierz -10
Jay 17
Robinson -18

Here, even though we have plenty of good bats in this lineup, Matheny prefers to slot one of the worst in the two hole. Many managers just can’t resist doing so, and I’m not really sure why, other than it seems to be a tradition without a good reason. Perhaps it harkens back to the day when managers would often sac bunt or hit and run after the leadoff hitter reached base with no outs. It is also a mystery why Jay bats 7th. He is even having a very good year at the plate, so it’s not like his seasonal performance belies his projection.

What if we swap Wong and Jay? That generates 69 runs above average per 150 games, which is 7 runs better than with Wong batting second, and 5 runs worse than my original “optimal” lineup. Let’s try another “manual” optimization. We’ll put Jay lead off, followed by Carp, Adams, Holliday, Peralta, Wong, Pierz, and Robinson. That lineup produces 76 runs above average, 14 runs better than the actual one, and better than my computer generated simple “optimal” one. So for the Cardinals, we’ve added 1.5 wins per season just by shuffling around their lineup, and especially by removing a poor hitter from the number 2 slot and moving up a good hitter in Jay (and who also happens to be an excellent base runner).

Braves

Name Projected Batting runs
Heyward 23
Gosselin -29
Freeman 24
J Upton 20
Johnson 9
Gattis -1
Simmons -16
BJ Upton -13

Our old friend Fredi Gonzalez finally moved BJ Upton from first to last (and correctly so, although he was about a year too late), he puts Heyward at lead off, which is pretty radical, yet he somehow bats one of the worst batters in all of baseball in the 2-hole, accumulating far too many outs at the top of the order. If we do nothing but move Gosselin down to 8th, where he belongs, we generate 35 runs, 4 more than with him batting second. Not a huge difference, but 1/2 win is a half a win. They all count and they all add up.

Indians

Name Projected Batting runs
Kipnis 5
Aviles -19
Brantley 13
Santana 6
Gomes 8
Rayburn -9
Walters -13
Holt -21
Jose Ramirez -32

The theme here is obvious. When a team puts a terrible hitter in the two-hole, they lose runs, which is not surprising. If we merely move Aviles down to the 7 spot and move everyone up accordingly, the lineup produces -28 runs rather than -33 runs, a gain of 5 runs just by removing Aviles from the second slot.

Marlins

Name Projected Batting runs
Yelich 15
Solano -21
Stanton 34
McGhee -8
Jones -10
Salty 0
Ozuna 4
Hechavarria -27

With the Fish, we have an awful batter in the two hole, a poor hitter in the 4 hole, and decent batters in the 6 and 7 hole. What if we just swap Solano for Ozuna, getting that putrid bat out of the 2 hole? Running another simulation results in 13 runs above average per 150 games, besting the actual lineup by 6 runs.

Just for the heck of it, let’s rework the entire lineup, putting Ozuna in the 2 hole, Salty in the 3 hole, Stanton in the 4 hole, then McGhee, Jones, Solano, and Hechy. Surpisingly, that only generates 12 runs above average per 150, better than their actual lineup, but slightly worse than just swapping Solano and Ozuna. The achilles heel for that lineup, as it is for several others, appears to be the poor hitter batting second.

Most optimal lineups

Team Actual Lineup Simulation Results (Runs per 150) “Optimal” Lineup Simulation Results Gain per 150 games
LAA 160 153 -7
SEA 45 39 -6
DET 13 8 -5
TOR 86 82 -4

Finally, let’s take a look at the actual lineups that generate more runs per game than my simple “optimal” batting order.

Angels

Name Projected Batting runs
Calhoun 20
Trout 59
Pujols 7
Hamilton 17
Kendrick 10
Freese 8
Aybar 0
Iannetta 2
Cowgill -7

 

Mariners

Name Projected Batting runs
Jackson 11
Ackley -3
Cano 35
Morales 1
Seager 13
Zunino -14
Morrison -2
Chavez -24
Taylor -2

 

Tigers

Name Projected Batting runs
Davis -2
Kinsler 6
Cabrera 50
V Martinez 17
Hunter 10
JD Martinez -4
Castellanos -20
Holaday -44
Suarez -23

 

Blue Jays

Name Projected Batting runs
Reyes 11
Cabrera 15
Bautista 34
Encarnacion 20
Lind 6
Navarro -7
Rasmus -1
Valencia -9
Lawasaki -23

Looking at all these “optimal” lineups, the trend is pretty clear. Bat your best hitters at the top and your worst at the bottom, and do NOT put a scrappy, no-hit batter in the two hole! The average projected linear weights per 150 games for the number two hitter in our 4 best actual lineups is 19.25 runs. The average 2-hole hitter in our 4 worst lineups is -20 runs. That should tell you just about everything you need to know about lineups construction.

Note: According to The Book, batting your pitcher 8th in an NL lineup generates slightly more runs per game than batting him 9th, as most managers do. Tony LaRussa sometimes did this, especially with McGwire in the lineup. Other managers, like Maddon, occasionally do the same. There is some controversy over which option is optimal.

When I ran my simulations above, swapping the pitcher and the 8th hitter in the NL lineups. the resultant runs per game were around 2 runs worse (per 150) than with the traditional order. It probably depends on who the position player is at the bottom of the order and perhaps on the players at the top of the order as well.

 

Yesterday I looked at how and whether a hitter’s mid-season-to-date stats can help us to inform his rest-of-season performance, over and above a credible up-to-date mid-season projection. Obviously the answer to that depends on the quality of the projection – specifically how well it incorporates the season-to-date data in the projection model.

For players who were having dismal performances after the first, second, third, all the way through the fifth month of the season, the projection accurately predicted the last month’s performance and the first 5 months of data added nothing to the equation. In fact, those players who were having dismal seasons so far, even into the last month of the season, performed fairly admirably the rest of the way – nowhere near the level of their season-to-date stats. I concluded that the answer to the question, “When should we worry about a player’s especially poor performance?” was, “Never. It is irrelevant other than how it influences our projection for that player, which is not much, apparently.” For example, full-time players who had a .277 wOBA after the first month of the season, were still projected to be .342 hitters, and in fact, they hit .343 for the remainder of the season. Even halfway through the season, players who hit .283 for 3 solid months were still projected at .334 and hit .335 from then on. So, ignore bad performances and simply look at a player’s projection if you want to estimate his likely performance tomorrow, tonight, next week, or for the rest of the season.

On the other hand, players who have been hitting well-above their mid-season projections (crafted after and including the hot hitting) actually outhit their projections by anywhere from 4 to 16 points, still nowhere near the level of their “hotness,” however. This suggests that the projection algorithm is not handling recent “hot” hitting properly – at least my projection algorithm. Then again, when I looked at hitters who were projected at well-above average 2 months into the season, around .353, the hot ones and the cold ones each hit almost exactly the same over the rest of the season, equivalent to their respective projections. In that case, how they performed over those 3 months gave us no useful information beyond the mid-season projection. In one group, the “cold” group, players hit .303 for the first 2 months of the season, and they were still projected at .352. Indeed, they hit .349 for the rest of the season. The “hot” batters hit .403 for the first 2 months, they were projected to hit .352 after that and they did indeed hit exactly .352. So there would be no reason to treat these hot and cold above-average hitters any differently from one another in terms of playing time or slot in the batting order.

Today, I am going to look at pitchers. I think the perception is that because pitchers get injured more easily than position players, learn and experiment with new and different pitches, often lose velocity, their mechanics can break down, and their performance can be affected by psychological and emotional factors more easily than hitters, that early or mid-season “trends” are important in terms of future performance. Let’s see to what extent that might be true.

After one month, there were 256 pitchers or around 1/3 of all qualified pitchers (at least 50 TBF) who pitched terribly, to the tune of a normalized ERA (NERA) of 5.80 (league average is defined as 4.00). I included all pitchers whose NERA was at least 1/2 run worse than their projection. What was their projection after that poor first month? 4.08. How did they pitch over the next 5 months? 4.10. They faced 531 more batters over the last 5 months of the season.

What about the “hot” pitchers? They were projected after one month at 3.86 and they pitched at 2.56 for that first month. Their performance over the next 5 months was 3.85. So for the “hot” and “cold” pitchers after one month, their updated projection accurately told us what to expect for the remainder of the season and their performance to-date was irrelevant.

In fact, if we look at pitchers who had good projections after one month and divide those into two groups: One that pitches terribly for the first month, and one that pitches brilliantly for the first month, here is what we get:

Good pitchers who were cold for 1 month

First month: 5.38
Projection after that month: 3.79
Performance over the last 5 months: 3.75

Good pitchers who were hot for 1 month

First month: 2.49
Projection after that month: 3.78
Performance over the last 5 months: 3.78

So, and this is critical, one month into the season if you are projected to pitch above average, at, say 3.78, it makes no difference whether you have pitched great or terribly thus far. You are going to pitch at exactly your projection for the remainder of the season!

Yet the cold group faced 587 more batters and the hot group 630. Managers again are putting too much emphasis in those first month’s stats.

What if you are projected after one month as a mediocre pitcher but you have pitched brilliantly or poorly over the first month?

Bad pitchers who were cold for 1 month

First month: 6.24
Projection after that month: 4.39
Performance over the last 5 months: 4.40

Bad pitchers who were hot for 1 month

First month: 3.06
Projection after that month: 4.39
Performance over the last 5 months: 4.47

Same thing. It makes no difference whether a poor or mediocre pitcher had pitched well or poorly over the first month of the season. If you want to know how he is likely to pitch for the remainder of the season, simply look at his projection and ignore the first month. Those stats give you no more useful information. Again, the “hot” but mediocre pitchers got 44 more TBF over the final 5 months of the season, despite pitching exactly the same as the “cold” group over that 5 month period.

What about halfway into the season? Do pitchers with the same mid-season projection but one group was “hot” over the first 3 months and the other group was “cold,” pitch the same for the remaining 3 months? The projection algorithm does not handle the 3-month anomalous performances very well. Here are the numbers:

Good pitchers who were cold for 3 months

First month: 4.60
Projection after 3 months: 3.67
Performance over the last 3 months: 3.84

Good pitchers who were hot for 3 months

First month: 2.74
Projection after 3 months: 3.64
Performance over the last 3 months: 3.46

So for the hot pitchers the projection is undershooting them by around .18 runs per 9 IP and for the cold ones, it is over-shooting them by .17 runs per 9. Then again the actual performance is much closer to the projection than to the season-to-date performance. As you can see, mid-season pitcher stats halfway through the season are a terrible proxy for true talent/future performance. These “hot” and “cold” pitchers whose first half performance and second half projections were divergent by at least .5 runs per 9, performed in the second half around .75 runs per 9 better or worse than in the first half. You are much better off using the mid-season projection than the actual first-half performance.

For poorer pitchers who were “hot” and “cold” for 3 months, we get these numbers:

Poor pitchers who were cold for 3 months

First month: 5.51
Projection after 3 months: 4.41
Performance over the last 3 months: 4.64

Poor pitchers who were hot for 3 months

First month: 3.53
Projection after 3 months: 4.43
Performance over the last 3 months: 4.33

The projection model is still not giving enough weight to the recent performance, apparently. That is especially true of the “cold” pitchers. It over values them by .23 runs per 9. It is likely that these pitchers are suffering some kind of injury or velocity decline and the projection algorithm is not properly accounting for that. For the “hot” pitchers, the model only undervalues these mediocre pitchers by .1 runs per 9. Again, if you try and use the actual 3-month performance as a proxy for true talent or to project their future performance, you would be making a much bigger mistake – to the tune of around .8 runs per 9.

What about 5 months into the season? If the projection and the 5 month performance is divergent, which is better? Is using those 5 month stats a bad idea?

Yes, it still is. In fact, it is a terrible idea. For some reason, the projection does a lot better after 5 months than after 3 months. Perhaps some of those injured pitchers are selected out. Even though the projection slightly under and over values the hot and cold pitchers, using their 5 month performance as a harbinger of the last month is a terrible idea. Look at these numbers:

Poor pitchers who were cold for 5 months

First month: 5.45
Projection after 5 months: 4.41
Performance over the last month: 4.40

Poor pitchers who were hot for 5 months

First month: 3.59
Projection after 5 months: 4.39
Performance over the last month: 4.31

For the mediocre pitchers, the projection almost nails both groups, despite it being nowhere near the level of the first 5 months of the season. I cannot emphasize this enough: Even 5 months into the season, using a pitcher’s season-to-date stats as a predictor of future performance or a proxy for true talent (which is pretty much the same thing) is a terrible idea!

Look at the mistakes you would be making. You would be thinking that the hot group were comprised of 3.59 pitchers when in fact they were 4.40 pitchers who performed as such. That is a difference of .71 runs per 9. For your cold pitchers, you would undervalue them by more than a run per 9! What do managers do after 5 months of “hot” and “cold” pitching, despite the fact that both groups pitched almost the same for the last month of the season? They gave the hot group an average of 13 more TBF per pitcher. That is around a 3 inning difference in one month.

Here are the good pitchers who were hot and cold over the first 5 months of the season:

Good pitchers who were cold for 5 months

First month: 4.62
Projection after 5 months: 3.72
Performance over the last month: 3.54

Good pitchers who were hot for 5 months

First month: 2.88
Projection after 5 months: 3.71
Performance over the last month: 3.72

Here the “hot,” good pitchers pitched exactly at their projection despite pitching at .83 runs per 9 better over the first 5 months of the season. The “cold” group actually outperformed their projection by .18 runs and pitched better than the “hot” group! This is probably a sample size blip, but the message is clear: Even after 5 months, forget about how your favorite pitcher has been pitching, even for most of the season. The only thing that counts is his projection, which utilizes many years of performance plus a regression component, and not just 5 months worth of data. It would be a huge mistake to use those 5 month stats to predict these pitchers’ performances.

Managers can learn a huge lesson from this. The average number of batters faced in the last month of the season among the hot pitchers was 137, or around 32 IP. For the cold group, it was 108 TBF, or 25 IP. Again, the “hot” group pitched 7 more IP in only a month, yet they pitched worse than the “cold” group and both groups had the same projection!

The moral of the story here is that for the most part, and especially at the beginning and end of the season, ignore actual pitching performance to-date and use credible mid-season projections if you want to predict how your favorite or not-so favorite pitcher is likely to pitch tonight or over the remainder of the season. If you don’t, and that actual performance is significantly different from the updated projection, you are making a sizable mistake.

 

 

Recently on twitter I have been harping on the folly of using a player’s season-to-date stats, be it OPS, wOBA, RC+, or some other metric, for anything other than, well, how they have done so far. From a week into the season until the last pitch is thrown in November, we are inundated with articles and TV and radio commentaries about how so and so should be getting more playing time because his OPS is .956 or how player X should be benched or at least dropped in the order because he hitting .245 (in wOBA). Commentators, writers, analysts and fans wonder whether player Y’s unusually great or poor performance is “sustainable,” whether it is a “breakout” likely to continue, an age or injury related decline that portends an end to a career or a temporary blip after said injury is healed.

With web sites such as Fangraphs.com allowing us to look up a player’s current, up-to-date projections which already account for season-to-date performance, the question that all these writers and fans must ask themselves is, “Do these current season stats offer any information over and above the projections that might be helpful in any future decisions, such as whom to play or where to slot a player in the lineup, or simply whom to be optimistic or pessimistic about on your favorite team?”

Sure, if you don’t have a projection for a player, and you know nothing about his history or pedigree, a player’s season-to-date performance tells you something about what he is likely to do in the future, but even then, it depends on the sample size of that performance – at the very least you must regress that performance towards the league mean, the amount of regression being a function of the number of opportunities (PA) underlying the seasonal stats.

However, it is so easy for virtually anyone to look up a player’s projection on Fangraphs, Baseball Prospectus, The Hardball Times, or a host of other fantasy baseball web sites, why should we care about those current stats other than as a reflection of what a certain player has accomplished thus far in the season? Let’s face it,  2 or 3 months into the season, if a player who is projected at .359 (wOBA) is hitting .286, it is human nature to call for his benching, dropping him in the batting order, or simply expecting him to continue to hit in a putrid fashion. Virtually everyone thinks this way, even many astute analysts. It is an example of recency bias, which is one of the most pervasive human traits in all facets of life, including and especially in sports.

Who would you rather have in your lineup – Player A who has a Steamer wOBA projection of .350 but who is hitting .290 4 months into the season or Player B whom Steamer projects at .330, but is hitting .375 with 400 PA in July? If you said, “Player A,” I think you are either lying or you are in a very, very small minority.

Let’s start out by looking at some players whose current projection and season-to-date performance are divergent. I’ll use Steamer ROS (rest-of-season) wOBA projections from Fangraphs as compared to their actual 2014 wOBA. I’ll include anyone who has at least 200 PA and the absolute difference between their wOBA and wOBA projection is at least 40 points. The difference between a .320 and .360 hitter is the difference between an average player and a star player like Pujols or Cano, and the difference between a .280 and a .320 batter is like comparing a light-hitting backup catcher to a league average hitter.

Believe it or not, even though we are 40% into the season, around 20% of all qualified (by PA) players have a current wOBA projection that is more than 39 points greater or less than their season-to-date wOBA.

Players whose projection is higher than their actual

Name, PA, Projected wOBA, Actual wOBA

Cargo 212 .375 .328
Posey 233 .365 .322
Butler 258 .351 .278
Wright 295 .351 .307
Mauer 263 .350 .301
Craig 276 .349 .303
McCann 224 .340 .286
Hosmer 287 .339 .284
Swisher 218 .334 .288
Aoki 269 .330 .285
Brown 236 .329 .252
Alonso 223 .328 .260
Brad Miller 204 .312 .242
Schierholtz 219 .312 .265
Gyorko 221 .311 .215
De Aza 221 .311 .268
Segura 258 .308 .267
Bradley Jr. 214 .308 .263
Cozart 228 .290 .251

Players whose projection is lower than their actual

Name, PA, Projected wOBA, Actual wOBA

Tulo 259 .403 .472
Puig 267 .382 .431
V. Martinez 257 .353 .409
N. Cruz 269 .352 .421
LaRoche 201 .349 .405
Moss 255 .345 .392
Lucroy 258 .340 .398
Seth Smith 209 .337 .403
Carlos Gomez 268 .334 .405
Dunn 226 .331 .373
Morse 239 .329 .377
Frazier 260 .329 .369
Brantley 277 .327 .386
Dozier 300 .316 .357
Solarte 237 .308 .354
Alexi Ramirez 271 .306 .348
Suzuki 209 .302 .348

Now tell the truth: Who would you rather have at the plate tonight or tomorrow, Billy Butler, with his .359 projection and .278 actual, or Carlos Gomez, projected at .334, but currently hitting at .405? How about Hosmer (not to pick on the Royals) or Michael Morse? If you are like most people, you probably would choose Gomez over Butler, despite the fact that he is projected  as 25 points worse, and Morse over Hosmer, even though Hosmer is supposedly 10 points better than Morse. (I am ignoring park effects to simplify this part of the analysis.)

So how can we test whether your decision or blindly going with the Steamer projections would likely be the correct thing to do, emotions and recency bias aside? That’s relatively simple, if we are willing to get our hands dirty doing some lengthy and somewhat complicated historical mid-season projections. Luckily, I’ve already done that. I have a database of my own proprietary projections on a month-by-month basis for 2007-2013. So, for example, 2 months into the 2013 season, I have a season-to-date projection for all players. It incorporates their 2009-2012 performance, including AA and AAA, as well as their 2-month performance (again, including the minor leagues) so far in 2013. These projections are park and context neutral. We can then compare the projections with both their season-to-date performance (also context-neutral) and their rest-of-season performance in order to see whether, for example, a player who is projected at .350 even though he has hit .290 after 2 months will perform any differently in the last 4 months of the season than another player who is also projected at .350 but who has hit .410 after 2 months. We can do the same thing after one month (looking at the next 5 months of performance) or 5 months (looking at the final month performance). The results of this analysis should suggest to us whether we would be better off with Butler for the remainder of the season or with Gomez, or with Hosmer or Morse.

I took all players in 2007-2013 whose projection was at least 40 points less than their actual wOBA after one month into the season. They had to have had at least 50 PA. There were 116 such players, or around 20% of all qualified players. Their collective projected wOBA was .341 and they were hitting .412 after one month with an average of 111 PA per player. For the remainder of the season, in a total of 12,922 PA, or 494 PA per player, they hit .346, or 5 points better than their projection, but 66 points worse than their season-to-date performance. Again, all numbers are context (park, opponent, etc.) neutral. One standard deviation in that many PA is 4 points, so a 5 point difference between projected and actual is not statistically significant. There is some suggestion, however, that the projection algorithm is slightly undervaluing the “hot” (as compared to their projection) hitter during the first month of the season, perhaps by giving too little weight to the current season.

What about the players who were “cold” (relative to their projections) the first month of the season? There were 92 such players and they averaged 110 PA during the first month with a .277 wOBA. Their projection after 1 month was .342, slightly higher than the first group. Interestingly, they only averaged 464 PA for the remainder of the season, 30 PA less than the “hot” group, even though they were equivalently projected, suggesting that managers were benching more of the “cold” players or moving them down in the batting order. How did they hit for the remainder of the season? .343 or almost exactly equal to their projection. This suggests that managers are depriving these players of deserved playing time. By the way, after only one month, more than 40% of all qualified players are hitting 40 points better or worse than their projections. That’s a lot of fodder for internet articles and sports talk radio!

You might be thinking, “Well, sure, if a player is “hot” or “cold” after only a month, it probably doesn’t mean anything.” In fact, most commentaries you read or hear will give the standard SSS (small sample size) disclaimer only a month or even two months into the season. But what about halfway into the season? Surely, a player’s season-to-date stats will have stabilized by then and we will be able to identify those young players who have “broken out,” old, washed-up players, or players who have lost their swing or their mental or physical capabilities.

About half into the season, around 9% of all qualified (50 PA per month) players were hitting 40 points or less than their projections in an average of 271 PA. Their collective projection was .334 and their actual performance after 3 months and 271 PA was .283. Basically, these guys, despite being supposed league-average full-time players, stunk for 3 solid months. Surely, they would stink, or at least not be up to “par,” for the rest of the season. After all, wOBA at least starts to “stabilize” after almost 300 PA, right? Well, these guys, just like the “cold” players after one month, hit .335 for the remainder of the season, 1 point better than their projection. So after 1 month or 3 months, their season-to-date performance tells us nothing that our up-to-date projection doesn’t tell us. A player is expected to perform at his projected level regardless of his current season performance after 3 months, at least for the “cold” players. What about the “hot” ones, you know, the ones who may be having a breakout season?

There were also about 9% of all qualified players who were having a “hot” first half. Their collective projection was .339, and their average performance was .391 after 275 PA. How did they hit the remainder of the season? .346, 7 points better than their projection and 45 points worse than their actual performance. Again, there is some suggestion that the projection algorithm is undervaluing these guys for some reason. Again, the “hot” first-half players accumulated 54 more PA over the last 3 months of the season than the “cold” first-half players despite hitting only 11 points better. It seems that managers are over-reacting to that first-half performance, which should hardly be surprising.

Finally, let’s look at the last month of the season as compared to the first 5 months of performance. Do we have a right to ignore projections and simply focus on season-to-date stats when it comes to discussing the future – the last month of the season?

The 5-month “hot” players were hitting .391 in 461 PA. Their projection was .343, and they hit .359 over the last month. So, we are still more than twice as close to the projection than we are to the actual, although there is a strong inference that the projection is not weighting the current season enough or doing something else wrong, at least for the “hot” players.

For the “cold” players, we see the same thing as we do at any point in the season. The season-to-date stats are worthless if you know the projection. 3% of all qualified players (at least 250 PA) hit at least 40 points worse than their projection after 5 months. They were projected at .338, hit .289 for the first 5 months in 413 PA, and then hit .339 in that last month. They only got an average of 70 PA over the last month of the season, as compared to 103 PA for the “hot” batters, despite proving that they were league-average players even though they stunk up the field for 5 straight months.

After 4 months, BTW, “cold” players actually hit 7 points better than their projection for the last 2 months of the season, even though their actual season-to-date performance was 49 points worse. The “hot” players hit only 10 points better than their projection despite hitting 52 points better over the first 4 months.

Let’s look at the numbers in another way. Let’s say that we are 2 months into the season, similar to the present time. How do .350 projected hitters fare for the rest of the season if we split them into two groups: One, those that have been “cold” so far and those that have been “hot.” This is like our Butler or Gomez, Morse or Hosmer question.

I looked at all “hot” and “cold” players who were projected at greater than .330 after 2 months into the season. The “hot” ones, the Carlos Gomez’ and Michael Morse’s, hit .403 for 2 months, and were then projected at .352. How did they hit over the rest of the season? .352.

What about the “cold” hitters who were also projected at greater than .330? These are the Butler’s and Hosmer’s. They hit a collective .303 for the first 2 months of the season, their projection was .352, the same as the “hot” hitters, and their wOBA for the last 4 months was .349! Wow. Both groups of good hitters (according to their projections) hit almost exactly the same. They were both projected at .353 and one group hit .352 and the other hit .349. Of course the “hot” group got 56 more PA per player over the remainder of the season, despite being projected the same and performing essentially the same.

Let’s try those same hitters who are projected at better than .330, but who have been “hot” or “cold” for 5 months rather than only 2.

Cold

Projected: .350 Season-to-date: .311 ROS: .351

Hot

Projected: .354 Season-to-date: .393 ROS: .363

Again, after 5 months, the players projected well who have been hot are undervalued by the projection, but not nearly as much as the season-to-date performance might suggest. Good players who have been cold for 5 months hit exactly as projected and the “cold” 5 months has no predictive value, other than how it changes the up-to-date projection.

For players who are projected poorly, less than a .320 wOBA, the 5-month hot ones outperform their projections and the cold ones under-perform their projections, both by around 8 points. After 2 months, there is no difference – both “hot” and “cold” players perform at around their projected levels over the last 4 months of the season.

So what are our conclusions? Until we get into the last month or two of the season, season-to-date stats provide virtually no useful information once we have a credible projection for a player. For “hot” players, we might “bump” the projection by a few points in wOBA even 2 or 3 months into the season – apparently the projection is slightly under-valuing these players for some reason. However, it does not appear to be correct to prefer a “hot” player like Gomez versus a “cold” one like Butler when the “cold” player is projected at 25 points better, regardless of the time-frame. Later in the season, at around the 4th or 5th month, we might need to “bump” our projection, at least my projection, by 10 or 15 points to account for a torrid first 4 or 5 months. However, the 20 or 25 point better player, according to the projection, is still the better choice.

For “cold” players, season-to-date stats appear to provide no information whatsoever over and above a player’s projection, regardless of what point in the season we are at. So, when should we be worried about a hitter if he is performing far below his “expected” performance? Never. If you want a good estimate of his future performance, simply use his projection and ignore his putrid season-to-date stats.

In the next installment, I am going to look at the spread of performance for hot and cold players. You might hypothesize that while being hot or cold for 2 or 3 months has almost no effect on the next few months of performance, perhaps it does change the distribution of that performance among the group of  hot and cold players.

 

 

Yesterday, I posted an article describing how I modeled to some extent a way to tell whether and by how much pitchers may be able to pitch in such a way as to allow fewer or more runs than their components, including the more subtle ones, like balks, SB/CS, WP, catcher PB, GIDP, and ROE suggest.

For various reasons, I suggest taking these numbers with a grain of salt. For one thing, I need to tweak my RA9 simulator to take into consideration a few more of these subtle components. For another, there may be some things that stick with a pitcher from year to year that have nothing to do with his “RA9 skill” but which serve to increase or decrease run scoring, given the same set of components. Two of these are a pitcher’s outfielder arms and the vagueries of his home park, which both have an effect on base runner advances on hits and outs. Using a pitcher’s actual sac flies against will mitigate this, but the sim is also using league averages for base runner advances on hits, which, as I said, can vary from pitchers to pitcher, and tend to persist from year to year (if a pitcher stays on the same team) based on his outfielders and his home park. Like DIPS, it would be better to do these correlations only on pitchers who switch teams, but I fear that the sample would be too small to get any meaningful results.

Anyway, I have a database now of the last 10 years’ differences between a pitcher’s RA9 and his sim RA9 (the runs per 27 outs generated by my sim), for all pitchers who threw to at least 100 batters in a season.

First here are some interesting categorical observations:

Jared Cross, of Steamer projections, suggested to me that perhaps some pitchers, like lefties, might hold base runners on first base better than others, and therefore depress scoring a little as compared to the sim, which uses league-average base running advancement numbers. Well, lefties actually did a hair worse in my database. Their RA9 was .02 greater than their sim RA. Righties were -.01 better. That does not necessarily mean that RHP have some kind of RA skill that LHP do not have. It is more likely a bias in the sim that I am not correcting for.

How about number of pitches in a pitcher’s repertoire. I hypothesized that pitchers with more pitches would be better able to tailor their approach to the situation. For example, with a base open, you want your pitcher to be able to throw lots of good off-speed pitches in order to induce a strikeout or weak contact, whereas you don’t mind if he walks the batter.

I was wrong. Pitchers with 3 or more pitches that they throw at least 10% of the time are .01 runs worse in RA9. Pitchers with only 2 or fewer pitches, are .02 runs better. I have no idea why that is.

How about pitchers who are just flat out good in their components such that their sim RA is low, like under 4.00 runs? Their RA9 is .04 worse. Again, their might be some bias in the sim which is causing that. Or perhaps if you just go out and there “air it out” and try and get as many outs and strikeouts as possible, regardless of the situation, you are not pitching optimally.

Conversely, pitchers with a sim RA of 4.5 or greater shave .03 points off their RA9. If you are over 5 in your sim RA, your actual RA9 is .07 points better and if you are below 3.5, your RA9 is .07 runs higher. So, there probably is something about having extreme components that even the sim is not picking up. I’m not sure what that could be. Or, perhaps if you are simply not that good of a pitcher, you have to find ways to minimize run scoring above and beyond the hits and walks you allow overall.

For the NL pitchers, their RA9 is .05 runs better than their sim RA, and for the AL, they are .05 runs worse. So the sim is not doing a good job with respect to the leagues, likely because of pitchers batting. I’m not sure why, but I need to fix that. For now, I’ll adjust a pitcher’s sim RA according to his league.

You might think that younger pitchers would be “throwers” and older ones would be “pitchers” and thus their RA skill would reflect that. This time you would be right – to some extent.

Pitchers less than 26 years old were .01 runs worse in RA9. Pitchers older than 30 were .03 better. But that might just reflect the fact that pitchers older than 30 are just not very good – remember, we have a bias in terms of quality of the sim RA and the difference between that and regular RA9.

Actually, even when I control for the quality of the pitcher, the older pitchers had more RA skill than the younger ones by around .02 to .04 runs. As you can see, none of these effects, even if they are other than noise, is very large.

Finally, here are the lists of the 10 best and worst pitchers with respect to “RA skill,” with no commentary. I adjusted for the “quality of the sim RA” bias, as well as the league bias. Again, take these with a large grain of salt, considering the discussion above.

Best, 2004-2013:

Sean Chacon -.18

Steve Trachsel -.18

Francisco Rodriguez -.18

Jose Mijares -.17

Scott Linebrink -.16

Roy Oswalt -.16

Dennys Reyes -.15

Dave Riske -.15

Ian Snell -.15

5 others tied for 10th.

Worst:

Derek Lowe .27

Luke Hochevar .20

Randy Johnson .19

Jeremy Bonderman .18

Blaine Boyer .18

Rich Hill .18

Jason Johnson .18

5 others tied for 8th place.

(None of these pitchers stand out to me one way or another. The “good” ones are not any you would expect, I don’t think.)

If anyone is out there (hello? helloooo?), as promised, here are the AL team expected winning percentages and their actual winning percentages, conglomerated over the last 5 years. In case you were waiting with bated breath, as I have been.

Combined results for all five years (AL 2009-2013), in order of the “best” teams to the “worst:”

Team

My WP

Vegas WP

Actual WP

Diff

My Starters

Actual Starters

My Batting

Actual Batting

NYA

.546

.566

.585

.039

98

99

.30

.45

TEX

.538

.546

.558

.020

102

95

.14

.24

OAK

.498

.490

.517

.019

104

101

-.08

.07

LAA

.508

.526

.522

.014

103

106

.07

.17

TBA

.556

.544

.562

.006

100

102

.24

.17

BAL

.460

.452

.463

.003

110

115

-.03

-.27

DET

.548

.547

.550

.002

97

91

.21

.31

BOS

.546

.596

.546

.000

99

98

.26

.36

CHW

.489

.450

.488

-.001

99

97

-.16

-.29

TOR

.479

.482

.478

-.001

106

107

-.05

.12

MIN

.468

.469

.464

-.004

108

109

-.07

-.07

SEA

.462

.464

.446

-.016

106

106

-.26

-.36

KCR

.474

.460

.444

-.030

108

106

-.22

-.28

CLE

.492

.469

.462

-.030

108

109

.13

.01

HOU

.420

.420

.386

-.034

106

109

-.46

-.61

I find this chart quite interesting. As with the NL, it looks to me like the top over-performing teams are managed by stable high-profile, peer and player respected guys – Torre, Washington, Maddon, Scioscia, Leyland, Showalter.

Also, as with the NL teams, much of the differences between my model and the actual results are due to over-regression on my part, especially on offense. Keep in mind that I do include defense and base running in my model, so there may be some similar biases there.

Even after accounting for too much regression, some of the teams completely surprised me with respect to my model. Look at Oakland’s batting. I had them projected as a minus -.08 run per game team and somehow they managed to produce .07 rpg. That’s a huge miss over many players and many years. There has to be something going on there. Perhaps they know a lot more about their young hitters than we (I) do. That extra offense alone accounts for 16 points in WP, almost all of their 19 point over-performance. Even the A’s pitching outdid my projections.

Say what you will about the Yankees, but even though my undershooting their offense cost my model 16 points in WP, they still over-performed by a whopping 39 points, or 6.3 wins per season! I’m sure Rivera had a little to do with that even though my model includes him as closer. Then there’s the Yankee Mystique!

Again, even accounting for my too-aggressive regression, I completely missed the mark with the TOR, CLE, and BAL offense. Amazingly, while the Orioles pitched 5 points in FIP- worse than I projected and .24 runs per game worse on offense, they somehow managed to equal my projection.

Other notable anomalies are the Rangers’ and Tigers’ pitching. Those two starting staffs outdid me by seven and six points in FIP-, respectively, which is around 1/4 run in ERA – 18 points in WP. Texas did indeed win games at a 20 point clip better than I expected, but the Tigers, despite out-pitching my projections by 18 points in WP, AND outhitting me by another 11 points in WP, somehow managed to only win .3 games per season more than I expected. Must be that Leyland (anti-) magic!

Ok, enough of the bad Posnanski and Woody Allen rants and back to some interesting baseball analysis – sort of. I’m not exactly sure what to make of this, but I think you might find it interesting, especially if you are a fan of a particular team, which I’m pretty sure most of you are.

I went back five years and compared every team’s performance in each and every game to what would be expected based on their lineup that day, their starting pitcher, an estimate of their reliever and pinch hitter usage for that game, as well as the same for their opponent. Basically, I created a win/loss model for every game over the last five years. I didn’t simulate the game as I have done in the past. Instead, I used a theoretical model to estimate mean runs scored for each team, given a real-time projection for all of the relevant players, as well as the run-scoring environment, based on the year, league, and ambient conditions, like the weather and park (among other things).

When I say “real-time” projections, they are actually not up-to-the game projections. They are running projections for the year, updated once per month. So, for the first month of every season, I am using pre-season projections, then for the second month, I am using pre-season projections updated to include the first month’s performance, etc.

For a “sanity check” I am also keeping track of a consensus expectation for each game, as reflected by the Las Vegas line, the closing line at Pinnacle Sports Book, one of the largest and most respected online sports books in the internet betosphere.

The results I will present are the combined numbers for all five years, 2009 to 2013. Basically, you will see something like, “The Royals had an expected 5-year winning% of .487 and this is how they actually performed – .457.” I will present two expected WP actually – one from my models and one from the Vegas line. They should be very similar. What is interesting of course is the amount that the actual WP varies from the expected WP for each team. You can make of those variations what you want. They could be due to random chance, bad expectations for whatever reasons, or poor execution by the teams for whatever reasons.

Keep in mind that the composite expectations for the entire 5-year period are based on the expectation of each and every game. And because those expectation are updated every 6 months by my model and presumably every day by the Vegas model, they reflect the changing expected talent of the team as the season progresses. By that, I mean this: Rather than using a pre-season projection for every player and then applying that to the personnel used or presumed used (in the case of the relievers and pinch hitters) in every game that season, after the first 30 games, for example, those projections are updated and thus reflect to some extent, actual performance that season. For example, last year, pre-season, Roy Halladay might have been expected to have a 3.20 ERA or something like that. After he pitched horribly for a few weeks or months, and it was well-known that he was injured, his expected performance presumably changed in my model as well as in the Vegas model. Again, the Vegas model likely changes every day, whereas my model can only change after each month, or 5 times per season.

Here are the combined results for all five years (NL 2009-2013):

Team

My Model

Vegas

Actual

My Exp. Starting Pitching (RA9-)

Actual Starting Pitching (FIP-)

My Exp. Batting (marginal rpg)

Actual Batting (marginal rpg)

ARI

.496

.495

.486

103

103

0

-.08

ATL

.530

.545

.564

100

97

.25

.21

CHC

.488

.478

.446

103

102

-.09

-17

CIN

.522

.517

.536

104

108

.01

.12

COL

.494

.500

.486

102

96

-.04

-.09

MIA

.493

.472

.453

102

102

.01

-.05

LAD

.524

.526

.542

96

99

.02

-.03

MLW

.519

.509

.504

105

108

.13

.30

NYM

.474

.470

.464

106

108

-.02

.01

PHI

.516

.546

.554

96

98

-.01

.07

PIT

.461

.454

.450

109

111

-.19

-.28

SDP

.469

.463

.483

110

115

-.12

-.26

STL

.532

.554

.558

100

98

.23

.40

SFG

.506

.518

.515

98

102

-.19

-.30

WAS

.497

.484

.486

103

103

.01

.07

If you are an American league fan, you’ll have to wait until Part II. This is a lot of work, guys!

By the way, if you think that the Vegas line is remarkably good, and much better than mine, it is at least partly an illusion. They get to “cheat,” and to some extent they do. I can do the same thing, but I don’t. I am not looking at the expected WP and result of each game and then doing some kind of RMS error to test the accuracy of my model and the Vegas “model” on a game-by-game basis. I am comparing the composite results of each model to the composite W/L results of each team, for the entire 5 years. That probably makes little sense, so here is an example which should explain what I mean by the oddsmakers being able to “cheat,” thus making their composite odds close to the actual odds for the entire 5-year period.

Let’s say that before the season starts Vegas thinks that the Nationals are a .430 team. And let’s say that after 3 months, they were a .550 team. Now, Vegas by all rights should have them as something like a .470 team for the rest of the season – numbers for illustration purposes only – and my model should too, assuming that I started off with .430 as well. And let’s say that the updated expected WP of .470 were perfect and that they went .470 for the second half. Vegas and I would have a composite expected WP of .450 for the season, .430 for the first half and .470 for the second half. The Nationals record would be .510 for the season, and both of our models would look pretty bad.

However, Vegas, to some extent uses a team’s W/L record to-date to set the lines, since that’s what the public does and since Vegas assumes that a team’s W/L record, even over a relatively short period of time, is somewhat indicative of their true talent, which it is of course. After the Nats go .550 for the first half, Vegas can set the second-half odds as .500 rather than .470, even if they think that .470 is truly the best estimate of their performance going forward.

One they do that, their composite expected WP for the season will be (.430 + .500) / 2, or .465, rather than my .450. And even if the .470 were correct, and the Nationals go .470 for the second half, whose composite model is going to look better at the end of the season? Theirs will of course.

If Vegas wanted to look even better for the season, they can set the second half lines to .550, on the average. Even if that is completely wrong, and the team goes .470 over the second half, Vegas will look even better at the end of the season! They will be .490 for the season, I will be .450, and the Nats will have a final W/L percentage of .490! Vegas will look perfect and I will look bad, even though we had the same “wrong” expectation for the first half of the season, and I was right on the money for the second half and they were completely and deliberately wrong. Quite the paradox, huh? So take those Vegas lines with a grain of salt as you compare them to my model and to the final composite records of the teams. I’m not saying that my model is necessarily better than the Vegas model, only that in order to fairly compare them, you would have to take them one game at a time, or always look at each team’s prospective results compared to the Vegas line or my model.

Here is the same table as above, ordered by the difference between my expected w/l percentage and each team’s actual w/l percentage. The firth column is that difference. Call those differences whatever you want – luck, team “efficiency,” good or bad managing, player development, team chemistry, etc. I hope you find these numbers as interesting as I do!

Combined results for all five years (NL 2009-2013), in order of the “best” teams to the “worst:”

Team

My Model

Vegas

Actual

Difference

My Exp. Starting Pitching (RA9-)

Actual Starting Pitching (FIP-)

My Exp. Batting (marginal rpg)

Actual Batting (marginal rpg)

PHI

.516

.546

.554

.038

96

98

-.01

.07

ATL

.530

.545

.564

.034

100

97

.25

.21

STL

.532

.554

.558

.026

100

98

.23

.40

LAD

.524

.526

.542

.018

96

99

.02

-.03

SDP

.469

.463

.483

.014

110

115

-.12

-.26

CIN

.522

.517

.536

.014

104

108

.01

.12

SFG

.506

.518

.515

.009

98

102

-.19

-.30

COL

.494

.500

.486

-.008

102

96

-.04

-.09

NYM

.474

.470

.464

-.010

106

108

-.02

.01

PIT

.461

.454

.450

-.010

109

111

-.19

-.28

ARI

.496

.495

.486

-.010

103

103

0

-.08

WAS

.497

.484

.486

-.011

103

103

.01

.07

MLW

.519

.509

.504

-.015

105

108

.13

.30

MIA

.493

.472

.453

-.040

102

102

.01

-.05

CHC

.488

.478

.446

-.042

103

102

-.09

-.17

As you can see from either chart, it appears as if my model over-regresses both batting and starting pitching, especially the former.

Also, a quick and random observation from the above chart – it may mean absolutely nothing. It seems as though those top teams, most of them at least, have had notable, long-term, “players’ managers,” like Manuel, LaRussa, Mattingly, Torre, Black, Bochy, and Baker, while you might not be able to even recall or name most of the managers of the teams at the bottom. It will be interesting to see if the American League teams evince a similar pattern.