精華區beta Asian-MLB 關於我們 聯絡資訊
These are the PA levels–at which, the players’ performance–can tell us about their skill going forward: * 50 PA - swing percentage * 100 PA - contact rate * 150 PA - K rate, line drive rate, pitches/PA * 200 PA - BB rate, grounder rate, GB/FB ratio * 250 PA - flyball rate * 300 PA - HR rate, HR/FB * 350 PA - sensitivity * 400 PA - none * 450 PA -none * 500 PA - OBP, SLG, OPS, 1B rate, popup rate * 550 PA - ISO * 600 PA - none * 650 PA - none 1. So after 100 PA (roughly a month, if a player is starting nearly everyday), I can tell you about how much a batter likes to swing and how good he is at making contact. 2. At 150 PA I can tell you if [the batter] likes to hit line drives (and line drives are good…) 3. At 150 PA, I can also start telling whether [the batter] likes to work the count and whether he’s a strikeout king. 4. By 250 PA, I can tell a lot about his walking tendencies and what he ’s going to be a ground ball hitter or a flyball hitter. 5. At 300 PA, I finally find out whether or not the player likes to hit the ball out of the park every once in a while. 6. Finally, a lot of the usual 1-number stats (OBP, SLG, OPS) don’t stablize until 500 PA, as well as knowing whether you’re a singles hitter. # Lets take a look at what Pizza Cutter found out about pitcher sample sizes: * 50 BF - nothing * 100 BF - nothing * 150 BF - K/PA, grounder rate, line drive rate * 200 BF - flyball rate, GB/FB * 250 BF - nothing * 300 BF - nothing * 350 BF - nothing * 400 BF - nothing * 450 BF - nothing * 500 BF - K/BB, pop up rate * 550 BF - BB/PA * 600 BF - nothing * 650 BF - nothing * 700 BF - nothing * 750 BF - nothing You can’t tell a lot about a pitcher by looking at his stats over a single season. You can get a pretty good idea of how often he walks and strikes batters out, and what type of batted balls he gives up generally… but that’s about it. http://mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/ On the reliability of pitching stats By Pizza Cutter | January 6th, 2008 E-mail | Print | Share Executive Summary Any study on the reliability of pitching stats is by default a paper on DIPS. When Voros McCracken wrote the original DIPS paper, it looked at the simple fact that while the correlation between strikeout rates, walk rate, HR rates, and HBP rates were fairly consistent from year to year, the correlation for BABIP (Batting average on balls in play) was not. Voros, in technical terms, was measuring test-retest reliability. In the intervening few years since his discovery, several folks have picked apart his findings and basic premise that a pitcher has great control (as opposed to luck) over things that happen without the involvement of the defense (hence “Defense Independent Pitching Statistics”), but little control over what happens when the ball is actually in play. Despite the large amount of work spent on attempting to disprove the theory, it has generally stood up to the tests thrown at it. A little while ago, I looked at the reliability of batting statistics using split-half reliability. Basically, I coded each plate appearance and coded it as even numbered or odd numbered, and separated the two groups accordingly. I looked at stats like OBP in a player’s even-numbered at bats and his odd-numbered at-bats. If OBP is reliable, then OBP in even-numbered plate appearances and OBP in odd-numbered plate appearances will correlate well with each other. Also, I looked to see when a stat became reliable enough to be useful in analyses, using a standard cutoff of a split-half reliability of .70. The more plate appearances (or for pitchers, batters faced) a player accumulates, the better idea we have of his true talent level, and the more reliable (reproduceable) a stat will be over the same time frame in the future. If a batter hit 10 HR in 400 PA, and HR/PA was very reliable, we’d guess that he’s going to hit 10 HR or so in his next 400 PA. If it’s not reliable, the fact that he hit 10 HR in his last 400 PA means nothing in terms of predicting his next 400 PA. Now I turn my attention to pitching stats. I’ve hidden the numerical spaghetti behind the cut, and if you want to read it, it’s all there for you. I’ve used a method similar to my previous article, in that I look at the issue in three ways. * First, I looked to answer the question of what the minimum number of PA or BF should be used in research studies. That is, suppose I want to do work on what strikeout rates predict to. Usually, I would say something like “ all pitchers with a minimum of X batters faced” so that the “cup of coffee” call up guys won’t contaminate the sample. What should that minimum number be? To do this, I found the number of BF where the split-half correlation for the sample composed of that minimum number was at least .70. * The second method is a look at how long until a stat becomes meaningful for a particular player. If Tuffy Rhodes hits 3 HR on Opening Day, we know that’s not a big enough sample size to tell us any meaningful information about him. However, after a few hundred PA’s, we can probably make some pretty good conclusions about his ability to hit for power. I went in 50 PA intervals to test this. For example, to test whether a stat was reliable at 50 PA’s, I took a player’s first 100 PA’s in the data base (2 - 50 PA samples) and calculated whatever stats were of interest from the even numbered PA’s and the odd-numbered. The number where the stat crossed a split-half reliability of .70 was where it became officially certified as appropriately reliable. * Finally, I looked at what the split-half correlations were for the pitching stats under observation at 300 BF and 750 BF. This gives us an idea of how reliable stats are for a starter and a reliever, using some rough cutoffs. Again, all of the data in all its glory is below the cut, but the main findings are: * Strikeouts are the one outcome over which pitchers seem to have the most control. Walks are slightly less reliable, but still worthy of mention as a reliable/skill-based outcome. This checks out with previous DIPS work, including my own. * Pitchers are astonishingly reliable in what sort of balls come off the bat when they pitch. At 750 batters faced, the split-half reliabilities for line drives and grounders were above .90. So, to say that once the ball is hit, the pitcher has no control over what happens, is false. The pitcher seems to have a good amount of stability in inducing different types of batted balls. There are going to be ground ball pitchers and fly ball pitchers, and that isn’t the product of random chance. Where that ball in play lands, either in someone’s glove or on the grass for a hit, doesn’t appear to be as reliable. I’ve previously shown that pitchers’ results on fly balls are more consistent (at least as a matter of degree… the reliability numbers themselves aren’t overwhelming) than their results on ground balls. Still, overall BABIP is still largely unstable, suggesting that there is little (although not nil) skill involved on the pitcher’s side. * Contrary to original DIPS theory, home run rate isn’t very stable. In fact, a ball in play stat, singles/PA, is more reliable than HR/PA. This could be something that has to do with the pitchers or perhaps it has something to do with my methodology. Split-half controls for the four gentlemen standing directly behind him on the infield, so it may be that defense and pitching are once again entangled. Still, HR/PA reliability stats are fairly low. Even with a good sample size (750 BF), the split-half correlations were only .34 or so. Seems like a full season isn’t a good measure of a pitcher’s HR/PA ability. * HR/FB was very very unstable for pitchers. For batters, HR/FB stablized pretty quickly. This suggests that the pitcher may be the one who gives up the fly ball, but the batter is the one who makes it leave the yard. So, if your favorite pitcher gave up a lot of HR/FB last year, fear not. Chances are he’ll be better next year. * Relievers are hard to project because at the small sample sizes that relievers have in terms of batters faced, the stats used to describe pitchers are largely unreliable. This means that regeression to the mean will take its toll on a reliever very quickly. Relievers who rely mostly on the strikeout are less likely to have this trouble. * And while I’m in the neighborhood, a post at Lookout Landing rating pitchers in a way very much consistent with what I’ve found here. Worth a read. Methodology and Results Some methodological notes: Again, I used Retrosheet files for 2001-2006 in two-year windows. (I lumped 2001-2002 together, 2003-2004, and 2005-2006.) It’s not ideal because for some folks, these plate appearances under study occurred a year and a half apart, but that’s the only way to get enough of a sample of one man’s work to draw any kind of conclusions. At least it’s better that they’re consecutive years. It does bring up the issue, especially with pitchers that there is some selective sampling going on. Consistent pitchers (and consistently good pitchers, especially) tend to get more playing time. Alas, baseball is a wonderful data set from a methodological point of view, but it’s not a perfect one. Again, I realize that .70 is an arbitrary cutoff. I’ve laid out my reasons for using it before and I stick by them. (Short version: .70 means that you have an r-squared of .49. Anything north of that means that the majority of the variance is consistent within a player.) Also, there’s one minor annoyance. I had to number the events by the way that Retrosheet does events. So, non-pitching events such as stolen base attempts, as well as passed balls were counted as events. (As were pitching events that don’t result in the end of a plate appearance, such as balks or wild pitches.) So, when I say 100 batters faced, it’s probably not 100 full batters, but actually something like 95 batters plus 5 (balks, WP, PB, SB, CS, etc.) It’s annoying enough that I should mention it, but probably not a big enough deal as to affect the major conculsions of the study. Then there’s another issue of which pitching stats to study. Several of the usual stats used to evaluate pitchers are game-level stats. Wins and saves are generally the yardsticks by which we measure pitchers, but they are a poor gauge of what a pitcher actually did that day. To say that C.C. Sabathia picked up a win might mean that he barely survived five innings, gave up 7 runs, and got bailed out by some run support and the bullpen. It might mean that he threw a two-hit shutout. They’re rather imprecise. ERA is also a puzzler in this methodolgy, because a pitcher can give up an earned run (or an unearned run) despite the fact that he wasn’t even in the game at the time (and I never liked ERA or the concept of “earned runs” to begin with). I stuck to looking at various rate stats (K rate, walk rate), some one-number stats (AVG, OBP), and the batted ball profile. Part I: Setting sampling minimum cutoffs for research A few of the “How often does he” stats: * K/PA - 50 BF; K/9 - 60 BF * BB/PA - 250 BF; BB/9 - 300 BF * K/BB ratio - 250 BF * HR/PA and HR/9 - never did (at 750 BF, they were at .32 and .34, respectively) * 1B/PA and 1B/9 - never did (at 750 BF, .57 and .50, respectively) * 2B + 3B/PA and per 9 - never did (.33 and .36) * HBP - never did (.53 and .54) * WP - never did (.54 for both) Looks like stats measured per PA, rather than per 9 innings stablize a bit more quickly, but it also looks like outside of walks and strikeouts, there is little consistency in a sample on issues of balls in play. The surprising finding was that in the 750 BF or more sample, singles were much more consistent than home runs. Some one-number stats: * I’ll give you the short version: I looked at AVG, OBP, SLG, OPS, and BABIP. None of them reached the magic cutoff of .70 * Interestingly enough, good old batting average against was the most reliable stat in the 750 BF sample (split-half r = .569), with OBP and SLG at .52 and .49. OPS was at .49. BABIP was at .238, which is certainly more than zero, but certainly nothing to write home about. A spin through the batted ball profile. * Ground balls / Ball in play - less than 50 BF * Line drives - less than 50 BF * Fly balls - less than 50 BF * Pop ups - 325 BF * HR/FB - never made it to .70, at 750 BF, it had a split half of .208 Again, the numbers above are for researchers looking to set a cutoff of “X number of batters faced or above” for their studies. Part II: Evaluating individual players. When does a stat become meaningful for an individual pitcher? Again, I used 50 BF intervals from 50 to 750. I’ll present the cutoffs and which stats hit the magic .70 mark at each one. I also only calculated stats per PA, rather than per 9 innings, since those seemed to be the more reliable stats, if only by a bit. * 50 BF - nothing * 100 BF - nothing * 150 BF - K/PA, grounder rate, line drive rate * 200 BF - flyball rate, GB/FB * 250 BF - nothing * 300 BF - nothing * 350 BF - nothing * 400 BF - nothing * 450 BF - nothing * 500 BF - K/BB, pop up rate * 550 BF - BB/PA * 600 BF - nothing * 650 BF - nothing * 700 BF - nothing * 750 BF - nothing You can’t tell a lot about a pitcher by looking at his stats over a single season. You can get a pretty good idea of how often he walks and strikes batters out, and what type of batted balls he gives up generally… but that’ s about it. Part III: How reliable is that stat? Using the same methodology as part two, I present split-half reliability numbers at two cutoffs: 300 batters faced and 750 batters faced. At 300 batters faced, here are the split-half reliability numbers, in order from most reliable stats to least reliable. Rate stats: 1. K/PA - .821 2. BB/PA - .597 3. K/BB - .575 4. 1B/PA - .340 5. HR/PA - .262 6. 2B+3B/PA - .216 One-number stats: 1. OBP - .430 2. OPS - .386 3. AVG - .379 4. SLG - .364 5. BABIP - .135 Batted ball profile: 1. Line drive/ball in play - .861 2. Ground ball/BIP - .816 3. GB/FB - .788 4. Fly ball/BIP - .779 5. Pop-up/BIP - .586 6. HR/FB - .145 And at 750 Batters Faced, same idea: Rate stats: 1. K/PA - .873 2. K/BB - .806 3. BB/PA - .789 4. 1B/PA - .525 5. HR/PA - .323 6. 2B+3B/PA - .237 One-number stats: 1. AVG - .527 2. OBP - .522 3. OPS - .459 4. SLG - .455 5. BABIP - .188 Batted ball stats: 1. Line drives - .936 2. Ground balls - .905 3. Fly balls - .862 4. GB/FB - .852 5. Pop ups - .764 6. HR/FB - .207 This entry was posted on Sunday, January 6th, 2008 at 8:38 pm and is filed under Voros McCracken, DIPS, pitching. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site. BallHype: hype it up! 12 Responses to “On the reliability of pitching stats” 1. Phil Birnbaum says: January 6th, 2008 at 9:15 pm Might the /PA stats have their reliability inflated because of strikeouts and walks? That is, suppose that HR/PA were completely random, except that some pitchers give up fewer because, with high SO and BB, the batters don’t get much chance to hit the ball. Wouldn’t you get a higher (perhaps significant) correlation even if HR/BIP had zero reliability? 2. tangotiger says: January 7th, 2008 at 10:33 am Pitchers that give up alot of HR do no exist in MLB. Correlation is not a measure of skill, but a measure of variance. If the variance of the true skill HR is low, correlation will be lower than from a “general population”. Lots of pitchers give up lots of singles. They can still be successful by doing other things. In short, if you take the top 5000 pitchers in the world as your general population, and get to select 500 of them to play MLB, the singles rate of the MLB population will be alot closer to the general population than the HR rate. *** In Retro, there is a “batter event flag” (somewhere around field 50 or so… don’t remember exactly). Select when that = ‘T’, and that gets rid of your running events. *** For your part 3, can you show what the mean PA is for the 300 PA and 750 PA cutoffs? *** Great job! 3. Ender says: January 7th, 2008 at 11:45 am Could you run the numbers for ERA? Afterall most non stats people live and die by ERA as a pitcher stat I know it is not going to show good results but it sure would help to be able to point to this article and show people just why ERA is not reliable over a single season. 4. Eric J. Seidman says: January 7th, 2008 at 11:46 am Ender, if you want to see some insight into why ERA is not a reliable stat over a season, look to my article from this blog from a few weeks ago, titled “Reevaluating ERA”. 5. BJ says: January 7th, 2008 at 4:26 pm Very interesting stuff, I guess people worrying about Santana b/c of his HR increase can shut up. 6. dan says: January 7th, 2008 at 7:55 pm If HR/9 or HR/PA fluctuate wildly, that seems to further the idea that pitchers don’t control how many of their fly balls go for home runs (or at least that it’s mostly due to park effects). 7. Jeff Kallman says: January 7th, 2008 at 10:51 pm BJ—I wonder what the people to whom you refer would have said about the only pitcher in the 500 home run club—as the victim, that is. It didn’t exactly keep Robin Roberts out of the Hall of Fame to have surrendered one lifetime bomb more than Eddie Murray hit. Don’t see why everyone gets on mah hitting. I go the other way. I’ve thrown some of the longest balls in baseball history.—Weak-hitting, homer-prone Brooklyn Dodger lefthander Preacher Roe.—Jeff 8. Pizza Cutter says: January 7th, 2008 at 11:59 pm A few answers, now that I’m finally back in one city (for now…) Phil - I have no doubt that the per PA stats are inflated a bit in their reliability in response to the excessive reliability of the K and BB rates. Tango - pitchers who give up a lot of HR don’t exist… for very long… as to part 3, the mean is exactly 300/750 BF on those correlation numbers. In those cases, I artificially clipped everyone down to 300/750 (by taking their first 300/750), so long as they had 300 to give. (So someone who had only 295 was politely excused from the sample.) Ender - here’s the problem with ERA in this framework. To start off a season, a pitcher gives up a single (odd BF). In the next PA (even), he gives up a double and the runner scores. Then he strikes out the next batter (odd), but the next guy singles in the guy on second (even). So, in the even-numbered plate appearances, two earned runs score, though he hasn’t recorded an out. In the odd numbered ones, he’s thrown 1/3 IP with no runs. But wait, one of the runs that scored got on during an odd-numbered PA. Earned runs are (home runs excepted), the stringing together of a couple of plate appearances. Even if I can get over my dislike for the “earned run”, I can’t figure out a way to appropriately partial things out. I suppose the only way I could do it would be even-numbered innings vs. odd-numbered innings pitched, but that only works for pitchers who complete the inning. 9. tangotiger says: January 8th, 2008 at 10:35 am Pizza, ok great. At 750 “PA”, you’ve got K/PA with an r=.873 Using the equation: r=PA/(PA+x) .873=750/(750+x) we get an x=109 So, our general equation for correlation of K/PA is: r=PA/(PA+109) So, if you have 300 PA, we can estimate the likely r. Using the above equation, and we get r=.73. Your sample data shows .82. I’m not happy with this difference. *** Let’s continue with BB/PA. Using 750 PA, the equation is r=PA/(PA+201) So, at 300 PA, we’d expect r=.60. Your sample shows r=.60! Bingo! *** We CANNOT do K/BB. That is a ratio of two independent events. Unless you take the log, you need to reform it as: K/(K+BB), and then run the regression. You need to create a rate stat if you want to apply linear regression. Otherwise, why not do BB/K? You’ll actually get different results. *** At 750 PA, the equation for HR/PA becomes: r=PA/(PA+1572) If you have 300 PA, we expect r=.16. Your sample shows r=.26. Again, not happy here. *** At 750 PA, the equation for 1B/PA becomes: r=PA/(PA+679) If you have 300 PA, we expect r=.31. Your sample shows r=.34. Pretty close. *** At 750 PA, the equation for XBH/PA becomes: r=PA/(PA+2415) At 300 PA, r=.11. Your sample shows r=.22. This is very inconsistent. Your r is very close at both PA=300 and PA=750. This certainly makes little sense, and you have some sort of bias in the data here, be it park, or whatnot. *** Here is how the batted ball info looks like: r at 750 PA Event 0.936 Line drives 0.905 Ground balls 0.862 Fly balls 0.764 Pop ups 0.207 HR/FB the “x” Event 51 Line drives 79 Ground balls 120 Fly balls 232 Pop ups 2,873 HR/FB expected r at 300 PA Event 0.85 Line drives 0.79 Ground balls 0.71 Fly balls 0.56 Pop ups 0.09 HR/FB sample r at 300 PA Event 0.86 Line drives 0.82 Ground balls 0.78 Fly balls 0.59 Pop ups 0.15 HR/FB result Event Bingo! Line drives Pretty close Ground balls Eh, not bad Fly balls Pretty close Pop ups a bit off HR/FB I find presenting the general “r” equation as I am doing provides what you need for *any* level of PA. *** Here’s another way to think about the BB/PA. I took all the players with at least 2000 BFP, from 2001-2006. That’s 158 pitchers. I figure the zScore for each pitcher, from Brad Radke’s -11 standard deviations to Ishii’ s +12 SDs. The standard deviation of all those zScores was 4.408. The average PA was 3596. The r (which is likely the same intraclass correlation that Pizza is talking about) is r=1-(1/4.408)^2= .9485 Plugging this into: r=PA/(PA+x) and we get: .9485=3596/(3596+x) Solving for x=195 So, our BB/PA equation is: r=PA/(PA+195) At PA=300, we’d expect r=.606. Pizza’s sample says r=.597. At PA=750, we’d expect r=.794. Pizza’s sample says r=.789. That’s a huge bingo! The advantage here is that it’s a supersnap to do in Excel. Plus, you get an actual regression equation based on PA (or whatever your denominator is). 10. tangotiger says: January 8th, 2008 at 10:39 am Pizza, looks like my post was too long. I’ve posted on my blog (click on my name). 11. 5 Pitching Statistics you Can’t Afford to Ignore Anymore | says: January 20th, 2008 at 7:20 am […] Singles allowed per 9 innings is a skill based outcome according to a study done by the fine Sabermetricians over at MVN. Their study concludes that pitchers have very little control over fly balls that turn into HR’s (which directly goes against Baseball HQ and many other fantasy baseball sites who use DIPS to evaluate pitchers). I would highly advise you to read the study here. […] ------------------------------------------------------------------- http://mvn.com/mlb-stats/2007/11/14/525600-minutes-how-do-you-measure-a-player-in-a-year/ What does a year really tell you about a player? Seriously. If I gave you the seasonal stats for any player last year (or the year before), how much could you really tell me about him? If I told you he hit .300 last year, are you confident that deep down, he’s really a .300 hitter? How do you measure a year in the life? Like a lot of things that happen out here in the Sabersphere, I take my inspiration for this (series of?) article(s?) from a conversation that went on at the Inside the Book blog. A few folks were discussing an article that I wrote here at StatSpeak on productive outs and as these things are wont to do, the conversation wandered. Inside the Book co-author MGL asked me a fair question: when I talked about productive outs, what sample size I was dealing with. Not so much how many player-years were in my data set, but for each of those player years, how many PA’s did each player have. It’s a much more important question than you might think. If you’ve been reading my work for a while, you know that I often say things like, “minimum of 100 PA.” (I’m hardly the only one to do this, by the way.) Why did I make sure that the batter had 100 PA? Well, first off, let’ s say that I’m interested in rating batters by how often they strike out. And I happen to come across a player who got five at-bats in a season and never ever struck out. I hereby crown him the king of all contact hitters! He will never ever ever strikeout ever. Right? Of course not. 5 PA isn’t a big enough sample size to measure anything. But what is? When I say minimum 100 PA, I must admit I’m usually using a very unscientific “yeah, that sounds about right” criteria for picking the number. What if 100 PA isn ’t a big enough sample for what I’m trying to measure either? I’m a scientist by training (my cancer biologist wife laughs at me when I say that), and I should be a little more… scientific. (Major and extensive numerical nerdiness alert. As if the reference to Rent wasn’t nerdy enough. This is a really long methodological article for the hardcore researchers out there. If you’re here for witty banter about statistical matters in baseball, may I suggest you pick another article.) What we’re talking about here is a concept known in social science research as measure reliability. It’s the idea that if I took the same measure over and over again, I’d get (roughly) the same answer each time. This shouldn’ t be confused with measure validity, which is whether or not the measure I’m using is actually measuring what I think it does. I might ask 25 people to tell me what color the sky is, and they might all say “green with orange polka dots.” The measure is very reliable, but not very valid. In statistics, the way to increase reliability of a measure is to have more observations in the data set. If I took a player’s on-base percentage for his first five at-bats in a season, and then his next five, and then his next five, and so on, those numbers are going to fluctuate all over the place. But if I do it in 200 at-bat sequences, the numbers will be more stable. I’ ll hopefully get (roughly) the same number each time I take a sample of 200 at-bats. The question I ask is when does that number become stable enough that we say that it’s OK to make inferences about a group of players? In social science, we look for a magic number, which is .70. For example, one way of estimating measure stablity and reliablity is to look at things from one time point to another, in the case of baseball, from one season to another. DIPS theory sprung from this type of question. Strikeout rate is stable from year to year, suggesting that a pitcher’s yearly strikeout rate is something that represents a coherent, stable measurement that tells you something about a player himself rather than his circumstances. A pitcher’s BABIP not so much, as it’s not stable from one year to the next. This type of question lends itself well to stats like year-to-year correlation and my favorite, intra-class correlation, and usually the gold standard for reliability is .70. Here’s the thing about year-to-year correlation. Let’ s say that baseball seasons lasted one plate appearance per season. Nothing’ s going to correlate year-to-year because one plate appearance doesn’t tell you much of anything. But, now let’s pretend that baseball seasons lasted for billions and billions plate appearances. If I watched a player for that long, I’d have a really good idea of what his “true” ability is. And if I got another billion PA’s, I’d probably get the exact same number the next time around. Over a bunch of players, with each one measured 1 billion times in each year, my correlation would probably hit 1.0 (which is a perfect correlation), because I would have a perfect measurement of all players at two different time points. But, in a season we get 600-700 PA for regulars, and less for bench/platoon/fringe/injured players. Is that enough to do real research? Turns out that the answer is “depends on what stat you want to measure.” How we get to that answer is another matter, but we will get there. If I say that 100 PA is enough of a sample to get adequate reliability on a measure (let’s say batting average), then I should be able to take 100 PA and calculate the batting average from those and then another 100 PA with which to calculate AVG. I could do this for some group of players and I could see how well their AVGs in the first group of 100 PA correlate with their second group. Why am I obsessed with .70? Because a correlation of .70 means an R-squared of 49%. Anything north of .70 means that a majority of the variance (> 50%) is stable. Higher correlations mean more stability, which is always better, but .70 is usually “good enough for government work.” Now, where can we get those samples of 100 or 300 or 500 PA? Well, first off, we’ll need two samples of 100 or 300 or 500 to compare against each other. So, if I want to see if a stat is stable at 300 PA, then I could take a player’s first 300 PA of the season (pre-All Star break?) and compare it against his next 300 (second half?) There’s a problem in there. In the second half, he might be more tired, or he might be better in the later summer, or perhaps he played the second half with an injury. In the aggregate that probably all shakes out, but perhaps all players tire out midway through the season in some systematic predictable way. Perhaps I could look year to year, but I’d have some of the same sort of issues. In the second year, the player is a year older and wiser, and that will affect him in a number of ways, good (smarter) and bad (physical decline?). There’s another method which sidesteps a lot of these issues. It’s called split-half reliability. Here’s what I did. I took each player’s plate appearances and numbered them sequentially, from his first to his last. Then, I split them up into even-numbered and odd-numbered appearances. In this way, I could split a season of 600 PA into 2-300 PA samples, and there would be plate appearances from just about all games played in both samples. This seems a much more fair way (if more cumbersome) of splitting things up. There’s one other problem, though. A year usually lasts 600-700 PA’s for regulars. Within a year, I don’t get a second sample of 600-700 more PA’s to use as a comparison. That means that the top number I can check for as the “is it consistent enough” number is going to be around 350 or so, and then, I’m only dealing with people who are good enough (and perhaps consistent enough) to play every day. I originally ran one-year only samples, but found that some stats weren’t reliable at 350 PA. So, I took consecutive two-year windows (2001-2002, 2003-2004, 2005-2006), and used split-half reliability within those two year windows. So, for each player, I took his even numbered PA’s and compared them against the odd numbered PA’ s, pulling plate appearances into both groups from both years. It’s not perfect, but at least it balances things out a bit. As always, data were kindly provided by Retrosheet. I love Retrosheet. I calculated a lot of the usual stats we like to use in baseball, for the time being focusing on batters, and checked to see where they started meeting the “at this minimum of PA’s, the correlation coefficient between the even and odd PA’s at least .70〃 criteria at varying levels of PA’s. You can interpret them this way: When I raised the minimum inclusion criteria to include all players with a minimum of ___ PA, the stat in question was reliable enough to actually say something about the sample of players, that is it had a split-half reliability over .70. When future researchers conduct studies on groups of players (more on why that phrase is important in a minute) using these statistics, these are the minima I recommend for inclusion in any sort of data set. (Whether anyone cares or not what I recommend is another issue.) In the previous paragraph, a very important distinction must be made. The minima listed below do not mean that the statistic in question stabilizes at ___ PA for an individual player, but that it stablizes at in a sample which includes all players with ___ PA and above. Since this is the way that we usually do research, it seems to be the best way to begin. Whether a statistic is reliable in a sample of players that had exactly two samples of ___ PA to compare against each other is another study. The problem is that if I say 100 PA or more, I’m taking a look both at those with 2-100 PA samples, but also those with 2-600 PA samples. The 600 PA samples will be much more stable and make things look more stable than they are. Only by restricting the range a wee bit can we answer the other player evaluation question of “How many PA’s until I have a good idea of Jones’s real abilities?” But I’m not doing that yet. Right now, I’m looking for minimum inclusion criteria. Numbers are rounded a little bit to make them a little more appealing to the eye. Some of the one-number stats for hitters stablized at: * AVG -never did. at 650 PA, it had only reached a split-half correlation of .668 * BABIP -never did. at 650 PA, it had only reached a split-half correlation of .631 * OBP - 350 PA * SLG - 350 PA * ISO - 350 PA * OPS - 350 PA Even full-season batting average for a regular players aren’t fully reliable stats, at least according to my definition. Add that to the list of reasons why it’s silly to give out a batting title to the highest batting average in the league (apologies to Magglio Ordonez). However, OBP and SLG (and their derivative combinations) are stable for a sample including part-time players and regulars. A few “how often does he…” stats stablized at: * 1B rate - 375 PA * 2B+3B rate - never did. at 650 PA, it had only reached a split-half correlation of .411 * HR rate - 100 PA * K rate - under 40 PA * BB rate - under 40 PA Talk about three true outcomes! Even when I got down to 40 PA’s, walks and strikeouts remained very stable. Batted Ball stats: * GB rate - under 40 PA * LD rate - under 40 PA * FB rate - 175 PA * IFFB rate - 350 PA * GB/FB - 100 PA * HR/FB - 100 PA Hmmmm… ground balls and line drives remain pretty stable, even when the sample size is ridiculously low, but you need half a season or so before the pop ups stablize. A few advanced stats: * WPA - never did. at 650 PA, it had a split-half reliability of .401 * Context Neutral wins (sum (WPA/LI)) - never did. at 650 PA, it was at .588 * Clutch (sum WPA - sum (WPA/LI)) - never did. .021. * RBI above league average expectation (based on average RBI in base/out state faced by batter minus actual RBI) - 650 PA Oh really? WPA-based stats didn’t fare well in these analyses. Context neutral WPA was more reliable than straight-up WPA, although neither made it to the magic .70 cutoff. That clutch wasn’t reliable shouldn’t come as a surprise to anyone. Even RBI expectation just barely made it to “reliable” and it’s only reliable in a sample that takes into account full-timers. This calls into serious question whether things like WPA and WPA/LI can be used in analyses. The following is a very important distinction. If we want to say that A-Rod added 7.51 wins of WPA to the Yankees this past year, that’ s a statement of fact and it is what it is. However, if we were to hit reset and replay the entire 2007 season from scratch, the first go-around’s numbers league wide in WPA wouldn’t be a very good guide to what would happen in the re-play. So, to run analyses with WPA-related variables as predictors or as outcomes can get into some shaky statistical ground. (For the statistically initiated, this is the difference between descriptive and inferrential analyses.) UPDATE: Upon further review, it looks like I had some problems with my calculations on this one. I’m looking into it. All of the WPA and LI based stats are currently under investigation. The rest of this article still stands. Finally, some swing diagnostic metrics: * Swing % (swings/pitches) - under 40 PA * Contact % (ball in play + foul / swings) - under 40 PA * Senstivity and Response Bias (one of my homemade stats) - under 40 PA * Pitches/PA - under 40 PA Well now, players have a great deal of control over whether or not they swing or not, so it makes sense that they are the same whenever and wherever they go. I really could do this with any batting stat out there. Notable by their omission from this study are stats like runs and stolen bases that generally don’t happen on an at-bat basis because they are base running stats. (A home run scores the batter, but most often, players score runs by having someone else knock them home in a separate plate appearance.) Now, when do the stats I’ve looked at become stable for individual players? That is, at what point in the season (measured by PA) do stats go from being garbage to being meaningful and actually describing something about the player? This is actually a little harder to do, but it’s just an engineering problem. I returned to the same data base and used the same split-half paradigm on consecutive 2 year windows. This time, however, when looking to see about a certain number of PA’s (say 200), I took the first two samples of that many PA’s that I could find (so, the first 400 PA in the two year period, split into even and odd numbered PA’s) with the rest of a player’s PA’s tossed away. This artificially gives everyone the same number of PA’s within each analysis, so long as they actually logged twice as many as the target number in the two seasons in question (so guys who had only 10 PA were politely excluded from the analyses that required more than 200 PA.) Then, I ran a correlation between the results from the even and odd PA’s to see if the correlation got to .70. I played around with the number of plate appearances until the correlation either hit .70 or I maxed out on the number of plate appearances available (650 was the upper limit.) I ran the analyses in 50 PA increments. The following are the stats that were stable enough (correlation > .70) at each plateau to be considered reliable. Again, these are the PA levels at which each stat can be considered to be saying something about an individual player. 50 PA - swing percentage 100 PA - contact rate, response bias (both just missed at 50… the real number is probably around 70) 150 PA - K rate, line drive rate, pitches/PA 200 PA - BB rate, grounder rate, GB/FB ratio 250 PA - flyball rate 300 PA - HR rate, HR/FB 350 PA - sensitivity 400 PA - none 450 PA -none 500 PA - OBP, SLG, OPS, 1B rate, popup rate 550 PA - ISO 600 PA - none 650 PA - none So after 100 PA (roughly a month, if a player is starting nearly everyday), I can tell you about how much a batter likes to swing and how good he is at making contact. But, what happens when the ball leaves his bat? Well, at 150 PA I can tell you if he likes to hit line drives (and line drives are good …), which is the first indicator to stablize that even says anything about what happens to the ball off the bat. At 150 PA, I can also start telling whether he likes to work the count and whether he’s a strikeout king. By 250 PA, I can tell a lot about his walking tendencies and what he’s going to be a ground ball hitter or a flyball hitter. I still have no stats that have stablized that tell me outcomes about where the ball landed after it left the bat. Are you the type of hitter that likes to hit balls into fielders’ gloves or onto that lovely green substance in the outfield? At 300 PA, I finally find out whether or not the player likes to hit the ball out of the park every once in a while. Finally, a lot of the usual 1-number stats (OBP, SLG, OPS) don’t stablize until 500 PA, as well as knowing whether you’re a singles hitter. A few very interesting stats didn’t stablize, even after 650 PA. Those stats, with their split-half correlation at 650 PA in parentheses. * Batting Average (.586) * BABIP (.586) [sic] * 2B + 3B rate (.401) * WPA (.403) * Context neutral WPA (.590) Even after 650 PA, batting average isn’t an ideal descriptor of a player’s true talent level, at least in so much as his ability to put up a repeat performance of that same AVG. Why do we make such a big deal out of batting titles? I have no idea. (A note: AVG probably stablizes around 1000 PA or so — and that’s just a guess on my part — so career AVG might be a decent enough statistic as far as reliability goes, assuming the player has been around for that many PA.) The last question that I’ll take on is the question of exactly how stable these stats are for full season starters (650 PA) or for part-timers (300 PA). If you’re trying to predict next year’s performance, how much can you trust the previous year’s numbers? Again, the stats are listed with their split-half correlation coefficients in parentheses, and higher is better. At 650 PA: * AVG (.586), OBP (.779), SLG (.762), OPS (.773), ISO (.740), BABIP (.586) * 1B rate (.831), 2B+3B rate (.401), HR rate (.855), BB rate (.878), K rate (.907) * GB rate (.883), LD rate (.937), FB rate (.871), IFFB rate (.703), GB/FB (.918), HR/FB (.879) * WPA (.403), context-neutral WPA (.590)… I didn’t even bother looking at clutch * Swing % (.954), Contact % (.959), Sensitivity (.833), Response Bias (.961), Pitches/PA (.881) At 300 PA: * AVG (.328), OBP (.596), SLG (.634), OPS (.624), ISO (.636), BABIP (.240) * 1B rate (.572), 2B+3B rate (.218), HR rate (.741), BB rate (.821), K rate (.844) * GB rate (.805), LD rate (.883), FB rate (.764), IFFB rate (.610), GB/FB (.809), HR/FB (.752) * WPA (.327), context-neutral WPA (.398) * Swing % (.940), Contact % (.925), Sensitivity (.742), Response Bias (.937), Pitches/PA (.857) A few limitations of this study. As with any study of this kind, any time you slap a minimum number of PA on a sample, it’s going to become a selective sample. Playing time is not handed out randomly (would that it were!), and those with 650 PA are a select group of players, namely, those who are good enough to justify starting every game all year. Because I’m comparing players to themselves (within subjects design), I can control for some of that. However, it does leave open the possibility that players who are full time starters are more consistent than those who aren’t. The split-half even-odd methodology might help to control for it, but I suppose there could be a confound in there somewhere. The other problem is that as I raise the bar for minimum number of PA, I get fewer and fewer players that meet the criteria. When putting together a correlation, that drives down my statistical power. In an ideal world, I’d have a million player-seasons (or in this case, 2 seasons) to use, but I don’t. Still, my smallest sample size was 127, which is pretty good for a correlation study. I don’t see a way around these limitations, but as always, I’m open to suggestions. The lessons to be learned: Those who traffic in projection systems might do well to look closely at these types of analyses. An OBP based on 600 PA is going to be much more reliable of a predictor than 400 PA, and that should be taken into account when projecting next year’s OBP. Even just from a fan perspective, this is a good reminder that we shouldn’t be fooled by a small sample size. Now we can know exactly how small of a sample size we should be wary of. If you’ve made it this far and have read the whole thing, you win a cookie. If you have a worthwhile suggestion for the comments, you win two cookies. This entry was posted on Wednesday, November 14th, 2007 at 12:10 am and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site. BallHype: hype it up! 10 Responses to “525,600 minutes: How do you measure a player in a year?” 1. dan says: November 16th, 2007 at 9:31 am Phew, made it to the end. It only took me 3 days to get around to reading this…. Not sure if this can be done, but what if batting average is normalized to account for fluctuation in player BABIP. This would probably make the correltation for AVG much stronger by eliminating the BABIP component. 2. Pizza Cutter says: November 16th, 2007 at 11:59 am Kinda like a DIPS ERA, just for batters? 3. dan says: November 16th, 2007 at 1:48 pm Yea that general idea, but on a more personal level…. I’d adjust the individual season BABIP to match career BABIP (not to league average, as DIPS kinda sorta tries to do), since hitters have a lot of, although not complete, control over their own BABIP. 4. Pizza Cutter says: November 16th, 2007 at 4:57 pm I’ve seen a few attempts to do something like this. There’s luck-adjusted AVG, which I’ve seen floating around elsewhere, which takes into account where each ball was hit and the probabilities of that being turned into an out. I’m guessing that BABIP would stablize over a few years worth of data, so your approach of using career numbers has merit. Hmmm… *hamster climbs on wheel and begins running* 5. The Detroit Tiger Weblog