October 05, 2005

Home Runs and Steroids: Some Perspective, Please: Home runs were down in 2005. Was it because of steroid testing?

There has been a lot of discussion (at SportsFilter and elsewhere) about the drop in home run production during the 2005 season, and the link between that power drop and MLB's new steroid testing policy.

The number of home runs hit in major league baseball games decreased in 2005: this is an indisputable fact. In the American League (AL), home runs were down 7% (from 2605 in 2004 to 2421 in 2005). In the National Leage (NL), home runs were down 10% (from 2846 in 2004 to 2561 in 2005). For MLB as a whole, that's a decrease of 9.1% over one year. Many fans, and many baseball insiders, have given the credit (or blame) for this decrease in power to the new steroid testing policy; the argument is applied to the league and to individual players (see Jason Giambi).

I'm not going to discuss individual players here, because we have very few facts at our disposal. The question I want to explore is whether the new steroid testing policy had any impact on the ability of major league players, as a group, to hit home runs; and if it did, then how much impact did it have? I won't pretend that I can answer that question conclusively. However, I would like to put the conversation on a firmer statistical footing.

Percentage changes and "single-season" variability

Most educated people understand that the absolute change in a given quantity is not always that useful for estimating significance. For example, if I told you that there were 20 fewer winning lottery tickets in Canada in 2005 than there were in 2004, your logical next question would be "out of how many?" Without that value, my original statement is meaningless — 20 out of 100 might be a big change, but 20 out of 10,000 would not be.

A more sophisticated thinker would want to know the relative change or percentage change. If I told you that there were 400 winning lottery tickets in Canada in 2004, and only 380 in 2005, you would know that the year-over-year change was 20 out of 400, or 5%. We have an intuitive sense of the significance of 5%. We also tend to think that a 5% change is a 5% change; a decrease of 2 out of 40 is the same as a decrease of 20 out of 400, which is the same as a decrease of 2,000 out of 40,000.

There is a trap here, however; often, that common sense thinking is just plain wrong.

Specifically, the number of people who win the lottery in a year is an example of a special kind of statistic:

  • There are a finite number of opportunities (tickets) for success (win a prize).
  • Each opportunity can be counted as either a success (win) or a failure (lose).
  • The probability of success is the same for all opportunities.
  • The probability of success is small.
  • There are a large number of successes (400 per year).
Whenever you encounter such a statistic, here is a handy rule of thumb for estimating the confidence interval for the statistic: the one-sigma (one standard deviation) confidence interval is plus or minus the square root of the observed number of successes. (If you are interested in the derivation, take a look at the properties of the binomial distribution.)

For our thought experiment, this means that the one-sigma confidence interval is ±?400 = ±20 people or so. A decrease of 20 out of 400, therefore, is consistent with the kind of random variability that we would expect, given the number of observed successes. Note that the percentage decrease (5%) is not a reliable guide here. For example, a 5% decrease from 40,000 to 38,000 would be a change of ten standard deviations! And that is extremely unlikely by random chance alone.

OK, back to home runs. If we squeeze the HR statistic into this template — a HR is a "success," an AB is an "opportunity" — we could estimate that the confidence interval in the observed number of home runs is about ±?2500 = ±50 HR. This is what I might call the "single-season" variability of the HR total. What we're saying is that if we repeated the experiment (2005 season) a number of times in exactly the same way, we would expect the observed total to randomly vary by about 50 home runs either way.

This would be a goot time to admit that I glossed over an important point in the discussion up to this point, which is that an AB is not exactly like a lottery ticket. In particular, the probability of a success (HR) in an AB is not the same for all ABs, since it depends on the hitter, pitcher, ballpark, weather, etc., etc. Therefore, we shouldn't put too much stock in the ±50 variability I derived above. I suspect that the variability in the probability of success actually leads to greater variability in the number of HR, which would mean that ±50 is a lower bound on the uncertainty. The proof of this is beyond my abilities, however.

No matter, really, because my main point is this: the relative (percentage) change is not a good indicator of the significance of the change. To assess the significance of the change in the number of home runs, we need an estimate of the random variability.

Short-term variability

For the rest of this discussion I am going to talk about the home run rate, HR/AB, which I will use as a measure of the ability of major league baseball players to hit home runs.

I am looking for an estimate of the random variability in HR/AB. In addition to purely random variability, there are numerous deterministic factors that can affect HR/AB, including equipment, the rules of the game, the skill of pitchers, etc. If we could identify a period where we think that all of those factors have been reasonably constant, or at least only randomly varying, then we could consider that time period as an ensemble of experiments, and estimate the variability from the measurements. This is a very common technique in experimental science; make N measurements and report the mean and standard deviation of the set of results.

Now, this is not quite experimental science. We don't have control over very many variables. But just for the heck of it, let's consider the eleven-year period from 1994 through 2004. Here are the HR/AB rates for the NL and AL during that time:

HR/AB from 1994 through 2004 (source: Baseball Reference)
SeasonNLAL
19942.78%3.21%
19952.783.11
19962.863.47
19972.803.17
19982.893.19
19993.253.37
20003.393.42
20013.353.21
20022.963.17
20033.063.19
20043.213.31
MEAN3.033.26
STD0.230.12
20052.933.11

The NL rate shows some signs of non-random variability during this time, being lower from 1994 through 1998, higher from 1999 through 2001, and then dropping to an intermediate value. A number of new ball parks came into service during this period, which might or might not explain some of this variation. The AL rate, however, is perfectly consistent with a constant rate of 3.26 HR per 100 AB, with a random variability of ±0.12 HR per 100 AB.

In this light, what are we to make of the drop in 2005? My suggestion at this point would be that we should make nothing of it. The 2005 AL value is as low as the lowest observed value in twelve years, but it is only about 1.25 standard deviations from the mean — not very conclusive. The 2005 NL value is even closer to the mean, and higher than the observations in five of the last twelve years.

Long-term variability

Finally, it is worth pointing out that compared to the historical changes in home run rate, the 2004-2005 decrease is miniscule. Again using the data from Baseball-Reference.com, here's a picture that shows the HR/AB rate for the NL and AL from 1900 through 2005. (The error bars are my own, derived using the same analysis that led to the ±50 value above.)

OK, so here's your challenge: point to the place on the graph where steroids were first introduced to major league baseball. Here's a helpful hint: anabolic steroids were first used in weightlifting in about 1959.

If you're one of those fans that insists that steroids are the main reason for the modern power surge, I suppose you might argue that we started to see their influence in the late 1970's. However, the increase in home run rate during that period is far from unprecedented; HR/AB increased more in the 20s and 30s, and again in the 40s and 50s, than it did in the 80s and 90s. Granted, I'm glossing over a huge drop in 1988 caused by a change in the baseball, but the point is still valid. It doesn't prove that it wasn't steroids, but it does prove that the game evolves over time, and that massive increases in home run production can be driven by factors other than steroids.

I suppose another possibility is that widespread steroid use started in 1994, since there was a large jump in HR/AB at that time. I find this somewhat hard to believe, though. First of all, it seems unlikely that widespread steroid use started suddenly between the 1993 and 1994 seasons. This is something that is done in relative secrecy, after all, so wouldn't it spread more slowly than that? And second, steroid use was already well-established and well-known in many sports by 1994, so why would baseball players wait so long?

At any rate, no matter when you think the "pre-steroid" era ended, it's clear that the 2005 policy has done almost nothing to get us back to those levels. Maybe the testing hasn't scared anybody into stopping, or maybe the long-term benefits of steroid use take time to wear off. I think the most likely answer is that the impact of steroids has been wildly overblown.

posted by Amateur to commentary at 02:42 PM - 0 comments

You're not logged in. Please log in or register.