Regression Towards the Mean

views updated

Regression Towards the Mean


Regression towards the mean is a fundamental yet at first sight puzzling statistical phenomenon occurring between data from two variables, and it is a natural and inherent consequence of correlation being generally imperfect.

The effect of regression towards the mean was recognized in the late nineteenth century by Francis Galton (1822-1911) when investigating the relationship of the heights of parents and their adult children (see Bland and Altman 1994, Stigler 1986). Such height data are positively correlated; tall parents tend to produce tall adult children.

The diagram shows an ellipse that represents a cloud of correlated data points. The X and Y values are assumed for simplicity here to have equal mean and equal standard deviation. The X and Y values for Galton would be parents height and adult offsprings height, but they could be any correlated variables, for example a measure of crime

on some scale, for a sample of areas at one time, X, and at a later time, Y. The tilt of the ellipse shows that high values earlier are associated with high values later and vice versa; that is, positive correlation exists. Also on the diagram is the line of equality going diagonally bottom left to top right along the major axis of the ellipse. (Equality means that its slope = 1). Any point below this line indicates that the value of Y is smaller than that of X, whereas a point above indicates that it is greater. If correlation was perfect, the ellipse would narrow and become identical with the line of equality.

Also on the diagram is a line of shallower slope that gives the mean of Y for a given X. This is the conditional mean of Y. One can see that the conditional mean of Y given X is not the line of equality, because taking a vertical slice through the ellipse shows that the bulk of the distribution lies above the line of equality for an X -value below the mean of X, whereas it is below the line for an X -value above the mean of X. In fact, the line of the conditional mean for the situation described, that is, with standard deviations of X and Y equal, has a slope equal to the (Pearson) correlation coefficient. Therefore, the expected Y -value for a given X -value, in other words the conditional mean, is above the line of equality for X below the mean of X, and below the line of equality for X above its mean. Therefore, there is a tendency for values to be closer to the overall Y -mean, the effect being greater the weaker the correlation is.

This is precisely what Galton found: that the heights of adult children tended to be closer to the mean of the population than their parents heights were; that is, they regressed towards the mean. Note that this does not make every one the same height in the endthe distribution can remain stable generation after generation.

A similar situation applies in a more general case than that described when neither the means nor the standard deviations of X and Y variables are equal to each other, such as when successive generations are getting taller on average and becoming more variable in absolute terms, that is, in centimeters. In a more general case such as this the elliptical cloud of data points will be shifted up and will have greater vertical extension due to the greater standard deviation of the Y variable. The major axis of the elliptical cloud will no longer be the line of equality, but will still represent the line that the ellipse shrinks towards as the correlation becomes perfect. As in the earlier case, the line of conditional mean is still at a shallower slope than the major axis, and so the same effect occurs, that there is regression towards the mean such that the expected Y -value will be fewer Y -standard deviations from the Y -mean than the X -value is X -standard deviations from the X -mean. In fact, for an X -value Zx standard deviations from the X -mean the expected Y -value will be (1-correlation coefficient) multiplied by Zx standard deviations in Y from the major axis line, and equivalently only correlation coefficient multiplied by Z x standard deviations in Y from the Y -mean. So, for example, if the X value is 1.5 standard deviations above its mean and the correlation coefficient between X and Y is 0.2, then the expected Y -value is only 0.3 standard deviations above its mean, that is, 1.2 below the original number of standard deviations.

The statistical method of regression owes its name to the discovery by Galton of this effect. Indeed, the regression line is simply the line of the conditional mean, exactly as discussed above. It is a surprise to some that the regression line does not run along the major axis of the cloud of data points. The equation for the line of conditional mean can be determined mathematically (see, for example, Freund 2004, which treats the more general case).

A consequence of the effect of regression towards the mean is if an intervention is applied to a group with high values before, for example a bad state before, and the control is another with lower values, a better state before, one is likely to find that the intervention appears to work even if it really has no superior effect. This is because the expectation is that the higher measurements will become lower. It is therefore vital that comparison is made like with like.

While it is possible to envisage situations that are more complex than those described above, so that the conditional mean is consequently no longer a simple straight line, one should never assume that the effect described is nonexistent.


Bland, J. Martin, and Douglas G. Altman. 1994. Regression Towards the Mean. British Medical Journal 308 (June 4): 1499.

Miller, Irwin, and Marylees Miller. 2004. John E. Freunds Mathematical Statistics with Applications. 7th ed. London: Prentice Hall.

Stigler, Stephen M. 1986 The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Belknap.

Paul R. Marchant