Correlation Coefficient

views updated Jun 11 2018

CORRELATION COEFFICIENT

Correlation refers to a quantitative relationship between two variables that can be measured either on ordinal or continuous scales. Correlation does not imply causation, rather it implies an association between two variables. The strength of a correlation can be indicated by the correlation coefficient.

The correlation coefficient is a statistic that is calculated from sample data and is used to estimate the corresponding population correlation coefficient. Correlation coefficients generally take values between −1 and +1. A positive value implies a positive association between variables (i.e., high values of one variable are associated with high values of the other), while a negative value implies a negative association between variables (i.e., high values of one variable are associated with low values of the other). Thus, a coefficient of −1 means the variables are perfectly negatively related; while +1 means a perfect positive relation. A coefficient of 0 means the variables are not related.

For hypothesis testing, the null hypothesis that the population correlation coefficient rho is 0 is rejected if the sample statistic is unlikely to have been drawn from a population with a true rho of 0. In the case where the correlation coefficient has a value of 0, the null hypothesis will not be rejected. As the coefficient diverges from 0, the probability of rejecting the null hypothesis will increase as the size of the sample increases.

There are a number of techniques for measuring correlation coefficients. The two most popular are examples of a parametric statistic (Pearson's product-moment correlation) and a nonparametric statistic (Spearman's rank correlation).

The Pearson product-moment correlation coefficient (r) quantifies the linear relationship between variables in terms of their actual raw values. Use of the Pearson correlation coefficient assumes both linearity and a normal distribution.

The Pearson correlation coefficient for two variables X and Y is defined as the covariance of X and Y divided by the product of the standard deviations of the individual variables:

The value of the correlation coefficient can be strongly influenced by one outlying point. For interpretation, r² represents the proportion of the variance in one variable that is "explained" by the other variable.

The Spearman rank correlation coefficient (r_s) is used for use ordinal variables (i.e., any data that can be ranked) and requires less stringent assumptions about the distributions of the variables of interest. It measures the strength of the relationship of the ranks of the data; thus it is a measure of correlation for which there may be a nonlinear relationship.

The formula for the Spearman rank correlation is the same as that for the Pearson correlation coefficient. The rank correlation coefficient is affected by the number of ties between data points. If there are no ties in rankings, the Spearman coefficient can be expressed more simply as:

where d_i is the difference in ranks between x_i and y_i. If more than half the ranks are tied, the Spearman coefficient is unreliable.

One example of the use of correlation coefficients is a study of the effects of mercury exposure at a thermometer factory. The study found significant correlation between mercury levels in the air and mercury in urine (r = 0.92), blood (r = 0.79), and hair (r = 0.42).

George Wells

(see also: Probability Model; Statistics for Public Health )

Bibliography

Elihu, D. R; Nechama, P.; and Menachem, L. (1982). "Mercury Exposure and Effects at a Thermometer Factory." Scandinavian Journal of Work Environmental Health 8 (Supp. 1):161–166.

Encyclopedia of Public Health Wells, George