Spearman Rank Correlation Coefficient

views updated

Spearman Rank Correlation Coefficient

The Spearman rank correlation coefficient is a nonpara-metric (distribution-free) rank statistic proposed by Charles Spearman in 1904. It is a measure of correlation that captures the strength of association between two variables without making any assumptions about the frequency distributions of the underlying variables.

The computation of the Spearman rank correlation coefficient requires first that the values of the two variables be assigned ranks. When the data are not initially ranked, the first step is to separately rank the two variables (X and Y) under examination. Then the Spearman coefficient is calculated as the ordinary Pearson correlation coefficient r between the ranked values of X and Y. After ranking the two variables, for each case i we take the difference d_i = X_i -Y_i for each pair, and then we calculate the Spearman coefficient. This coefficient of rank correlation measures the degree of association between the two sets of ranks. The formula of this statistic is

where r_s = the Spearman rank correlation coefficient, d = the difference between each pair of ranks of corresponding values of the variables X and Y, and n = the number of pairs of values in the sample.

There are two properties of this coefficient. First, the values of the Spearman correlation coefficient will always vary between–1 and 1. When the value of the coefficient is 1, there is a perfect positive correlation or direct correlation. That is, large values of the one variable, for example, X, are associated with large values of the other, for example, Y, and small values of the X variable are associated with small values of the Y variable. When the coefficient is–1, there is a perfect negative correlation or indirect correlation. In this case, large values of the X variable are associated with small values of the Y variable and vice versa. When the value is equal to zero, it means that there is no relationship or correlation. This statistic shows only that two variables X and Y correlate positively or negatively; it does not offer any indication that one variable affects the other.

The second property of the Spearman rank correlation, and of its parametric counterpart, is that it is a pure number without units or dimensions. In using this coefficient we deal with two sets of ranks assigned to the variables X and Y. The original observations on the variables may be ranks, or they may be numerical values ranked by magnitude.

In addition to interpreting the magnitude and the direction of a correlation coefficient, the significance of a given value of correlation should also be tested. The null hypothesis states that no correlation exists and that whatever value of correlation is found between the two examined variables is due to sampling error.

A modern approach to testing the hypothesis that the value of the Spearman coefficient is significantly different from zero is to calculate the probability that it would be greater than or equal to the observed r by using a permutation test. This method is superior to the traditional ones in most cases, even when the dataset is large, because modern computing has the power to generate permutation.

The traditional approach for determining significance is still widely used. It involves the comparison of the calculated r with published tables for various levels of significance. It only requires that the tables have the pertinent values for the desired ranges.

An alternative approach for samples of large sizes is the approximation to the Student’s t -distribution that is given by the following formula:

When the sample size (n ) is less than 10, the above is not appropriate. For values of (n ) less than 10, table 1 shows the critical values of r_s required for the significance

Table 1
Critical values of Spearman rank correlation coefficient for N<10
N	Nondirectional test	Directional test
SOURCE: http://faculty.vassar.edu/lowry/corr_rank.html.
5	1.00	0.90
6	0.89	0.83
7	0.79	0.72
8	0.72	0.62
9	0.70	0.60

at the 0.05 level for both a nondirectional and a directional test.

APPLICATIONS AND SHORTCOMINGS

The Spearman rank correlation coefficient can be used when the normality assumption of the two examined variables’ distribution is violated. It also can be used when the data are nominal or ordinal. It may be a better indicator that a relationship exists between two variables when the relationship is nonlinear, even for variables with numerical values, when the Pearson correlation coefficient indicates a low or zero linear relationship. When there are three or more conditions, a number of subjects are all observed in each of them, and we predict that the observations will have a particular order. In this case, a generalization of the Spearman coefficient is useful.

As with all other nonparametric or distribution-free procedures or tests, the Spearman rank correlation coefficient is less powerful than its parametric counterpart, the Pearson product-moment correlation coefficient. In addition, there is no evidence of causality between the two variables that have been found to be related by this coefficient.

ALTERNATIVE CORRELATION COEFFICIENTS

There are other examples of correlation coefficients, including the Pearson product-moment correlation coefficient, which is used for making inferences about the population correlation coefficient, assuming that the two variables are jointly normally distributed. When this assumption cannot be justified, then a nonparametric measure such as the Spearman correlation coefficient is more appropriate. The Pearson correlation coefficient measures the linear association between two variables that have been measured on interval or ratio scales. The formula that determines the Pearson product-moment correlation coefficient (r_XY) is

The significance of r, where the null hypothesis states that no correlation exists between the two variables X and Y (r_XY = 0), is found by the following t -statistic, with (n - 2) degrees of freedom:

Another measure of degree of concordance that is closely related to the Spearman correlation coefficient is the Kendall tau rank correlation coefficient, given by the formula

Similar to the Spearman coefficient, Kendall’s tau lies between -1 and 1. When it is equal to +1, we assume that there exists complete concordance, whereas when it is equal to -1, there exists complete disagreement. This Kendall tau (τ) coefficient uses the same data as the Spearman correlation coefficient (r_s ) but differs arithmetically, so they are not exactly similar.

In order to calculate the tau (τ) coefficient we go through the following steps.

We rank the values of variable X from 1 to n ; similarly, we rank the values of variable Y, both in an ascending order.
We make pairs for every ranked value of X _i the equivalent Y_i.
We compute the variable S, which is equal to S = Σ c _ii where c _ii = 1 if X _i and X _j have the same rank with the Y _i and Y _j and c _jj = -1 in the opposite case.
If there are no same-value cases among the X _i and Y _i (or there are few compared to the sample size (n )), then Kendall’s tau coefficient is equal to τ = 2 S / [ n (n -1)].

An advantage of the Kendall tau coefficient compared to the Spearman correlation coefficient is that the former can be generalized in order to determine the Kendall partial correlation coefficient, which is equivalent to the Pearson partial correlation coefficient in cases where non-parametric statistics are appropriate.

Another correlation coefficient, the correlation coefficient C of rank matrices (or “double-entrance matrix”), examines the degree of dependence between two variables X and Y, despite the facts that these variables could be ranked or not, that they could be continuous or discontinuous, and that they could have normal distributions or not. It can be calculated by the formula:

and it follows the chi-squared (X ²) distribution with v = (r -1)(q -1) degrees of freedom, where r and q are the numbers of rows and columns, respectively, of the rank matrix under examination. In order to test the null hypothesis that C = 0, or that the two variables X and Y do not have a significant relationship, we calculate the X ² and the which is the critical value. The latter can be found in the tables of the X ² distribution for v = (r -1) (q -1) degrees of freedom. If then the null hypothesis is rejected.