Statistical reliability refers to the consistency of measurement. According to Jum C. Nunnally and Ira H. Bernstein in their 1994 publication Psychometric Theory, in classical test theory reliability is defined mathematically as the ratio of true-score variance to total variance. A true score can be thought of as the score an individual would receive if the construct of interest were measured without error. If a researcher developed a perfect way to measure shyness, then each individual’s shyness score would be a true score. In the social sciences, constructs are virtually always measured with error, and as such true scores are unknowns that must be estimated by observed scores.
The expected standard deviation of scores for an individual taking a test many times is known as the standard error of measurement (SEM). The SEM of a score is estimated by
where sx is the between-groups standard deviation of scores on the test and rxx is the estimate of score reliability. The expectation is that if an individual takes a test many times, 68 percent of scores will fall within one and 95 percent of scores will fall within two SEM’s of the observed score. Assume that Jamie scored 1,000 on a test with an observed standard deviation of 100 and a score reliability estimate of .84. The SEM for Jamie would be
If Jamie took the achievement test many times, one would expect approximately 68 percent of the resulting scores to fall between 960 and 1,040, and approximately 95 percent of the scores to fall between 920 and 1,080.
Reliability can refer to several different types of estimates of consistency. Alternative form reliability refers to the degree of consistency between two equivalent forms of a measure. Split-half reliability refers to the consistency between two randomly-created sections of a single test. The internal consistency statistics coefficient α (also referred to as Cronbach’s α ) and the related Kuder-Richardson 20 (or “KR-20”) can be thought of as the mean of all possible split-half estimates. Test-retest reliability is the consistency of scores on separate administrations of the same test. Determining an appropriate delay between the initial and follow-up administrations can be difficult. As examples, if the chosen delay is too long, expected changes (e.g., learning) in the respondents may contribute to a low test-retest correlation, which may be mistaken for evidence of invalidity. If the delay is too short respondents’ remembering and repeating their responses from the first test may contribute to a high test-retest correlation, which may be mistaken as evidence of validity. Ideally in this context, respondents would receive the exact same score on the test and the retest. If scores tend to go up or down from the test to the retest, the correlation between the test and the retest will be reduced unless all scores rise or fall by the same amount.
Inter-rater reliability refers to the extent to which judges or raters agree with one another. As an example, assume two judges rate teachers on the extent to which they promote a mastery orientation among students. The consistency of the judges can be estimated by correlating the scores of both judges. A 1965 study by John E. Overall found that the effective reliability of the judging process (i.e., of both judges together) can also be estimated; effective reliability will generally be better than the reliability of either judge in isolation. One problem in estimating inter-rater reliability is that the judges’ scores may reflect both random error and systematic differences in responding. This problem reflects a fundamental concern about the representativeness of judges, and, according to Richard J. Shavelson and Noreen M. Webb in their 1991 publication Generalizability Theory: A Primer, it is addressed in an extension of classical test theory known as generalizability theory.
The assertion “this measure is reliable” reflects a common misunderstanding about the nature of reliability. Reliability is a property of scores, not of measures, and it is therefore sample and context dependent. A test yielding good reliability estimates with one sample might yield poor reliability estimates with a sample from a different population, or even with the same sample under different testing conditions. The decision about whether to adopt an existing measure should therefore be based on more than a simple examination of score reliability in a few studies. Further, score reliability should be examined on each administration regardless of the measure’s empirical history (Thompson, 2002).
Reliability and validity are not necessarily related. Validity concerns the extent to which a measure actually measures what it is supposed to. While reliability does set a maximum limit on the correlation that can be observed between two measures, it is possible for scores to be perfectly reliable on a measure that has no validity or for scores to have no reliability on a measure that is perfectly valid. As an example of the latter, suppose that students in a pharmacology class are given a test of content knowledge at the beginning of the semester and at the end of the semester. Assume that on the pretest the students know nothing and therefore respond randomly, while on the post-test they all receive As. Estimating reliability via the split-half method or coefficient α on the pre-test would suggest very little consistency, as would the test-retest method. The test itself, however, would seem to be measuring content taught during the course (indicating good validity).
Nunnally, Jum C., and Ira H. Bernstein. 1994. Psychometric Theory. 3rd ed. New York: McGraw-Hill.
Overall, John E. 1965. Reliability of Composite Ratings. Educational and Psychological Measurement 25: 1011-1022.
Shavelson, Richard J., and Noreen M. Webb. 1991. Generalizability Theory: A Primer. Newbury Park, CA: Sage Publications.
Thompson, Bruce, ed. 2002. Score Reliability: Contemporary Thinking on Reliability Issues. Newbury Park, CA: Sage Publications.
Jeffrey C. Valentine