## reliability

**-**

## Reliability

# RELIABILITY

The reliability of a measured variable has two components, consistency and stability. *Consistency* is the degree to which two measures of the same concept provide the same assessment at the same time; consistency is based on "cross-sectional" research. *Stability* is the degree to which a measure of a concept remains unchanged across time; stability is based on "longitudinal" research. Let us illustrate consistency and stability on the measurement of height.

## THE MEASUREMENT OF HEIGHT

As an example, we often measure how tall people are. Height, how tall someone is, is a measure of distance. In order to measure distance, we establish an arbitrary standard. A common arbitrary standard for measuring distance is the "yardstick." The yardstick is 36 inches long, and is broken down into feet, inches, and fractions of inches. Another common measuring rod is the "meterstick." The meterstick is 100 centimeters long, and is broken down into decimeters, centimeters, and millimeters. If we know how tall someone is in inches, we can calculate how tall he or she is in centimeters, and vice versa. For example, rounding to two decimal places, a 70-inch-tall person is 177.80 centimeters tall (1 inch = 2.54 centimeters; 70 × 2.54 = 177.80). Conversely, rounding to two decimal places, if we know that someone is 160 centimeters tall, we also know that that person is 62.40 inches tall (1 centimeter = 0.39 inches; 160 × 0.39 = 62.40).

Indeed, the yardstick and the meterstick are highly consistent. With reasonable attention to proper measurement protocol, the correlation between height as measured by the yardstick and height as measured by the meterstick across a sample with sufficient variation in height would be very high. For all intents and purposes, the yardstick and the meterstick are interchangeable; the researcher need not establish their consistency. This leads to the *principle of consistency:*

If two measures of the same concept are perfectly consistent, they provide identical results. When this is so, the use of multiple measures is needlessly repetitive.

In this situation, the researcher need only use either the yardstick or the meterstick; using both sticks provides no additional information.

When babies are born, they are usually 18–24 inches "tall." Parents (and developmental researchers) often measure how tall babies and children are as time passes. This over-time height measurement is a stability assessment. Ordinarily, children grow a rough average of 3 inches per year for 15 years, resulting in most 15-year-olds being between 63 and 69 inches tall. Then female height growth stops while male height growth continues. By the time females are 18 years old, they average about 66 inches tall, while males of the same age average about 72 inches. Their heights then remain roughly stable for the remainder of their adult lives. This leads to the *principle of stability:*

A measure of a concept is perfectly stable when it provides identical results at different points in time. When this is so, repeated measurement over time is needlessly repetitive.

Height measurement from year to year provides useful information for children but not for adults. This is because the children grow taller with the passage of time, but adults do not. Elderly people who suffer from osteoporosis (loss of bone density) will become shorter, but this decline in height is slight, compared to their growth when they were children.

Let us now turn to a discussion of how the principles of consistency and stability apply to the measurement of sociological concepts. We will first discuss the protocols of good sociological measurement. Then we will discuss the implications of these protocols for the assessment of the consistency of self-esteem and the stability of alienation.

## PROTOCOLS OF GOOD MEASUREMENT

**The Measurement of Self-Esteem.** Researchers often attempt to measure how well people feel about themselves. Many decades ago, Charles Horton Cooley (1902) and George Herbert Mead (1934) theorized about the concept "self-esteem." In offering the "looking glass self," Cooley assumed that people develop a definition of themselves by evaluating what they believe others think of them. Mead differentiated between what a person actually is and what that person believes about himself or herself.

Rosenberg (1965) wished to measure self-esteem as conceptualized by Cooley and Mead. He did so by creating ten questionnaire items, each of which he believed would provide an empirical measure of the concept of self-esteem. His measurement attempt will be discussed in detail later in this paper. For now, let us assume that each of these items positively but imperfectly represents self-esteem. The positive representation implies that the concept "self-esteem" has a positive causal effect on each item. The imperfectness of the representation implies that there are other factors that also cause each item. Under this condition, none of these ten different measures of self-esteem was nearly as consistent as the yardstick and the meterstick. That is, the correlations among these ten questionnaire items were far from perfect. When this is so, the use of multiple measures is more consistent than the use of any single measure alone. Thus, the *converse principle of consistency:*

If multiple measures of the same concept provide imperfect assessments of the same concept, then the use of multiple measures is more consistent than the use of any single measure alone.

Commonly, items presumed to measure sociological concepts do so imperfectly. Therefore, sociological researchers often turn to the use of multiple items in social surveys as indexes to represent concepts. Combining their substantive knowledge of the literature with their clinical knowledge of people who exhibit various aspects of the concept, these researchers design items to represent each of these aspects. Then researchers are faced with the tasks of evaluating the consistency and stability of the items as measures of their respective concepts. In order to do this, good researchers employ a set of protocols of good measurement. We now turn to a discussion of these good measurement protocols.

Good measurement of sociological concepts satisfies the following criteria:

- Clear definitions of concepts.
- Multiple items.
- Clear definitions of items.
- Strong positive interitem correlation.
- Score construction.
- Known groups validity.
- Construct validity.
- Consistency.
- Stability.

These protocols represent the state of the art not only in sociology (Mueller 1997), but also in a wide variety of other scientific disciplines. A computer search of the literature revealed more than five hundred articles citing these protocols in the 1990s, including: political science (Jackman and Miller 1996); psychology (Hendrix and Schumm 1990); nursing research (Budd et al. 1997); the family (Grandin and Lupri 1997); sports and leisure (Riemer and Chelladurai 1998); computer information systems (McTavish 1997); management (Szulanski 1996); gerontology (Wright 1991); genetics (Tambs et al. 1995); social work (Moran et al. 1995); higher education (Aguirre et al. 1993); market research (Lam and Woo 1997); and preventive medicine (Saunders et al. 1997).

Let us briefly discuss each of these protocols in turn. Then we will focus the attention of the remainder of this paper on the two major focuses of reliability, consistency and stability.

*Clear Definitions of Concepts.* Good measurement protocol requires that each concept be clearly defined and clearly differentiated from every other concept. Good measurement protocol can document that an ambiguous concept is, indeed, ambiguous. Moreover, such protocol may suggest points of theoretical clarification. However, there is no substitute for clear theoretical thinking augmented by a thorough knowledge of the literature and a clinical immersion in the domain of content.

*Multiple Items.* Good measurement protocol requires that each aspect of a concept be assessed using multiple items. A single item, taken alone, suffers from measurement error. That is, the item is, in part, a representation of its respective concept. However, this same item may be a representation of other concepts, of systematic measurement error, and of random content. These other contents are called "error"; they reduce the degree to which the item accurately represents the concept it is designed to measure empirically. The rationale for the use of multiple items revolves around minimizing this measurement inaccuracy. That is, all items designed to measure a concept contain inaccuracies. If a single item is used to measure the concept, the researcher is, in essence, stuck with the specific inaccuracies of the single item. However, if multiple items designed to measure the same concept are used, the inaccuracies of one item may be offset by different inaccuracies of the other items.

*Clear Definitions of Items.* Good measurement protocol requires that each item be designed to measure one and only one concept. The response categories should be constructed so that the higher the code of the response category, the more positive the respondent's attitude on that concept.

*Strong Positive Interitem Correlation.* When multiple items are designed to measure a single variable, the standard of the research industry has long been that the items should be coded in such a way that the higher the score, the more positive the empirical representation on the concept. Good measurement protocol requires strong positive intercorrelations among items designed to measure a concept. Ordinarily, these intercorrelations are presented in a correlation matrix. A visual inspection of the correlation matrix will be revealing. An item that correlates strongly (e.g., *r* > .4) with other items will generally emerge as a strong contributor to the reliability of the resulting score; an item that has a zero correlation with other items will not add to the reliability of the score; and an item that inversely correlates with other items (assuming that it has been coded such that the higher the score on the item, the higher the measure of the concept) will detract from the reliability of the score.

To the author's knowledge, the sole exception to this principle was articulated by Curtis and Jackson (1962, p. 199) who argued that "two equally valid indicators of the same concept may. . . be strongly related to one another, or they may be totally unrelated (or negatively related)." The difficulty with the Curtis and Jackson position is that it effectively invalidates the most powerful empirical argument that can be made for multiple items representing a single dimension—that of the equivalence established using convergence. Instead, the author would argue that if two items are unrelated or negatively related to one another, either they represent different dimensions, or they are reflecting a method artifact or both. For a more detailed discussion of this matter, see Zeller and Carmines (1980, p. 77–136) or Carmines and Zeller (1979).

Factor analysis is the major statistical technique designed to describe a matrix of item intercorrelatedness. As such, factor analysis enables researchers to (1) describe a large number of items in terms of a small number of factors and (2) select those items which best represent each of the identified concepts (see Bohrnstedt 1970, p. 96; and Zeller and Carmines 1980, p. 19–46). Items that have high factor loadings on a factor that represents a concept are then retained. These items are then used to construct a score to represent the concept.

In evaluating the degree to which a large set of items represents a small number of theoretical concepts, the application of factor analytic techniques is as much an art as it is a science. This is because there are numerous ambiguities in the measurement setting. The researcher defines one or more concepts and explores the degree to which the factors coincide with the hypothesized concepts. For each item, the researcher wants to know the degree to which it is a function of the concept it was designed to measure, other concepts, method artifacts, and random error.

*Score Construction.* Once the number of factors and which items define which factors has been established, the researcher needs to create scores. One score should be created to represent each concept empirically for each respondent. If the items defining a concept have roughly equal variances, the simplest way to create a score is to sum the items defining the concept. In practice, researchers can tolerate moderate variation in the item variances. For example, if the item variances for a set of Likert items range from, say, .8 to 1.4, summing the items seems to make the most sense. However, if the variation in the items is severe (say from .5 to 2.5), then the researcher should first create standardized scores using the following formula: *z* = (score − mean)/standard deviation. The standardized scores have equal variances (i.e., 1); the sum of these standardized scores will create each desired score.

*Known Groups Validity.* Once scores have been constructed, comparisons of scores between groups known to be high and low on the dimensions of the concept should be made. Known groups validity is established if groups known to be high on the concept have substantially higher scores than groups known to be low on the concept.

*Construct Validity.* Construct validity is intimately related to theory testing. Construct validity involves (1) specifying theoretical relationships among concepts, (2) assessing empirical relationships among scores, and (3) interpreting how the evidence clarifies the validity of any particular measure. For more information on this concept, see Carmines and Zeller (1979, pp. 22–26).

*Consistency.* Good measurement protocol requires that the consistency among items designed to measure a concept should be strong. This means that the correlation between any two items designed to measure the same concept should positively and strongly correlate. We will apply the principle of consistency to Rosenberg's attempt to consistently measure the concept of self-esteem.

*Stability.* Good measurement protocol requires that, if a concept does not change over time, a score designed to measure that concept also does not change over time. The trick of consistency is that the researcher ordinarily does not know whether there is a change in the value of the concept over time. We will apply the principle of stability to the attempt by R. A. Zeller, A. G. Neal, and H. T. Groat (1980) to stably measure the concept of alienation.

When these protocols of good measurement are not followed, the researcher increases the risk of torpedoing the best of conceptual schemes and sentencing them to the intellectual trash heap, whether they belong there or not. High-tech statistical tools, such as structural equation modeling (SEM), make requirements that, by definition, are not present in the measurement development situation (Bollen 1989; Bollen and Long 1993; Hayduk 1987; and Hoyle 1995). That is, SEM requires both strong theory and strong measurement a priori. Indeed, SEM demands that the researcher know beforehand:

- How many factors there are.
- Which items represent which factors.

But these are precisely the major questions that the researcher wants to answer! The end result of the factor analysis should be that the researcher has inferred how many factors are represented by the items, and which items define which factors.

We now turn to a discussion of the consistency of self-esteem.

## CONSISTENCY OF SELF-ESTEEM

Good measurement protocol requires that the consistency be strong among items designed to measure each dimension of a concept. This means that the correlation between any two items designed to measure the same concept should positively and strongly correlate. Often different measures of the same concept have relatively modest positive intercorrelations.

In constructing the self-esteem scale, Rosenberg created ten items using the response categories "Never true," "Seldom true," "Sometimes true," "Often true," and "Almost always true." Five of these were positive items; these items made a positive statement about self-esteem. The other five were negative items; these items made a negative statement about self-esteem. The response categories for the positive items were assigned the values 1 through 5, respectively, such that the higher the score, the higher that respondent's self-esteem was inferred to be. These positively stated items were:

- "I feel that I have a number of good qualities."
- "I feel that I'm a person of worth, at least on an equal place with others."
- "I take a positive attitude toward myself."
- "I am able to do things as well as most other people."
- "On the whole, I am satisfied with myself."

For the five negatively phrased items, a higher score indicated a lower self-esteem. These items had the same response categories as above, but the assignment of values was reversed. That is, the negatively stated items were assigned the values 5 through 1 respectively. That is, a "Never true" response to the item "I wish I could have more respect for myself" was assigned a 5 and an "Almost always true" response to that item was assigned a 1. These five negatively stated items were:

- "I wish I could have more respect for myself."
- "I feel I do not have much to be proud of."
- "I certainly feel useless at times."
- "All in all, I'm inclined to feel that I am a failure."
- "At times I think I am no good at all."

Given the reverse scoring for these items, a higher score indicated higher self-esteem. In order to create a self-esteem scale, the scores were summed into a value that ranged from 10 representing the lowest measured self-esteem possible to 50 for the highest possible measured self-esteem.

How consistent are these items? We suggest the following as a *consistency rule of thumb* for a variable to be used in sociological research:

- If
*r*is above .8, the score is highly consistent. - If
*r*is between .6 and .8, the score is modestly consistent. - If
*r*is less than .6, the score may not be used in research.

In the author's research (Zeller and Carmines 1976), interitem correlations among the ten Rosenberg items designed to measure self-esteem ranged from a low of .05 to a high of .58 with a mean *r* of .32. These intercorrelations do not meet this rule of thumb. When properly analyzed, however, they will. We now turn to a discussion of the strategy for this analysis that will address this criterion of consistency.

*Split-Half Consistency.* The "split-half" approach to estimating the consistency of items designed to measure a concept is to divide the items into two subscores and calculate the correlation between those subscores. For example, the ten items can be divided into two subscores of five items each. The resulting split-half correlation between the two subscores provides an estimate of consistency. If the average interitem correlation equals .3, a score created by summing the responses to the ten items into two five-item subscores would have a split-half correlation of .68.

However, it is well known that, given items whose intercorrelations are equal (i.e., *r* = .3), the greater the number of items, the higher the consistency of a score resulting from those items. Thus, a ten-item score will have more consistency than a five-item score when both scores are made up of items that intercorrelate .3. The split-half reliability correlation, however, does not represent the ten-item score, it is two subscores made up of five items each. Therefore, this split-half correlation will be lower than the actual consistency of the ten-item score.

Two researchers, Spearman (1910) and Brown (1910), independently recognized and solved this statistical estimation problem. Specifically, they noted that the split-half reliability correlation can be adjusted to project what the consistency of a ten-item score would have been if it had been calculated on the basis of two ten-item subscores instead of two five-item subscores. They shared attribution for this solution and called the result the *Spearman-Brown Prophecy*. It is presented in formula (1):

Using the example from above, we can see that the Spearman-Brown Prophecy formula projects the consistency of the entire ten-item scale using formula (1) as follows:

This .81 is an unbiased estimate of the consistency of the total score. Applying the above rule of thumb, such a scale is quite consistent and can be used in sociological research.

In actual research, intercorrelations among score items vary substantially. In the self-esteem example, item intercorrelations varied from .05 to .58. Moreover, the researcher must decide which items to assign to which subscales. One assignment of items to subscales will produce a different reliability estimate than another. When this occurs, the split-half reliability correlation between the two subscales is beset with the *problem of equivalence*: Which items are assigned to which subscales? We now turn to a way of handling variations in intercorrelations among items.

*Equivalence Consistency* The researcher could assign the even-numbered items to one subscore and the odd-numbered items to the other; or items 1, 2, 3, 4, and 5 to one subscore and 6, 7, 8, 9, and 10 to the other; or items 1, 4, 5, 8, and 10 to one score and 2, 3, 6, 7, and 9 to the other. There are many combinations of assignments that could be made. Which one should the researcher use?

Lee Cronbach (1951) solved this dilemma by creating *Cronbach's Alpha*. Cronbach's Alpha uses the average of all possible split-half reliability correlations that are Spearman-Brown projected to the number of items in the score. This is presented in formula (2):

Applying formula (2) to the ten-item score designed to measure self-esteem where the mean interitem correlation is .3, we get:

Thus, Cronbach's Alpha produces the same value that we obtained when we calculated a split-half correlation and applied formula (1), the Spearman-Brown Prophecy formula. This occurred because all the items were, we assumed, equally correlated with each other.

Both the number of items and the average interitem correlations influence the value of Cronbach's Alpha as follows:

- As the number of equally intercorrelated items increases, Cronbach's Alpha increases.
- As the average intecorrelation among the same number of items increases, Cronbach's Alpha increases.

We now turn to the implications of these two patterns:

*Number of Items* The researcher often faces the question "How many items should I use to measure a concept?" The oversimplified answer to this question is, "More!" The more equally intercorrelated items a researcher uses to measure a concept, the higher the reliability will be.

The trick, of course, is that the items must be equally intercorrelated. In most research situations, items designed to measure a concept are not equally correlated. Some items will intercorrelate more strongly with the set of items than others. When this occurs, the researcher's judgment must be employed to decide how much of a reduction in interitem correlation offsets the increase in the number of items in the score. At a minimum, the researcher does not want to add an item which decreases the Cronbach's Alpha consistency of a scale. Standard computer software provides an option which allows the researcher to examine the Cronbach's Alpha if any item is removed from the score. When the alpha with the item removed is higher than the alpha when that item is included, there is consistency justification for the removal of that item from the scale.

This question can be posed in terms of how many items the researcher needs to meet specific alpha reliability thresholds given the mean interitem correlations. Table 1 addresses this concern. In Table 1, three alpha reliability thresholds (.7, .8, and .9) and eight mean interitem correlations (1. through .8) are specified. We then solved formula

sample size needed for various alphas with various mean correlations | |||

cronbach's alpha | |||

meanr | .7 | .8 | .9 |

.1 | 21 | 36 | 81 |

.2 | 10 | 16 | 36 |

.3 | 6 | 10 | 21 |

.4 | 4 | 6 | 14 |

.5 | 3 | 4 | 9 |

.6 | 2 | 3 | 6 |

.7 | 1 | 2 | 4 |

.8 | 1 | 1 | 3 |

(2) algebraically for the sample size needed to achieve each threshold, given each mean interitem correlation using formula (3):

Using formula (3), the body of Table 1 presents the number of items needed for each Cronbach's Alpha threshold for each mean interitem correlation. For example, if the mean item intercorrelation is .2, sixteen items will be needed in order to achieve a Cronbach's Alpha of .8.

An examination of Table 1 reveals that when the mean interitem correlation is equal to .5, only three items are needed for an alpha of .7, four items for an Alpha of .8, and nine items for an alpha of .9. If the mean interitem correlation is .3, six, ten, and twenty-one items are needed for alphas of .7, .8, and .9, respectively. Moreover, if the mean interitem correlation is .1, twenty-one, thirty-six, and eighty-one items are needed for Alphas of .7, .8, and .9, respectively.

Thus, weak interitem correlations can be used to achieve consistency thresholds when many items are used. This is what ordinarily occurs in academic achievement tests. An exam of eighty-one items with a mean interitem correlation of .1 reaches the highly consistent .9 alpha; and an exam of only thirty-six items with a mean interitem correlation of .1 is a reasonably consistent .8 alpha. At the same time, strong interitem correlations reach these thresholds with a small number of items. This harkens back to the observation that if two measures correlate strongly, the researcher merely picks the most convenient measure and uses it with little concern for consistency reliability.

However, the law of diminishing returns suggests that at some point, additional items with the same average intercorrelation with other items will not provide sufficient value in exchange for the effort to be expended to include those additional items. When the number of items is small, an additional equally correlated item adds substantial enough value to the reliability of the score to warrant the effort needed to include it.

Table 2 presents Cronbach's Alphas for various numbers of items with various mean interitem correlations. An examination of Table 2 illustrates the law of diminishing returns. When the mean interitem correlation is .9, the alpha is .98 with five items; adding additional items does not, indeed cannot, increase the consistency much. This is so because the maximum consistency is a perfect 1.0. When the mean interitem correlation is .3, the alpha of .68 with five items is only marginally consistent. However, the alpha increases to an acceptable .81 when ten items are used and to a highly consistent .9 when twenty items are used. Finally, the alpha for five items with a mean interitem correlation of .1 is .37. In order for a score made up of such items to be adequately consistent, the number of such items must be increased substantially.

Cronbach's Alpha can be calculated using formula (2) above. Standard statistical computer software packages can also be used for this purpose. However, care must be taken in using these packages to assure that all the items and only the items that define a specific score be included in the calculations. Indeed, the attentive researcher will want to produce the Cronbach's Alpha by hand, using formula (2), and by computer. When these two measures are identical, the researcher can take comfort that both are likely to have been done

cronbach's alpha for various numbers of items with various mean correlations | |||||

number of items | |||||

meanr | 5 | 10 | 20 | 30 | 50 |

.1 | .37 | .53 | .69 | .77 | .850 |

.2 | .56 | .71 | .83 | .88 | .930 |

.3 | .68 | .81 | .90 | .93 | .960 |

.4 | .77 | .87 | .93 | .95 | .970 |

.5 | .83 | .91 | .95 | .97 | .980 |

.6 | .88 | .94 | .97 | .98 | .990 |

.7 | .92 | .96 | .98 | .99 | .990 |

.8 | .95 | .98 | .987 | .992 | .995 |

.9 | .98 | .99 | .994 | .996 | .998 |

properly. As a postscript on this discussion, we note that the Cronbach's Alpha consistency of Rosenberg's ten-item self-esteem score calculated on the data presented in Zeller and Carmines (1980, p. 92) was equal to a reasonably consistent .83. More advanced procedures which take into account which items are more highly correlated with the total score, such as theta and omega, have been omitted from this discussion. For a discussion of theta and omega, see Carmines and Zeller (1979, pp. 60–62) or Zeller and Carmines (1980, pp. 60–63). Rosenberg's self-esteem scale continues to attract academic attention (e.g., GrayLittle et al. 1997).

## STABILITY OF ALIENATION

**The Measurement of Alienation.** The concept of alienation is one of the major "unit ideas" of sociology (Nisbet 1966). But the concept is so imbued with different meanings that some have come to question its usefulness as a sociological concept (Lee 1972). Seeman (1959) believed that the conceptual confusion surrounding the study of alienation can be addressed by construing it as multidimensional. Neil and Rettig (1967) have operationalized Seeman's original conceptualizations. Following the protocols of good measurement described above, Neal and Groat (1974) theoretically defined and empirically confirmed powerlessness, normlessness, meaninglessness, and social isolation as the four dimensions of alienation. Specifically, they constructed items designed to measure each of the four dimensions of alienation, gathered data, conducted factor analyses, noted that the observed factor structure coincided with the conceptual dimensions, created factor-based scores, and conducted substantive analyses.

R. A. Zeller, A. G. Neal, and H. T. Groat (1980) conducted a consistency and stability analysis. Data on the same sample in 1963 and 1971 revealed that reliabilities ranged from .64 to .83 in 1963 and from .65 to .88 in 1971. The authors needed accurate consistency estimates because they wished to minimize the correction for attenuation. Correction for attenuation will be discussed shortly. Zeller and colleagues wished to describe the amount of stability in the dimensions of alienation over the turbulent years between 1963 and 1971. Specifically, they wished to assess the degree to which those who had high levels of alienation in 1963 would also have high levels of alienation in 1971. In order to do so, they created scores for each of the four dimensions of alienation in both 1963 and 1971. For each score, the correlation between the 1963 and the 1971 value represented the "stability" of that dimension over that time period. High correlations would suggest substantial stability in which respondents were alienated over that time period; low correlations would suggest substantial change.

**Correction for Attenuation Due to Measurement Inconsistency.** In order to assess the stability of the dimensions of alienation over time, Zeller et al. (1980) calculated correlation coefficients between the scale scores for each dimension of alienation. They found stability coefficients ranging from .40 for normlessness to .53 for social isolation. It would appear that there was substantial stability over the eight years under investigation. Before we jump to any conclusions, however, we must consider that measurement inconsistency attenuates (i.e., reduces) the observed correlation from what it would have been if each concept had been perfectly measured at each time point. That is, they needed to correct their stability correlations for measurement inconsistency. Formula (4) presents the correction for attenuation:

Let us apply the correction for attenuation to Zeller and his colleagues' meaninglessness score. The meaninglessness stability correlation = .52; meaninglessness had an omega consistency of .64 in 1963 and of .65 in 1971. Substituting these estimates into formula (4), we get:

Similar analyses were conducted on the other dimensions of alienation.

This analysis led Zeller and colleagues (1980, pp. 1202–1203) to conclude that their data "indicate substantial stability in the dimensions of alienation over an eight-year period." They believe that their longitudinal data "have provided evidence to suggest that operationalizing dimensions of alienation is not only feasible, but may be accomplished with a high degree of confidence in the (consistency) reliability of the measuring instruments. The obtained stability of alienation scores over a long period of time lends credence to the search for the causal, antecedent conditions."

**Method Artifacts in Longitudinal Research.** There are several method artifacts that can artificially attenuate or inflate the estimation of stability. As noted above, score inconsistency attenuates the stability estimate. Memory tends to inflate the stability estimate. That is, if, at time 2, respondents remember what they answered at time 1 and wish to present themselves as being stable in their answers, they will make the same response to the item at time 2 that they made at time 1. We do not believe that this "memory effect" operated to any great degree in the analysis by Zeller and colleagues, because we doubt that respondents would remember their specific response to a specific questionnaire item for eight years. However, when the interval between time 1 and time 2 is relatively short, memory becomes a problem.

A conventional wisdom in stability research is that the interval of time that elapses between time 1 and time 2 should be long enough that respondents will not remember their specific answers to specific items, but short enough that very little change (i.e., instability) takes place in the interim. We believe, on the contrary, that part of what we wish to estimate in stability research is how much change actually takes place. Given our perspective, it does not matter how much time elapses between time 1 and time 2.

Still, the threat of artifactual deflations and inflations to the stability estimate is real. Consider the effect of item-specific variance. The respondent may answer an item in a "stable" fashion over time not because of the stability of the concept it measures, but because of some idiosyncratic nuance of the item. Idiosyncratic nuances of items unrelated to the concept the item is designed to measure are systematic, not random, error. As such, idiosyncratic nuances of items threaten to inflate the stability estimate. We now turn to a statistically advanced discussion of the identification and removal of item specific variance from stability estimation. This section requires a working knowledge of path analysis as described in Asher [(1976), 1983].

## COMBINING CONSISTENCY AND STABILITY INTO A MEASUREMENT MODEL

The path model presented in Figure 1 combines consistency and stability into a measurement path model. In this measurement model, *X*1 and *X*2 represent the value of the concept at time 1 and time 2; *P*21 is the theoretical causal path from *X*1 on *X*2, it represents stability, the theoretical effect of *X*1 on *X*2. This and the other effects in this model
can be thought of as path coefficients. The *x*ij represent the observations; specifically, *x*21 is item 2 at time 1; *x*32 is item 3 at time 2. The *p*ij are the epistemic correlations, the effect of the concept on each respective measure; specifically, *p*21 is effect of *X*1 on item 2 at time 1; *p*32 is the effect of *X*2 on item 3 at time 2. The *p*ijt are the item specific effects over time, the effects of an item at time 1 on that same item at time 2; specifically, *p*11 is effect of *x*11 on *x*12 over and above the effect mediated through the concept. For a more complete discussion of epistemic correlations, see Blalock (1969).

Figure 2 presents this consistency and stability measurement model where the stability effect is *P*21 = .8, the epistemic correlation are *p*ij = .7, and the item-specific effects are *p*ijt = .3. These are approximately the effect parameters for the meaninglessness measurement over time in Zeller and colleagues (1980).

Table 3 presents the correlation matrix that results from applying the rules of path analysis [Asher (1976) 1983] to Figure 2. Specifically, within the measurement model, the correlation between *x*11 is *x*21 is equal to the product of the path from *X*1 to *x*11 times the path from *X*1 to *x*21. That is, *r* = (*p*11)(*p*21) = (.7)(.7) = .49. In the same way, all the time 1 measures intercorrelate .49; all the time 2 measures correlate .49.

The correlation between the time 1 and time 2 items must differentiate between the correlations between different items over time and the correlations between the same item over time. Let us first address correlations between different items over time.

The correlation between *x*11 (item 1 at time 1) and *x*22 (item 2 at time 2) is equal to the product of the path from *X*1 to *x*11 times the path from *X*1 to *X*2 times the path from *X*1 to *x*22. That is, *r* = (*p*11)(*P*21)(*p*22) = (.7)(.8)(.7) = .39. In the same way, all the time 1– time 2 correlations *among different items* are .39.

The correlation between *x*11 (item 1 at time 1) and *x*12 (item 1 at time 2) is equal to the product of the path from *X*1 to *x*11 times the path from *X*1 to *X*2 times the path from *X*1 to *x*12 plus *p*11t. That is, *r* = (*p*11)(*P*21)(*p*12) + *p*11t = (.7)(.8)(.7) + .3 = .69. In the same way, all the time 1–time 2 correlations *for the same item* are .69.

Using formula 2, we can solve for the Cronbach's alpha, at both time 1 and time 2, as follows:

Using the criteria described above, the score is modestly, and almost highly, consistent.

The correlation between two scores can be calculated using the correlation matrix with the following formula:

Applying formula 5 to the data in Table 3, we get:

and

Correcting this correlation for attenuation using formula 4, we get:

Thus, the stability coefficient is .96. But we specified this stability coefficient to be .8 in Figure 2! What is wrong with our procedures? Why did we overstate the stability of the model? We overstated the model's stability because we included the item specific effects as stability effects. That is, the observed correlations between the same item at time 1 and time 2 represented the effects of both stability and item-specific variance. We need to remove the item specific variance from our estimation of the stability coefficient.

We can estimate the item-specific effects by subtracting the mean of the correlations of different items at time 1 compared to time 2 (mean *r* = .39) from the mean of the correlations of the same item at time 1 compared to time 2 (mean *r* = .69). Then we use only the variance that is not item-specific in the same item correlations across time

Correlation Matrix among Four Measures at Two Points in Time | ||||||||

item | x11 | x21 | x31 | x41 | x12 | x22 | x32 | x42 |

x11 | — | .49 | .49 | .49 | .692 | .392 | .392 | .392 |

x21 | — | .49 | .49 | .392 | .692 | .392 | .392 | |

x31 | — | .49 | .392 | .392 | .692 | .392 | ||

x41 | — | .392 | .392 | .392 | .692 | |||

x12 | — | .490 | .490 | .490 | ||||

x22 | — | .490 | .490 | |||||

x32 | — | .490 | ||||||

x42 | — |

(*r* = .69 − .30 = .39) as our estimate of what these correlations would have been if there had been no item-specific variance.

We now reapply formula 5 to the adjusted data in Table 3 to get:

and

Correcting this correlation for attenuation using formula 4, we get:

Thus, the stability coefficient corrected for the removal of item specific variance is .80; despite rounding, this is equal to the .8 which was the stability coefficient specified in Figure 2.

Estimating the stability of concepts measured by scores across time is complex. A simple correlation between a measure at time 1 and the same measure at time 2 is subject to a variety of influences. First, this stability coefficient is attenuated due to inconsistency of the item. We can address this by using multiple measures. The multiple measures allows us to estimate the consistency of the score and to correct the stability coefficient for the attenuation that occurs because the score is not perfectly consistent.

Second, the stability coefficient is artificially inflated because of item-specific variance. We can address this by estimating the size of the item-specific variance and removing it from the correlation matrix. Then we can correlate the score at time 1 and time 2 on the correlation matrix (with the item specific variance having been removed). This correlation, corrected for attenuation, is an unbiased estimate of the actual stability. For the past twenty years, the author has been searching in vain for someone who will solve for the model presented in Figure 2 from the correlation matrix presented in Table 3 using SEM techniques. Many have claimed to be able to do so, but so far, to my knowledge, no one has succeeded in doing so.

## CONCLUSION

Thirty years ago, Hauser (1969, pp. 127–128) noted that "it is inadequate measurement, more than inadequate concept or hypothesis, that has plagued social researchers and prevented fuller explanations of the variances with which they are confounded." We have come a long way since then. The scientific community has given greater attention to the measurement properties of the variables we use. Our capacity to conduct numerous alternative data analyses on large and well-documented data sets has been substantially enhanced. At the same time, nature is jealous of her secrets and there are many land mines buried along the paths we need to follow (or blaze) in order to make sense of our social scene. Moreover, there are many who seek shortcuts to sociological knowledge. Hence, we continue to address the challenges of establishing the consistency and stability of our measures.

### references

Aguirre, A., R. Martinez, and A. Hernandez 1993 "Majority and Minority Faculty Perceptions in Academe." *Research in Higher Education* 34(3):371–385.

Asher, H. B. (1976) 1983 *Causal Modeling*. Beverly Hills, Calif.: Sage.

Blalock, H. M. 1969 "Multiple Indicators and the Causal Approach to Measurement Error." *American Journal**of Sociology* 75:264–272.

Bohrnstedt, G. W. 1970 "Reliability and Validity Assessment in Attitude Measurement." In G. F. Summers, ed., *Attitude Measurement*. Chicago: Rand McNally.

Bollen, K. 1989 *Structural Equations with Latent Variables*. New York: John Wiley.

——, and S. Long 1993 *Testing Structural Models*. Newbary Park, Calif.: Sage.

Brown, W. 1910 "Some Experimental Results in the Correlation of Mental Abilities." *British Journal of**Psychology* 3:269–322.

Budd, K. W., D. RossAlaolmolki, and R. A. Zeller 1997 "Psychometric Analysis of the Self-Coherence Survey." *Archives Psychiatric Nursing* 11(5):276–281.

Carmines, E. G., and R. A. Zeller 1979 *Reliability and**Validity Assessment*. Beverly Hills, Calif.: Sage.

Cooley, C. H. 1902 *Human Nature and the Social Order*. New York: Scribner's.

Cronbach, L. J. 1951 "Coefficient Alpha and the Internal Structure of Tests." *Psychometrika* 16:297–334.

Curtis, R. F., and E. F. Jackson 1962 "Multiple Indicators in Survey Research." *American Journal of Sociology* 68:195–204.

Fredman, L., M. P. Daly, and A. M. Lazur 1995 "Burden among White and Black Caregivers to Elderly Adults." *Journals of Gerontology Series-B-Psychological Sciences**and Social Sciences* 50(2):S110–S118.

Grandin, E., and E. Lupri 1997 "Intimate Violence in Canada and the United States." *Journal of Family**Violence* 12(4):417–443.

GrayLittle, B. V. S. Williams, and T. D. Hancock 1997 "An Item Response Theory Analysis of the Rosenberg Self-Esteem Scale." *Personality and Social Psychology**Bulletin* May 23(5):443–451.

Hauser, P. 1969 "Comments on Coleman's Paper." Pp. 122–128 in R. Bierstedt, ed., *A Design for Sociology:**Scope, Objectives, and Methods*. Philadelphia: American Academy of Political and Social Science.

Hayduk, L. 1987 *Structural Equation Modeling with**LISREL*. Baltimore, Md.: Johns Hopkins University Press.

Hendrix, C., and W. Schumm 1990 "Reliability and Validity of Abusive Violence Scale." *Psychological Reports* 66(3):1251–1258.

Hoyle, R. 1995 *Structural Equation Modeling: Concepts,**Issues, and Applications*. Newbury Park, Calif.: Sage.

Jackman, R. W., and R. A. Miller 1996 "A Renaissance of Political Culture?" *American Journal of Political Science* 40(3):632–659.

Lam, S. S., and K. S. Woo 1997 "Measuring Service Quality." *Journal of Marketing Research* 39(2):381–396.

Lee, A. M. 1972 "An Obituary for Alienation." *Social**Problems* 20(1): 121–127.

McTavish, D. G. 1997 "Scale Validity—A Computer Content Analysis Approach." *Social Science Computer**Review* 15(4):379–393.

Mead, G. H. 1934 *Mind, Self, and Society*. Chicago: University of Chicago Press.

Moran, J. R., D. Frans, and P. A. Gibson 1995 "A Comparison of Beginning MSW and MBA Students on Their Aptitudes for Human-Service Management." *Journal of Social Work Education* 31(1):95–105.

Mueller, C. 1997 "International Press Coverage of East German Protest Events, 1989." *American Sociological**Review* 62(5):820–832.

Neal, A. G., and H. T. Groat 1974 "Social Class Correlates of Stability and Change in Levels of Alienation: A Longitudinal Study." *Sociological Quarterly* 15(4):548–558.

——, and S. Rettig 1967 "On the Multidimensionality of Alienation." *American Sociological Review* 32(1):54–64.

Nisbet, R. A. 1966 *The Sociological Tradition*. New York: Basic Books.

Riemer, H. A., and P. Chelladurai 1998 "Development of the Athlete Satisfaction Questionnaire." *Journal of**Sport and Exercise Psychology* 20(2):127–156.

Rosenberg, M. 1965 *Society and the Adolescent Self Image*. Princeton, N.J.: Princeton University Press.

Seeman, M. 1959 "On the Meaning of Alienation." *American Sociological Review* 24:772–782.

Spearman, C. 1910 "Correlation Calculated from Faulty Data." *British Journal of Psychology* 3:271–295.

Szulanski, G. 1996 "Exploring Internal Stickiness." *Strategic Management Journal* 17:27–43.

Tambs, D., J. R. Harris, and P. Magnus 1995 "Sex-Specific Causal Factors and Effects of Common Environment for Symptoms of Anxiety and Depression in Twins." *Behavorial Genetics* 25(1):33–44.

Wright, L. K. 1991 "The Impact of Alzheimers Disease on the Marital Relationship." *Gerontologist* 31(2):224–237.

Zeller, R. A., and E. G. Carmines 1980 *Measurement in the**Social Sciences: The Link between Theory and Data*. New York: Cambridge University Press.

—— 1976 "Factor Scaling, External Consistency, and the Measurement of Theoretical Constructs." *Political Methodology* 3:215–252.

Zeller, R. A., A. G. Neal, and H. T. Groat 1980 "On the Reliability and Stability of Alienation Measures: A Longitudinal Analysis." *Social Forces* 58:1195–1204.

Richard A. Zeller

## Reliability

# RELIABILITY

The term *reliability* can be used to indicate a virtue in a person, a feature of scientific knowledge, or the quality of a product, process, or system. Personal unreliability makes an individual difficult to trust. Unreliability in science calls the scientific enterprise into question. Lack of reliability in technology or engineering undermines utility and public confidence and perhaps commercial success. In all cases the pursuit of reliability is a conscious goal.

## Scientific Reliability as Replication

Reliability in science takes its primary form as replicability. Research experiments and research must be performed and then communicated in such a way that they can be replicated by others or the results cannot become part of the edifice of science. Both replicability in principle and actual replication by diverse members of the scientific community are central to the processes of science that make the knowledge produced by science uniquely reliable and able to be trusted both within the community and by nonscientists.

Replication is easier to achieve in some scientific domains than in others, but when it fails, the science is judged unreliable. Historically replication was established first in physics and chemistry, and so in the physical sciences especially lack of replicability can become newsworthy. For example, the inability of other scientists to replicate the experiments on which Stanley Pons and Martin Fleischmann based their announcement of the discovery of cold fusion in 1989 doomed the credibility of their claims.

As Harry Collins and Trevor Pinch (1998) have shown in case studies, the replication of particular experiments often depends on the phenomenon of "golden hands." Not all experimenters are equally skilled at setting up and performing experiments, and subtle differences can be more relevant than it is possible to articulate clearly in the methods section of a research article.

In science another version of replicability is associated with peer review. Peer review procedures for scientific publication and for decision making about grants in effect depend on two or more persons coming to the same conclusion about the value of a report or proposal. Assessments must be replicated among independent professionals to support reliable decisions. Several evaluations of the peer review process in various disciplines have been performed (Peters and Ceci 1982). Many of those reports suggest that the system is unreliable because reviewers often fail to agree on the quality of a scientific article. Unreliability in this process undermines the internal quality controls of science, thus hampering progress. It also raises epistemological questions about the constitution of truth.

For instance, even if two reviewers judge a paper to be of high quality, both may be mistaken because they failed to spot a statistical error. In this sense reliability (agreement between reviewers) does not constitute validity (internal consistency or the absence of obvious errors of logic) (Wood, Roberts, and Howell 2004). However, on another level the negotiation of scientific claims within the scientific community is an integral part of determining what is true. Thus, in this sense reliability is a way of making or legitimating truth claims. These issues are made more complex by the role of editors in synthesizing disparate claims by reviewers and the question of whether reliability can be assessed by the metric of agreement between reviewers.

Another example of the issue of replicability in science is associated with the development of the *Diagnostic and Statistical Manual of Mental Disorders (DSM)* in psychiatry. Before this compendium of standardized descriptions of mental disorders was published, diagnoses of psychological illnesses lacked reliability. For example, if three physicians independently saw a patient with a psychological illness, it was unlikely that they would make the same diagnosis. Indeed, this remained the case through the publication of the original *DSM* in 1952 and *DSM-II* in 1968. It was only with the increasing detail and sophistication of *DSM-III,* published in 1980, that the psychiatric community began to achieve a significant measure of reliability in its diagnostic practices and psychiatry became more respected as a science.

This case suggests the connection between reliability and professionalization (the formation of a specialized academic discipline) because replicability was made possible only after a community of practitioners developed a shared conceptual language and a methodology that were sufficiently nuanced to communicate and establish likes as likes. Reliability as a way of establishing truth through replication thus is a product of both material reality and the way peers conceptualize the world and are able to replicate that conceptualization among themselves.

## Functional Reliability in Engineering

Engineering or technological reliability is the probability that a product, process, or system will perform as intended or expected. Issues include the expected level of reliability, the cost-benefit trade-offs in improving reliability, and the consequences of failure. When these issues involve persons other than those inventing or tinkering with the relevant products, processes, or systems, with consequences for public safety, health, or welfare, ethical issues become prominent. Just as in science, reliability, in this case in the form of functional reliability, is a precondition for the integration of a particular technological device into the accepted or trusted edifice of the built environment.

Any technological product, process, or system is designed to perform one or more specified functions. In principle, the performance of the system can be defined mathematically and the demands placed on the system can be specified. Because uncertainties are associated with all aspects of systems in the real world, these descriptors should be defined in terms of uncertainties and reliability should be computed as the probability of intended performance. Because most systems have effects beyond their stated output (radiation, accidents, behavior modification, etc.), a comprehensive model must include all possible outcomes. Because complicated models all are based on extrapolations of the basic principles, the fundamental concepts are described in this entry.

The demands placed on a system include environmental and operational loads, which for simplicity will be designated here as a single demand, *D.* The capacity of the system to absorb those loads and perform its function is designated *C,* for capacity. The satisfactory operation of the system simply entails that the capacity be at least as large as the demand. This is expressed mathematically as *S* = *C**D* 0 in which *S* represents satisfactory performance. In probability terms this becomes *P(S)* = *P(C**D)* 0.

Each of these basic quantities can be described probabilistically by its probability function: *F*_{D}(*d*) for demand and *F*_{C}(*c*) for capacity. It is usually a safe assumption that the capacity (a function of the physical system) and the demand (a function of the operating environment) are statistically independent. In this case the reliability of the system is given by
in which *f*_{C}(*x*) is the probability density function of the capacity (the derivative of the capacity probability function, *F*_{C}(*x*)) if the capacity is a continuous variable and otherwise is the probability mass function of the capacity (analogous to a histogram). In words the preceding equation indicates that one should assign a probability that the capacity is a particular value (*f*_{C}(*x*)) and then multiply by the probability that the demand is no greater than that value of capacity (*F*_{D}(*x*)). This process then is repeated for all possible values of the demand and the capacity, and the results are added (that is what the integration function does for continuous variables). The integrand of the equation above is shown in the Figure 1.

TIME DEPENDENCY. Most systems are not designed to be used just once but instead to perform over an intended period. In this case, the demand and the capacity become time-dependent variables and the probability of satisfactory performance is interpreted as being over an intended design lifetime. The formulation of the previous section then is interpreted as being at a single point in time, and the results are integrated over the lifetime.

Most technology displays a characteristic failure curve that is relatively steep at the beginning of the design lifetime, during which time initial defects are discovered. The failure rate then decreases to a steady-state value that exists over most of the design lifetime of the technology. As the technology nears the end of its useful lifetime, the failure rate again rises as parts begin to wear out.

When failure is due to relatively rare events such as environmental hazards, unusual parts wear, and abnormal use, simplified time-dependent models can be developed on the basis of the independent occurrence of these unusual events. These models usually are based on the Poisson process model, which is the simplest among the time-dependent processes that are referred to as stochastic processes. The Poisson model assumes that the occurrence of each event is independent of the past history of performance of the technology.

Systems reliability adds another level to this analysis. A system is a technology that is composed of multiple parts. Usually it is necessary that the parts work together properly for the system as a whole to function as desired. Systems theory builds on the theory described above to consider multiple capacities and demands, and many theories and models have been developed to analyze the risks of systems (Haimes 1998). Because systems analysis can be complicated, formalized approaches such as decision tree analysis (Clemen 1996) and event tree and fault tree analysis have been developed (Page 1989). Approximate analyses use the concepts of systems reaching a discrete number of undesirable states that are referred to as limit states. One then evaluates the probability of reaching those states by using approximate analyses such as the first-order, second-moment (FOSM) method, in which the limit state is approximated by a straight line and the full probability descriptors of the demands and capacities are approximated by the first and second moments of the probability function, which usually are the average and the standard deviation (Melchers 1999).

Software reliability can be used to illustrate some of the issues mentioned here. Newly engineered software is notoriously unreliable. After in-house testing and even after beta (user) testing in the field or market, "patches" regularly have to be introduced as new problems arise. Sometimes those problems arise because of a lack of correctness in the underlying code, and at other times because of a lack of robustness in the overall design. Software engineers also can fail to appreciate the ways users may choose to utilize a particular piece of software, and hackers and others may try to exploit weaknesses in ways that undermine reliability. As software illustrates, the pursuit of functional reliability in engineering and technology is a never-ending quest with ethical implications.

## Ethics of Reliability

Despite its ethical importance in science and technology reliability has been subject to little extended ethical analysis. With regard to persons, in which case the virtue of reliability manifests itself as trustworthiness, there has been more discussion. However, the following comments on the ethics of reliability in general are only preliminary observations.

First, as has been suggested in this entry, technological reliability is what makes engineered artifice the basis for improved material well-being. It is for this reason that a few technical professional ethics codes include the promotion of reliability as an explicit obligation. For example, in the Code of Ethics (developed 1948) of the American Society for Quality (founded in 1946), the third fundamental principle commits a member to promote "the safety and reliability of products for public use." However, although in some instances unreliability in products may be attributed to a failure of intention, in other cases it is caused by evolutionary changes in nature (e.g., the evolution of antibiotic-resistant bacteria), economic change (as occurs when parts cease to be available for cars or other vehicles) or unintended consequences. Indeed, unintended consequences are one of the most common ways to conceptualize breakdowns in technological reliability as engineered devices bring about unexpected scenarios. This both raises questions about the degree to which reliability can be an ethical obligation and suggests the need for engineers to consider the wider ramifications of technology in their analyses of reliability and to build flexibility into their designs.

Another instance in which reliability has been adopted explicitly as an ethical concept related to technology occurred at a Poynter Journalism Values and Ethics in New Media Conference in 1997. That conference drafted an ethics code that included the following recommended "Online Reliability Statement":

This site strives to provide accurate, reliable information to its users. We pledge to:

Ensure information on our Web site has been edited to a standard equal to our print or broadcast standards.

Notify our online users if newsworthy materials are posted from outside our site and may not have been edited or reviewed to meet our standards for reliability.

Update all our databases for timeliness, accuracy and relevance.

Warn users when they are leaving our site that they may be entering a site that has not embraced the content reliability protocol.

The idea here is that professional standards of reliability in the print media need to be transported consciously into a new technological media framework. Similar statements about the need for commitment to reliability in information delivery related in one way or another to technology have been discussed with regard to both medicine and computers.

With regard to science replicability generally is thought of as a self-regulating process that serves both as a method for epistemological quality control and as a way to prevent scientific misconduct, including fabrication, falsification, and plagiarism. Thus, it is a mechanism for nurturing trust within the scientific community. The dominant perception that scientists deal with absolute certainties often undermines public trust in science when scientists openly communicate uncertainties in their research or when a scientific finding of high public concern is disputed and eventually overturned ("In Science We Trust" 2001).

The notion of reliability as replicability also manifests a certain hierarchy of values or axiology in the pursuit of knowledge. Alvin Weinberg (1971) has noted that physics serves as the ideal science (of which other sciences are more or less distorted images) because of the universalizability and replicability of its findings. It most closely approximates deeply entrenched Western beliefs about truth as timeless and noncontextual. However, this ingrained cultural deference to this ideal of science can lead to misunderstandings of science and unrealistic expectations about its contributions to complex political decisions.

Questions also might be raised about the issue of reliability in ethics itself. The human sciences, including ethical inquiry, proceed by means of dialectical and hermeneutical processes that are different from the models of the engineering construction of reliable artifacts or the scientific construction of reliable knowledge claims. In the popular imagination ethical and other value claims often are treated as matters of religious commitment, subjective preference, or legalistic requirements. However, a more nuanced appreciation of the process of ethical argumentation can point to possibilities for reliability.

Substantive agreement and reliability can be found, for instance, in some common documents, such as the Universal Declaration of Human Rights of 1948. Procedural reliability is manifested in the democratic considerations of ethics and other values that also are able to proceed toward common interest solutions through reasonable argumentation, tolerance, compromise, and openness of mind, procedures not dissimilar to those involved in the pursuit of an always provisional scientific truth.

Thus, the test for reliability in ethics may not be replicability, but it also may not be as distant from the actual workings of science as is maintained by many people. Indeed, when it comes to practical affairs, the desirable trait for both science and ethics may not be replicability so much as something more akin to the functional reliability of technology. That is, reliable science and ethics, much like reliable technologies, help human beings navigate toward common goods within complex situations marked by uncertainties and pluralities.

ROSS B. COROTIS CARL MITCHAM ADAM BRIGGLE

SEE ALSO *Uncertainty*.

## BIBLIOGRAPHY

Clemen, Robert. (1996). *Making Hard Decisions: An Introduction to Decision Analysis.* Pacific Grove, CA: Duxbury Press.

Collins, Harry, and Trevor Pinch. (1998). *The Golem: What You Should Know about Science,* 2nd rev. edition. New York: Cambridge University Press. This controversial book, which argues through a series of seven case studies for the influence of social factors in the assessment of experiments, originally was published in 1993.

Haimes, Yacov. (1998). *Risk Modeling, Assessment, and Management.* New York: Wiley.

"In Science We Trust." (2001). *Nature Medicine* 7(8): 871.

Melchers, Robert. (1999). *Structural Reliability Analysis and Prediction.* New York: Wiley.

Page, Lavon. (1989). *Probability for Engineering.* New York: Computer Science Press.

Peters, Douglas P., and Stephen J. Ceci. (1982). "Peer Review Practices of Psychological Journals: The Fate of Published Articles, Submitted Again." *Behavioral and Brain Sciences* 5: 187–255. Finds that eight of twelve articles resubmitted after they had been published were rejected by the same journal.

Weinberg, Alvin. (1971). "The Axiology of Science." *American Scientist* 58(6): 612–617.

Wood, Michael; Martyn Roberts; and Barbara Howell. (2004). "The Reliability of Peer Reviews of Papers on Information Systems." *Journal of Information Science* 30(1): 2–11.

### INTERNET RESOURCES

American Society for Quality. "Code of Ethics." Available from http://www.asq.org. The 1948 code of ethics can be found on their site at http://www.asq.org/join/about/ethics.html

The Poynter Institute. "Online Content Reliability Guidelines." 1997. Includes the "Online Reliability Statement." Available from http://legacy.poynter.org/research/me/nme/me_samprot.htm#relability.

## reliability

**reliability** When sociologists enquire as to the reliability of data, or of a measurement procedure, they are questioning whether the same results would be produced if the research procedure were to be repeated. Reliability embraces two principal forms of repetition: temporal reliability (the same result is obtained when the measurement is repeated at a later time); and comparative reliability (the same result is obtained when two different forms of a test are used, the same test is applied by different researchers, or the same test is applied to two different samples taken from the same population). Reliability raises many technical problems for the sociologist. For example, having once interviewed someone, a repeat interview may be contaminated by the earlier experience.

Reliability is usually contrasted with validity—whether or not a measurement procedure actually measures what the researcher supposes it does. However the two are not perfectly independent. One may have a highly reliable measure which is not valid: for example, we might measure IQ by standing subjects on a weighing machine and reading off the numbers, an extremely reliable procedure in both senses; but body-weight is hardly a valid indicator of IQ. A highly unreliable measure, on the other hand, cannot be valid. See also VARIABLE.

## reliability

**reliability** **1.** The ability of a computer system to perform its required functions for a given period of time. It is often quoted in terms of percentage of uptime, but may be more usefully expressed as MTBF (mean time between failures). See also hardware reliability, repair time.

**2.** of software. See software reliability.