## Multiple Indicator Models

## Multiple Indicator Models

# MULTIPLE INDICATOR MODELS

A primary goal of sociology (and science in general) is to provide accurate estimates of the causal relationship between concepts of central interest to the discipline. Thus, for example, sociologists might examine the causal link between the amount of money people make and how satisfied they are with their lives in general. But in assessing the causal relationships between concepts—such as income and life satisfaction—researchers are subject to making errors stemming from a multitude of sources. In this article, we will focus on one common and especially large source of errors in making causal inferences in sociology and other social and behavioral sciences—specifically, "measurement errors." Such errors will produce biased (under- or over- ) estimates of the true causal relationship between concepts.

*Multiple indicator models* are a method of testing and correcting for errors made in measuring a concept or "latent construct." Multiple indicators consist of two or more "alternative" measures of the same concept (e.g., two different ways of asking how satisfied you are with your life). Before examining models that use multiple indicators to assess and correct for measurement error, however, the reader should become familiar with the terms and logic underlying such models. "Latent constructs" (also described as a "latent variables") are unobservable phenomena (e.g., internal states such as the amount of "satisfaction" a person experiences) or more concrete concepts (e.g., income) that represent the hypothetical "true" or "actual" score persons would provide if we could measure a given concept without any error (e.g., a person's actual income or actual satisfaction, as opposed to their reported income or reported satisfaction).

Additionally, an "indicator" is simply another name for the empirical measure ("observed" score) of a given construct. For instance, researchers might measure income through a single "indicator" on a questionnaire that asks people "How much money do your earn per year?"—with the response options being, say, five possible income levels ranging from "income greater than $200,000 per year" (coded as "5") to "income less than $10,000 per year" (coded as "1"). Similarly, researchers might provide a single indicator for life satisfaction by asking persons to respond to the statement "I am satisfied with my life." The response options here might be: "strongly agree" (coded as "5"), "agree" (coded as "4"), "neither agree nor disagree" (coded as "3"), "disagree" (coded as "2"), and "strongly disagree" (coded as "1").

Social scientists might expect to find a positive association between the above measures of income and life satisfaction. In fact, a review of empirical studies suggests a correlation coefficient ranging from .1 to .3 (Larson 1978; Haring et al. 1984). Correlation coefficients can have values between 1.0 and -1.0, with 0 indicating no association, and 1.0 or -1.0 indicating a perfect relationship. Thus, for example, a correlation of 1.0 for income and life satisfaction would imply that we can perfectly predict a person's life satisfaction score by knowing that person's income. In other words, individuals with the highest income (i.e., scored as a "5") consistently have the highest life satisfaction (i.e., scored as a "5"); those with the next-highest income (i.e., "4") consistently have the next-highest life satisfaction (i.e., "4"), and so on. Conversely, a -1.0 suggests the opposite relationship. That is, people with the highest income consistently have the lowest life satisfaction; those with the second-highest income consistently have the second-lowest life satisfaction, and so on.

Furthermore, a correlation coefficient of .2, as possibly found, say, between income and life satisfaction, suggests a relatively weak association. A coefficient of this size would indicate that individuals with higher incomes only *tend* to have higher life satisfaction. Hence, we should expect to find many exceptions to this "average" pattern. (More technically, one can square a correlation coefficient to obtain the amount of variance that one variable explains in another. Accordingly, squaring the *r* = .2 correlation between income and life satisfaction produces an *r*2 of .04—i.e., income explains 4 percent of the variance in life satisfaction.) In sum, given a correlation of .2, we can predict life satisfaction better by knowing someone's income than if we did not have this information (we reduce our errors by 4 percent) but we will still make a lot of errors in our prediction (96 percent of the variance in life satisfaction remains unexplained).

These errors in prediction stem in part from less-than-perfect measures of income and life satisfaction analyzed (a topic covered in the next section). However, they also occur because there are many other causes of life satisfaction (e.g., physical health) in addition to income. The more of these additional causes there are, and the stronger their effects (i.e., the stronger their correlation with life satisfaction), the weaker the ability of a single construct such as income to predict life satisfaction. The same principles apply, of course, to any construct used to predict other constructs (e.g., using people's level of stress to predict the amount of aggression they will display).

Correlation coefficients and "path coefficients" are part of the "language" of causal modeling, including multiple indicator models. Like a correlation coefficient, a path coefficient describes the strength of the relationship between two variables. One can interpret a (standardized) path coefficient in a manner roughly similar to a correlation coefficient. Readers will increase their understanding of the material to follow if they familiarize themselves with these measures of strength of association. (For more information on interpreting correlation and path coefficients, see Blalock [1979]; Kline [1998].)

## RELIABILITY AND VALIDITY OF MEASURES

As noted earlier, measurement errors can bias estimates of the true causal associations between constructs of interest to sociologists and other researchers. Accordingly, it is important to use measures of high quality. Specialists in the field of measurement (often labeled "psychometricans") describe high-quality measures as having strong *reliability* and *validity* (Nunnally and Bernstein 1994). Reliability concerns the consistency with which an indicator measures a given construct; validity assesses whether one is measuring what one intends to measure or something else.

**Reliability** A common method of assessing reliability is to determine the strength of the correlation (consistency) between alternative (multiple) indicators of the same construct—for example, the correlation between two different measures of life satisfaction. A correlation of 1.0 would suggest that both indicators are perfectly reliable measures of life satisfaction. Conversely, a correlation of, say, .3 would suggest that the two indicators are not very reliable measures of life satisfaction.

Given the subjective nature of indicators of life satisfaction (and many other constructs found in the social and behavioral sciences), we should not be surprised to find fairly low correlations (consistency) among their many possible multiple indicators. The ambiguity inherent in agreeing or disagreeing with statements like "I am satisfied with my life," "The conditions of my life are excellent," and "In most ways my life is close to my ideal" should introduce considerable measurement error. Furthermore, we might anticipate that much of this measurement error would be *random*. That is, relative to respondents' actual ("true") scores for life satisfaction, they would provide answers (observed scores) to the subjective questions regarding life satisfaction that are likely to be nonsystematic. For example, depending on the degree of ambiguity in the multiple indicators for subjective construct like life satisfaction, a person is likely to display a random pattern of giving too high and too low scores relative to the person's true score across the set of measures.

This "noise"—unreliability due to random measurement error—will reduce the correlation of a given indicator with another indicator of the same construct. Indeed, "pure" noise (e.g., completely random responses to questions concerning respondents' life satisfaction) should not correlate with anything (i.e., *r* = 0). To the extent that researchers can reduce random noise in the indicators (e.g., by attempting to create as clearly worded self-report measures of a construct as possible) the reliability and corresponding correlations among multiple indicators should increase. Even where researchers are careful however, to select the best available indicators of constructs that represent subjective states (like life satisfaction), correlations between indicators frequently do not exceed *r*'s of .3 to .5.

Not only does less-than-perfect reliability reduce correlations among multiple indicators of a given construct, but, more importantly, this random measurement error also reduces the degree to which indicators for one latent construct correlate with the indicators for another latent construct. That is, unreliable measures (such as each of the multiple indicators of life satisfaction) will *underestimate* the true causal linkages between constructs of interest (e.g., the effect of income on life satisfaction). These biased estimates can, of course, have adverse consequences for advancing our scientific knowledge (e.g., perceiving income as a less important source of life satisfaction than it might actually be). (Although unreliable measures will always underestimate the true relationship between two constructs, the bias of unreliable measures is more complex in "multivariate" situations where important control variables may exhibit as much unreliability as [or more unreliability than] do the predictor and outcome variables of interest.)

Psychometricans have long been aware of this problem of "attenuated correlations" from unreliable measures. In response, traditional practice is to combine each of the multiple indicators for a given construct into a single *composite* scale (e.g., sum a person's score across each life satisfaction indicator). The random errors contained in each individual indicator tend to "cancel each other out" in the composite scale (cf. Nunnally and Bernstein 1994), and the overall reliability (typically measured with Cronbach's alpha) on scale ranging from 0 to 1.0 can improve substantially relative to the reliability of individual items within the scale. Although composite scales are a definite step in the right direction, they are still less than perfectly reliable, and often much less. Consequently, researchers are still faced with the problem of biased estimates of the causal linkages among constructs of interest.

**Validity.** Unreliable indicators are not the only source of measurement error that can bias estimates of causal linkages among constructs. Invalid indicators can also create biased estimates. As we shall see in subsequent sections, bias from invalid measures stems from different sources and is more complex and difficult to detect and control than bias from unreliable measures.

Valid measures require at least modest reliability (i.e., correlations among indicators of a given construct cannot be *r* = 0); but reliable measures are not necessarily valid measures. One can have multiple indicators that are moderately to highly reliable (e.g., *r*'s = .5 to .8), but they may not measure the construct they are supposed to measure (i.e., may not be valid). For example, life satisfaction indicators may display at least moderate reliability, but no one would claim that they are valid measures of, say, a person's physical health.

This example helps clarify some differences between reliability and validity, but at the risk of obscuring the difficulty that researchers typically encounter in establishing valid measures of many latent constructs. Continuing with our example, researchers may select multiple indicators of life satisfaction that critics could never plausibly argue actually measure physical health: Critics might make a very plausible argument, however, that some or all the indicators of life satisfaction also measure a more closely related concept—such as "optimism."

Note, too, that if the life satisfaction indicators do, in fact, also measure optimism, then the correlation that income has with the life satisfaction indicators could stem entirely from income's causal links with an optimistic personality, rather than from income's effect on life satisfaction itself. In other words, in this hypothetical situation, invalid ("contaminated") measures of life satisfaction could lead to *over*estimating income's effect on life satisfaction (though, as we will see in later sections, one can construct examples where invalid measures of life satisfaction could also lead to *under*estimating income's effect).

Given the many subjective, loosely defined constructs that form the core concepts of sociology and other social and behavioral sciences, the issue of what the indicators are actually measuring (i.e., their validity) is a common and often serious problem. Clearly, our scientific knowledge is not advanced where researchers claim a relationship between constructs using invalid measures of one or more of the constructs.

The sections, below, on single-indicator and multiple-indicator models will elaborate on the bias introduced by measurement error stemming from unreliability and invalidity, and how to use the multiple indicators and "path analysis" of structural equation modeling (SEM) to test and correct for the bias. This discussion continues to use the example of estimating the effect of income on life satisfaction in the face of measurement error.

## SINGLE-INDICATOR MODELS

Figure 1 depicts the hypothesized causal link (solid arrow labeled "*x*") between the latent (unobservable) constructs of income and life satisfaction (represented with circles), and the hypothesized causal link (solid arrows labeled "*a*" and "*b*") between each latent construct (circle) and its respective empirical (observed) indicator (box). The *D* (disturbances) in Figure 1 represents all potential causes of life satisfaction that the researcher has not included in the causal model (stressful life events, personality characteristics, family relationships, etc.). The *E*'s in Figure 1 represent random measurement error (and any unspecified latent constructs that have a "unique" effect on a given indicator).

Because each latent construct (circled "*I*" and "*LS*") in Figure 1 has only one indicator (boxed "*I*1" or "*LS*1") to measure the respective construct, researchers describe the diagram as a causal model with *single* (as opposed to multiple) indicators. Additionally, Figure 1 displays a dashed, doubleheaded arrow (labeled "*r*1") between the box for income and the box for life satisfaction. This dashed, double-headed arrow represents the empirical or *observed* correlation between the empirical indicators of income and life satisfaction. Following the logic diagramed in Figure 1, this observed correlation is the product of actual income affecting both measured income through path *a*, and actual life satisfaction through path *x*. In turn, actual life satisfaction affects measured life satisfaction via path *b*. Stated in more formal "path equations," *a***x***b* = *r*1.

Note, that researchers can never directly observe the *true* causal effect (i.e., path *x*) of actual income (I) on actual life satisfaction (*LS*). Researchers can only *infer* such a relationship based on the *observed* correlation (*r*1) between the empirical indicators for income and life satisfaction—*I*1 and *LS*1. In other words, social scientists use an observed correlation—*r*1—to estimate an unobservable true causal effect—path *x*.

Notice also that in the presence of random measurement error, the observed correlation *r*1 will always be an *under*estimate of the unobservable path *x* representing the hypothesized true effect of income on life satisfaction. Of course, researchers hope that *r*1 will equal path *x*. But *r*1 will only equal *x* if our empirical measures of income and life satisfaction (*I*1 and *LS*1) are *perfectly reliable*—that is, have no random measurement error.

The phrase "completely reliable measures" implies that each person's observed score for income and life satisfaction indicators reflect exactly that person's actual or "true score" for income and life satisfaction. If the indicators for income and life satisfaction are indeed perfect measures, then researchers can attach a (standardized) path coefficient of 1.0 to each path (i.e, *a* and *b*) between the latent constructs and their indicator. Likewise, researchers can attach a path coefficient of 0 to each path (i.e., *d* and *e*) representing the effects of random measurement errors (*E*1 and *E*2) on the respective indicators for income and life satisfaction.

The path coefficient of 1.0 for *a* and *b* signifies a perfect relationship between the latent construct (i.e., true score) and the measure or indicator for the latent construct (i.e., recorded score). In other words, there is no "slippage" between the actual amount of income or life satisfaction people have and the amount of income or life satisfaction that a researcher records for each person (i.e., there is no random measurement error). Therefore, people who truly have the highest income will report the most income, those who truly have the lowest income will report the lowest income, and so on. Likewise, people who, in fact, have the most life satisfaction will always record a life satisfaction score (e.g., "5") higher than those a little less satisfied (e.g., "4"), individuals a little less satisfied will always record a life satisfaction score higher than those a little bit less satisfied yet (e.g., "3"); and so on.

Under the assumption that the measures of income and life satisfaction are perfectly reliable, social scientists can use the *observed* correlation (*r*1) between the indicators *I*1 and *LS*1 to estimate the *true* causal effect (path *x*) of actual income (*I*) on actual life satisfaction (*LS*). Specifically, *r*1 = *a***x***b*;
hence, *r*1 = 1.0**x**1.0, or *r*1 = *x*. Thus, if the observed correlation between income and life satisfaction (i.e., *r*1) is, say, .2, then the true (unobservable) causal effect of income on life satisfaction (i.e., path *x*) would also be .2. (For more detailed explanations of how to interpret and calculate path coefficients, see Sullivan and Feldman [1979]; Loehlin [1998].)

Of course, even if researchers were to measure income and life satisfaction with perfect reliability (i.e., paths *a* and *b* each equal 1.0), there are other possible errors ("misspecifications") in the model shown in Figure 1 that could bias researchers' estimates of how strong an effect income truly has on life satisfaction. That is, Figure 1 does not depict other possible "misspecifications" in the model such as "reverse causal order" (e.g., amount of life satisfaction determines a person's income) or "spuriousness" (e.g., education determines both income and life satisfaction and hence only makes it appear that income causes life satisfaction). (For more details on these additional sources of potential misspecification, see also Blalock [1979].)

How realistic is the assumption of perfect measurement in single indicator causal models? The answer is, "It depends." For example, to assume a path coefficient of 1.0 for path *a* in Figure 1 would not be too unrealistic (the actual coefficient is likely a little less than 1.0—say, .90 or .95). That is, we would not expect many errors in measuring a person's true income. Likewise, we would expect few measurement errors in recording, say, a person's age, sex, or race. As noted earlier, the measurement of subjective states (including satisfaction with life) is likely to occur with considerable error. Therefore, researchers would likely *under*estimate the true causal link between income and life satisfaction, if they assumed no random measurement error for the life satisfaction indicator—that is, if they assumed a coefficient of 1.0 for path *b* in Figure 1.

How badly researchers underestimate the true causal effect (i.e., path *x*) would depend, of course, on how much less than 1.0 was the value for path *b*. For the sake of illustration, assume that path *b* equals .5. Assume also that income is perfectly measured (i.e., path *a* = 1.0) and the observed correlation (*r*1) between the indicators for income and life satisfaction is .2. Under these conditions, 1.0**x**.5 = .2, and *x* = .4. In other words, researchers who report the observed correlation between income and life satisfaction (*r*1 = .2) would substantially *under*estimate the strength of the true effect (*x* = .4) that income has on life satisfaction. Recall that an *r* of .2 represents an *r*2 of .04, or 4 percent of the variance in life satisfaction explained by income, whereas an *r* of .4 represents an *r*2 of .16, or 16 percent of the variance explained by income. Accounting for 16 percent of the variance in life satisfaction represents a much stronger effect for income than does 4 percent explained variance. Based on this hypothetical example, income goes from a weak to a relatively strong predictor of life satisfaction, if we have the appropriate information to allow us to correct for the bias of the unreliable measure of life satisfaction.

But how do researchers know what values to assign to the unobservable paths (such as *a* and *b* in Figure 1) linking a given latent construct to its single indicator? Unless logic, theory, or prior empirical evidence suggests that the constructs in a given model are measured with little error (i.e., the path between the circle and the box = 1.0) or indicate what might be an appropriate value less than 1.0 to assign for the path between the circle and box, researchers must turn from *single* indicator models to *multiple* indicator models. As noted earlier, multiple indicator models allow one to make corrections for measurement errors that would otherwise bias estimates of the causal relationships between constructs.

## MULTIPLE INDICATOR MODELS

For researchers to claim that they are using multiple indicator models, at least one of the concepts in a causal model must have more than one indicator. "Multiple indicators" means simply that a causal model contains alternative measures of the same thing (same latent construct). Figure 2 depicts a multiple indicator model in which income still has a single indicator but life satisfaction now has three indicators (i.e., three alternative measures of the same life satisfaction latent construct). In addition to the original indicator for life satisfaction—"I am satisfied with my life"—there are two new indicators; namely, "In most ways my life is close to my ideal" and "The conditions of my life are excellent." (Recall that the possible response categories range from "strongly agree" to "strongly disagree." See also Diener et. al. [1985] for a fuller description of this measurement scale.)

As in Figure 1, the dashed, double-headed arrows represent the *observed* correlations between each pair of indicators. Recall that these observed correlations stem from the assumed operation of the latent constructs inferred in the causal model. Specifically, an increase in actual income (*I*) should produce an increase in measured income (*I*1) through (causal) path *a*. Moreover, an increase in actual income should also produce an increase in actual life satisfaction (*LS*) through (causal) path *x*, which in turn should produce an increase in each of the measures of life satisfaction (*LS*1, *LS*2, and *LS*3) through (causal) paths *b, c*, and *d*. In other words, these hypothesized causal pathways should produce observed correlations between all possible pairs of the measured variables (I1, LS1, LS2, and LS3). (We are assuming here that the underlying measurement model is one in which the unobserved constructs have a causal effect on their respective indicators. There are some types of multiple indicators, however, where a more plausible measurement model would suggest the reverse causal order—i.e., that the multiple indicators each drive the latent construct common to the set of indicators. See, Kline [1998] for a discussion of these "cause" indicator measurement models, as opposed to the more traditional "effect" indicator measurement model described here.)

The use of a single indicator for income means that researchers must use logic, theory, or prior
empirical evidence to assign a path coefficient to represent "slippage" between the true score (I) and the measured score (*I*1). For income, path *a* = 1.0 (i.e., no random measurement error) seems like a reasonable estimate, and makes it easier to illustrate the path equations. (Although a coefficient of say, .95, might be more realistic, whether we use 1.0 or .95 will make little difference in the calculations that follow.) As noted earlier, however, indicators of life satisfaction are not as easily assigned path coefficients of 1.0. That is, there is likely to be considerable random error (unreliability) in measuring a subjective state such as life satisfaction. Fortunately, however, the use of *multiple indicators* for the life satisfaction construct permits researchers to provide reasonable estimates of random measurement error based on the *empirical data in the current study*—namely, the observed correlations (i.e., consistency) among the multiple indicators. Because measurement error can vary so much from one research setting to another, it is always preferable to provide estimates of reliability based on the current rather than previous empirical studies. Likewise, reliability estimates based on the current study are much better than those estimates obtained from logic or theory, unless the latter sources can provide a compelling case for a highly reliable single indicator (such as the measure of income used in the present example).

If the multiple indicator model in Figure 2 is correctly specified, then the observed correlations among the several pairs of indicators should provide researchers with information to calculate estimates for the hypothesized (unobservable) causal paths *b, c*, and *d* (i.e., estimates of how much "slippage" there is between actual life satisfaction and each measure of life satisfaction). Researchers can use hand calculations involving simple algebra to estimate the causal paths for such simple multiple indicator models as depicted in Figure 2 (for examples, see Sullivan and Feldman 1979 and Loehlin 1998). But more complicated models are best left to "structural equation modeling" (SEM) computer software programs such as LISREL (Linear Structural Relationships; Joreskog and Sorbom 1993), EQS (Equations; Bentler 1995), or AMOS (Analysis of Moment Structures; Arbuckle 1997). Kline (1998) provides a particularly excellent and comprehensive introduction to the topic. There are two annotated bibliographies that represent almost all work related to SEM up to about 1996 (Austin and Wolfe 1991; Austin and Calderon 1996). Marcoulides and Schumacker (1996) and Schumacker and Marcoulides (1998) cover even more recent advances. Smallwaters software company has a Web site that gives a wealth of information, including other relevant Web sites: http://www.smallwaters.com/weblinks.

In essence, these SEM computer programs go through a series of trial-and-error "iterations" in which different values are substituted for the hypothesized causal paths—in Figure 2, paths *b, c, d*, and *x*. (Recall that we assigned a value of 1.0 for path *a*, so the SEM program does not have to estimate a value for this hypothesized path.) Ultimately, the program reaches ("converges on") a "final solution." This solution will *reproduce* as closely as possible the *observed correlations*—in Figure 2, *r*1, *r*2, *r*3, *r*4, *r*5, and *r*6—among each of the indicators in the proposed causal model. In the *final solution* depicted in Figure 2, the path estimates for *b, c, d*, and *x* (when combined with the "assigned" or "fixed" value for path *a*) exactly reproduce the observed correlations. (Note that the *final* solution will reproduce the observed correlations better than will the *initial* solutions, unless, of course, the SEM program finds the best solution on its first attempt-which is not likely in most "real-world" data analyses.)

More technically, the SEM program builds a series of (simultaneous) equations that represent the various hypothesized causal paths that determine each observed correlation. In Figure 2, the correlation (*r*1 = .20) for *I*1 and *LS*1 involves the "path" equation: *a***x***b* = .20; for the correlation (*r*2 = .28) of *I*1 and *LS*2: *a***x***c* = .28; for the correlation (*r*3 = .28) of *I*1 and *LS*3: *a***x***d* = .28; for the correlation (*r*4 = .35) of *LS*1 and *LS*2: *b***c* = .35; for the correlation (*r*5 = .35) of *LS*1 and *LS*3: *b***d* = .35; and for the correlation (*r*6 = .49) of *LS*2 and *LS*3: *c***d* = .49. The SEM program then uses the *known* values—observed correlations and, in the causal model for Figure 2, the fixed (predetermined) value of 1.0 for path *a*—to simultaneously solve the set of equations to obtain a value for each of the causal paths that initially have *unknown* values.

Except in artificial examples (like Figure 2), however, the SEM program is unlikely to obtain final values for the causal paths such that the path equations exactly reproduce the observed correlations. More specifically, the program will attempt through its iterative trial-and-error procedures to find a final value for each of the causal paths *b, c, d*, and *x* that will *minimize* the "average discrepancy" across each of the six model-*implied* correlations (i.e., predicted by the path equations) versus the six empirically *observed* correlations.

For example, to reproduce the observed correlation (*r*4 = .35) between *LS*1 and *LS*2 (recall that empirical correlations among indicators of subjective states like life satisfaction often range between .3 and .5), the SEM program would have to start with values for causal paths *b* and *c* (i.e., estimates of "slippage" between actual life satisfaction and measured life satisfaction) considerably lower than 1.0. In other words, the software program would need to allow for some random measurement error. If, instead, the initial solution of the SEM program assumed perfect reliability, there would be a substantial discrepancy between at least some of the implied versus observed correlations among the indicators. That is, multiplying the path *b* = 1.0 by the path *c* = 1.0 (i.e., assuming perfect reliability of each indicator) would imply an observed correlation of 1.0 (i.e., perfect consistency) between *LS*1 and *LS*2—an implied (i.e., predicted) correlation that far exceeds the *r*4 = .35 correlation we actually observe. (Keep in mind that, as depicted in Figure 2, the *best values* for *b* and *c* are .5 and .7, respectively, which the SEM program will eventually converge on as it "iterates" to a final solution).

If, at the next iteration, the SEM program were to substitute equal values for *b* and *c* of about .59 each (to allow for less than perfect reliability), this solution would exactly reproduce the observed correlation of .35 (i.e., .59*.59 = .35) between *LS*1 and *LS*2. But using a value of .59 for both *b* and *c* would *not* allow the program to reproduce the observed correlations for *LS*1 and *LS*2 with I1 (the indicator for income). To obtain the observed correlation of .20 between *I*1 and *LS*1, the program needs to multiply the paths *a***x***b*. Accordingly, *r*1 (.20) should equal *a***x***b*—that is, 1.0*x*.59 = .20. Solving for *x*, the program would obtain a path value of about .35. Likewise, to obtain the observed correlation of .28 between *I*1 and *LS*2, the program needs to multiply the paths *a***x***b*. Accordingly, *r*2 (.28) must equal *a***x***c*—that is, 1.0**x**.59 = .28. Solving for *x*, the program would obtain a path value of about .47. In other words, the program cannot find a solution for the preceding two equations that uses the *same value* for *x*. That is, for the first equation *x* ≃ .35. But for the second equation *x* ≃ .47.

Given the SEM program's need to come up with a *unique* (i.e., single) value for *x* (and for all the other causal paths the program must estimate), a possible compromise might be to use a value of .41. Substituting this value into the preceding two equations—*a***x***b* and *a***x***c*—would provide an *implied* correlations of about .24—1.0*.41*.59 . .24—for both equations. Comparing this implied correlation with the *observed* correlations of .20 and .28 for *I*1/*LS*1 and *I*1/*LS*2, respectively, the discrepancy in these two situations is +/- .04.

Although this is not a large discrepancy between the implied and the observed correlations, the SEM program can do better (at least in this hypothetical example). If the SEM program subsequently estimates values of .50 and .70 for causal paths *b* and *c*, respectively, then it is possible to use the same value of *x* (specifically, .40) for each path equation involving *I*1/*LS*1 and *I*1/*LS*2, and reproduce exactly the observed correlations for *r*1 (i.e., 1.0*.40*.50 = .20) and *r*2 (i.e., 1.0*.40*.70 = .28). By using these estimated values of .5 and .7 for paths *b* and *c* in place of the .6 and .6 values initially estimated, the program can also reproduce exactly the observed correlation (*r*4 = .35) between *LS*1 and *LS*2—that is, .5*.7 = .35. Furthermore, by using an estimated value of .7 for causal path *d*, the program can exactly reproduce *all* the remaining observed correlations in Figure 2—*r*3 = .28, *r*5 = .35, and *r*6 = .49—that involve path *d*, that is, *a***x***d*, *b***d*, and *c***d*, respectively. (We leave to the reader the task of solving the equations.)

In sum, by using the fixed (a priori) value of *a* = 1.0 and the estimated values of *x* = .4, *b* = .5, *c* = .7, and *d* = .7, the six implied correlations exactly match the six observed correlations depicted in Figure 2. In other words, the hypothesized causal paths in our model provide a "perfect fit" to the "data" (empirical correlations).

Reproducing the observed correlations among indicators does not, however, establish that the proposed model is correct. In the logic of hypothesis testing, one can only disconfirm models, not prove them. Indeed, there is generally a large number of alternative models, often with entirely different causal structures, that would reproduce the observable correlations just as well as the original model specified (see Kim and Mueller 1978 for examples). It should be apparent, therefore, that social scientists must provide rigorous logic and theory in building multiple indicator models, that is, in providing support for one model among a wide variety of possible models. In other words, multiple indicator procedures require that researchers think very carefully about how measures are linked to latent constructs, and how latent constructs are linked to other latent constructs.

Additionally, it is highly desirable that a model contain more observable correlations among indicators than unobservable causal paths to be estimated—that is, the model should be "overidentified." For example, Figure 2 has *six* observed correlations (*r*1, *r*2, *r*3, *r*4, *r*5, and *r*6) but only *four* hypothesized causal paths (*x, b, c*, and *d*) to estimate. Thus, Figure 2 is overidentified—with *two* "degrees of freedom" (df) By having an excess of observed correlations versus hypothesized causal paths (i.e., by having at least one and preferably many degrees of freedom), a researcher can provide tests of "model fit" to assess the probability that there exist alternative causal paths not specified in the original model. (Where the fit between the implied vs. observed correlations is poor, researchers typically seek to revise their causal model to better fit the data.)

Conversely, "just-identified" models will contain exactly as many observable correlations as hypothesized causal paths to be estimated (i.e., will have 0 degrees of freedom). Such models will *always* produce estimates (solutions) for the causal paths that *exactly* reproduce the observable correlations—no matter how badly misspecified the proposed causal pathways may be. In other words, "perfect" fit is inevitable and provides no useful information regarding whether the model is correctly specified or not. This result occurs because, unlike an overidentified model, a just-identified model does not have any degrees of freedom with which to detect alternative causal pathways to those specified in the original model. Accordingly, just-identified models are not very interesting to SEM practitioners.

Finally, the worst possible model is one that is "underidentified," that is, has fewer observable correlations than unobservable causal paths to be estimated. Such models can provide no single (unique) solutions for the unobservable paths. In other words, an infinite variety of alternative estimates for the causal paths is possible. For example, if we restricted Figure 2 to include only *LS*1 and *LS*2 (i.e., dropping *LS*3 and *I*1), the resulting two-indicator model of life satisfaction would be underidentified. That is, we would have *two* causal paths to estimate—from the latent construct to each of the two indicators (i.e., paths *b* and *c*)—but only *one* observed correlation (*r*4 = .35). Under this situation, there is no *unique* solution. We can literally substitute an infinite set of values for paths *b* and *c* to exactly reproduce the observed correlation (e.g., given *b***c* = .35, we can use .7*.5 or .5*.7, or two values slightly less than .6 each, and so on).

The number of indicators per latent construct helps determine whether a model will be overidentified or not. In general, one should have at least three and preferably four or more indicators per latent construct—unless one can assume a single indicator, such as income in Figure 2, has little measurement error. Adding more indicators for a latent construct rapidly increases the "overidentifying" pieces of empirical information (i.e., degrees of freedom). That is to say, observable correlations (between indicators) grow faster than the unobservable causal paths (between a given latent construct and indicator) to be estimated.

For example, adding a fourth indicator for life satisfaction in Figure 2 would require estimating *one* additional causal path (linking the life satisfaction latent construct to the fourth indicator), but would also produce *four* more observed correlations (*LS*4 with *LS*3, *LS*2, *LS*1, and *I*1). The modified model would thus have *three* more degrees of freedom, and correspondingly greater *power* to determine how well the model fits the empirical data (observed correlations). Including a fifth indicator for life satisfaction would produce *four* more degrees of freedom, and even more power to detect a misspecified model. (The issue of model identification is more complicated than outlined here. The requirement that an identified model have at least as many observed correlations as causal paths to be estimated is a necessary but not sufficient condition; cf. Kline [1998].)

Some additional points regarding multiple indicator models require clarification. For instance, in "real life" a researcher would never encounter such a perfect reproduction of the (noncausal) observable correlations from the unobservable (causal) paths as Figure 2 depicts. (We are assuming here that the causal model tested in "real life," like that model tested in Figure 2, is "overidentified." Recall that a "just-identified" model always exactly reproduces the observed correlations.) Indeed, even if the researcher's model is correctly specified, the researcher should expect at least *some* minor discrepancies in comparing the observed correlations among indicators with the correlations among indicators predicted (implied) by the hypothesized causal paths.

Researchers can dismiss as "sampling error" (i.e., "chance") any discrepancies that are not too large (given a specific sample size). At some point, however, the discrepancies do become too large to dismiss as "chance." At that point, researchers may determine that they have not specified a proper model. Poor model fit is strong grounds for reevaluating and respecifying the original causal model—typically by adding and (less often) subtracting causal paths to obtain a better fitting model. Just because an overidentified model can detect that a model is misspecified does not mean, however, that it is easy to then tell where the misspecification is occurring. Finding and correcting misspecification is a complex "art form" that we cannot describe here (but see Kline 1998 for an overview).

The next section will describe how nonrandom measurement error can create a misfitting multiple indicator model and corresponding bias in estimates of causal pathways. Additionally, we will demonstrate how a just-identified model, in contrast to an overidentified model, will fail to detect and thus correct for this misspecification, resulting in considerable bias in estimating the true effect of income on life satisfaction.

## MULTIPLE-INDICATOR MODELS WITH NONRANDOM MEASUREMENT ERROR

We have discussed how poor quality measures—low reliability and low validity—can bias the estimates of the true effects of one latent construct on another. Our specific modeling examples (Figures 1 and 2), however, have focused on the bias introduced by unreliable measures only. That is, our causal diagrams have assumed that all measurement error is *random*. For example, Figure 2 depicts the error terms (*E*'s) for each of three multiple indicators of life satisfaction to be *unconnected*. Such random measurement error can occur for any number of reasons: ambiguous questions, coding errors, respondent fatigue, and so forth. But none of these sources of measurement error should *increase* correlations among the multiple indicators. Indeed, as noted in previous sections, random measurement error should *reduce* the correlations (consistency) among multiple indicators of a given latent construct—less-than-perfect correlations that researchers can then use to estimate reliability and thereby correct for the bias that would otherwise occur in underestimating the true effect of one latent construct on another (e.g., income on life satisfaction).

Conversely, where measurement error *increases* correlations among indicators, social scientists describe it as systematic or *nonrandom*. Under these conditions, the measurement errors of two or more indicators have a common source (a latent construct) other than or in addition to the concept that the indicators were suppose to measure. The focus now becomes the *validity* of measures. Are you measuring what you claim to measure or something else? (See the section above entitled "Reliability and Validity of Measures" for a more general discussion.)

Failure to include nonrandom—linked or "correlated"—errors in a multiple-indicator model will bias the estimates of other causal paths in the model. Figure 3 depicts such a linkage of error terms through the personality variable "optimism" (O). Based on the hypothetical model in Figure 3, the observed correlation (*r*6) between the indicators *LS*2 and *LS*3 would not be entirely the consequence of the effects of life satisfaction (*LS*) operating through the causal paths *c* and *d*. In fact, part of this observed correlation would occur as a consequence of the causal paths *e* and *f* (which in this hypothetical example, we have constrained to be equal). In other words, the indicators *LS*2 and *LS*3 measure some of life satisfaction but also optimism. That is, they measure something in addition to what they were intended to measure. Stated in still other words, the two indicators are not "pure" (completely valid) measures of life satisfaction because they are "contaminated" by also tapping optimism.

Note that *r*6 is the only observed correlation that differs in strength in comparing Figures 2 and 3. For Figure 2, this correlation equals .49; for Figure 3, this correlation equals .85. The higher observed correlation for *r*6 in Figure 3 stems from the "inflated" correlation produced by the effects of optimism through paths *e* (.6) and *f* (.6). Note, also, that .6*.6 equals .36. If we add .36 to the original observed correlation for *r*6 (i.e., .49) in Figure 2, we obtain the observed correlation of .85 in Figure 3. All other paths estimates and observed correlation remain the same across the two figures. Furthermore, like Figure 2, Figure 3 depicts path estimates for a final solution that exactly reproduce all observed correlations.

Note, too, if we had not added the causal paths (*e* and *f*) to represent the hypothesized effect of optimism on two measures of life satisfaction, we could not obtain a "good fit" for the observed correlations in Figure 3. Indeed, *without* these additional paths, the SEM computer program would have to increase the path coefficients for *c* and *d*—say, to about .92 each—in order to reproduce the observed correlation of .85 for *r*6. But then the program would fail to reproduce the observed correlations involving *LS*2 and *LS*3 with the other indicators in the model (*LS*1 and *I*1). For example, the observed correlation between *I*1 and *LS*2 (*r*2 = .28) would now be *overestimated*, based on the product of the causal paths—*a***x***c*—that the model in Figure 3 suggests determines *r*2. That is, 1.0*.4*.92 results in an implied (predicted) correlation of about .37, which leaves a discrepancy of .09 relative to the correlation (.28) actually observed.

In the absence of good fit—as a consequence of not allowing paths *e* and *f* to be included in the model to be tested (i.e., not specifying the "correct" model)—the SEM program might continue to "iterate" by substituting new estimates for *c* and the other causal paths in Figure 3—*b*, *d*, and *x*—which initially have unknown values. No matter what values are estimated for each of these causal paths, however, the program will not be able to eliminate a discrepancy between the implied and the observed correlations for at least some of the pairs of indicators—a discrepancy that, as noted previously, the program will attempt to minimize based on the criterion of finding path estimates that result in the lowest average discrepancy summed across all possible comparisons of implied versus observed correlations in the model.

Indeed, a misspecification in one part of a model (in this instance, failure to model nonrandom measurement error in *LS*1 and *LS*2 indicators) generally "reverberates" throughout the causal model, as the SEM program attempts to make adjustments in all path coefficients to minimize the average discrepancy. In other words, estimates for each path coefficient are most often at least slightly biased by misspecified models (the greater the misspecification, the greater the bias). Most importantly, the failure to model (and thus correct for) nonrandom measurement error will typically result in the SEM program's estimating a value for the true effect of one construct on another—in the present example, the true effect (path *x*) of income on life satisfaction—that will be biased (either over- or underestimated). For example, when we use a SEM program (EQS) to calculate estimates for the hypothesized causal paths in Figure 3 (excluding the paths for the contamination of optimism), we obtain a final solution with the following estimates: *b*= .38, *c*= .92, *d*= .92, and *x* = .31. Thus, in the present example, the failure to model nonrandom measurement error has led to an underestimate (*x*= .31) of income's true effect (*x* = .40) on life satisfaction. Interestingly, by far the largest discrepancy in fit occurs between *I*1 and *LS*1 (the implied correlation is .08 less than the observed correlation), not between *LS*2 and *LS*3(where there is 0 discrepancy)—demonstrating that misfit in one part of the model can produce misfit (reverberate) elsewhere in the model.

The hypothetical model depicted in Figure 3 is necessarily artificial (to provide simple illustrations of SEM principles). In real-life situations, it is unlikely (though still possible) that sources of contamination, when present, would impact only some of the multiple indicators for a given construct (e.g., only two of the three measures of life satisfaction). Furthermore, researchers would be on shaky ground if they were to actually claim that the correlated measurement error modeled in the present example stemmed from an identifiable source—such as optimism. (Indeed, to avoid specifically labeling the source of the inflated correlation between *LS*1 and *LS*2, researchers would most likely model a curved, double-headed arrow between *E*3 and *E*4 in Figure 3, and have the SEM program simply estimate the appropriate value—.36—for this new path representing correlated measurement error.) To make a legitimate claim that a construct like optimism contaminates a construct such as life satisfaction, researchers would also need to provide traditional measures of the suspected source of contamination (e.g., include established multiple indicators for optimism).

As a final example of the need to use *overidentified* multiple indicator models in order to detect misspecified models, consider a modification of Figure 3 in which we have access to the single indicator for income (*I*1) but have only two indicators for life satisfaction available to us—*LS*2 and *LS*3. Assume, further, that we are unaware that optimism contaminates the two life satisfaction indicators. Note that this new model is *just identified*. That is, there are three observed correlations (*r*2, *r*3, and *r*6) and three hypothesized causal paths to estimate (*x*, *c*, and *d*). Accordingly, this model will perfectly fit the data, despite the fact that we have not modeled (i.e., misspecified) the correlated measurement error (.36) that we've built into the .85 observed correlation (*r*6) between *LS*2 and *LS*3.

More specifically, in order to reproduce the .85 correlation, the SEM program will increase the estimate of reliability for the *LS*2 and *LS*3 life satisfaction indicators by increasing the *c* and *d* paths from their original (true) value of .7 each to new (biased) values of about .92 each. To reproduce the observed correlation of .28 for both *I*1/*LS*2 and *I*1/*LS*3 (*r*2 and *r*3), the program can simply decrease the *x* path (i.e., the estimate of the true effect of income on life satisfaction) from its original (true) value of .40 to a new (biased) value of about .30—that is, for both path equations *a***x***c* and *a***x***d*: 1.0*.3*.92 . .28.

In sum, our just-identified model perfectly fits the data despite the fact our model has failed to include (i.e., has misspecified) the correlated measurement error created by optimism contaminating the two life satisfaction indicators. More importantly, the failure to model the nonrandom measurement error in these two indicators has led us to a biased underestimate of the true effect of income and life satisfaction: Income explains .32 = 9% of the variance in life satisfaction for the current (biased) model, but explains .42 = 16% variance in life satisfaction for the earlier (correctly specified) model.

As noted earlier, the debilitating effects of nonrandom measurement error are extremely diverse in their potential forms and correspondingly complex in their potential bias in under- or overestimating causal links among latent constructs. The present discussion only touches on the issues and possible multiple indicator models for detecting and correcting this type of measurement error. The reader is encouraged to consult other sources that provide additional details (starting with Kline 1998).

## STRENGTHS AND WEAKNESSES OF MULTIPLE-INDICATOR MODELS USING SEM

As we have seen, by incorporating multiple indicators of constructs, structural equation modeling procedures allow more rigorous tests of causal models than possible when using single indicators. With sufficient multiple indicators to provide an overidentified model, SEM procedures are particularly powerful in detecting and correcting for the bias created by less than perfectly reliable and valid measures of latent constructs of interest. In other words, by building a more accurate *measurement model*—that is, providing a more correct specification of the causal linkages between latent constructs and their indicators—SEM researchers can obtain more accurate estimates of the causal linkages in the *structural model*—that is, the effect of one latent variable on another.

A correct specification of the measurement model is, then, a means to the ultimate goal of obtaining accurate path estimates (also described as "parameter estimates") for the structural model (e.g., obtaining the true effect of income on life satisfaction by correcting for measurement error). Some research, however, makes the measurement model the central focus (e.g., whether the three indicators of life satisfaction correlate with each other in a manner consistent with their measuring a single latent construct). In this situation, the SEM procedure is known as "confirmatory factor analysis" (CFA). The CFA uses the typical SEM principle of positing a measurement model in which latent constructs (now described as "factors") impact multiple indicators (the path coefficient now described as a "factor loading"). Unlike a "full-blown" SEM analysis, however, the CFA does not specify causal links among the latent constructs (factors), that is, does not specify a structural model (e.g., there is no interest in whether income affects life satisfaction). Instead, the CFA treats potential relationships among any two factors as simply a "correlation" (typically represented in diagrams with a curved, double-headed arrow between the factors); that is, the causal ordering among latent constructs is *not specified*.

Researchers have developed especially powerful CFA procedures for establishing the validity of measures through significant enhancements to traditional construct validity techniques based on multitrait/multimethods (MTMM), second-order factor structures, and testing the invariance of measurement models across diverse subsamples (again, see Kline 1998 for an introduction). CFA has also been at the vanguard in developing "nonlinear" factor analysis to help reduce the presence of "spurious" or "superfluous" latent constructs (referred to as "methods" or "difficulty" factors) that can occur as a consequence of using multiple indicators that are highly skewed (i.e., non-normally distributed) (Muthen 1993, 1998). These nonlinear factor analysis procedures also provide a bridge to recent advances in psychometric techniques based on item-response theory (cf. Reise et al. 1993; Waller et al. 1996).

CFA provides access to such powerful psychometric procedures as a consequence of its enormous flexibility in specifying the structure of a measurement model. But this strength is also its weakness. CFA requires that the researcher specify all aspects of the measurement model a priori. Accordingly, one must have considerable knowledge (based on logic, theory, or prior empirical evidence) of what the underlying structure of the measurement model is likely to be. In the absence of such knowledge, measurement models are likely to be badly misspecified. Although SEM has available particularly sophisticated techniques for detecting misspecification, simulation studies indicate that these techniques do not do well when the hypothesized model is at some distance from the true model (cf. Kline 1998).

Under these circumstances of less certainty about the underlying measurement model, researchers often use "exploratory factor analysis" (EFA). Unlike CFA, EFA requires no a priori specification of the measurement model (though it still works best if researchers have some idea of at least the likely number of factors that underlie the set of indicators submitted to the program). Individual indicators are free to "load on" any factor. Which indicators load on which factor is, essentially, a consequence of the EFA using the empirical correlations among the multiple indicators to seek out "clusters" of items that have the highest correlations with each other, and designating a factor to represent the underlying latent construct that is creating (is the "common cause" of) the correlations among items. Of course, designating these factors is easier said then done, given that the EFA can continue to extract additional factors, contingent on how stringent the standards are for what constitutes indicators that are "more" versus "less" correlated with each other. (For example, does a set of items with correlations of, say, .5 with each other constitute the same or a different "cluster" relative to other indicators that correlate .4 with the first cluster and .5 with each other? Extracting more factors will tend to separate these clusters into different factors; extracting fewer factors is more likely to combine the clusters into one factor.) The issue of the number of factors to extract is, then, a major dilemma in using EFA, with a number of criteria available, including the preferred method, where possible, of specifying a priori the number of factors to extract (cf. Floydand Widaman 1995).

A potential weakness with EFA is that it works best with relatively "simple" measurement models. EFA falters in situations where the reasons for the "clustering" of indicators (i.e., interitem correlations) stem from complex sources (see Bollen 1989 for an example). As an aspect of this inability to handle complex measurement models, EFA cannot match CFA in the sophistication of its tests for validity of measures. On the other hand, because CFA is more sensitive to even slight misspecifications, it is often more difficult to obtain satisfactory model "fit" with CFA.

Exploratory and confirmatory factor analysis can complement each other (but see Floyd and Widaman 1995 for caveats). Researchers may initially develop their measurement models based on EFA, then follow that analysis with CFA (preferably on a second random subsample, in order to avoid modeling "sampling error"). In this context, the EFA serves as a crude "first cut," which it is hoped, results in a measurement model that is close enough to the "true model" that a subsequent CFA can further refine the initial model, using the more sophisticated procedures available with the confirmatory procedure.

Even though researchers use EFA and CFA to focus exclusively on *measurement models*, these psychometric studies are really a preamble to subsequent work in which the measures previously developed are now used to test causal linkages among constructs in *structural models*. Most of these subsequent tests of causal linkages among constructs (e.g., the effect of income on life satisfaction) use data analytic procedures, such as ordinary least squares (OLS) regression or analysis of variance (ANOVA) that assume no underlying latent structure. Accordingly, these traditional analyses of structural models work with measured (observable) variables only, can accommodate only a *single indicator* per construct of interest and, therefore, must assume (in essence) that each single indicator is perfectly reliable and valid. In other words, in the absence of a multiple indicators measurement model for each construct, there is no way to make adjustments for less-than-perfect reliability or validity.

In an attempt to enhance the reliability of the single-indicator measures that traditional data analyses procedures require, researchers often combine into a single *composite* scale the set of multiple indicators (e.g., the three indicators of life satisfaction) that prior EFA and CFA research has established as measuring a given latent construct (factor). Although, as noted earlier, a single-indicator composite scale (e.g., summing a respondent's score on each of the three life satisfaction indicators) is generally preferable to a noncomposite single indicator (at least when measuring more subjective and abstract phenomena), a composite scale is still less than perfectly reliable (often, much less). Accordingly, using such an enhanced (more reliable) measure will still result in biased parameter estimates in the structural model.

Validity also remains a concern. Even if prior psychometric work on measures has rigorously addressed validity (though this is often not the case), there is always a question of how valid the measure of a given construct is in the context of other constructs in the current structural model (e.g., what sort of "cross-contamination" of measures may be occurring among the constructs that might bias estimates of their causal linkages). SEM remains a viable alternative to the more traditional and widely used data analysis procedures for assessing causal effects in structural models. Indeed, SEM is so flexible in specifying the structural, as well as measurement part of a causal model, that it can literally "mimic" most of the traditional, single indicator, data analysis procedures based on the general linear model (e.g., Analysis of Variance [ANOVA], Multivariate Analysis of Variance [MANOVA], Ordinary Least Squares [OLS] regression), as well as the newer developments in these statistical techniques (e.g., Hierarchical Linear Modeling [HLM]). SEM can even incorporate the single-indicator measurement models (i.e., based on observable variables only) that these other data analysis procedures use. The power to adjust for random and nonrandom measurement error (unreliability and invalidity) is realized, however, only if the SEM includes a multiple-indicator (latent variable) measurement model.

A multiple-indicator measurement model also provides SEM procedures with a particularly powerful tool for modeling (and thus adjusting for) correlated measurement errors that inevitably occur in analyses of *longitudinal* data (cf. Bollen 1989). Likewise, multiple-indicator models allow SEM programs to combine latent growth curves and multiple subgroups of cohorts in a "cohort-sequential" design through which researchers can examine potential cohort effects and developmental sequences over longer time frames than the period of data collection in the study. By also forming subgroups based on their pattern of attrition, (i.e., at what wave of data collection do respondents drop out of the study), SEM researchers can analyze the "extended" growth curves while adjusting for the potential bias from nonrandom missing data (Duncan and Duncan 1995)—a combination of important features that non-SEM methods cannot match. Furthermore, although SEM is rarely used with *experimental* research designs, multiple-indicator models provide exceptionally powerful methods of incorporating "manipulation checks" (i.e., whether the experimental treatment manipulates the independent variable that was intended) and various "experimental artifacts" (e.g., "demand characteristics") as alternative causal pathways to the outcome variable of interest (cf. Blalock 1985).

Although this article has focused on strengths of the SEM multiple indicator measurement model, the advantages of SEM extend beyond its ability to model random and nonrandom measurement error. Particularly noteworthy in this regard is SEM's power to assess both contemporaneous and lagged *reciprocal* effects using nonexperimental longitudinal data. Likewise, SEM has an especially elegant way of incorporating maximum likelihood full-information imputation procedures for handling missing data (e.g., Arbuckle 1996; Schafer and Olsen 1998), as an alternative to conventional listwise, pairwise, and mean substitution options. Simulation studies show that the full-information procedures for missing data, relative to standard methods, reduce both the bias and inefficiency that can otherwise occur in estimating factor loadings in measurement models and causal paths in structural models.

So given all the apparent advantages of SEM, why might researchers hesitate to use this technique? As noted earlier in discussing limitations of confirmatory factor analysis, SEM requires that a field of study be mature enough in its development so that researchers can build the a priori causal models that SEM demands as input. SEM also requires relatively large sample sizes to work properly. Although recommendations vary, a minimum sample size of 100 to 200 would appear necessary in order to have a reasonable chance of getting solutions that "converge" or that provide parameter (path) estimates that are not "out of range" (e.g., a negative variance estimate) or otherwise implausible. Experts often also suggest that the "power" to obtain stable parameter estimates (i.e. the ability to detect path coefficients that differ from zero) requires the ratio of subjects-toparameters-estimated be between 5:1 and 20:1 (the minimum ratio increasing as variables become less normally distributed). In other words, more complex models demand even larger sample sizes.

Apart from the issue of needed a larger *N* to accommodate more complex causal models, there also is some agreement among SEM experts that models should be kept simpler in order to have a reasonable chance to get the models to fit well (cf., Floyd and Widaman 1995). In other words, SEM works better where the total number of empirical indicators in a model is not extremely large (say 30 or less). Accordingly, if researchers wish to test causal models with many predictors (latent constructs) or wish to model fewer constructs with many indicators per construct, SEM may not be the most viable option. Alternatively, one might consider "trimming" the proposed model of constructs and indicators through more "exploratory" SEM procedures of testing and respecifying models, possibly buttressed by preliminary model trimming runs with statistical techniques that can accommodate a larger number of predictors (e.g., OLS regression), and with the final model hopefully tested (confirmed) on a second random subsample. Combining pairs or larger "parcels" from a common set of multiple indicators is another option for reducing the number of total indicators, and has the added benefit of providing measures that have more normal distributions and fewer correlated errors from idiosyncratic "content overlap." (One needs to be careful, however, of eliminating too many indicators, given that MacCallum et al. [1996] have shown that the power to detect good-fitting models is substantially reduced as degrees of freedom decline.)

An additional limitation on using SEM has been the complex set of procedures (based on matrix algebra) one has had to go through to input the model to be tested—accompanied by output that was also less than user-friendly. Fortunately, newer versions of several SEM programs can now use causal diagrams for both input and output of the structural and measurement models. Indeed, these simple diagramming options threaten to surpass the user-friendliness of the more popular mainstream software packages (e.g., Statistical Package for the Social Sciences [SPSS]) that implement the more traditional statistical procedures. This statement is not meant to be sanguine, however, about the knowledge and experience required, and the care one must exercise, in using SEM packages. In this regard, Kline (1998) has a particularly excellent summary of "how to fool yourself with SEM." The reader is also encouraged to read Joreskog (1993) on how to use SEM to build more complex models from simpler models.

As the preceding discussion implies, using multiple-indicator models requires more thought and more complicated procedures than does using more common data analytic procedures. However, given the serious distortions that measurement errors can produce in estimating the true causal links among concepts, the extra effort in using multiple-indicator models can pay large dividends.

## references

Arbuckle, J. L. 1996 "Full Information Estimation in the Presence of Incomplete Data. In G. Marcoulides and R. Schumacker eds., *Advanced Structural Equation Modeling: Issues and Techniques*. Mahwah, N. J.: Lawrence Erlbaum.

——1997 *Amos User's Guide Version 3.6*. Chicago: SPSS.

Austin, J., and R. Calderon 1996 "Theoretical and Technical Contributions to Structural Equation Modeling: An Updated Annotated Bibliography." *Structural Equation Modeling* 3:105–175.

——, and L. Wolfe 1991 "Annotated Bibliography of Structural Equation Modeling: Technical Work." *British Journal of Mathematical and Statistical Psychology* 44:93–152.

Bentler, P.M. 1995. *EQS: Structural Equations Program Manual*. Encino, Calif.: Multivariate Software.

Blalock, H. M. 1979 *Social Statistics*, 2nd ed. New York: McGraw-Hill.

——, ed. 1985 *Causal Model in Panel and Experimental Designs*. New York: Aldine.

Bollen, K. A. 1989 *Structural Equations with Latent Variables*. New York: John Wiley.

Diener, E., R. A. Emmons, R. J. Larsen, and S. Griffin 1985 "The Satisfaction with Life Scale." *Journal of Personality Assessment* 49:71–75.

Duncan, T. E., and S. C. Duncan 1995 "Modeling the Processes of Development via Latent Variable Growth Curve Methodology." *Structural Equation Modeling* 2:187–213.

Floyd, F. J., and K. F. Widaman 1995 "Factor Analysis in the Development and Refinement of Clinical Assessment Instruments." *Psychological Assessment* 7:286–299.

Haring, M. J., W. A. Stock, and M. A. Okun 1984 "A Research Synthesis of Gender and Social Class as Correlates of Subjective Well-Being." *Human Relations* 37: 645–657.

Joreskog, K. G. 1993 "Testing Structural Equation Models." In K. A. Bollen and J. S. Long, eds., *Testing Structural Equation Models*. Thousand Oaks, Calif.: Sage.

——, and D. Sorbom 1993 *Lisrel 8: User's Reference Guide*. Chicago: Scientific Software.

Kim, J., and C. W. Mueller 1978 *Introduction to Factor Analysis: What It Is and How to Do It*. Beverly Hills, Calif.: Sage.

Kline, R. B. 1998 *Principles and Practice of Structural Equation Modeling*. New York: Guilford.

Larson, R. 1978 "Thirty Years of Research on the Subjective Well-Being of Older Americans. *Journal of Gerontology* 33:109–125.

Loehlin, J. C. 1998 *Latent Variable Models: An Introduction to Factor, Path, and Structural Analysis*, 3rd ed. Hillsdale, N.J.: Lawrence Erlbaum.

Marcoulides, G. A., and R. E. Schumacker, eds. 1996 *Advanced Structural Equation Modeling: Issues and Techniques*. Mahwah, N.J.: Lawrence Erlbaum.

Muthen, B. O. 1993 "Goodness of Fit with Categorical and Other Nonnormal Variables." In K. A. Bollen and J. S. Long, eds., *Testing Structural Equation Models*. Thousand Oaks, Calif.: Sage.

——1998 *Mplus User's Guide: The Comprehensive Modeling Program for Applied Researchers*. Los Angeles, Calif.: Muthen and Muthen.

Nunnally, J. C., and I. H. Bernstein 1994 *Psychometric Theory*, 3rd ed. New York: McGraw-Hill.

Randall, R. E., and G. A. Marcoulides (eds.) 1998 *Interaction and Nonlinear Effects in Structural Equation Modeling*. Mahwah, N. J.: Lawrence Erlbaum.

Reise, S. P., K. F. Widaman, and R. H. Pugh 1993 "Confirmatory Factor Analysis and Item Response Theory: Two Approaches for Exploring Measurement Invariance." *Psychological Bulletin* 114:552–566.

Schafer, J. L., and M. K. Olsen 1998 "Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective." Unpublished manuscript, Pennsylvania State University. http://www.stat.psu.edu/˜jls/index.html/#res.

Sullivan, J. L., and S. Feldman 1979 *Multiple Indicators: An Introduction*. Beverly Hills, Calif.: Sage.

Waller, N. G., A. Tellegen, R. P. McDonald, and D. T. Lykken 1996 "Exploring Nonlinear Models in Personality Assessment: Development and Preliminary Validation of a Negative Emotionality Scale." *Journal of Personality* 64:545–576.

Kyle Kercher