Path analysis is a widely used technique for modeling plausible sets of causal relations among three or more observed variables. In the social sciences path analysis has been widely used especially in sociology, and also in psychology (most notably in areas of child or lifespan development or other longitudinal research); elsewhere, it has proven useful in biology, particularly in genetics (including behavioral genetics).
Path models are typically represented in the form of path diagrams, but can also be modeled as a set of regression equations. A path model can include any number of independent (or exogenous) variables, any number of dependent (endogenous) variables, and any number of intermediate variables, which are both dependent on some variables and predictive of others. In a path diagram, each variable is represented. The hypothesized links among variables are shown by arrows, representing predictive or correlational relations.
In most cases all of the exogenous variables are modeled with all possible correlations among them represented. A failure to include such a correlation would in effect be a hypothesis that that correlation equals zero, which is rarely applicable to measured variables. Endogenous variables (those with predictive arrows, or paths, leading to them) cannot be included in correlational relations. Typically each endogenous variable will have one additional path leading to it from an unspecified source, representing all sources of variance in the endogenous variable that are not already modeled, called the residual or the disturbance. The absence of this arrow indicates a hypothesis that all of the variance is accounted for in the model which, again, is rarely the case. These residual variance sources are unmeasured exogenous variables that can be correlated with each other or with observed exogenous variables.
When the path model has been established, the next step is to estimate the path coefficients. In conventional multiple linear regression, two or more independent (or exogenous) variables are modeled as predicting one dependent (endogenous) variable. The coefficients derived in multiple regression are partial regression coefficients: the regression of the dependent variable on each independent variable, holding the other independent variables constant. Path analysis is much the same. Each path coefficient is a partial regression coefficient: again, the regression of the specified endogenous variable on the specified “upstream” variable, controlling for the other variables that have paths leading to the endogenous variable. And thus the path coefficients are interpretable as partial regression coefficients: the change in the downstream variable per unit change in the upstream variable, holding all other variables constant. Certain hypotheses involving the path coefficients can be tested as in regression, such as the null hypothesis that the path coefficient equals zero, which is tested by the ratio of the coefficient to its standard error. In fact, if the path model is a recursive system, such that the predictions of downstream variables from upstream can be depicted in a block triangular matrix (where the matrix elements are the effects of the upstream variables [columns] on the downstream variables [rows]—i.e., no variable is both upstream and downstream of a given other variable), then the path analysis can be completely conducted as a series of sequential ordinary least squares regression analyses. In more complex models, other algorithms such as maximum likelihood are needed.
More commonly the path model is estimated in software for structural equation modeling. Structural equation modeling is an extension of path analysis, in which the paths of interest are typically among latent (unmeasured) variables, or factors, with an explicit measurement model linking the factors to observed variables. As a special case of structural equation models, path models can easily be fit in the more sophisticated software.
An important benefit of such software is that it calculates a number of indices of fit of the model. A fitted path model allows for the calculation, from the various path coefficients and estimated variances, covariances, and residual covariances, of a model-implied variance/covariance matrix of the original variables—that is, a covariance matrix that is consistent with the fitted model. Broadly speaking, the fit of the model is an assessment of how well the model, with its estimated coefficients, implies a covariance matrix that matches the original matrix from the data. If there is no significant discrepancy, as measured by a chi-squared statistic, then it may be concluded that the path model is consistent with the data. Note that this does not necessarily indicate that the model is an accurate depiction of causation in reality, but only that it is not inconsistent with reality as indicated by the covariance matrix.
There has been, however, much debate over the utility of the chi-squared statistic. It is widely agreed that a non-significant chi-squared results in a failure to reject the model. However, there may be cases where the chi-squared statistic is sensitive to small deviations between the actual and implied covariance matrices that are not of practical importance to the researcher; this is especially true when the sample size is large. As a result, numerous statistics have been developed to assess approximate or close fit. Among the more prominent of these are the Comparative Fit Index, the Tucker-Lewis Index, and the Root Mean Squared Error of Approximation.
The test of model fit is one of the primary results of fitting a path model. Other hypotheses of interest in path analysis frequently involve constraints on the path coefficients: for example, that two path coefficients are equal to each other, or that a set of three coefficients are all equal to zero. These can readily be tested in structural equation modeling software by the estimation of nested models. In this situation the fit of a full model (without the constraints) is compared with a restricted model (with the constraints applied in the estimation process). Two such models are nested if the restricted model can be created strictly by imposing constraints on the full model. If the full model fits the data well, then the difference between the chi-squared statistics for the two models is itself distributed as a chi-squared, with degrees of freedom equal to the number of constraints applied. A significant chi-squared statistic indicates that the restricted model fits significantly less well than the full model.
Path analysis has found utility in a number of areas of the social sciences. Tw o areas in which it is most prominently featured are behavioral genetics and longitudinal research. In behavioral genetics, one of the most common tools for evaluating heritability is the ACE model, a path model which allows partitioning of variability in twin studies into additive genetic effects, common (shared) environmental effects, and non-shared environmental effects. In child development or lifespan development—or any longitudinal study—path analysis is well suited as the causal direction between variables is typically unambiguous: The hypothesized causation is from the temporally prior variable. Finally, path analysis can be especially useful in intervention studies to identify mechanisms that may mediate the effects of the intervention on the outcomes.
SEE ALSO Research, Longitudinal; Structural Equation Models
Bollen, Kenneth A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Loehlin, John C. 1992. Latent Variable Models: An Introduction to Factor, Path, and Structural Analysis. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates.
Marsh, Herbert W., and John Balla. 1994. Goodness of Fit in Confirmatory Factor Analysis: The Effects of Sample Size and Model Parsimony. Quality and Quantity 28: 185–217.
Patrick S. Malone
A hypothetical example is provided in the diagram below, which indicates the causal connections between the four variables of father's occupation, father's education, respondent's education, and respondent's occupation. In this model, social origins are placed before respondent's educational achievement, which is in turn placed before his or her occupational attainment. This technique attaches quantitative estimates to the causal connections in question, although it does not actually establish causality, since the pattern of relationships between the variables is entirely dependent on the researcher's own judgements about the likely causality among the variables. Where it is impossible to specify directionality, variables are deemed to be correlates (as in the case of social background and parental education above), and the link between them is conventionally described by a curved arrow having two heads.
The principal advantage of path analysis is that it allows the researcher to estimate the relative influence of variables within a causal network. The obvious disadvantage is that the model depends upon the researcher's own conception of the likely causal sequences involved, and since this cannot be validated or invalidated by the analysis, misleading path diagrams are sometimes produced. See also MULTIVARIATE ANALYSIS.