A study is valid when it actually measures what it claims to measure and when there are no logical errors in the drawing of conclusions from the data. There are many labels for different types of validity, but all concern threats that undermine the meaningfulness of research. Some early writers simply equated validity with establishing that a construct’s scale correlated with a dependent variable in the intended manner, and, indeed, a scale might be considered valid as a measure of anything with which it correlated (Guilford 1946). Types of validity were codified in 1954 by the American Psychological Association (APA), which identified four categories: content validity, construct validity, concurrent validity, and predictive validity. Each type corresponded to a different research purpose: Content validity had to do with subject-matter content testing; construct validity with measuring abstract concepts like IQ; concurrent validity with devising new scales or tests to replace existing ones; and predictive validity with devising indicators of future performance. A 1966 update to the APA typology combined the last two types under the label criterion-related validity. Later, Lorrie Shepard (1993) was among those who argued that both criterion and content validity were subtypes of construct validity, leaving only one type of validity.
The unified view of validity supported the notion that only rarely could a researcher establish validity with reference to a single earlier type. Moreover, Lee Cronbach’s (1971, p. 447) earlier argument that validity could not be established for a test or a scale, but only for interpretations that researchers might make from a test or a scale, also became widely accepted in the current era. Some researchers, such as Samuel Messick (1989), accept construct validity as the only type of validity, but argue for multiple standards for assessing it, including relevant content based on sound theory or rationale, internally consistent items, external correlation with related measures, generalizability across populations and time, and explicitness in social consequences (e.g., racial bias). In a nutshell, since about the mid-twentieth century, the concept of validation has evolved from the establishing of correlation with a dependent variable to the idea that researchers must validate each interpretation of each scale, test, or instrument, and do so in multiple ways that taken together form the whole of what validity testing is about.
The outline below largely accepts the unified view of validity, centering on construct validity, but adds to it separate coverage in three areas: (1) content validity, focusing on the labeling of constructs; (2) internal validity, focusing on research design bias; and (3) statistical validity, focusing on meeting the assumptions of empirical procedures. While all three might be (and by some are) considered subtypes of construct validity, they do not fall neatly in its two major subdomains—convergent and discriminant validity—and so have been treated here separately.
Under construct validity (or factorial validity), a good construct has a theoretical basis that is translated through clear operational definitions involving measurable indicators. A poor construct may be characterized by lack of theoretical agreement on content, or by flawed operationalization such that its indicators may be construed as measuring one thing by one researcher and another thing by a second researcher. To the extent that a proposed construct is at odds with the existing literature on related hypothesized relationships using other measures, its construct validity is suspect. The more a construct is used, in more settings with more outcomes consistent with theory, the more its construct validity.
Researchers should establish both of the two main types of construct validity: convergent and discriminant. Convergent validity is assessed by the correlation among items that make up the scale or the instrument measuring a construct (internal-consistency validity); by the correlation of the given scale with measures of the same construct using scales and instruments proposed by other researchers and, preferably, already accepted in the field (criterion validity); and by the correlation of relationships involving the given scale across samples (e.g., racial tolerance using subject data versus spousal data) or across methods (e.g., survey data versus archival data).
Internal-consistency validity seeks to establish at least moderate correlation among the indicators for a concept. Cronbach’s alpha is commonly used, with . 60 considered acceptable for exploratory purposes, .70 adequate for confirmatory purposes, and .80 good for confirmatory purposes. Other tests used to demonstrate convergent validity include demonstrating a simple factor structure, employing the one-parameter logistic models developed by Georg Rasch (1960), or using the average variance extracted (AVE) method developed by Claus Fornell and David Larcker (1981).
Criterion validity (or concurrent validity) has to do with the correlation between measurement items and accepted measures. Ideally, the criteria are direct, objective measures of what is being assessed (e.g., how well does survey-reported voting correlate with actual voting in voting records?), but correlation with well-accepted related scales is an alternative criterion.
External validity has to do with possible bias in the process of generalizing conclusions from a sample to a population, to other subject populations, to other settings, or to other time periods. The questions raised include: “Are findings using the construct scale consistent across samples?” and “To what population does the researcher wish to generalize conclusions, and is there something unique about the study sample’s subjects—the place where they lived and worked, the setting in which they were involved, the times of the study—that would prevent valid generalization?” When a sample is nonrandom in unknown ways, the likelihood of external validity is low, as in the case of convenience samples. External validity may be increased by cross-validation, where the researcher develops the instrument on a calibration sample and then tests it on an independent validation sample.
Discriminant validity, the second major type of construct validity, refers to the principle that the indicators for different constructs should not be so highly correlated as to lead one to conclude that they measure the same thing. This could happen if there is definitional overlap between constructs.
Discriminant validity analysis may include correlational methods, factor methods (Straub 1989), the AVE method, and structural equation modeling (SEM) approaches. In confirmatory factor analysis within SEM, if goodness-of-fit measures for the measurement model are adequate, the researcher concludes that the constructs in the model differ. A more rigorous and widely accepted SEM-based alternative is to run the model unconstrained and also constraining the correlation between constructs to 1.0. If the two models do not differ significantly on a chi-square difference test, the researcher will fail to conclude that the constructs differ (Bagozzi et al. 1991).
Content validity (or face validity) exists when items measure the full domain indicated by their label and description. A naming fallacy exists when indicators display construct validity, yet the label attached to the concept is inappropriate (e.g., satisfaction with outcomes is measured but is labeled as effectiveness of outcomes). A domain fallacy exists when indicators are restricted in value (e.g., “monetary incentives” may be the label of an indicator in a small group simulation, but the indicator would be more accurately labeled “small monetary incentives” due to restricted range; the label “large monetary incentives” may have a very different effect).
Internal validity concerns defending against sources of bias arising in research design. When there is lack of internal validity, variables other than the independent(s) being studied may be responsible for part or all of the observed effect on the dependent variable(s). If there is no causal phenomenon under study, internal validity is not at issue.
Common issues related to internal validity are:
Hawthorne effect (experimenter expectation): The expectations or actions of the investigator may contaminate the outcomes.
Mortality bias : Attrition of subjects later in the research process may render the final sample no longer representative.
Selection bias : The subjects may not reflect a random sample, and when multiple groups are studied, there can be differential selection of the groups associated with differential biases with regard to history, maturation, testing, mortality, regression, and instrumentation (i.e., selection may combine differentially with other threats to validity).
Evaluation apprehension : Study sponsorship, phrasing of the questions, and other steps taken by the researcher may not suffice to mitigate the natural apprehension of subjects, encouraging a bias toward responses the researcher is thought to want to hear.
Special problems involving control groups include:
Control awareness : If the control group is aware it is not receiving the experimental treatment, it may exhibit compensatory rivalry, resentful demoralization, or other traits that may contaminate study results.
Compensatory equalization of treatments : Researchers may compensate for the control group’s lack of benefit from treatment by providing some other benefit, such as alternative experiences, thereby introducing unmeasured variables.
Unintended treatments : Researcher attention, the status of the testing locale, and other testing experiences may constitute unmeasured variables.
Likewise, special problems exist for before-after and time series studies:
Instrumentation change : Measurement of variables may shift in before-after studies, as when the observers, through experience, become more adept at measurement.
History : Intervening events that are not part of the study may occur between measurement intervals, affecting results.
Maturation : Invalid inferences may be made when the maturation of subjects between intervals has an effect.
Regression toward the mean : If subjects are chosen because they are above or below the mean, there is a statistical tendency that they will be closer to the mean on remeasurement, regardless of the intervention.
Test experience : The before-study impacts the after-study in its own right, or multiple measurement of a concept leads to familiarity with the items and hence a history or fatigue effect.
Statistical validity concerns basing conclusions on a proper use of statistics, and in particular whether the assumptions of statistical procedures are met (e.g., normality, homoscedasticity, independence, and other traits may be required). Statistical invalidity also occurs when the researcher has not properly specified the model, has not taken interaction and nonlinear effects into account, or has misinterpreted the causal direction of relationships.
When significance tests are employed, they may be invalid if data are not randomly sampled, if an inappropriate alpha level has been selected (e.g., .05 is common in social science but is too liberal for medical research), if the test has an inadequate power level, or if a post hoc “shotgun” approach is used in which large numbers of relationships are examined without taking into account that multiple a posteriori tests require a higher operational alpha significance level to achieve the same nominal level.
SEE ALSO Regression Analysis; Regression Towards the Mean; Sample Attrition; Scales; Selection Bias; Structural Equation Models; Test Statistics
American Psychological Association. 1954. Technical Recommendations for Psychological Tests and Diagnostic Techniques. Psychological Bulletin 51 (2, suppl.): 201–238.
American Psychological Association. 1966. Standards for Educational and Psychological Tests and Manuals. Washington, DC: Author.
Bagozzi, Richard P., Youjae Yi, and Lynn W. Phillips. 1991. Assessing Construct Validity in Organizational Research. Administrative Science Quarterly 36 (3): 421–458.
Campbell, Donald T., and Julian C. Stanley. 1963. Experimental and Quasi-experimental Designs for Research. Chicago: Rand-McNally.
Carmines, Edward G., and Richard A. Zeller. 1979. Reliability and Validity Assessment. Newbury Park, CA: Sage.
Cook, Thomas D., and Donald T. Campbell. 1979. Quasi-experimentation: Design and Analysis Issues for Field Settings. Boston: Houghton Mifflin.
Cronbach, Lee J. 1971. Test Validation. In Educational Measurement, ed. Robert L. Thorndike. 2nd ed. pp. 443–507 Washington, DC: American Council on Education.
Fornell, Claus, and David F. Larcker. 1981. Evaluating Structural Equation Models with Unobservable Variables and Measurement Error. Journal of Marketing Research 18 (1): 39–50.
Guilford, Joy P. 1946. New Standards for Test Evaluation. Educational and Psychological Measurement 6 (5): 427–439.
Messick, Samuel. 1989. Validity. In Educational Measurement, ed. Robert L. Linn, 13–103. 3rd ed. New York: American Council on Education and Macmillan.
Rasch, Georg.  1980. Probabilistic Models for Some Intelligence and Achievement Tests. Expanded ed. Chicago: University of Chicago Press.
Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and Quasi-experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin.
Shepard, Lorrie A. 1993. Evaluating Test Validity. In Review of Research in Education, ed. Linda Darling-Hammond, vol. 19, 405–450. Washington, DC: American Educational Research Association.
Straub, Detmar W. 1989. Validating Instruments in MIS Research. MIS Quarterly 13 (2): 147–166.
G. David Garson