Skip to main content

Secondary Data Analysis and Data Archives


The creation and growth of publicly accessible data archives (or data banks) have revolutionized the way sociologists conduct research. These resources have made possible a variety of secondary analyses, often utilizing the data in ways never anticipated by their creators. Traditionally, secondary data analysis involves the use of an available data resource by researchers to study a problem different from the one treated in the original analysis. For example, a researcher might have conducted a survey of workers' reactions to technological change and analyzed those data to evaluate whether the workers welcomed or resisted such change in the workplace. As a matter of secondary interest, the researcher collects data on workers' perceptions of the internal labor-market structures of their firms. She then lends those data to a colleague who studies the determinants of (workers' perceptions of) job-ladder length and complexity in order to understand workers' views on prospects for upward mobility in their places of employment. The latter investigation is a secondary analysis.

More recently, however, the definition of a secondary analysis has expanded as more data sets have been explicitly constructed with multiple purposes and multiple users in mind. The creators, or principal investigators, exercise control over the content of a data set but are responsive to a variety of constituencies that are likely to use that resource. The creators may undertake analyses of the data, addressing questions of intellectual interest to themselves while simultaneously releasing the data to the public or depositing the data resource in an archive. Data archives are depositories where data produced by a number of investigators are available for secondary analyses. The data bank generally takes responsibility for providing documentation on the data sets and other information needed for their use. The term also refers more generally to any source of data already produced that an investigator may uncover in the course of an investigation, such as government or business records housed in libraries. For example, the U.S. government archives thousands of government documents yearly in libraries around the world. The data in those documents cover a wide variety of topics and are often useful in sociological investigations. It remains the responsibility of the analyst to configure the data in a way that is useful to his or her investigation. This entry illustrates these expanded opportunities by describing one key data archive and indicating the extent and breadth of data resources that this and other archives include. It then describes the process of conducting secondary analyses from resources such as these.


One of the most important data archives for social scientists is the Interuniversity Consortium for Political and Social Research (ICPSR) at the University of Michigan, Ann Arbor. The ICPSR publishes an annual Guide to Resources and Services (much of this description was taken from the 1996–1997 volume). Additional information is available at the ICPSR Web site ( The consortium was founded in 1962 as a partnership between the Survey Research Center at the University of Michigan and twenty-one U.S. universities. In 1997 the holdings included over 3,500 titles, some of them capturing several panels of data on the same respondents or several waves of data involving comparable information. These titles are available to researchers at member institutions. The consortium charges fees on a sliding scale to academic institutions for membership privileges; researchers whose institutions are not members can obtain data for a fee. In 1997, over four hundred institutions in the United States, Canada, and countries throughout the world were members. While ICPSR originated as a service to political analysts, it currently serves a broad spectrum of the social sciences, including economics, sociology, geography, psychology, and history as well, and its data resources have been used by researchers in education, social work, foreign policy, criminal justice, and urban affairs.

Although ICPSR provides training in research and statistical methods and helps members in the effective use of computing resources, its central function is the archiving, processing, and distribution of machine-readable data of interest to social scientists. Although data capturing elements of the U.S. political process are well represented in its holdings, data are available on consumer attitudes, educational processes and attainment, health care utilization, social indicators of the quality of American life, employment conditions, workers' views on technology, and criminal behavior. The data come from over 130 countries, include both contemporary and historical censuses, and are not confined to the individual level but also provide information on the characteristics of nations and organizational attributes. ICPSR actively seeks out high-quality data sets, and the user fees finance additional data acquisition as well as other operations. It also encourages investigators to deposit their data holdings in the archives to make them available to researchers for secondary analyses. Researchers whose data production efforts are funded by federal agencies such as the National Science Foundation are required to make their data publicly available after their grants have expired, and ICPSR is a logical depository for many data sets produced in the social sciences.

ICPSR maintains over ninety serial data holdings, including the earlier waves of the National Longitudinal Surveys of Labor Market Experience (NLS) (discussed below), the Survey of Income and Program Participation, the General Social Surveys, National Crime Surveys, the Panel Study of Income Dynamics, the Detroit Area Studies, the U.S. Census of Population and Housing, and the American National Elections Studies. These serial holdings include longitudinal surveys (in which the same respondents are interviewed repeatedly over time) such as the NLS and the Panel Study of Income Dynamics. These resources are particularly useful in determining the impact of earlier life events on later life outcomes, since the causal orders of all events measured on the data sets are clearly indicated. The holdings also include sets of cross-sectional studies conducted at regular intervals, such as the Detroit Area Studies and the General Social Surveys (GSS). These studies contain different cross sections from the same populations over time and are useful in charting trends in the attitudes of the respective populations over time, assuming that the same questions are repeated. Sources, such as the GSS, that ask the same questions over several years allow the researcher to pool samples across those years and obtain larger numbers of cases that are useful in multivariate analyses.

To illustrate one data set, consider the National Longitudinal Surveys of Labor Market Experience. These surveys are produced by the Center for Human Resource Research (CHRR) at Ohio State University. The CHRR produces a yearly NLS Handbook, and much of the following information regarding the NLS was taken from the 1998 NLS Handbook. These surveys began in 1966 with a study of older men aged 45–59 and a survey of young men aged 14–24, continued in 1967 with a survey of mature women aged 30–44, and were followed up with a survey of young women aged 14–24 in 1968. In 1979, CHRR began a survey of over 12,000 youths aged 14–22, known as the NLSY79. In 1997, CHRR surveyed a new cohort of over 9,000 youths aged 12–16, called the NLSY97, and is continuing with yearly surveys of this cohort. The six major surveys contain a wealth of data on labor-force experience (e.g., labor-force and employment status, work history, and earnings) as well as investment in education and training, marital status, household composition and fertility, background material on respondents' parents, work-related attitudes, health, alcohol and drug use, and region of residence.

Each of these cohorts has been followed at varying intervals since the surveys' inceptions. For example, the Young Women were surveyed nineteen times between 1968 and 1997. The NLSY79 respondents were surveyed every year until 1994, when surveys in even-numbered years began. The Older Men were surveyed every year until 1983, and they or their widows were resurveyed in 1990. Data production for the Older Men and Young Men is complete; data production for the Mature Women and Young Women is ongoing biennially. In 1986 the NLS added a survey of the children of the NLSY79 cohort's women; that described the social, cognitive, and physiological development of those children and, given the longitudinal nature of the data on the mothers, allows an explanation of these child outcomes in terms of maternal background and current maternal characteristics. Surveys of the children occur in even-numbered years; this accumulated longitudinal database on child outcomes allows important inferences regarding the process of child development, with the numbers of children surveyed far exceeding those in most other sources. This additional resource has expanded NLSY79's usefulness to other disciplines, including psychology, and to other researchers interested in child development.

The NLS data sets are produced with the cooperation of CHRR, NORC (formerly the National Opinion Research Center) at the University of Chicago, and the U.S. Bureau of the Census. For example, for NLSY79, the CHRR takes responsibility for questionnaire construction, documentation, and data dissemination, while NORC has handled sample design, fieldwork, and data reduction. The Census Bureau has handled sample design, fieldwork, and data reduction for the four original cohorts. All data are available on CDROM from CHRR. Waves of data prior to 1993 are also available from ICPSR, as was noted above.

Social scientists from several disciplines, including sociology, economics, and industrial relations, have found the NLS to be a critical resource for the study of earnings and income attainment, human capital investment, job searches, fertility, racial and sex discrimination, and the determinants of labor supply. Inferences from these studies have been useful in regard to theory as well as policy formation. Other topics the data resource can usefully inform include family structure and processes, child outcomes, and aging processes. The CHRR estimates that by 1998 over 3,000 articles, books, working papers, and dissertations were produced using the NLS data. The 1998 NLS Handbook provides a wealth of detail regarding the designs of the surveys, survey procedures, variables, and CD availability. It also describes the extensive documentation available on the NLS data sets and lists references to key Web sites, including one that contains NLS publications. This handbook is indispensable for any researcher considering a secondary analysis using NLS data. The CHRR at Ohio State University disseminates the data and provides documentation and assistance to users with questions about the data sets. This summary gives a glimpse of the tremendous potential for secondary analyses of NLS data; this potential is multiplied many times over when one considers the number of other data sets available to researchers.

Because of the increase in resources devoted to survey research in sociology and related social sciences, the ICPSR holdings containing surveys of individuals have grown rapidly. However, ICPSR also archives data produced at varying levels of aggregation, thus facilitating secondary analyses in which the theoretically appropriate units of analysis are countries or organizations. For example, ICPSR archives the World Tables of Economic and Social Indicators, 1950–1992, provided by the World Bank. These data contain economic and social indicators from 183 countries, with the indicators including measures such as gross national product, value of imports and exports, gross national savings, value added across major industrial categories, net direct foreign investment, public long-term debt, international reserves excluding gold, and gold holdings at the London market price. Demographic and social variables include population, total fertility rate, crude birthrate, percentage of the labor force in agriculture, percentage of the labor force that is female, and primary and secondary school enrollment rates. An older data set, also from the World Bank, contains similar measures from 1950 to 1981 as well as additional indicators not included in the data set covering the 1950–1992 period. Because these are also longitudinal data sets, there is the potential for pooling across time variation in these measures across the countries so that cross-sectional and longitudinal variations can be studied simultaneously.

ICPSR also maintains a small number of holdings useful for studying organizational processes. For example, a 1972 study of industrial location decisions obtained from the Economic Behavior Program of the Survey Research Center at the University of Michigan surveyed 173 industrial plants in Detroit, Chicago, and Atlanta. The interviewees were organizational informants such as president, vice president, general manager, and public relations director. The items included reasons for the location of the plant and the advantages and disadvantages of a location; other constructs measured included duration of plant operations, levels of sales and production, production problems, and plans for future expansion.

More recent arguments, however, have suggested that although sociology has invested considerably in surveys of individuals, it has invested insufficiently in surveys of organizations (Freeman 1986; see also Parcel et al. 1991). Kalleberg et al. (1996) present results from the National Study of Organizations, a National Science Foundation–sponsored study of a representative cross section of organizations that addresses their structures, contexts, and personnel practices. Although they demonstrate the utility of this design for addressing some questions regarding organizational functioning, these data cannot address issues of organizational change. A possible solution would be to produce a longitudinal database of organizations. The characteristics of a representative sample of organizations would be produced across time, analogous to the panel data sets of individual characteristics described above. Such a resource would enable researchers to study processes of organizational change with models that allow a clear causal ordering of variables. This type of resource also would permit analyses of pooled cross sections. Most important, the resource would allow organizational theories to be subjected to tests based on a representative sample of organizations, in contrast to the purposive samples that are used more frequently. To date, the resources have not been sufficient to approach the panel design suggested above. Clearly, the capacity to conduct secondary analyses at the organizational level is in its infancy relative to studies of individual-level processes and phenomena.

Finally, ICPSR also archives a variety of data sets that make possible historical analyses of social, economic, and political processes. For example, it archives the Annual Time Series Statistics for the United States, 1929–1968, which includes 280 variables for most of that period, although only 127 variables are available for the period 1947–1968. Available data include population characteristics, measures of political characteristics of the U.S. Congress, business and consumer expenditures, and expenditures by various federal government departments. ICPSR also archives Political Systems Performance Data for France, Sweden, and the United States, 1950–1965, in which the central constructs measured include size of public debt, gross national product (GNP), energy consumption, income tax rates, birthrates and death rates, labor force and unemployment, voting behavior, urbanization, and agricultural growth. Each of these historical data sources makes possible time series analyses of the macro-level phenomena they measure.

Additional major archives include the Roper Center for Public Opinion Research at the University of Connecticut and the Lewis Harris Data Center at the University of North Carolina at Chapel Hill. Kiecolt and Nathan (1985) provide additional information on the major archives, and Stewart (1984) outlines the extensive holdings in U.S. Government Document Depositories, especially the products of the U.S. Bureau of the Census. Other important archives include several in Europe with which ICPSR maintains a relationship, such as the Norwegian Social Science Data Services, the Australian Social Science Data Archives, and the Zentralarchiv far empirische Sozialforschung (ZA) at the University of Cologne. There is the potential for member institutions to obtain from ICPSR data contained in those local archives as well. The International Social Survey Program (ISSP) has worked toward coordinating survey research internationally by asking common questions cross-nationally in given years, facilitating cross cultural analyses of social phenomena. For example, in 1990 social surveys in Austria, West Germany, Great Britain, Hungary, Ireland, Israel, Italy, the Netherlands, and Norway all included questions on work, including the consequences of unemployment, union activities, working conditions, and preferred job characteristics. A comparable module in 1987 focused on social inequality in Australia, Austria, West Germany, Great Britain, Hungary, Italy, and the United States. The 1993 module focused on nature, the environment, recycling, and the role of science in solving environmental problems. Data from the ISSP are available from ICPSR.


The key advantage of secondary data analysis is also the key disadvantage: The researcher gains access to a wealth of information, usually far in excess of what he or she could have produced with individual resources, but in exchange must accept the myriad operational decisions that the investigators who produced the data have made. On the positive side, the researcher frequently is able to take advantage of a national sample of respondents or data produced on national populations when individual resources would have supported only local primary data production. The numbers of cases available in secondary resources often far outstrip the sample sizes individual investigators could have afforded to produce; these large sample sizes enhance the precision of parameter estimates and allow forms of multivariate analyses that smaller sample sizes preclude. A secondary analyst also can take advantage of the significant expertise concentrated in the large survey organizations that produce data sets for secondary analysis. This collective expertise usually exceeds that of any single investigator. Despite these advantages, the researcher must carefully match the requirements of the research project to the characteristics of the data set. When the match is close, the use of secondary data will enhance the research effort by making use of existing resources and taking advantage of the time, money, and expertise of others devoted to data production. If the match is poor, the research project will fail because the data will not address the questions posed.

Because many secondary analyses are conducted on survey data, effective use of secondary survey sources frequently depends on knowledge of sample design, question wording, questionnaire construction, and measurement. Ideally, the researcher conceptualizes precisely what he or she wishes to do with the data in the analysis, since analytic requirements must be met by existing data. If the research questions posed are longitudinal in nature, the researcher must be sure that the survey questions are measured at time points that mirror the researcher's assumptions of causal order.

The researcher also must be certain that the survey samples all the respondents relevant to the problem. For example, analyses of racial differences in socioeconomic outcomes must use data sets in which racial minorities are oversampled to ensure adequate numbers of cases for analysis. The researcher also must be certain that a data set contains sufficient cases for the analysis she or he intends to perform. Kiecolt and Nathan (1985) stress the challenges for trend and cross-cultural studies that result from changes in sampling procedures over time. For example, suppose a researcher wants to ascertain whether more people support a voucher system for public education in 2000 compared with 1990. Changes in the sampling frame over the decade may introduce variations into survey responses that would not otherwise exist. These variations can be in either direction, and hypotheses regarding their direction are a function of the nature of sampling changes. Gallup surveys have increased their coverage of noninstitutionalized civilian adult populations over time, with the result that there has been an artifactual decrease in the levels of education they report (Kiecolt and Nathan 1985, pp. 62–63), since the later surveys have progressively included groups with lower levels of schooling. Sampling changes also can occur over time because of changes in geographic boundaries. Cities change boundaries owing to annexation of areas, and Metropolitan Statistical Areas (MSAs, formerly Standard Metropolitan Statistical Areas [SMSAs]) are created over time as increased numbers of counties meet the population and economic criteria for defining MSAs.

The most common problem in conducting secondary analyses, however, occurs in the questionnaire coverage of items needed to construct appropriate measures. It is likely that the original survey was constructed with one purpose and asked adequate numbers and forms of questions regarding the constructs central to that problem but gave only cursory attention to other items. A secondary researcher must evaluate carefully whether the questions that involve his or her area of central interest are adequate for measurement and for analytic tasks. The biggest fear of a secondary researcher is that some variables needed for proper model specification have been omitted. Omitted variables pose potentially severe problems of misspecification in estimating the parameters of the variables that are included in the models. In these cases the researcher must decide whether an adequate proxy (or substitute) variable exists on the data set, whether the research problem can be reformulated so that omission of that construct is less critical, or whether the initially chosen data set is unsuitable and another must be sought. Researchers can also purchase time on major social surveys such as the GSS administered by NORC. This strategy enables researchers with adequate financial resources to be certain that the questions needed to investigate the issues of interest to them will be included in a national survey. This strategy mixes primary data production with secondary analysis of a multipurpose data set. The entire data resource then becomes available to other secondary analysts.

Other challenges for secondary analysts occur as a function of the particular form of secondary analysis used. For example, Kiecolt and Nathan (1985) note that survey researchers who produce series of cross sections of data that are useful in studying trends may "improve" the wording of questions over time. In regard to the problem of voucher systems in public education, the researcher may observe increased percentages of survey respondents favoring this option over the period covered by the surveys but still may have difficulty eliminating the possibility that question wording in the later survey or surveys may have encouraged a more positive response. Such changes also can occur if the wording of the question remains the same over time but the nature of the response categories changes. Secondary analysts who conduct cross-cultural comparisons must be sensitive to the fact that the same question can mean different things in different cultures, thus interfering with their ability to compare the same social phenomenon cross-culturally.

Dale, et al. (1988) note that in-depth studies of specific populations may be most realistic with national samples that provide sufficient cases for analyses of the subgroups while allowing the researcher to place those data within a broader empirical context. It is also possible that surveys produced by different survey organizations will produce different results even when question wording, response categories, and sampling procedures remain the same (Kiecolt and Nathan 1985, p. 67). A secondary analyst must be certain that the survey organization or individual responsible for producing the data set exercised appropriate care in constructing the data resource. As was noted above, detailed familiarity with the documentation describing the data set production procedures is essential, as is a codebook indicating frequencies on categorical variables, appropriate ranges for continuous variables, and codes for missing data.

There is often an interactive nature to the process of conducting a secondary data analysis. While the researcher's theoretical interests may be reasonably well formulated when he or she identifies a useful data set, the variables present in the data resource may suggest additional empirical opportunities of theoretical interest that the researcher had not previously considered. Also, familiarity with data resources can facilitate the formulation of empirical investigations that otherwise might not be initiated. Once a researcher is familiar with the features of a particular secondary source, accessing additional variables for the analysis of a related problem may require less investment than would accessing a new data resource. However, there is general agreement that data availability should never dictate the nature of a research question. Although it is legitimate for a researcher to use his or her awareness of data resources to recognize that analyses of problems of long-standing interest are now empirically possible, "data dredging" has a deservedly negative connotation and does not result in the advancement of social science. Hyman's (1972) classic treatment of secondary analyses of survey data richly chronicles the experiences of a number of sociologists as they interactively considered the matching of theoretical interests and data availability in formulating and conducting secondary analyses. He also describes a number of ways in which secondary analysts can configure existing data to test hypotheses.

Recent developments in technology have streamlined several steps in secondary analyses that formerly were time-consuming and labor-intensive. Many secondary data sets are now available on CDROM (compact disk-read only memory); the NLS data discussed above are only one example. With many computers having attached CD readers, analysts can read the disks and extract from them the variables and cases they wish to study. Often the disks also contain searching devices that enable researchers to locate variables of interest easily. These "search engines" simultaneously enable analysts to select a sample and obtain the variables needed on each case. These capabilities totally bypass older technologies involving nine-track tapes containing data. In tape-based technologies, analysts had to write original computer programs to extract the needed variables and cases. A typical analyst no longer depends on a centralized computing facility for storing, mounting, and reading magnetic tapes.

The next steps in secondary analysis differ only slightly from the steps that investigators who produce primary data undertake. In both cases, data must be cleaned to remove coding errors that might result in erroneous findings. Similarly, both investigators need to address problems with missing data. The primary data producer is close enough to the actual data production not only to identify such problems but also to resolve many of them appropriately. For example, if the researcher is studying a single organization and notes that a respondent has failed to report his or her earnings, the researcher, knowing the respondent's occupation, may be able to obtain data from the organization that approximates that respondent's earnings closely. The secondary analyst would not have access to the original organization but might approximate the missing data by searching for other respondents who reported the same occupation but who also reported earnings. Variations on this theme involve the imputation of missing data by using mathematical functions of observed data to derive reasonable inferences about values that are missing (Little and Rubin 1987, 1990; Jinn and Sedransk 1989).

Both types of investigator have to be familiar with the descriptive properties of their data. For a primary investigator, observing distributions of respective variables as well as their central tendencies should be an outgrowth of data production itself. A secondary analyst has less familiarity with the data someone else produces but is under the same obligation to become familiar with the descriptive properties of the data in a detailed way. For both researchers, good decisions involving measurement of variables and model specification for multivariate analyses depend on knowledge of the descriptive properties of the data.

Within the respective multipurpose data sets, research traditions often arise from the sometimes unique suitability of certain resources for addressing given problems. These traditions derive from the fact that several investigators have access to the data simultaneously, a feature that distinguishes secondary data analysis from analyses undertaken by different primary investigators, each of whom has a unique data set. For example, in the late 1980s and into the 1990s, the NLSY79 with Mother and Child Supplements was virtually unique in combining a large sample size, longitudinal data on maternal familial and work histories, observed child outcomes, and oversamplings of racial minorities. Problems tracing the impact of maternal events on child outcomes are addressable with this data resource in a way that they were not with other resources. Investigators with an interest in these issues use the data and exchange information regarding strategies for measuring constructs and data analysis and then exchange their findings. Over time, bodies of findings emerge from common data sources where the findings are contributed by a number of secondary investigators, although the particular problems, theoretical frameworks, and empirical strategies represented in each one may differ markedly. As was suggested above, multipurpose data sets frequently allow secondary analyses by researchers from several disciplines. The products of these investigations bear the stamps of their respective disciplines. In addition, the NLSY79 with Mother and Child Supplements has served as a model for the Michigan Panel Study of Income Dynamics (PSID) in its 1997 Child Development Supplement on the PSID respondents. This new data resource, which combines longitudinal data on parents and developmental assessments of children from birth to age 12, will enable replication of key findings produced with the NLSY79 child data set as well as the production of new findings. For example, both data sets contain age-appropriate cognitive assessments for children, permitting findings produced with the NLSY79 child data set to be replicated with the PSID Child Development Supplement. The PSID, however, contains data on how children spend their time. These variables should allow researchers to understand the effects of children's time use on several developmental outcomes, something that the NLSY79 child data do not permit.

The wealth of secondary data sources also permits investigators to use more than one data source to pursue a particular line of inquiry. No single data set is perfect. Researchers can analyze several data sets, all with key measures but each with unique strengths, to check interpretations of findings and evaluate alternative explanations. McLanahan and Sandefur (1994) use this approach in their study of the effects of single parenthood on the offspring's academic success and social adjustment. Their data sources include the NLSY79, the PSID, and the High School and Beyond Study. The result is a stronger set of findings than those which could have been produced with any one of those sources.

Another model for conducting secondary research is suggested by researchers who use census data produced by the U.S. Department of Commerce. Census holdings cover not only information on the general U.S. population but also data on businesses, housing units, governments, and agricultural enterprises. Researchers who use these sources singly or in combination must be familiar with the questionnaires used to produce the data and with the relevant features of sample coverage. While some census data are available on machine-readable tape, other data exist only in printed form. In these cases, the researcher must configure the needed data into a form suitable for analyses, in many cases a rectangular file in which cases form row entries and variables form column entries. Data produced on cities from the County and City Data Books, for example, allow a variety of analyses that involve the relationships among urban social and economic characteristics. In these analyses, the unit of analysis is probably an aggregate unit such as a county or city, illustrating the applicability of secondary analysis to problems conceptualized at a level of aggregation higher than that of the individual.

Another advantage of secondary analyses is the potential for those most interested in a particular set of findings to replicate them by using the same data and to introduce additional variables or alternative operationalizations as a method for evaluating the robustness of the first secondary investigator's findings. A classic example is Beck et al.'s 1978 investigation of differences in earnings attainment processes by economic sector. Hauser's (1980) reanalysis of those data suggested that most of the differences in sectoral earnings reported in the original study were a function of coding decisions for low-earnings respondents, since the differences disappeared when the code for low earnings was changed. Despite this criticism, the impact of the original investigation has been enormous, with many additional investigators exploring the structure and implications of economic sectors. The point, of course, is that such debate is more likely to occur when researchers have access to common data sets, although gracious investigators often lend their data resources to interested critics. Hauser (1980) acknowledges that Beck et al. shared their original data, although he could have obtained the original data set from ICPSR.

Secondary data sets can be augmented with additional data to enrich the data resource and allow the derivation of additional theoretical and empirical insights. Contextual analysis, or the investigation of whether social context influences social outcomes, is a key example. Parcel and Mueller (1983) used the 1975 and 1976 panels from the PSID to study racial and sex differences in earnings attainment. To evaluate the impact of occupational, industrial, and local labor-market conditions on workers' earnings, they augmented the PSID data with archival data from U.S. Census and Dictionary of Occupational Titles sources that were based on the occupations, industries, and local markets of respective PSID respondents. Illustrative contextual indicators included occupational complexity, industrial profitability, and local-market manufacturing-sector productivity. Analyses then suggested how these contextual, as well as individual-level, indicators affected workers' earnings differently depending on ascriptive statuses. Computer software is now available to correct for problems in estimating models that use contextual data.

The potential for many sociologists to use secondary analysis to conduct studies of theoretical and practical importance probably has contributed to a change in productivity standards in sociology, particularly in certain subfields. The fact that certain issues can be addressed by using existing data can result in enormous savings in time relative to the time that would be required if primary data had to be produced. Research-oriented departments either implicitly or explicitly take this into account in assigning rewards such as salaries, tenure, and promotion. The potential for secondary analyses thus may create pressures toward increased scientific productivity; whether these pressures work generally for the good of social science or against it may be a matter of debate.

It is undeniable that progress in addressing some of the most important problems in social science has been facilitated greatly by the existence of multipurpose data sets and secondary resources. It is also true that the resources needed to produce and disseminate these data are considerable and that the existence and continuation of these resources are vulnerable to changes in political climate and priorities when those priorities influence resource allocation. It is critical that such decisions on resource allocation, particularly those made at the level of the federal government, recognize the important role that secondary resources have played in furthering both basic social science and applications informing social policy.

(see also: Census, Social Indicators, Survey Research)


Beck, E. M., Patrick Horan, and Charles W. Tolbert II 1978 "Stratification in a Dual Economy: A Sectoral Model of Earnings Determination." American Sociological Review 43:704–720.

Dale, Angela, Sara Arber, and Michael Proctor 1988 Doing Secondary Analysis. London: Unwin Hyman.

Freeman, John 1986 "Data Quality and the Development of Organizational Social Science: An Editorial Essay." Administrative Science Quarterly 31:298–303.

Guide to Resources and Services, 1996–1997. Ann Arbor, Mich.: Interuniversity Consortium for Political and Social Research, University of Michigan.

Hauser, Robert 1980 "Comment on 'Stratification in a Dual Economy."' American Sociological Review 45:702–712.

Hyman, Herbert H. 1972 Secondary Analysis of SampleSurveys: Principles, Procedures, and Potentialities. New York: Wiley.

Jinn, J. H. and J. Sedransk 1989 "Effect on Secondary Data Analysis of Common Imputation Methods." Sociological Methodology 19:213–241.

Kalleberg, Arne L., David Knoke, Peter V. Marsden, and Joe L. Spaeth 1996 Organizations in America: Analyzing Their Structures and Human Resource Practices. Thousand Oaks, Calif.: Sage.

Kiecolt, K. Jill, and Laura E. Nathan 1985 "Secondary Analysis of Survey Data." Sage University Paper Series on Quantitative Applications in the Social Sciences, series no. 07-053. Beverly Hills, Calif.: Sage.

Little, Roderick J. A. and Donald B. Rubin 1987 Statistical Analysis with Missing Data. New York: John Wiley and Sons.

—— 1990 "The Analysis of Social Science Data with Missing Values." Pp. 375–409 in John Fox and J. Scott Long, eds., Modern Methods of Data Analysis. Newbury Park, Calif.: Sage Publications.

McLanahan, Sara, and Gary Sandefeur 1994 Growing Upwith a Single Parent: What Hurts, What Helps. Cambridge, Mass.: Harvard University Press.

NLS Handbook 1998. Columbus, Ohio: Center for Human Resource Research, Ohio State University.

Parcel, Toby L., and Charles W. Mueller 1983 Ascriptionand Labor Markets: Race and Sex Differences in Earnings. New York: Academic Press.

——, Robert L. Kaufman, and Leeann Jolly 1991 "Going Up the Ladder: Multiplicity Sampling to Create Linked Macro-Micro Organizational Samples." Sociological Methodology, 1991. 21:43–80.

Stewart, David W. 1984 Secondary Research: InformationSources and Methods. Beverly Hills, Calif.: Sage.

U.S. Department of Commerce [various years] Countyand City Data Book. Washington, D.C.: U.S. Government Printing Office.

Toby L. Parcel

Cite this article
Pick a style below, and copy the text for your bibliography.

  • MLA
  • Chicago
  • APA

"Secondary Data Analysis and Data Archives." Encyclopedia of Sociology. . 19 Nov. 2018 <>.

"Secondary Data Analysis and Data Archives." Encyclopedia of Sociology. . (November 19, 2018).

"Secondary Data Analysis and Data Archives." Encyclopedia of Sociology. . Retrieved November 19, 2018 from

Learn more about citation styles

Citation styles gives you the ability to cite reference entries and articles according to common styles from the Modern Language Association (MLA), The Chicago Manual of Style, and the American Psychological Association (APA).

Within the “Cite this article” tool, pick a style to see how all available information looks when formatted according to that style. Then, copy and paste the text into your bibliography or works cited list.

Because each style has its own formatting nuances that evolve over time and not all information is available for every reference entry or article, cannot guarantee each citation it generates. Therefore, it’s best to use citations as a starting point before checking the style against your school or publication’s requirements and the most-recent information available at these sites:

Modern Language Association

The Chicago Manual of Style

American Psychological Association

  • Most online reference entries and articles do not have page numbers. Therefore, that information is unavailable for most content. However, the date of retrieval is often important. Refer to each style’s convention regarding the best way to format page numbers and retrieval dates.
  • In addition to the MLA, Chicago, and APA styles, your school, university, publication, or institution may have its own requirements for citations. Therefore, be sure to refer to those guidelines when editing your bibliography or works cited list.