Record Linkage

views updated


Record linkage is the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area). In 1946, H. L. Dunn of the United States National Bureau of Statistics introduced the term in this way: "Each person in the world creates a Book of Life. This Book starts with birth and ends with death. Record linkage is the name of the process of assembling the pages of this Book into a volume" (Dunn, 1946). Computerized record linkage was first undertaken by the Canadian geneticist Howard Newcombe and his associates in 1959. Newcombe recognized the full implications of extending the principle to the arrangement of personal files and into family histories. Computerized record linkage has the advantages of quality control, speed, consistency, reproducibility of results, and the ability to handle large volumes of data. For its actual implementation, Newcombe prepared a handbook in 1988.

Sir Donald Acheson established the Oxford Record Linkage Study in Oxford, England, in 1962. This medical record linkage system connects birth, morbidity, and mortality data for an entire community. This type of system links morbidity and mortality data and provides information for studies of health care utilization and for descriptive epidemiology of disease as analyzed by characteristics of time, place, and event.

There are several different approaches to linkage. At the crudest level, linkage may be based on agreement on one or more variablesthis is referred to as deterministic linkage. Decision tables, a hierarchy of rules, and a variety of different sets of matching criteria may also be used to bring record pairs together. Although a "unique numerical identifier," such as a health card number, can be used, this number may have been issued more than once, changed over time, or recorded incorrectly. Checking and verifying associated names is prudent when using numerous identifiers.

A mathematical theory of probabilistic linkage was developed by I. P. Fellegi and A. B. Sunter in 1969. In the subsequent generalized record-linking software developed, there are three main phases in linkage: searching, decision-making, and grouping. Conceptually, each record on one file is compared to each record on another file to form record pairs of all possible comparisons. In practice, in the searching phase, the files are blocked using identifiers (e.g., the phonetic code of the surname and gender code) to limit the number of potential pairs of records compared. In the decision-making phase, evidence contained in different records is compared to determine the probability, or "weight," that the records relate to the same entity. Record agreement with a rare name such as "Quigley," for example, has more weight than agreement of a common name such as "Smith." For convenience, record pairs are commonly classified in three areas: (1) definite "linked" pairs; (2) definite "nonlinked" pairs; and (3) "possible" links, where the inference cannot be made without further evidence (see Figure 1). In the final grouping phase, a group of appropriate records relating to the same individual or entity is formed. Records may have just one link to another record, or they may have several links. Two major types of errors may be made in classifying a record pair: The pairs may be either falsely linked; or they may be incorrectly unlinked (nonlinked pairs that indeed refer to the same entity).

The potential for linkage varies greatly between countries according to how information is collected and identified. The National Death Index in the United States and the Canadian Mortality Data Base have facilitated linkages at a national level. National birth and cancer data are also available in Canada.

Agencies need to develop explicit policies and mechanisms for the review and approval process for record linkage projects so that no individual will be harmed in the linkage process, either by false linkages or by the release of confidential information. Distinctions should be made for linkages done for statistical research purposes, where only aggregate statistics are released. Where possible, informed individual consent should be obtained, and the nature of the "public good" to be served should be assessed and reviewed.

Record linkage is an important tool in creating data required for examining the health of the public and of the health care system itself. It can be used to improve data holdings, data collection, quality assessment, and the dissemination of information. Data sources can be examined to eliminate duplicate records, to identify underreporting and missing cases (e.g., census population counts), to create person-oriented health statistics, and to generate disease registries and health surveillance systems. Some cancer registries link various data sources (e.g., hospital admissions, pathology and clinical reports, and death registrations) to generate their registries.

Record linkage is also used to create health indicators. For example, fetal and infant mortality

Figure 1

is a general indicator of a country's socioeconomic development, public health, and maternal and child services. If infant death records are matched to birth records, it is possible to use birth variables, such as birth weight and gestational age, along with mortality data, such as cause of death, in analyzing the data.

Linkages can help in follow-up studies of cohorts or other groups to determine factors such as vital status, residential status, or health outcomes. Tracing is often needed for follow-up of industrial cohorts, clinical trials, and longitudinal surveys to obtain the cause of death and/or cancer.

In addition, record linkage can aid in developing recommendations about regulatory standards at the national and international levels. A good example can be seen in the work of the United Nations Scientific Committee on the Effects of Atomic Radiation, which provides evaluations of the sources of ionizing radiation and the effects of exposures. This committee assesses the consequences to human health of a wide variety of doses of ionizing radiation and estimates the dose people receive all over the world from natural and man-made radiation sources. Linkage of a variety of data sources is required, including health, exposure, and outcome information (e.g., cancer and deaths).

Martha E. Fair

(see also: Confidentiality; Data Sources and Collection Methods; Epidemiology; Information Technology; Informed Consent; Privacy; Registries; Statistics for Public Health; Vital Statistics )


Baldwin, J. A.; Acheson, E. D.; and Graham, W. J., eds. (1987). Textbook of Medical Record Linkage. Oxford, UK: Oxford University Press.

Chong, N. (1998). "Computerized Record Linkage in Cancer Registries." In Automated Data Collection in Cancer Registration, eds. R. J. Black, L. Simonato, H. H. Storm, and E. Démaret. Lyon: IARC, Technical Reports No. 32:711.

Duncan, G. T.; Jabine, T. B.; and de Wolf, V. A., eds. (1993). Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. Washington, DC: National Academy Press.

Dunn, H. L. (1946). "Record Linkage." American Journal of Public Health 36:14121416.

Federal Committee on Statistical Methodology (1997). Record Linkage Techniques1997 Proceedings of an International Workshop and Exposition. Washington, DC:U.S. Office of Management and Budget.

Fellegi, I. P., and Sunter, A. B. (1969). "A Theory of Record Linkage." Journal of the American Statistical Association 40:11831210.

Howe, G. R. (1998). "Use of Computerized Record Linkage in Cohort Studies." Epidemiologic Reviews 20:112121.

Newcombe, H. B. (1988). Handbook of Record Linkage Methods for Health and Statistical Studies, Administration and Business. Oxford, UK: Oxford University Press.

Newcombe, H. B.; Fair, M. E.; and Lalonde, P. (1992). "The Use of Names for Linking Personal Records." Journal of the American Statistical Association 87:11931208.

Newcombe, H. B.; Kennedy, J. M.; Axford, S. J.; and James, A. P. (1959). "Automatic Linkage of Vital Records." Science 130:954959.

Smith, M. E., and Newcombe, H. B. (1980) "Automated Follow-up Facilities in Canada for Monitoring Delayed Health Effects." American Journal of Public Health 70(12):12611268.

Statistics Canada (2000). Generalized Record Linkage System. Concepts, Research and General Systems. Ottawa: Author.