Record linkage

Record linkage

Record linkage (RL) refers to the task of finding entries that refer to the same entity across different data sources (e.g., files, books, websites, databases, etc.). Record linkage is an appropriate technique when you have to join data sets that do not already share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as cross-linked (which is not necessarily the same as that data set containing Linked Data).

Record linkage is a useful tool when performing data mining tasks, where the data originated from different sources or different organizations. Most commonly, performing RL on data sets involves joining records of persons based on name, since no National identification number or similar is recorded in the data. In mathematical graph theory, record linkage can be seen as a technique of resolving bipartite graphs.


Naming conventions

Record linkage is the term used by statisticians, epidemiologists, and historians, among others. Commercial mail and database applications refer to it as "merge/purge processing" or "list washing". Computer scientists often refer to it as "data matching" or as the "object identity problem". Other names used to describe the same concept include "entity resolution", "identity resolution", "entity disambiguation", "duplicate detection", "record matching", "instance identification", "deduplication", "coreference resolution", "reference reconciliation", and "database hardening". This profusion of terminology has led to few cross-references between these research communities.[1][2]


There are several approaches to record linkage. The most straightforward is a rules-based approach, in which reasonable rules are developed and then refined as common exceptions are found. The advantage to this approach is that it is possible to get a good deal of accuracy without needing a lot of labeled data to train or test the rules on. The disadvantage is that to obtain very high accuracy, more and more exceptions and special cases need to be handled, and eventually the list of rules gets too complex to be built by hand.

A very popular approach has been Probabilistic Record Linkage (PRL). In this approach, a large set of pairs of records are human-labeled as being matching or differing pairs. Then statistics are calculated from the agreement of fields on matching and differing records to determine weights on each field. During execution, the agreement or disagreement weight for each field is added to get a combined score that represents the probability that the records refer to the same entity. Often there is one threshold above which a pair is considered a match, and another threshold below which it is considered not to be a match. Between the two thresholds a pair is considered to be "possibly a match", and dealt with accordingly (e.g., human reviewed, linked, or not linked, depending on the application).

In recent years, a variety of machine learning techniques have been used in record linkage. It has been recognized that Probabilistic Record Linkage is equivalent to the "Naive Bayes" algorithm in the field of machine learning, and suffers from the same assumption of the independence of its features, which is typically not true. Higher accuracy can often be achieved by using various other machine learning techniques, including a single-layer Perceptron.

Regardless of whether rule-based, PRL, or machine learning techniques are used, normalization of the data is very important. Names are often spelled differently in different sources (e.g., "Wm. Smith", "William Smith", "William J. Smith", "Bill Smith", etc.); dates can be recorded various ways ("1/2/73", "1973.1.2", "Jan 2, 1973"); and places can be recorded differently as well ("Berkeley, CA", "Berkeley, Alameda, California, USA", etc.). By normalizing these into a common format and using comparison techniques that handle additional variation, much more consistency can be achieved, resulting in higher accuracy in any record linkage technique.

Record linkage typically involves two main steps: blocking and scoring. The blocking phase attempts to find groups of records that are similar in some way. For example, a query might select all individuals with the same surname and ZIP code. Of course, many of these individuals would not be the same real person (false positives), and some records for the same individual might have different values for the surname (due to typos, married name, etc.) or ZIP code (e.g., because they moved). Thus robust systems often use multiple blocking passes to group data in various ways in order to come up with groups of records that should be compared to each other.

Once a set of potential matches is identified, a scoring algorithm is used to calculate a score indicating how likely the two records are to really represent the same real entity. By using labeled test data, score thresholds can be selected that yield a desired trade-off between recall and precision. (Recall is the percent of true matching pairs that get a score above a given threshold, and precision is the percent of pairs with a score above a given threshold that are really a match).

History of RL theory

The initial idea goes back to Halbert L. Dunn in 1946.[3] In the 1950s, Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory.

In 1969, Ivan Fellegi and Alan Sunter formalized these ideas and proved that the probabilistic decision rule they described was optimal when the comparison attributes are conditionally independent. Their pioneering work "A Theory For Record Linkage"[4] is, still today, the mathematical tool for many record linkage applications.

Since the late 1990s, various machine learning techniques have been developed that can, under favorable conditions, be used to estimate the conditional probabilities required by the Fellegi-Sunter (FS) theory. Several researchers have reported that the conditional independence assumption of the FS algorithm is often violated in practice; however, published efforts to explicitly model the conditional dependencies among the comparison attributes have not resulted in an improvement in record linkage quality.[citation needed]

Mathematical model

In an application with two files, A and B, denote the rows (records) by α(a) in file A and β(b) in file B. Assign K characteristics to each record. The set of records that represent identical entities is defined by

 M = \left\{ (a,b); a=b; a \in A; b \in B \right\}

and the complement of set M, namely set U representing different entities is defined as

 U = \{ (a,b); a \neq b; a \in A, b \in B \} .

A vector, γ is defined, that contains the coded agreements and disagreements on each characteristic:

 \gamma \left[ \alpha ( a ), \beta ( b ) \right] = \{ \gamma^{1} \left[ \alpha ( a ) , \beta ( b ) \right] ,...,	\gamma^{K} \left[ \alpha ( a ), \beta ( b ) \right] \}

where K is a subscript for the characteristics (sex, age, marital status, etc.) in the files. The conditional probabilities of observing a specific vector γ given (a, b) \in M, (a, b) \in U are defined as

 m(\gamma) = P \left\{ \gamma \left[ \alpha (a), \beta (b) \right] | (a,b) \in M \right\} =
 \sum_{(a, b) \in M} P \left\{\gamma\left[ \alpha(a), \beta(b) \right] \right\} \cdot
                 P \left[ (a, b) | M\right]


 u(\gamma) = P \left\{ \gamma \left[ \alpha (a), \beta (b) \right] | (a,b) \in U \right\} =
 \sum_{(a, b) \in U} P \left\{\gamma\left[ \alpha(a), \beta(b) \right] \right\} \cdot
                 P \left[ (a, b) | U\right],


Applications in historical research

Record linkage is important to social history research since most data sets, such as census records and parish registers were recorded long before the invention of National identification numbers. When old sources are digitized, linking of data sets is a prerequisite for longitudinal study. This process is often further complicated by lack of standard spelling of names, family names that change according to place of dwelling, changing of administrative boundaries, and problems of checking the data against other sources. Record Linkage was among the most prominent themes in the History and computing field in the 1980s, but has since been subject to less attention in research.

Applications in medical practice and research

Record linkage is an important tool in creating data required for examining the health of the public and of the health care system itself. It can be used to improve data holdings, data collection, quality assessment, and the dissemination of information. Data sources can be examined to eliminate duplicate records, to identify under-reporting and missing cases (e.g., census population counts), to create person-oriented health statistics, and to generate disease registries and health surveillance systems. Some cancer registries link various data sources (e.g., hospital admissions, pathology and clinical reports, and death registrations) to generate their registries. Record linkage is also used to create health indicators. For example, fetal and infant mortality is a general indicator of a country's socioeconomic development, public health, and maternal and child services. If infant death records are matched to birth records, it is possible to use birth variables, such as birth weight and gestational age, along with mortality data, such as cause of death, in analyzing the data. Linkages can help in follow-up studies of cohorts or other groups to determine factors such as vital status, residential status, or health outcomes. Tracing is often needed for follow-up of industrial cohorts, clinical trials, and longitudinal surveys to obtain the cause of death and/or cancer. An example of a successful and long-standing record linkage system allowing for population-based medical research is the Rochester Epidemiology Project based in Rochester, Minnesota.

See also


  1. ^ Cristen, P & T: Febrl - Freely extensible biomedical record linkage (Manual, release 0.3) p.9
  2. ^ Elmagarmid, Ahmed; Panagiotis G. Ipeirotis, Vassilios Verykios (January 2007). "Duplicate Record Detection: A Survey" (PDF). IEEE Transactions on Knowledge and Data Engineering 19 (1): pp. 1–16. doi:10.1109/TKDE.2007.9. Retrieved 2009-03-30. 
  3. ^ Dunn, Halbert L. (December 1946). "Record Linkage" (PDF). American Journal of Public Health 36 (12): pp. 1412–1416. doi:10.2105/AJPH.36.12.1412. Retrieved 2008-05-31. 
  4. ^ Fellegi, Ivan; Sunter, Alan (December 1969). "A Theory for Record Linkage". Journal of the American Statistical Association 64 (328): pp. 1183–1210. doi:10.2307/2286061. JSTOR 2286061. 

External links

Software implementations

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • record linkage — the means by which information about health events from several different sources (e.g. hospital attendance, vaccination, and consultation with general practitioners) are all related to a specific individual in a common file or more usually a… …   Medical dictionary

  • record linkage — the means by which information about health events from several different sources (e.g. hospital attendance, vaccination, and consultation with general practitioners) are all related to a specific individual in a common file or more usually a… …   The new mediacal dictionary

  • Linkage — The tendency for genes and other genetic markers to be inherited together because of their location near one another on the same chromosome. A gene is a functional physical unit of heredity that can be passed from parent to child. All genes in… …   Medical dictionary

  • Product Data Record — The Product Record, or Product Data Record, is an Information technology management concept used to refer to data associated with the entire lifecycle of a product from its conception, through design and manufacture, to service and disposal. It… …   Wikipedia

  • Distance de Jaro — Winkler La distance de Jaro Winkler mesure la similarité entre deux chaînes de caractères. Il s agit d une variante proposée en 1999 par William E. Winkler, découlant de la distance de Jaro (1989, Matthew A. Jaro) qui est principalement utilisée… …   Wikipédia en Français

  • Distance de Jaro-Winkler — La distance de Jaro Winkler mesure la similarité entre deux chaînes de caractères. Il s agit d une variante proposée en 1999 par William E. Winkler, découlant de la distance de Jaro (1989, Matthew A. Jaro) qui est principalement utilisée dans la… …   Wikipédia en Français

  • Jaro-Winkler — Distance de Jaro Winkler La distance de Jaro Winkler mesure la similarité entre deux chaînes de caractères. Il s agit d une variante proposée en 1999 par William E. Winkler, découlant de la distance de Jaro (1989, Matthew A. Jaro) qui est… …   Wikipédia en Français

  • Интеграция данных — включает объединение данных, находящихся в различных источниках и предоставление данных пользователям в унифицированном виде. Этот процесс становится существенным как в коммерческих задачах (когда двум похожим компаниям необходимо объединить их… …   Википедия

  • Jaro-Winkler distance — The Jaro Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection). The higher the Jaro… …   Wikipedia

  • David Reardon — David C. Reardon is the American director of the Elliot Institute and an advocate in favor of legislating strict barriers to abortion.[1] Reardon is the author of a number of articles and five books examining the controversial issue of mental… …   Wikipedia