Approximate string matching


Approximate string matching

In computing, approximate string matching is the technique of finding approximate matches to a pattern in a string.

The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. The usual primitive operations are:Fact|date=February 2007

* "insertion" (e.g., changing "cot" to "coat"),
* "deletion" (e.g. changing "coat" to "cot"), and
* "substitution" (e.g. changing "coat" to "cost").

Some approximate matchers also treat "transposition", in which the positions of two letters in the string are swapped, to be a primitive operation.Fact|date=February 2007 Changing "cost" to "cots" is an example of a transposition.

Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern.Fact|date=February 2007 For example, if the pattern is "coil", "foil" differs by one substitution, "coils" by one insertion, "oil" by one deletion, and "foal" by two substitutions. If all operations count as a single unit of cost and the limit is set to one, "foil", "coils", and "oil" will count as matches while "foal" will not.

Other matchers specify the number of operations of each type separately,Fact|date=February 2007 while still others set a total cost but allow different weights to be assigned to different operations.Fact|date=February 2007 Some matchers allow separate assignments of limits and weights to individual groups in the pattern.

Most approximate matchers used for text processing are regular expression matchers.Fact|date=February 2007 The distance between a candidate and the pattern is therefore computed as the minimum distance between the candidate and a fixed string matching the regular expression. Thus, if the pattern is "co.l", using the POSIX notation in which a dot matches any single character, both "coal" and "coil" are exact matches, while "soil" differs by one substitution.

The most common application of approximate matchers until recently has been spell checking.Fact|date=February 2007 With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application.Fact|date=February 2007 Approximate matching is also used to identify pieces of music from small snatches and in spam filtering.Fact|date=February 2007

References

* "Pattern Matching Algorithms", Alberto Apostolico & Zvi Galil, Oxford University Press, UK, 1997.

See also

* Fuzzy string searching
* Levenshtein distance
* Needleman-Wunsch algorithm
* Soundex
* Agrep
* Zsh


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • String metric — String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison and in fuzzy string… …   Wikipedia

  • Fuzzy string searching — Approximate string search is the name that is used for a category of techniques for finding strings that approximately match some given pattern string. It may also be known as approximate or inexact matching. Approximate string searching has two… …   Wikipedia

  • Compressed pattern matching — In computer science Compressed Pattern Matching or CPM is the process of searching for pattern in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less… …   Wikipedia

  • Levenshtein distance — In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance. The… …   Wikipedia

  • List of terms relating to algorithms and data structures — The [http://www.nist.gov/dads/ NIST Dictionary of Algorithms and Data Structures] is a reference work maintained by the U.S. National Institute of Standards and Technology. It defines a large number of terms relating to algorithms and data… …   Wikipedia

  • Damerau–Levenshtein distance — In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the …   Wikipedia

  • Список терминов, относящихся к алгоритмам и структурам данных —   Это служебный список статей, созданный для координации работ по развитию темы.   Данное предупреждение не устанавливается на информационные списки и глоссарии …   Википедия

  • Список терминов — Список терминов, относящихся к алгоритмам и структурам данных   Это сл …   Википедия

  • Agrep — (approximate grep) is a fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.It selects the best suited algorithm for the… …   Wikipedia

  • BK-tree — A BK tree is a metric tree suggested by Burkhard and Keller ref|BK73 specifically adapted to discrete metric spaces.For simplicity, let us consider integer discrete metric varrho(x,y). Then, BK tree is defined in the following way. An arbitrary… …   Wikipedia