String metric

String metric

String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison and in fuzzy string searching.For example the strings "Sam" and "Samuel" can be considered (although not the same) to a degree similar. A string metric provides a floating point number indicating an algorithm-specific indication of similarity.

The most widely known (although rudimentary) string metric is Levenshtein Distance (also known as Edit Distance), which operates between two input strings, returning a score equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

A widespread example of a string metric is DNA sequence analysis and RNA analysis, which are performed by optimised string metrics to identify matching sequences.

String metrics are used heavily in information integration and are currently used in fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database deduplication, data mining, Web interfaces, e.g. Ajax-style suggestions as you type, data integration, semantic knowledge integration, etc..

List of string metrics

* Hamming distance
* Levenshtein distance and Damerau–Levenshtein distance
* Needleman-Wunsch distance or Sellers algorithm
* Smith-Waterman distance
** FASTA
** Blast
* Gotoh distance or Smith-Waterman-Gotoh distance
** Monge Elkan distance
* Block distance or L1 distance or City block distance
* Jaro distance metric
* Jaro-Winkler
* Soundex distance metric
* Matching coefficient
* Dice's coefficient
* Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
* Overlap coefficient
* Euclidean distance or L2 distance
* Cosine similarity
* Variational distance
* Hellinger distance or Bhattacharyya distance
* Information radius (Jensen-Shannon divergence)
* Harmonic mean
* Skew divergence
* Confusion probability
* Tau metric, an approximation of the Kullback-Leibler divergence
* Fellegi and Sunters metric (SFS)
* TFIDF or TF/IDF
* Maximal matches
* Monge-Elkan

ee also

* String matching
* SimMetrics - an implementation
* [http://www.speech.cs.cmu.edu/ Carnegy Mellon University open source library]

External links

*http://www.dcs.shef.ac.uk/~sam/stringmetrics.html A fairly complete overview


Wikimedia Foundation. 2010.

См. также в других словарях:

  • String (computer science) — In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet. In computer programming, a string is traditionally a sequence of… …   Wikipedia

  • String art — String art, created with thread and paper. A string art representing a projection of the 8 dimen …   Wikipedia

  • String theory — This article is about the branch of theoretical physics. For other uses, see String theory (disambiguation). String theory …   Wikipedia

  • String frame and Einstein frame — In string theory, there are two different definitions of the metric tensor, both related by a Weyl rescaling given by the dilaton. One metric gives the Einstein frame and the other gives the string frame …   Wikipedia

  • String Quartet No. 11 (Beethoven) — Ludwig van Beethoven s opus 95, his String Quartet No. 11 in F minor, is his last before his exalted late string quartets. It is commonly referred to as the Serioso, stemming from his title Quartett [o] Serioso at the beginning and the tempo… …   Wikipedia

  • String Quartet No. 1 (Carter) — The First String Quartet by American composer Elliott Carter (1908 ) was written during a year spent in the Arizona desert from 1950 51. To some extent, it can be said that this was his first major breakthrough work as a composer.A primary… …   Wikipedia

  • Fuzzy string searching — Approximate string search is the name that is used for a category of techniques for finding strings that approximately match some given pattern string. It may also be known as approximate or inexact matching. Approximate string searching has two… …   Wikipedia

  • Mirror symmetry (string theory) — In physics and mathematics, mirror symmetry is a relation that can exist between two Calabi Yau manifolds. It happens, usually for two such six dimensional manifolds, that the shapes may look very different geometrically, but nevertheless they… …   Wikipedia

  • Damerau–Levenshtein distance — In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the …   Wikipedia

  • Levenshtein distance — In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance. The… …   Wikipedia


Поделиться ссылкой на выделенное

Прямая ссылка:
Нажмите правой клавишей мыши и выберите «Копировать ссылку»