Dice's coefficient

Dice's coefficient


Dice's coefficient, named after Lee Raymond Dice[1] and also known as the Dice coefficient, is a similarity measure over sets:

s = \frac{2 | X \cap Y |}{| X | + | Y |}

It is identical to the Sørensen similarity index, and is occasionally referred to as the Sørensen-Dice coefficient. It is not very different in form from the Jaccard index but has some different properties.

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

d = 1 -  \frac{2 | X \cap Y |}{| X | + | Y |}

is not a proper distance metric as it does not possess the property of triangle inequality. The simplest counterexample of this is given by the three sets {a), {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third.

Similarly to Jaccard, the set operations can be expressed in terms of vector operations over binary vectors A and B:

s_v = \frac{2 | A \cdot B |}{| A |^2 + | B |^2}

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :[2]

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[3]

s = \frac{2 n_t}{n_x + n_y}

where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night
nacht

We would find the set of bigrams in each word:

{ni,ig,gh,ht}
{na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

See also

Notes

  1. ^ Dice, Lee R. (1945). "Measures of the Amount of Ecologic Association Between Species". Ecology 26 (3): 297–302. doi:10.2307/1932409. JSTOR 1932409. 
  2. ^ van Rijsbergen, Cornelis Joost (1979). Information Retrieval. London: Butterworths. ISBN 3642122744. http://www.dcs.gla.ac.uk/Keith/Preface.html. 
  3. ^ Kondrak, Grzegorz; Marcu, Daniel; and Knight, Kevin (2003). "Cognates Can Improve Statistical Translation Models". Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. pp. 46–48. http://aclweb.org/anthology/N/N03/N03-2016.pdf. 

References


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Dice (disambiguation) — Dice are a polyhedral objects used in games for generating random numbers. Dice games themselves Dice, DICE, or dicing may also refer to: Dice Living, a way of life depicted in the novel The Dice Man by Luke Rhinehart DICE (compiler), a C… …   Wikipedia

  • Sørensen similarity index — The Sørensen index, also known as Sørensen’s similarity coefficient, is a statistic used for comparing the similarity of two samples. It was developed by the botanist Thorvald Sørensen and published in 1948 [Sørensen, T. (1948) A method of… …   Wikipedia

  • Коэффициент Сёренсена — Мера Сёренсена бинарная мера сходства, предложенная Торвальдом Сёренсеном в 1948 году.[1] Фамилия автора коэффициента в литературе переводится самыми различными способами: Съёренсен, Съеренсен, Соренсен, Серенсен. Вариант Сёренсен приводится в… …   Википедия

  • Jaccard index — The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets.The Jaccard coefficient measures… …   Wikipedia

  • Cosine similarity — is a measure of similarity between two vectors by measuring the cosine of the angle between them. The cosine of 0 is 1, and less than 1 for any other angle. The cosine of the angle between two vectors thus determines whether two vectors are… …   Wikipedia

  • String metric — String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison and in fuzzy string… …   Wikipedia

  • SimMetrics — is an open source extensible library of similarity or distance metrics (also known as string metrics). The SimMetrics open source library includes the following metrics * Levenshtein distance, * Block distance or city block distance or L2… …   Wikipedia

  • List of mathematics articles (D) — NOTOC D D distribution D module D D Agostino s K squared test D Alembert Euler condition D Alembert operator D Alembert s formula D Alembert s paradox D Alembert s principle Dagger category Dagger compact category Dagger symmetric monoidal… …   Wikipedia

  • Dst+ — Dice s similarity coefficient …   Medical dictionary

  • Dst+ — • Dice s similarity coefficient …   Dictionary of medical acronyms & abbreviations

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”