Semantic similarity

Semantic similarity: Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content.

Concretely, this can be achieved for instance by defining a topological similarity, by using ontologies to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph like a hierarchy would be the minimal distance—in separating edges—between the two term nodes), or using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus (co-occurrence).

Contents

1 Taxonomy

2 Visualisation

3 Applications

3.1 Biomedical Informatics

3.2 GeoInformatics

3.3 Linguistics

4 Measures

4.1 Topological similarity

4.1.1 Edge-based

4.1.2 Node-based

4.1.3 Pairwise

4.1.4 Groupwise

4.2 Statistical similarity

5 Software

6 Web Services

7 See also

8 Notes

9 References

10 External links

Taxonomy

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy, while similarity does not .^[1] However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.

Visualisation

An intuitive way of visualising the semantic similarity of terms is by grouping together closer related terms and spacing more distantly related ones wider apart. This is also common - if sometime subconscious - practice for mind maps and concept maps.

Applications

Biomedical Informatics

Semantic similarity measures have been applied and developed in biomedical ontologies,^[2] ^[3]namely, the Gene Ontology (GO). They are mainly used to compare genes and proteins based on the similarity of their functions rather than on their sequence similarity, but they are also being extended to other bioentities, such as chemical compounds^[4] and diseases.^[5]

These comparisons can be done using tools freely available on the web:

ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of UniProt proteins and to get the information content and calculate the functional semantic similarity of GO terms.

CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using ChEBI based semantic similarity measures.

CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.

GeoInformatics

Similarity is also applied to find similar geographic features or feature types:

SIM-DL similarity server can be used to compute similarities between concepts stored in geographic feature type ontologies.

Geo-Net-PT Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology.

Linguistics

Several metrics use WordNet: (+) humanly constructed; (−) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary

Measures

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:

Edge-based: which use the edges and their types as the data source;

Node-based: in which the main data sources are the nodes and their properties.

Other measures calculate the similarity between ontological instances:

Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent

Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

Some examples:

Edge-based

Pekar, Viktor; Staab, Steffen (2002). Taxonomy learning. 1. pp. 1. doi:10.3115/1072228.1072318.

Cheng, J; Cline, M; Martin, J; Finkelstein, D; Awad, T; Kulp, D; Siani-Rose, MA (2004). "A knowledge-based clustering algorithm driven by Gene Ontology". Journal of biopharmaceutical statistics 14 (3): 687–700. doi:10.1081/BIP-200025659. PMID 15468759.

Wu, H; Su, Z; Mao, F; Olman, V; Xu, Y (2005). "Prediction of functional modules based on comparative genome analysis and Gene Ontology application". Nucleic Acids Research 33 (9): 2822–37. doi:10.1093/nar/gki573. PMC 1130488. PMID 15901854. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=1130488.

Del Pozo, Angela; Pazos, Florencio; Valencia, Alfonso (2008). "Defining functional distances over Gene Ontology". BMC Bioinformatics 9: 50. doi:10.1186/1471-2105-9-50. PMC 2375122. PMID 18221506. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2375122.

IntelliGO: Benabderrahmane, Sidahmed; Smail Tabbone, Malika; Poch, Olivier; Napoli, Amedeo; Devignes, Marie-Domonique. (2010). "IntelliGO: a new vector-based semantic similarity measure including annotation origin". Biomed Central 11: 588. doi:10.1186/1471-2105-11-588. PMID 21122125.

Node-based

Resnik ^[6]

based on the notion of information content

Lin ^[7]

Jiang and Conrath ^[8]

DiShIn Disjunctive Shared Information between Ontology Concepts ^[9]

other alternative: GraSM (Graph-based Similarity Measure) ^[10]

Pairwise

maximum of the pairwise similarities

composite average in which only the best-matching pairs are considered (best-match average)

Groupwise

Jaccard index

simGIC ^[11]

simLP

simUI

Statistical similarity

LSA (Latent semantic analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times

PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents

SOC-PMI (Second-order co-occurrence pointwise mutual information) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents

GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times

ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times

NGD (Normalized Google distance) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below. ^[12]

ESA (Explicit Semantic Analysis) based on Wikipedia and the ODP

n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.

VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions

BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map to reduce high dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation

SimRank

Software

WordNet-Similarity, an open source package for computing the similarity and relatedness of concepts found in WordNet

UMLS-Similarity, an open source package for computing the similarity and relatedness of concepts found in the Unified Medical Language System (UMLS)

Web Services

Measures of Semantic Relatedness (MRS)

WordNet-Similarity, a web interface to WordNet-Similarity

UMLS-Similarity, a web interface to UMLS-Similarity

See also

Linguistics portal

Terminology extraction

Coherence (linguistics)

Analogy

Semantic differential

Notes

^ Budanitsky, Alexander; Hirst, Graeme (2001). "Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures". Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh

^ Pesquita, Catia; Faria, Daniel; Falcão, André O.; Lord, Phillip; Couto, Francisco M. (2009). Bourne, Philip E.. ed. "Semantic Similarity in Biomedical Ontologies". PLoS Computational Biology 5 (7): e1000443. doi:10.1371/journal.pcbi.1000443. PMC 2712090. PMID 19649320. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2712090.

^ Benabderrahmane, Sidahmed; Smail Tabbone, Malika; Poch, Olivier; Napoli, Amedeo; Devignes, Marie-Domonique. (2010). "IntelliGO: a new vector-based semantic similarity measure including annotation origin". Biomed Central 11: 588. doi:10.1186/1471-2105-11-588. PMID 21122125.

^ Ferreira, João D.; Couto, Francisco M. (2010). Mitchell, John B. O.. ed. "Semantic Similarity for Automatic Classification of Chemical Compounds". PLoS Computational Biology 6 (9): e1000937. doi:10.1371/journal.pcbi.1000937. PMC 2944781. PMID 20885779. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2944781.

^ Köhler, S; Schulz, MH; Krawitz, P; Bauer, S; Dolken, S; Ott, CE; Mundlos, C; Horn, D et al. (2009). "Clinical diagnostics in human genetics with semantic similarity searches in ontologies". American journal of human genetics 85 (4): 457–64. doi:10.1016/j.ajhg.2009.09.003. PMC 2756558. PMID 19800049. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2756558.

^ Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 (IJCAI'95), Chris S. Mellish (Ed.), Vol. 1. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 448-453

^ Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296-304

^ J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In International Conference Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997

^ Couto, F. & Silva, M. (2011), Disjunctive Shared Information between Ontology Concepts: application to Gene Ontology. Journal of Biomedical Semantics, 2:5

^ Couto, F., Silva, M., & Coutinho, P. (2007). Measuring semantic similarity between Gene Ontology terms. Data and Knowledge Engineering, 61:137–152

^ Catia Pesquita, Daniel Faria, Hugo Bastos, António Ferreira, Andre O Falcao, Francisco Couto 2008: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics Suppl 5(9), S4

^ |title= Google Similarity Distance

References

Benabderrahmane Sidahmed , Malika Smail-Tabbone, Olivier Poch, Amedeo Napoli and Marie-Dominique Devignes, (2010). IntelliGO: a new vector-based semantic similarity measure including annotation origin. Biomed Central, Volume 11.

Cilibrasi, R.L. & Vitanyi, P.M.B. (2007). The Google Similarity Distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370-383.

Dumais, S. (2003). Data-driven approaches to information access. Cognitive Science, 27(3), 491-524.

Gabrilovich, E. and Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007.

Janowicz, K., Raubal, M. and Kuhn, W. The semantics of similarity in geographic information retrieval. Journal of Spatial Information Science, No 2 (2011), pp. 29-57.

Juvina, I., van Oostendorp, H., Karbor, P., & Pauw, B. (2005). Towards modeling contextual information in web navigation. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1078–1083). Austin, Tx: The Cognitive Science Society, Inc.

Kaur, I. & Hornof, A.J. (2005). A Comparison of LSA, WordNet and PMI for Predicting User Click Behavior. Proceedings of the Conference on Human Factors in Computing, CHI 2005 (pp. 51–60).

Landauer, T. K.; Dumais, S. T. (1997). "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge". Psychological Review 104 (2): 211–240.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.

Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1254–1259). Austin, Tx: The Cognitive Science Society, Inc.

Lemaire, B., & Denhiére, G. (2004). Incremental construction of an associative network from a corpus. In K. D. Forbus & D. Gentner & T. Regier (Eds.), 26th Annual Meeting of the Cognitive Science Society, CogSci2004. Hillsdale, NJ: Lawrence Erlbaum Publisher.

Lindsey, R., Veksler, V.D., Grintsvayg, A., Gray, W.D. (2007). The Effects of Corpus Selection on Measuring Semantic Relatedness. Proceedings of the 8th International Conference on Cognitive Modeling, Ann Arbor, MI.

Navigli, R., Lapata, M. (2010). "An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation". IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(4), IEEE Press, 2010, pp. 678–692.

Navigli, R., Lapata, M. (2007). Graph Connectivity Measures for Unsupervised Word Sense Disambiguation, Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, January 6-12th, 2007, pp. 1683–1688.

Pirolli, P. (2005). Rational analyses of information foraging on the Web. Cognitive Science, 29(3), 343-373.

Pirolli, P., & Fu, W.-T. (2003). SNIF-ACT: A model of information foraging on the World Wide Web. Lecture Notes in Computer Science, 2702, 45-54.

Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491–502). Freiburg, Germany.

Veksler, V.D. & Gray, W.D. (2006). Test Case Selection for Evaluating Measures of Semantic Distance. Proceedings of the 28th Annual Meeting of the Cognitive Science Society, CogSci2006.

Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. In: M. Song and Y. Wu; Handbook of Research on Text and Web Mining Technologies; IGI Global. [ISBN 978-1-59904-990-8] (the use of NGD and noW for term and URI clustering)

Couto, F., Silva, M., & Coutinho, P. (2003). Implementation of a functional semantic similarity measure between gene-products. DI/FCUL TR 03–29, University of Lisbon

Couto, F., Silva, M., & Coutinho, P. (2005). Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors. In Proc. Of the ACM Conference in Information and Knowledge Management (CIKM)

Couto, F., Silva, M., & Coutinho, P. (2007). Measuring semantic similarity between Gene Ontology terms. Data and Knowledge Engineering, 61:137–152

Pesquita, C., Faria, D., Falcão, A., Lord, P., & Couto, F. (2009). Semantic similarity in biomedical ontologies. PLoS Computational Biology, 5:e1000443

Ferreira, J. & Couto, F. (2010). Semantic similarity for automatic classification of chemical compounds. PLoS Computational Biolology 6(9): e1000937, 2010

Dong, H., Hussain, F., & Chang, E. (2011). A Context-aware Semantic Similarity Model for Ontology Environments. Concurrency and Computation: Practice and Experience.23(5) pp.505-524

External links

List of related literature

WordNet::Similarity (using WordNet as an ontology)

WordNet Explorer (interactive graphic WordNet database editor)

Similarity-based Learning Methods for the Semantic Web (C. d'Amato, PhD Thesis)

Survey on Semantic Similarity Measures (C. d'Amato, S. Staab, N. Fanizzi, EKAW 2008, Springer-Verlag)

lgorithm, Implementation and Application of the SIM-DL Similarity Server (Introduction to the SIM-DL Similarity Server)

Categories:
Computational linguistics
Statistical distance measures

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

Semantic memory — refers to the memory of meanings, understandings, and other concept based knowledge unrelated to specific experiences. The conscious recollection of factual information and general knowledge about the world,cite web… … Wikipedia
Semantic integration — is the process of interrelating information from diverse sources, for example calendars and to do lists; email archives; physical, psychological, and social presence information; documents of all sorts; contacts (including social graphs); search… … Wikipedia
Semantic relatedness — Computational Measures of Semantic Relatedness are [http://cwl projects.cogsci.rpi.edu/msr/ publically available] means for approximating the relative meaning of words/documents. These have been used for essay grading by the Educational Testing… … Wikipedia
Similarity (geometry) — Geometry = Two geometrical objects are called similar if one is congruent to the result of a uniform scaling (enlarging or shrinking) of the other. One can be obtained from the other by uniformly stretching , possibly with additional rotation,… … Wikipedia
Similarity — Similar redirects here. For the place in India, see Shimla. Contents 1 Specific definitions 2 In mathematics 3 In computer science 4 In other fields … Wikipedia
Semantic change — Semantic change, also known as semantic shift or semantic progression describes the evolution of word usage usually to the point that the modern meaning is radically different from the original usage. In diachronic (or historical) linguistics,… … Wikipedia
Similarity (psychology) — Cognitive Psychological Approaches to Similarity Similarity refers to the psychological nearness or proximity of two mental representations. Research in cognitive psychology has taken a number of approaches to the concept of similarity. Each of… … Wikipedia
Semantic Web — The Semantic Web is an evolving extension of the World Wide Web in which the semantics of information and services on the web is defined, making it possible for the web to understand and satisfy the requests of people and machines to use the web… … Wikipedia
Latent Semantic Structure Indexing — (LaSSI) is a technique for calculating chemical similarity derived from Latent semantic analysis (LSA).LaSSI was developed at Merck Co. and patented in 2007 [http://patft.uspto.gov/netacgi/nph Parser?patentnumber=7219020] by Richard Hull, Eugene… … Wikipedia
Latent semantic analysis — (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA was … Wikipedia

Academic Dictionaries and Encyclopedias

Semantic similarity

Contents

Taxonomy

Visualisation

Applications

Biomedical Informatics

GeoInformatics

Linguistics

Measures

Topological similarity

Edge-based

Node-based

Pairwise

Groupwise

Statistical similarity

Software

Web Services

See also

Notes

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Semantic similarity

Contents

Taxonomy

Visualisation

Applications

Biomedical Informatics

GeoInformatics

Linguistics

Measures

Topological similarity

Edge-based

Node-based

Pairwise

Groupwise

Statistical similarity

Software

Web Services

See also

Notes

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link