Semantic similarity

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.
For example, "car" is similar to "bus", but is also related to "road" and "driving".
Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph, would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such the information retrieval, recommender systems, natural language processing, etc.

Terminology

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy, while similarity does not. However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity.

Visualization

An intuitive way of visualizing the semantic similarity of terms is by grouping together terms which are closely related and spacing wider apart the ones which are distantly related. This is also common in practice for mind maps and concept maps.
A more direct way of visualizing the semantic similarity of two linguistic items can be seen with the Semantic Folding approach. In this approach a linguistic item such as a term or a text can be represented by generating a pixel for each of its active semantic features in e.g. a 128 x 128 grid. This allows for a direct visual comparison of the semantics of two items by comparing image representations of their respective feature sets.

Applications

Biomedical informatics

Semantic similarity measures have been applied and developed in biomedical ontologies.
They are mainly used to compare genes and proteins based on the similarity of their functions rather than on their sequence similarity,
but they are also being extended to other bioentities, such as diseases.
These comparisons can be done using tools freely available on the web:

ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of UniProt proteins and to get the information content and calculate the functional semantic similarity of GO terms.
CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using ChEBI based semantic similarity measures.
CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.
GeoInformatics

Similarity is also applied to find similar geographic features or feature types:

SIM-DL similarity server can be used to compute similarities between concepts stored in geographic feature type ontologies.
Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology.
The semantic network can be used to compute the semantic similarity of tags in OpenStreetMap.
Computational linguistics

Several metrics use WordNet, a manually constructed lexical database of English words. Despite the advantages of having human supervision in constructing the database, since the words are not automatically learned the database cannot measure relatedness between multi-word term, non-incremental vocabulary.

Natural language processing

Natural language processing is a field of computer science and linguistics. Sentiment analysis, Natural language understanding and Machine translation are a few of the major areas where it is being used. For example, knowing one information resource in the internet, it is often of immediate interest to find similar resources. The Semantic Web provides semantic extensions to find similar data by content and not just by arbitrary descriptors. Deep learning methods have become an accurate way to gauge semantic similarity between two text passages, in which each passage is first embedded into a continuous vector representation.

Measures

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:

Edge-based: which use the edges and their types as the data source;
Node-based: in which the main data sources are the nodes and their properties.

Other measures calculate the similarity between ontological instances:

Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

Some examples:

Edge-based

Pekar et al.
Cheng and Cline
Wu et al.
Del Pozo et al.
IntelliGO: Benabderrahmane et al.
Node-based
Resnik
* based on the notion of information content. The information content of a concept is the logarithm of the probability of finding the concept in a given corpus.
* only considers the information content of lowest common subsumer. A lowest common subsumer is a concept in a lexical taxonomy, which has the shortest distance from the two concepts compared. For example, animal and mammal both are the subsumers of cat and dog, but mammal is lower subsumer than animal for them.
Lin
* based on Resnik's similarity.
* considers the information content of lowest common subsumer and the two compared concepts.
Maguitman, Menczer, Roinestad and Vespignani
* Generalizes Lin's similarity to arbitrary ontologies.
Jiang and Conrath
* based on Resnik's similarity.
* considers the information content of lowest common subsumer and the two compared concepts to calculate the distance between the two concepts. The distance is later used in computing the similarity measure.
: Random walks on Semantic Networks
Node-and-Relation-Content-based
applicable to ontology
consider properties of nodes
consider types of relations
based on eTVSM
based on Resnik's similarity
Pairwise
maximum of the pairwise similarities
composite average in which only the best-matching pairs are considered
Groupwise
Jaccard index
Statistical similarity

Statistical similarity approaches can be learned from data, or predefined.
Similarity learning can often outperform predefined similarity measures.
Broadly speaking, these approaches build a statistical model of documents, and use it to estimate similarity.

LSA vector-based, adds vectors to measure multi-word terms; non-incremental vocabulary, long pre-processing times
PMI large vocab, because it uses any search engine ; cannot measure relatedness between whole sentences or documents
SOC-PMI sort lists of important neighbor words from a large corpus; cannot measure relatedness between whole sentences or documents
GLSA vector-based, adds vectors to measure multi-word terms; non-incremental vocabulary, long pre-processing times
ICAN incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; cannot measure relatedness between multi-word terms, long pre-processing times
NGD large vocab, because it uses any search engine ; can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi, reference below.
TSS - - large vocab, because it use online tweets from Twitter to compute the similarity. It has high temporary resolution that allows to capture high frequency events. Open Source
NCD
based on Wikipedia and the ODP
which indexes terms using salient concepts found in their immediate context.
, inspired by the game , is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics in the graph.
incremental vocab, can compare multi-word terms performance depends on choosing specific dimensions
SimRank
: Sparse vector representations constructed by applying the hypergeometric distribution over the Wikipedia corpus in combination with taxonomy. Cross-lingual similarity is currently also possible thanks to the multilingual and unified extension.
Semantics-based similarity
Marker Passing: Combining Lexical Decomposition for automated Ontology Creation and Marker Passing the approach of Fähndrich et al. introduces a new type of semantic similarity measure. Here markers are passed from the two target concepts carrying an amount of activation. This activation might increase or decrease depending on the relations weight with which the concepts are connected. This combines edge and node based approaches and includes connectionist reasoning with symbolic information.
Good Common Subsumer--based Semantic Similarity Measure
Gold standards

Researchers have collected datasets with similarity judgements on pairs of words, which are used to evaluate the cognitive plausibility of computational measures. The golden standard up to today is an old 65 word list where humans have judged the word similarity. For a list of datasets, and an overview of the state of the art see .

RG65
MC30
WordSim353
Survey articles
Conference article: C. d'Amato, S. Staab, N. Fanizzi. 2008. . In Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns Pages 48 – 63. Acitrezza, Italy, Springer-Verlag
Journal article on the more general topic of relatedness, also including similarity: Z. Zhang, A. Gentile, F. Ciravegna. 2013. . Natural Language Engineering 19, 411-479, Cambridge University Press
Book: S. Harispe, S. Ranwez, S. Janaqi, J. Montmain. 2015. , Morgan & Claypool Publishers.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Semantic similarity

Terminology

Visualization

Applications

Biomedical informatics

GeoInformatics

Computational linguistics

Natural language processing

Measures

Topological similarity

Edge-based

Node-based

Node-and-Relation-Content-based

Pairwise

Groupwise

Statistical similarity

Semantics-based similarity

Gold standards

Survey articles