Edit distance

In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.
Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

Types of edit distance

Different types of edit distance allow different sets of string operations. For instance:

The Levenshtein distance allows deletion, insertion and substitution.
The Longest common subsequence distance allows only insertion and deletion, not substitution.
The Hamming distance allows only substitution, hence, it only applies to strings of the same length.
The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters.
The Jaro distance allows only transposition.

Some edit distance are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost. This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

Formal definition and properties

Given two strings and on an alphabet , the edit distance is the minimum-weight series of edit operations that transforms into. One of the simplest sets of edit operations is that defined by Levenshtein in 1966:
In Levenshtein's original definition, each of these operations has unit cost, so the Levenshtein distance is equal to the minimum number of operations required to transform to. A more general definition associates non-negative weight functions _ins, _del and _sub with the operations.
Additional primitive operations have been suggested. Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes into.
For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa.
Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost. Similarly, by only allowing substitutions, Hamming distance is obtained; this must be restricted to equal-length strings.
Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed.

Example

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

kitten → sitten
sitten → sittin
sittin → sitting

LCS distance gives a different distance and minimal edit script:

kitten → itten
itten → sitten
sitten → sittn
sittn → sittin
sittin → sitting

for a total cost/distance of 5 operations.

Properties

Edit distance with non-negative cost satisfies the axioms of a metric, giving rise to a metric space of strings, when the following conditions are met:

Every edit operation has positive cost;
for every operation, there is an inverse operation with equal cost.

With these properties, the metric axioms are satisfied as follows:
Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature.
Other useful properties of unit-cost edit distances include:

LCS distance is bounded above by the sum of lengths of a pair of strings.
LCS distance is an upper bound on Levenshtein distance.
For strings of the same length, Hamming distance is an upper bound on Levenshtein distance.

Regardless of cost/weights, the following property holds of all edit distances:

When and share a common prefix, this prefix has no effect on the distance. Formally, when = and =, then =. This allows speeding up many computations involving edit distance and edit scripts, since common prefixes and suffixes can be skipped in linear time.
Computation

The first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964.

Common algorithm

Using Levenshtein's original operations, the edit distance from to is given by, defined by the recurrence
This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization.
The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer, although it has a history of multiple invention.
After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at.
This algorithm has a time complexity of Θ. When the full dynamic programming table is constructed, its space complexity is also ; this can be improved to by observing that at any instant, the algorithm only requires two rows in memory. However, this optimization makes it impossible to read off the minimal series of edit operations. A linear-space solution to this problem is offered by Hirschberg's algorithm.

Improved algorithms

Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants, one of which takes two strings and a maximum edit distance, and returns. It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time, where and are the lengths of the strings. Space complexity is or, depending on whether the edit sequence needs to be read off.
Further improvements by Landau, Myers, and Schmidt give an time algorithm.

Applications

Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.
Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.

Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance.
Approximate string matching can be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string, called the pattern, and a constant ; it then builds a deterministic finite state automaton that finds, in an arbitrary string, a substring whose edit distance to is at most . A similar algorithm for approximate string matching is the bitap algorithm, also defined in terms of edit distance.
Levenshtein automata are finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.
Language edit distance

A generalization of the edit distance between strings is the language edit distance between a string and a language, usually a formal language. Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string and any string taken from a set of strings. More formally, for any language L and string x over an alphabet, the language edit distance d is given by, where is the string edit distance. When the language L is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance. For less expressive families of grammars, such as the regular grammars, faster algorithms exist for computing the edit distance.
Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...