Suffix array

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full text indices, data compression algorithms, and the field of bibliometrics.
Suffix arrays were introduced by as a simple, space efficient alternative to suffix trees. They had independently been discovered by Gaston Gonnet in 1987 under the name PAT array.
gave the first in-place time suffix array construction algorithm that is optimal both in time and space, where in-place means that the algorithm only needs additional space beyond the input string and the output suffix array.
Enhanced suffix arrays are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity.
The suffix array for a subset of all suffixes of a string is called sparse suffix array. Multiple probabilistic algorithms have been developed to minimize the additional memory usage including an optimal time and memory algorithm.

Definition

Let be an -string and let denote the substring of ranging from to.
The suffix array of is now defined to be an array of integers providing the starting positions of suffixes of in lexicographical order. This means, an entry contains the starting position of the -th smallest suffix in and thus for all :.
Each suffix of shows up in exactly once. Suffixes are simple strings. These strings are sorted, before their starting positions are saved in.

Example

Consider the text =banana$ to be indexed:

i	1	2	3	4	5	6	7
	b	a	n	a	n	a	$

The text ends with the special sentinel letter $ that is unique and lexicographically smaller than any other character. The text has the following suffixes:

Suffix	i
banana$	1
anana$	2
nana$	3
ana$	4
na$	5
a$	6
$	7

These suffixes can be sorted in ascending order:

Suffix	i
$	7
a$	6
ana$	4
anana$	2
banana$	1
na$	5
nana$	3

The suffix array contains the starting positions of these sorted suffixes:

i =	1	2	3	4	5	6	7
=	7	6	4	2	1	5	3

The suffix array with the suffixes written out vertically underneath for clarity:

i =	1	2	3	4	5	6	7
=	7	6	4	2	1	5	3
1	$	a	a	a	b	n	n
2		$	n	n	a	a	a
3			a	a	n	$	n
4			$	n	a		a
5				a	n		$
6				$	a
7					$

So for example, contains the value 4, and therefore refers to the suffix starting at position 4 within, which is the suffix ana$.

Correspondence to suffix trees

Suffix arrays are closely related to suffix trees:

Suffix arrays can be constructed by performing a depth-first traversal of a suffix tree. The suffix array corresponds to the leaf-labels given in the order in which these are visited during the traversal, if edges are visited in the lexicographical order of their first character.
A suffix tree can be constructed in linear time by using a combination of suffix array and LCP array. For a description of the algorithm, see the corresponding section in the LCP array article.

It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suffix array enhanced with additional information and solves the same problem in the same time complexity.
Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms and improved cache locality.

Space efficiency

Suffix arrays were introduced by in order to improve over the space requirements of suffix trees: Suffix arrays store integers. Assuming an integer requires bytes, a suffix array requires bytes in total. This is significantly less than the bytes which are required by a careful suffix tree implementation.
However, in certain applications, the space requirements of suffix arrays may still be prohibitive. Analyzed in bits, a suffix array requires space, whereas the original text over an alphabet of size only requires bits.
For a human genome with and the suffix array would therefore occupy about 16 times more memory than the genome itself.
Such discrepancies motivated a trend towards compressed suffix arrays and BWT-based compressed full-text indices such as the FM-index. These data structures require only space within the size of the text or even less.

Construction algorithms

A suffix tree can be built in and can be converted into a suffix array by traversing the tree depth-first also in, so there exist algorithms that can build a suffix array in.
A naive approach to construct a suffix array is to use a comparison-based sorting algorithm. These algorithms require suffix comparisons, but a suffix comparison runs in time, so the overall runtime of this approach is.
More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other. These algorithms strive to achieve the following goals:

minimal asymptotic complexity
lightweight in space, meaning little or no working memory beside the text and the suffix array itself is needed
fast in practice

One of the first algorithms to achieve all goals is the SA-IS algorithm of. The algorithm is also rather simple and can be enhanced to simultaneously construct the LCP array. The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful outperforms most other linear or super-linear construction approaches.
Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported alphabet: constant alphabets where the alphabet size is bound by a constant, integer alphabets where characters are integers in a range depending on and general alphabets where only character comparisons are allowed.
Most suffix array construction algorithms are based on one of the following approaches:

Prefix doubling algorithms are based on a strategy of. The idea is to find prefixes that honor the lexicographic ordering of suffixes. The assessed prefix length doubles in each iteration of the algorithm until a prefix is unique and provides the rank of the associated suffix.
Recursive algorithms follow the approach of the suffix tree construction algorithm by to recursively sort a subset of suffixes. This subset is then used to infer a suffix array of the remaining suffixes. Both of these suffix arrays are then merged to compute the final suffix array.
Induced copying algorithms are similar to recursive algorithms in the sense that they use an already sorted subset to induce a fast sort of the remaining suffixes. The difference is that these algorithms favor iteration over recursion to sort the selected suffix subset. A survey of this diverse group of algorithms has been put together by.

A well-known recursive algorithm for integer alphabets is the DC3 / skew algorithm of. It runs in linear time and has successfully been used as the basis for parallel and external memory suffix array construction algorithms.
Recent work by proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is, it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.
In practical open source work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm. This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka.

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring pattern within the string. Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with two binary searches. The first search locates the starting position of the interval, and the second one determines the end position:

n = len
def search:
l = 0
r = n + 1
while l < r:
mid = / 2
if P > suffixAt:
l = mid + 1
else:
r = mid
s = l
r = n + 1
while l < r:
mid = / 2
if P not a prefix of suffixAt:
r = mid
else:
l = mid + 1
return

Finding the substring pattern of length in the string of length takes time, given that a single suffix comparison needs to compare characters. describe how this bound can be improved to time using LCP information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. improve the bound even further and achieve a search time of as known from suffix trees.
Suffix sorting algorithms can be used to compute the Burrows–Wheeler transform. The BWT requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character, then the order of the sorted rotated BWT matrix corresponds to the order of suffixes in a suffix array. The BWT can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the BWT string:.
Suffix arrays can also be used to look up substrings in Example-Based Machine Translation, demanding much less storage than a full phrase table as used in Statistical machine translation.
Many additional applications of the suffix array require the LCP array. Some of these are detailed in the application section of the latter.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...