Induction of regular languages

In computational learning theory, induction of regular languages refers to the task of learning a formal description of a regular language from a given set of example strings. Although Mark E. Gold has shown that not every regular language can be learned this way, approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction.

Example

A regular language is defined as a set of strings that can be described by one of the mathematical formalisms called "finite automaton", "regular grammar", or "regular expression", all of which have the same expressive power. Since the latter formalism leads to shortest notations, it shall be introduced and used here. Given a set Σ of symbols, a regular expression can be any of

∅,
ε,
a,
r+s
r⋅s,
r⁺, or
r^*.

For example, using Σ =, the regular expression ⋅ denotes the set of all binary numbers with one or two digits, while 1⋅^*⋅0 denotes the set of all even binary numbers.
Given a set of strings, the task of regular language induction is to come up with a regular expression that denotes a set containing all of them.
As an example, given, a "natural" description could be the regular expression 1⋅0^*, corresponding to the informal characterization "a 1 followed by arbitrarily many 0es".
However, ^* and 1++ is another regular expression, denoting the largest and the smallest set containing the given strings, and called the trivial overgeneralization and undergeneralization, respectively.
Some approaches work in an extended setting where also a set of "negative example" strings is given; then, a regular expression is to be found that generates all of the positive, but none of the negative examples.

Lattice of automata

Dupont et al. have shown that the set of all structurally complete finite automata
generating a given input set of example strings forms a lattice, with the trivial undergeneralized and the trivial overgeneralized automaton as bottom and top element, respectively.
Each member of this lattice can be obtained by factoring the undergeneralized automaton by an appropriate equivalence relation.
For the above example string set, the picture show at its bottom the undergeneralized automaton A_a,b,c,d in, consisting of states,,, and. On the state set, a total of 15 equivalence relations exist, forming a lattice. Mapping each equivalence E to the corresponding quotient automaton language L obtains the partially ordered set shown in the picture.
Each node's language is denoted by a regular expression. The language may be recognized by quotient automata w.r.t. different equivalence relations, all of which are shown [|below] the node. An arrow between two nodes indicates that the lower node's language is a proper subset of the higher node's.
If both positive and negative example strings are given, Dupont et al. build the lattice from the positive examples, and then investigate the separation border between automata that generate some negative example and such that do not.
Most interesting are those automata immediately below the border.
In the picture, separation borders are shown for the negative example strings 11, 1001, and 0.
Coste and Nicolas present an own search method within the lattice, which they relate to Mitchell's version space paradigm.
To find the separation border, they use a graph coloring algorithm on the state inequality relation induced by the negative examples.
Later, they investigate several ordering relations on the set of all possible state fusions.
Kudo and Shimbo use the representation by automaton factorizations to give a unique framework for the following approaches :

[|k-reversible languages] and the "tail clustering" follow-up approach,
[|Successor automata] and the predecessor-successor method, and
[|pumping-based approaches].

Each of these approaches is shown to correspond to a particular kind of equivalence relations used for factorization.

Approaches

''k''-reversible languages

Angluin considers so-called "k-reversible" regular automata, that is, deterministic automata in which each state can be reached from at most one state by following a transition chain of length k.
Formally, if Σ, Q, and δ denote the input alphabet, the state set, and the transition function of an automaton A, respectively, then A is called k-reversible if : ∀a₀,...,a_k ∈ Σ ∀s₁, s₂ ∈ Q: δ^* = δ^* ⇒ s₁ = s₂, where δ^* means the homomorphic extension of δ to arbitrary words.
Angluin gives a cubic algorithm for learning of the smallest k-reversible language from a given set of input words; for k=0, the algorithm has even almost linear complexity.
The required state uniqueness after k+1 given symbols forces unifying automaton states, thus leading to a proper generalization different from the trivial undergeneralized automaton.
This algorithm has been used to learn simple parts of English syntax;
later, an incremental version has been provided.
Another approach based on k-reversible automata is the tail clustering method.

Successor automata

From a given set of input strings, Vernadat and Richetin build a so-called successor automaton, consisting of one state for each distinct character and a transition between each two adjacent characters' states.
For example, the singleton input set leads to an automaton corresponding to the regular expression ^*.
An extension of this approach is the predecessor-successor method which generalizes each character repetition immediately to a Kleene ⁺ and then includes for each character the set of its possible predecessors in its state.
Successor automata can learn exactly the class of local languages.
Since each regular language is the homomorphic image of a local language, grammars from the former class can be learned by lifting, if an appropriate homomorphism is provided.
In particular, there is such a homomorphism for the class of languages learnable by the predecessor-successor method.
The learnability of local languages can be reduced to that of k-reversible languages.

Early approaches

Chomsky and Miller
used the pumping lemma: they guess a part v of an input string uvw and try to build a corresponding cycle into the automaton to be learned; using membership queries they ask, for appropriate k, which of the strings uw, uvvw, uvvvw,..., uv^kw also belongs to the language to be learned, thereby refining the structure of their automaton. In 1959, Solomonoff generalized this approach to context-free languages, which also obey a pumping lemma.

Cover automata

Câmpeanu et al. learn a finite automaton as a compact representation of a large finite language.
Given such a language F, they search a so-called cover automaton A such that its language L covers F in the following sense:, where is the length of the longest string in F, and denotes the set of all strings not longer than.
If such a cover automaton exists, F is uniquely determined by A and.
For example, F = has and a cover automaton corresponding to the regular expression ^*⋅a⋅d.
For two strings x and y, Câmpeanu et al. define x ~ y if xz∈F ⇔ yz∈F for all strings z of a length such that both xz and yz are not longer than. Based on this relation, whose lack of transitivity causes considerable technical problems, they give an O algorithm to construct from F a cover automaton A of minimal state count.
Moreover, for union, intersection, and difference of two finite languages they provide corresponding operations on their cover automata.
Păun et al. improve the time complexity to O.

Residual automata

For a set S of strings and a string u, the Brzozowski derivative u⁻¹S is defined as the set of all rest-strings obtainable from a string in S by cutting off its prefix u, formally: u⁻¹S =, cf. picture.
Denis et al. define a residual automaton to be a nondeterministic finite automaton A where each state q corresponds to a Brzozowski derivative of its accepted language L, formally: ∀q∈Q ∃u∈Σ^*: L = u⁻¹L, where L denotes the language accepted from q as start state.
They show that each regular language is generated by a uniquely determined minimal residual automaton. Its states are ∪-indecomposable Brzozowski derivatives, and it may be exponentially smaller than the minimal deterministic automaton.
Moreover, they show that residual automata for regular languages cannot be learned in polynomial time, even assuming optimal sample inputs.
They give a learning algorithm for residual automata and prove that it learns the automaton from its characteristic sample of positive and negative input strings.

Reduced regular expressions

Brill defines a reduced regular expression to be any of

a,
¬a,
•
a^*, ^*, or •^*, or
r⋅s.

Given an input set of strings, he builds step by step a tree with each branch labelled by a reduced regular expression accepting a prefix of some input strings, and each node labelled with the set of lengths of accepted prefixes.
He aims at learning correction rules for English spelling errors,
rather than at theoretical considerations about learnability of language classes.
Consequently, he uses heuristics to prune the tree-buildup, leading to a considerable improvement in run time.

Applications

Finding common patterns in DNA and RNA structure descriptions
Modelling natural language acquisition by humans
Learning of structural descriptions from structured example documents, in particular Document Type Definitions from SGML documents
Learning the structure of music pieces
Obtaining compact representations of finite languages
Classifying and retrieving documents
Generating of context-dependent correction rules for English grammatical errors

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...