Automatic taxonomy construction


Automatic taxonomy construction is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence.
Among other things, a taxonomy can be used to organize and index knowledge, such as in the form of a library classification system, or a search engine taxonomy, so that users can more easily find the information they are searching for. Taxonomies are typically tree structured and divide a domain into categories based on the value of properties called taxa.
Manually developing and maintaining a taxonomy is a labor-intensive task requiring significant time and resources, including familiarity of or expertise in the taxonomy's domain. Also, domain modelers have their own points of view which inevitably, even if unintentionally, work their way into the taxonomy. ATC uses artificial intelligence techniques to automatically generate a taxonomy for a domain in order to avoid these problems.

Approaches

There are several approaches to ATC. One approach is to use rules to detect patterns in the corpus and use those patterns to infer relations such as hyponymy. Other approaches use machine learning techniques such as Bayesian inferencing and Artificial Neural Networks.

Keyword extraction

One approach to building a taxonomy is to automatically gather the keywords from a domain using keyword extraction, then analyze the relationships between them, and then arrange them as a taxonomy based on those relationships.

[|Hyponymy] and "is-a" relations

In ATC programs, one of the most important tasks is the discovery of hypernym and hyponym relations among words. One way to do that from a body of text is to search for certain phrases like "is a" and "such as".
In linguistics, is-a relations are called hyponymy. Words that describe categories are called hypernyms and words that are examples of categories are hyponyms. For example, dog is a hypernym and Fido is one of its hyponyms. A word can be both a hyponym and a hypernym. So, dog is a hyponym of mammal and also a hypernym of Fido.
Taxonomies are often represented as is-a hierarchies where each level is more specific the level above it. For example, a basic biology taxonomy would have concepts such as mammal, which is a subset of animal, and dogs and cats, which are subsets of mammal. This kind of taxonomy is called an is-a model because the specific objects are considered instances of a concept. For example, Fido is-a instance of the concept dog and Fluffy is-a cat.

Applications

ATC can be used to build taxonomies for search engines, to improve search results.
ATC systems are a key component of ontology learning, and have been used to automatically generate large ontologies for domains such as insurance and finance. They have also been used to enhance existing large networks such as Wordnet to make them more complete and consistent.

ATC software

Other names

Other names for automatic taxonomy construction include: