Text simplification


Text simplification is an operation used in natural language processing to modify, enhance, classify or otherwise process an existing corpus of human-readable text in such a way that the grammar and structure of the prose is greatly simplified, while the underlying meaning and information remains the same. Text simplification is an important area of research, because natural human languages ordinarily contain large vocabularies and complex compound constructions that are not easily processed through automation. In terms of reducing language diversity, semantic compression can be employed to limit and simplify a set of words used in given texts.

Example

Text Simplification is illustrated with an example from Siddharthan. The first sentence contains two relative clauses and one conjoined verb phrase. A text simplification system aims to simplify the first sentence to the second sentence.
One approach to text simplification is lexical simplification via lexical substitution, a two-step process consisting of identifying complex words and replacing them with simpler synonyms. A key challenge here is identifying complex words, which is performed by a machine learning classifier trained on labelled data. An improvement over classical methods of applying binary labels to words as simple or complex is to ask labellers to sort words in order of complexity; this results in higher consistency of resultant labels.