Russian National Corpus


The Russian National Corpus is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.
It currently contains more than 600 million word forms that are automatically lemmatized and POS-/grammeme-tagged, i. e. all the possible morphological analyses for each orthographic form are ascribed to it. Lemmata, POS, grammatical items and their combinations are searchable. Additionally, 6 million word forms are in the subcorpus with manually resolved homonymy.
The subcorpus with resolved morphological homonymy is also automatically accentuated. The whole corpus has a searchable tagging concerning lexical semantics, including morphosemantic POS subclasses, LS characteristics proper, derivation.
The RNC includes also the following subcorpora:
All the texts have tags bearing metatextual information - the author, his/her birth date, creation date, text size, text genres ; all these categories are browsable and searchable separately. It is possible to define a user's subcorpus to search lemmata/POS-grammeme/semantic tags combinations only within this subset.