Word lists by frequency

Word lists by frequency are lists of a language's words grouped by frequency of occurrence within some given text corpus, either by levels or as a ranked list, serving the purpose of vocabulary acquisition. A word list by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles has accelerated the research field.
In computational linguistics, a frequency list is a sorted list of words together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

Type	Occurrences	Rank
the	3789654	1st
he	2098762	2nd

king	57897	1,356th
boy	56975	1,357th

stringyfy	5	34,589th

transducionalify	1	123,567th

Methodology

Factors

Nation noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:

corpus representativeness
word frequency and range
treatment of word families
treatment of idioms and fixed expressions
range of information
various other criteria
Corpora

Traditional written corpus

Most of currently available studies are based on written text corpus, more easily available and easy to process.

SUBTLEX movement

However, proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. made a long critical evaluation of traditional textual analysis approach, and support a move from written corpus toward oral corpuses analysis and thanks to open film subtitles available online. This has recently been followed by a handful of follow-up studies, providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French, American English, Dutch, Chinese, Spanish, Greek, Vietnamese, Brazil Portuguese and Portugal Portuguese, Albanian and Polish. SUBTLEX-IT provides raw data only.

Lexical unit

In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.

Statistics

It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.
German linguists define the Häufigkeitsklasse of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.
where is the floor function.
Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.

Pedagogy

Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors. Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion".

Effects of words frequency

Word frequency is known to have various effects. Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures. Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect. The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.

Languages

Below is a review of available resources.

English

Word counting dates back to Hellenistic time. Thorndike & Lorge, assisted by their colleagues, counted 18,000,000 running words to provide the first large scale frequency list in 1944, before modern computers made such projects far easier.

Traditional lists

These all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency in the Corpus of Contemporary American English, was first attested to in 1999, and does not appear in any of these three lists.
;The Teachers Word Book of 30,000 words
The TWB contains 30,000 lemmas or ~13,000 word families. A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability.
;The General Service List
The GSL contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise. This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List.
;The American Heritage Word Frequency Book
A corpus of 5 million running words, from written texts used in United States schools. Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas.
;The Brown LOB and related corpora
These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists.

French

;Traditional datasets
A review has been made by.
An attempt was made in the 1950s–60s with the :fr:Français fondamental|Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules. It is claimed that 70 grammatical words constitute 50% of the communicatives sentence, while 3,680 words make about 95~98% of coverage. A list of 3,000 frequent words is available.
The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet. Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".
More recently, the project Lexique3 provides 142,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0.
;Subtlex
This Lexique3 is a continuous study from which originate the [|Subtlex movement] cited above. made a completely new counting based on online film subtitles.

Spanish

There have been several studies of Spanish word frequency.

Chinese

Chinese corpuses have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency. American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can't Read Chinese. As a frequency toolkit, Da and the Taiwanese Ministry of Education provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China 's TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, recently made a rich study of Chinese word and character frequencies.

Other

Most frequently used words in different languages based on Wikipedia or combined corpora.

Theoretical concepts

.
Helmut Meier: Deutsche Sprachstatistik. Hildesheim: Olms 1967.
Written texts-based databases
.
.
SUBTLEX movement
SUBTLEX-DE:

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Word lists by frequency

Methodology

Factors

Corpora

Traditional written corpus

SUBTLEX movement

Lexical unit

Statistics

Pedagogy

Effects of words frequency

Languages

English

Traditional lists

French

Spanish

Chinese

Other

Theoretical concepts

Written texts-based databases

SUBTLEX movement