Common Voice


Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences will be collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text applications without restrictions or costs.

Aims

Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents.

Voice database

The English Common Voice database is the second largest freely accessible voice database after LibriSpeech. By the time the first data were published on 29 November 2017, more than 20,000 users worldwide had registered 400,000 validated sentences, with a total length of 500 hours.
In February 2019, the first batch of languages was released for use. This included 18 languages: English, French, German and Mandarin Chinese, but also less prevalent languages as Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors.