Speech corpus


A speech corpus is a database of speech audio files and text transcriptions.
In speech technology, speech corpora are used, among other things, to create acoustic models.
In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.
A corpus is one such database. Corpora is the plural of corpus.
There are two types of Speech Corpora:
  1. Read Speech – which includes:
  2. * Book excerpts
  3. * Broadcast news
  4. * Lists of words
  5. * Sequences of numbers
  6. Spontaneous Speech – which includes:
  7. * Dialogs – between two or more people ;
  8. * Narratives – a person telling a story ;
  9. * Map-tasks – one person explains a route on a map to another;
  10. * Appointment-tasks – two people try to find a common meeting time based on individual schedules.
A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.