Speech corpus


Speech corpus

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models (which can then be used with a speech recognition engine).

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Corpora:

*(1) Read Speech - which includes:

:*Book excerpts:*Broadcast news:*Lists of words:*Sequences of numbers

*(2) Spontaneous Speech - which includes:

:* Dialogs - between two or more people (includes meetings);:* Narratives - a person telling a story;:* Map-tasks - one person explains a route on a map to another;:* Appointment-tasks - two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.

External links

* [http://www.phonetik.uni-muenchen.de/Bas/BasHomeeng.html BAS – Bavarian Archive for Speech Signals]
* [http://buckeyecorpus.osu.edu/ Buckeye Corpus] - The Buckeye Corpus of Conversational Speech
* [http://www.ece.msstate.edu/research/isip/projects/switchboard/ Switchboard] - ISIP's Switchboard database
* [http://www.voxforge.org/ VoxForge - open source speech corpora]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Corpus — (Latin plural corpora, English plural corpuses or corpora) is Latin for body. It may refer to: Contents 1 Law 2 Biology …   Wikipedia

  • Speech recognition — For the human linguistic concept, see Speech perception. The display of the Speech Recognition screensaver on a PC, in which the character responds to questions, e.g. Where are you? or statements, e.g. Hello. Speech recognition (also known as… …   Wikipedia

  • Speech recognition in Linux — There is currently no open source equivalent of proprietary speech recognition software (e.g. Nuances Dragon NaturallySpeaking or Windows Speech Recognition) for Linux. However, there are several incomplete, open source projects and solutions… …   Wikipedia

  • Corpus linguistics — is the study of language as expressed in samples (corpora) or real world text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally …   Wikipedia

  • Corpus Christi (play) — Corpus Christi is a passion play by Terrence McNally dramatizing the story of Jesus and the Apostles. It depicts Jesus and the Apostles as gay men living in modern day Texas. It utilizes modern devices like television with anachronisms like Roman …   Wikipedia

  • Speech perception — is the process by which the sounds of language are heard, interpreted and understood. The study of speech perception is closely linked to the fields of phonetics and phonology in linguistics and cognitive psychology and perception in psychology.… …   Wikipedia

  • Corpus oraux — Corpus oral En linguistique, un corpus oral est un corpus constitué de transcriptions de données orales. Bibliographie Olivier Baude, Corpus oraux. Guide des bonnes pratiques, Paris, CNRS, 2006 Douglas Biber, Variation across speech and writing,… …   Wikipédia en Français

  • Corpus of Contemporary American English — The freely searchable 425 million word Corpus of Contemporary American English (COCA) is the largest corpus of American English currently available, and the only publicly available corpus of American English to contain a wide array of texts from… …   Wikipedia

  • Corpus callosum — For the two films with this name, see Corpus Callosum (2002) and Corpus Callosum (2007) Brain: Corpus callosum Corpus callosum from above. (Anterior portion is at the top of the image.) …   Wikipedia

  • Corpus callosum, agenesis of the — A congenital abnormality (a birth defect) in which there is partial or complete absence (agenesis) of the corpus callosum, the area of the brain which connects the two cerebral hemispheres (the two halves of the brain). Agenesis of the corpus… …   Medical dictionary