Corpora and databases

  • British National Corpus Corpora page
    Information about the BNC, and links to other English corpora sites.
  • UCREL Corpus Holdings
    A large selection of links to corpora of written and spoken languages (chiefly English), from the University of Lancaster.
  • Child Language Data Exchange System (CHILDES)
    The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding,and systems for linking transcripts to digitised audio and video.
  • UCL Speech Data database
    A page with links to the UCL Speaker Database, SCRIBE, EUROM, and the UCL Dysfluency Database.
  • EUSTACE (Edinburgh University Speech Timing Archive and Corpus of English)
    4608 sentences of spoken English provided online by Edinburgh's Centre for Speech Technology Research.
  • The IViE (Intonational Variability in English) project
    Homepage for an Oxford University-based project investigating intonational variability in British and Irish English.
  • UCLA Phonetics Lab Language Archive
    For over half a century, the UCLA Phonetics Laboratory has collected recordings of hundreds of languages from around the world, providing source materials for phonetic and phonological research. This website, funded by the US National Science Foundation, aims to make the Lab's materials more easily accessible, serving the interests of scholars, speakers, and language learners everywhere.
  • ToBI corpus
    Dozens of sound files (.wav format, available by anonymous FTP) to accompany the guidelines for the use of Ohio State's ToBI (Tones and Break Indices) intonational labelling system. More information can be found on the ToBI homepage.
  • Fromkin Speech Error Database
    The 2000 version of the database, provided in XML format by the Max Planck Institute for Psycholinguistics, Nijmegen.
  • WebCorp
    The University of Liverpool's WebCorp is a suite of tools which allows access to the World Wide Web as a corpus. It can be used by anyone who has an interest in language and how particular words and phrases are used, especially ones which are too new or too rare to appear in any dictionary or standard corpus.
  • The Rosetta Project
    'A global collaboration of language specialists and native speakers [whose] goal is a meaningful survey and near permanent archive of 1,000 languages. Our intention is to create a unique platform for comparative linguistic research and education as well as a functional linguistic tool that might help in the recovery or revitalization of lost languages in unknown futures.'
  • IDEA (International Dialects of English Archive)
    Created in 1998 as a resource for actors, this archive is comprised of recordings of native speakers of English from various parts of the world, and English spoken in various non-native accents.
  • Current Corpora at CSLU
    Long list of corpora provided by Oregon Health and Science University's Center for Spoken Language Understanding.
  • W3 Corpora
    Web access to linguistic corpora provided by the University of Essex (no longer maintained).
  • Legal Language Corpora: summary
    Information on and links to corpora made up of legal texts.
  • Corpus of late 18th C prose
    c. 300,000 words of north-western English letters on practical subjects (1761-89), collected by the University of Manchester.
  • The English Lexicon Project
    A database containing a variety of lexical characteristics and experimental measurement data for over 40,000 English words.
  • World Atlas of Language Structures Online
    A large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of more than 40 authors (many of them the leading authorities on the subject).
  • Syntactic Structures of the World's Languages 
    A searchable database that allows users to discover which properties (morphological, syntactic, and semantic) characterize a language, as well as how these properties relate across languages.