Spoken corpora


In the Linguistic Convergence Laboratory we create spoken corpora, collections of spoken language which give users access to audio recordings of texts as well as the written gloss. Audio access allows researchers to study languages at different levels, without having to rely on another’s transcription. It is important for users to understand that the search function for these corpora is made possible by a standardized glossing of texts, so any study of linguistic features in the spoken corpora requires both working with the text and listening to all the examples used.

The Laboratory develops corpora of dialectal, regional, and bilingual speech variations, predominantly of those spoken in rural areas.

An important aspect of the Laboratory’s spoken corpora is the addition of sociolinguistic metadata about the speakers including information about their age, sex, education, place of residency, and command of other languages.

The spoken corpora are developed in cooperation with researchers from other universities and institutes. The Laboratory is open to the development of new resources along the lines of those already released.

Dialect corpora

Khislavichi dialect

Tokens: 260 793

Luzhnikovo dialect

Tokens: 68 666

Lukh and Teza river basins dialects

Tokens: 146 350

Malinino dialect

Tokens: 138 943

Nekhochi dialect

Tokens: 88 965

Opochetsky dialects

Tokens: 68 741

Rogovatka dialect

Tokens: 100 047

Spiridonova Buda dialect

Tokens: 70 565

Ustja River Basin dialects

Tokens: 959 782

Corpora of bilingual Russian

Bashkortostan Russian

Tokens: ND

Beserman Russian

Tokens: 97 216

Chuvash Russian

Tokens: 46 307

Daghestanian Russian

Tokens: 227 885

Karelian Russian

Tokens: 74 014

Yakut Russian

Tokens: 15 139

Romani Russian

Tokens: 41 767

Corpora of minority languages of Russia

Abaza

Tokens: 3 636

Adyghe

Tokens: ND

Bashkir

Tokens: ~25 000

Kabardian

Tokens: ND

Khakas

Tokens: ~58 000

Pushkino-Mikhalevskaja, Velsky District, Arkhangelskaja oblastj
by Michael Daniel

Dictionaries


Dictionaries contain audio and text data from several villages of Daghestan. The wordlists are based onfor dictionaries derives primarily from the Jena proposal for a unified comparative lexicon of the languages of Daghestan and include bothintends to cover the Swadesh list and Kibrik and Kodzasov’s thesaurus for Daghestanian languages together with some additional items.

Mehweb

Entries: 1 132

Rutul

Entries: 738

Tukita

Entries: 1 175

Kina, Rutulsky District, Daghestan
by Timur Maisak

Other projects


In addition to dictionaries and corpora, the Laboratory also develops databases and atlases containing lexical, grammatical and sociolinguistic data from many villages of Daghestan.

DagSwadesh

Swadesh-100 wordlists from the languages of Daghestan

DagLoans

Daghestanian loans database

MultiDag

Atlas of multilingualism in Daghestan

TALD

Typological Atlas of the languages of Daghestan

Karata area, Akhvakhsky District, Daghestan
by Timofey Mukhin