Spoken corpora

At the Linguistic Convergence Laboratory we create spoken corpora, which give users access to audio recordings of texts as well as their transcription. Audio access allows researchers to study languages at different levels, without having to rely on the annotator’s transcription alone. It is important to keep in mind that the search function in these corpora is based on a standardized glossing of texts, so any study of linguistic features cannot rely on transcription alone but also requires listening to all the examples used.

The Laboratory develops corpora of dialectal, regional, and bilingual speech varieties, predominantly those spoken in rural areas.

An important aspect of the Laboratory’s spoken corpora is the availability of sociolinguistic metadata about the speakers including information about their age, gender, education, place of residence, and command of other languages.

The spoken corpora are developed in cooperation with researchers from other universities and institutions. The Laboratory is open to the development of new resources.

Don river dialects

Tokens: 69 098

Ilmen Lake district dialects

Tokens: 134 207

Keba dialect

Tokens: 54 535

Khislavichi dialect

Tokens: 260 793

Lukh and Teza river basins dialects

Tokens: 146 350

Luzhnikovo dialect

Tokens: 68 666

Malinino dialect

Tokens: 138 943

Manturovo dialect

Tokens: 113 837

Middle Northern Dvina dialects

Tokens: 68 010

Middle Pinega dialects

Tokens: 43 270

Middle Pyoza dialect

Tokens: 79 566

Mikhaylov Corpus

Tokens: 47 579

Nekhochi dialect

Tokens: 88 965

Opochetsky dialects

Tokens: 68 741

Popovka Corpus

Tokens: 36 617

Rogovatka dialect

Tokens: 100 047

Shetnevo and Makeevo dialect

Tokens: 95 335

Spiridonova Buda dialect

Tokens: 70 565

Svishni and Trostnoe dialects

Tokens: 24 414

Tserkovnoe dialect

Tokens: 39 469

Upper Pinega and Vyya river basins dialect

Tokens: 70 803

Ustja River Basin dialects

Tokens: 959 782

Veegora dialect

Tokens: 91 514

Zvenigorod dialect

Tokens: 68 324

No matching items

Bashkortostan Russian

Tokens: 93 127

Beserman Russian

Tokens: 97 216

Chuvash Russian

Tokens: 46 307

Daghestanian Russian

Tokens: 376 717

Karelian Russian

Tokens: 578 646

Mari Russian

Tokens: 69 109

Romani Russian

Tokens: 41 767

Yakut Russian

Tokens: 15 139

No matching items


Tokens: 3 636


Tokens: 28 202

Besleney Kabardian

Tokens: 7 955


Tokens: 1 603

Itsari Dargwa

Tokens: 2 535

Kadar Dargwa

Tokens: 6 366


Tokens: 57 633

Meadow Mari

Tokens: 11 647

Muira Dargwa

Tokens: 6 935

Standard Dargwa

Tokens: 6 382 427

Tanti Dargwa

Tokens: 2 683

Tsnal Lezgi

Tokens: 5 113

West Circassian

Tokens: 9 128

No matching items

Pushkino-Mikhalevskaja, Velsky District, Arkhangelskaja oblastj by Michael Daniel


Dictionaries contain audio and text data from several villages of Daghestan. The wordlists for dictionaries are primarily based on the Jena proposal for a unified comparative lexicon of the languages of Daghestan, and include both the Swadesh list and Kibrik and Kodzasov’s thesaurus for Daghestanian languages together with some additional items.

Dargwa varieties

Tokens: 7 917


Tokens: 10 291

Mehweb Dargwa

Tokens: 1 132


Tokens: 738


Tokens: 1 175

No matching items

Kina, Rutulsky District, Daghestan by Timur Maisak

Other resources

In addition to dictionaries and corpora, the Laboratory also develops databases and atlases containing lexical, grammatical and sociolinguistic data from many villages of Daghestan.

Atlas of Multilingualism in Dagestan

Atlas of Rutul dialects


Daghestanian loans database



Prosody of Russian Dialects

The Andic Dictionaries Examples Database

Typological Atlas of the Languages of Daghestan

No matching items

Karata area, Akhvakhsky District, Daghestan by Timofey Mukhin