Spoken corpora

In the Linguistic Convergence Laboratory we create spoken corpora, collections of spoken language which give users access to audio recordings of texts as well as the written gloss. Audio access allows researchers to study languages at different levels, without having to rely on another’s transcription. It is important for users to understand that the search function for these corpora is made possible by a standardized glossing of texts, so any study of linguistic features in the spoken corpora requires both working with the text and listening to all the examples used.

The Laboratory develops corpora of dialectal, regional, and bilingual speech variations, predominantly of those spoken in rural areas.

An important aspect of the Laboratory’s spoken corpora is the addition of sociolinguistic metadata about the speakers including information about their age, sex, education, place of residency, and command of other languages.

The spoken corpora are developed in cooperation with researchers from other universities and institutes. The Laboratory is open to the development of new resources along the lines of those already released.

Dialect corpora

Keba dialect

Tokens: 54 535

Khislavichi dialect

Tokens: 260 793

Lukh and Teza river basins dialects

Tokens: 146 350

Luzhnikovo dialect

Tokens: 68 666

Malinino dialect

Tokens: 138 943

Middle Pyoza dialect

Tokens: 79 566

Nekhochi dialect

Tokens: 88 965

Opochetsky dialects

Tokens: 68 741

Rogovatka dialect

Tokens: 100 047

Shetnevo and Makeevo dialect

Tokens: 58 003

Spiridonova Buda dialect

Tokens: 70 565

Tserkovnoe dialect

Tokens: 19 960

Upper Pinega and Vyya river basins dialect

Tokens: 70 803

Ustja River Basin dialects

Tokens: 959 782

Zvenigorod dialect

Tokens: 68 324

Manturovo dialect

Tokens: 113 837

Don river dialects

Tokens: 50 375

Middle Northern Dvina dialects

Tokens: 68 010

Corpora of bilingual Russian

Bashkortostan Russian

Tokens: 93 127

Beserman Russian

Tokens: 97 216

Chuvash Russian

Tokens: 46 307

Daghestanian Russian

Tokens: 376 717

Karelian Russian

Tokens: 578 646

Romani Russian

Tokens: 41 767

Yakut Russian

Tokens: 15 139

Mari Russian

Tokens: 69 109

Corpora of minority languages of Russia


Tokens: 3 636


Tokens: 9 128


Tokens: ~25 000


Tokens: 7 955


Tokens: ~58 000

Meadow Mari

Tokens: 11 647

Standard Dargwa

Tokens: 35 569

Muira Dargwa

Tokens: 7 470

Kadar Dargwa

Tokens: 12 654

Tsnal Lezgi

Tokens: 5 113

Pushkino-Mikhalevskaja, Velsky District, Arkhangelskaja oblastj
by Michael Daniel


Dictionaries contain audio and text data from several villages of Daghestan. The wordlists are based onfor dictionaries derives primarily from the Jena proposal for a unified comparative lexicon of the languages of Daghestan and include bothintends to cover the Swadesh list and Kibrik and Kodzasov’s thesaurus for Daghestanian languages together with some additional items.


Entries: 1 132


Entries: 738


Entries: 1 175

Dargwa varieties

Entries: 7 917

Kina, Rutulsky District, Daghestan
by Timur Maisak

Other projects

In addition to dictionaries and corpora, the Laboratory also develops databases and atlases containing lexical, grammatical and sociolinguistic data from many villages of Daghestan.


Swadesh-100 wordlists from the languages of Daghestan


Daghestanian loans database


Atlas of multilingualism in Daghestan


Typological Atlas of the languages of Daghestan

Karata area, Akhvakhsky District, Daghestan
by Timofey Mukhin