Spoken corpora

In the Linguistic Convergence Laboratory we create spoken corpora, collections of spoken language which give users access to audio recordings of texts as well as the written gloss. Audio access allows researchers to study languages at different levels, without having to rely on another’s transcription. It is important for users to understand that the search function for these corpora is made possible by a standardized glossing of texts, so any study of linguistic features in the spoken corpora requires both working with the text and listening to all the examples used.

The Laboratory develops corpora of dialectal, regional, and bilingual speech variations, predominantly of those spoken in rural areas.

An important aspect of the Laboratory’s spoken corpora is the addition of sociolinguistic metadata about the speakers including information about their age, sex, education, place of residency, and command of other languages.

The spoken corpora are developed in cooperation with researchers from other universities and institutes. The Laboratory is open to the development of new resources along the lines of those already released.

Dialect corpora

Khislavichi dialect

Tokens: 260 793

Luzhnikovo dialect

Tokens: 68 666

Lukh and Teza river basins dialects

Tokens: 146 350

Malinino dialect

Tokens: 138 943

Nekhochi dialect

Tokens: 88 965

Opochetsky dialects

Tokens: 68 741

Rogovatka dialect

Tokens: 100 047

Spiridonova Buda dialect

Tokens: 70 565

Ustja River Basin dialects

Tokens: 959 782

Zvenigorod dialect

Tokens: 68 324

Corpora of bilingual Russian

Bashkortostan Russian

Tokens: ND

Beserman Russian

Tokens: 97 216

Chuvash Russian

Tokens: 46 307

Daghestanian Russian

Tokens: 227 885

Karelian Russian

Tokens: 74 014

Yakut Russian

Tokens: 15 139

Romani Russian

Tokens: 41 767

Corpora of minority languages of Russia


Tokens: 3 636


Tokens: ND


Tokens: ~25 000


Tokens: ND


Tokens: ~58 000

Pushkino-Mikhalevskaja, Velsky District, Arkhangelskaja oblastj
by Michael Daniel


Dictionaries contain audio and text data from several villages of Daghestan. The wordlists are based onfor dictionaries derives primarily from the Jena proposal for a unified comparative lexicon of the languages of Daghestan and include bothintends to cover the Swadesh list and Kibrik and Kodzasov’s thesaurus for Daghestanian languages together with some additional items.


Entries: 1 132


Entries: 738


Entries: 1 175

Kina, Rutulsky District, Daghestan
by Timur Maisak

Other projects

In addition to dictionaries and corpora, the Laboratory also develops databases and atlases containing lexical, grammatical and sociolinguistic data from many villages of Daghestan.


Swadesh-100 wordlists from the languages of Daghestan


Daghestanian loans database


Atlas of multilingualism in Daghestan


Typological Atlas of the languages of Daghestan

Karata area, Akhvakhsky District, Daghestan
by Timofey Mukhin