Spoken corpora
At the Linguistic Convergence Laboratory we create spoken corpora, which give users access to audio recordings of texts as well as their transcription. Audio access allows researchers to study languages at different levels, without having to rely on the annotator’s transcription alone. It is important to keep in mind that the search function in these corpora is based on a standardized glossing of texts, so any study of linguistic features cannot rely on transcription alone but also requires listening to all the examples used.
The Laboratory develops corpora of dialectal, regional, and bilingual speech varieties, predominantly those spoken in rural areas.
An important aspect of the Laboratory’s spoken corpora is the availability of sociolinguistic metadata about the speakers including information about their age, gender, education, place of residence, and command of other languages.
The spoken corpora are developed in cooperation with researchers from other universities and institutions. The Laboratory is open to the development of new resources.
Don river dialects
Tokens: 69 098
Ilmen Lake district dialects
Tokens: 134 207
Keba dialect
Tokens: 54 535
Khislavichi dialect
Tokens: 260 793
Lukh and Teza river basins dialects
Tokens: 146 350
Luzhnikovo dialect
Tokens: 68 666
Malinino dialect
Tokens: 138 943
Manturovo dialect
Tokens: 113 837
Middle Northern Dvina dialects
Tokens: 68 010
Middle Pinega dialects
Tokens: 43 270
Middle Pyoza dialect
Tokens: 79 566
Mikhaylov Corpus
Tokens: 47 579
Nekhochi dialect
Tokens: 88 965
Opochetsky dialects
Tokens: 68 741
Popovka dialect
Tokens: 36 617
Rogovatka dialect
Tokens: 100 047
Shetnevo and Makeevo dialect
Tokens: 95 335
Spiridonova Buda dialect
Tokens: 70 565
Svishni and Trostnoe dialects
Tokens: 24 414
Tserkovnoe dialect
Tokens: 39 469
Upper Pinega and Vyya river basins dialect
Tokens: 70 803
Ustja River Basin dialects
Tokens: 959 782
Veegora dialect
Tokens: 91 514
Zvenigorod dialect
Tokens: 68 324
Bashkortostan Russian
Tokens: 93 127
Beserman Russian
Tokens: 97 216
Chuvash Russian
Tokens: 46 307
Daghestanian Russian
Tokens: 376 717
Karelian Russian
Tokens: 578 646
Khanty Russian
Tokens: 40 225
Mari Russian
Tokens: 69 109
Romani Russian
Tokens: 41 767
Yakut Russian
Tokens: 15 139
Abaza
Tokens: 3 636
Bashkir
Tokens: 28 202
Besleney Kabardian
Tokens: 7 955
Botlikh
Tokens: 1 603
Itsari Dargwa
Tokens: 2 535
Kadar Dargwa
Tokens: 6 366
Khakas
Tokens: 57 633
Meadow Mari
Tokens: 11 647
Muira Dargwa
Tokens: 6 935
Standard Dargwa
Tokens: 6 382 427
Tanti Dargwa
Tokens: 2 683
Tsnal Lezgi
Tokens: 5 113
West Circassian
Tokens: 9 128
Dictionaries
Dictionaries contain audio and text data from several villages of Daghestan. The wordlists for dictionaries are primarily based on the Jena proposal for a unified comparative lexicon of the languages of Daghestan, and include both the Swadesh list and Kibrik and Kodzasov’s thesaurus for Daghestanian languages together with some additional items.
Dargwa varieties
Tokens: 7 917
Khwarshi
Tokens: 10 291
Kina Rutul
Tokens: 738
Mehweb Dargwa
Tokens: 1 132
Tukita
Tokens: 1 175
Zilo Andi
Tokens: 738
Other resources
In addition to dictionaries and corpora, the Laboratory also develops databases and atlases containing lexical, grammatical and sociolinguistic data from many villages of Daghestan.