Keba dialect
Tokens: 54 535
In the Linguistic Convergence Laboratory we create spoken corpora, collections of spoken language which give users access to audio recordings of texts as well as the written gloss. Audio access allows researchers to study languages at different levels, without having to rely on another’s transcription. It is important for users to understand that the search function for these corpora is made possible by a standardized glossing of texts, so any study of linguistic features in the spoken corpora requires both working with the text and listening to all the examples used.
The Laboratory develops corpora of dialectal, regional, and bilingual speech variations, predominantly of those spoken in rural areas.
An important aspect of the Laboratory’s spoken corpora is the addition of sociolinguistic metadata about the speakers including information about their age, sex, education, place of residency, and command of other languages.
The spoken corpora are developed in cooperation with researchers from other universities and institutes. The Laboratory is open to the development of new resources along the lines of those already released.
Keba dialect
Tokens: 54 535
Khislavichi dialect
Tokens: 260 793
Lukh and Teza river basins dialects
Tokens: 146 350
Luzhnikovo dialect
Tokens: 68 666
Malinino dialect
Tokens: 138 943
Middle Pyoza dialect
Tokens: 79 566
Nekhochi dialect
Tokens: 88 965
Opochetsky dialects
Tokens: 68 741
Rogovatka dialect
Tokens: 100 047
Shetnevo and Makeevo dialect
Tokens: 58 003
Spiridonova Buda dialect
Tokens: 70 565
Tserkovnoe dialect
Tokens: 19 960
Upper Pinega and Vyya river basins dialect
Tokens: 70 803
Ustja River Basin dialects
Tokens: 959 782
Zvenigorod dialect
Tokens: 68 324
Manturovo dialect
Tokens: 113 837
Don river dialects
Tokens: 50 375
Middle Northern Dvina dialects
Tokens: 68 010
Bashkortostan Russian
Tokens: 93 127
Beserman Russian
Tokens: 97 216
Chuvash Russian
Tokens: 46 307
Daghestanian Russian
Tokens: 376 717
Karelian Russian
Tokens: 578 646
Romani Russian
Tokens: 41 767
Yakut Russian
Tokens: 15 139
Mari Russian
Tokens: 69 109
Abaza
Tokens: 3 636
Adyghe
Tokens: 9 128
Bashkir
Tokens: ~25 000
Kabardian
Tokens: 7 955
Khakas
Tokens: ~58 000
Meadow Mari
Tokens: 11 647
Standard Dargwa
Tokens: 35 569
Muira Dargwa
Tokens: 7 470
Kadar Dargwa
Tokens: 12 654
Tsnal Lezgi
Tokens: 5 113
Dictionaries contain audio and text data from several villages of Daghestan. The wordlists are based onfor dictionaries derives primarily from the Jena proposal for a unified comparative lexicon of the languages of Daghestan and include bothintends to cover the Swadesh list and Kibrik and Kodzasov’s thesaurus for Daghestanian languages together with some additional items.
Mehweb
Entries: 1 132
Rutul
Entries: 738
Tukita
Entries: 1 175
Dargwa varieties
Entries: 7 917
In addition to dictionaries and corpora, the Laboratory also develops databases and atlases containing lexical, grammatical and sociolinguistic data from many villages of Daghestan.
DagSwadesh
Swadesh-100 wordlists from the languages of Daghestan
DagLoans
Daghestanian loans database
MultiDag
Atlas of multilingualism in Daghestan
TALD
Typological Atlas of the languages of Daghestan