EN  |  RU  | 

Code-Switching Corpus

Code-Switching — the alternation of two languages within a discourse — is a distinctive feature of bilingual language behaviour. In Russia bilingualism is especially common in the republics where the language of the titular nation is approved at the state level alongside Russian. Previous studies on code-switching (for example, Poplack 1980) have shown that data obtained by recording spontaneous speech of bilinguals is an important tool for studying code-switching. Nevertheless, there are practically no bilingual speech corpora in the public domain, which can be explained by the difficulties of creating such a corpus (Gullberg et al. 2009). The corpus is based on audio data gathered in December 2018 in Yakutsk, during a conversation between Yakut-Russian bilinguals while playing the board game “Monopoly”. The corpus consists of audio files and audio-aligned transcripts, annotated for a number of attributes.

The corpus annotation strategy was developed by A. Petukhova in the course of a project implemented at the School of Linguistics, NRU HSE.

For each of the speakers were annotated: statement — elementary discourse unit (EDU), words in Russian, words in Yakut (as well as their translation into Russian) and the syntactic category of the constituents at the code-switching boundary (see List of syntactic categories). Utterances made outside of the conversation, such as during a telephone conversation, and utterances made while reading aloud playing cards and the rules of the game were marked separately. Since the informants were given a Russian-language version of Monopoly, and the language of the game was initially set, the speakers did not have the opportunity to freely choose the language of the game party. Utterances related to the game therefore do not belong to any language. Utterances in Yakut are provided with a translation.

The project of the Yakut-Russian Code-Switching corpus is supported by the International Language Convergence Laboratory of the Higher School of Economics. The corpus was created within the framework of the Basic Research Program of the National Research University "Higher School of Economics" (NRU HSE) and with the aid of a grant under the state support of the leading universities of the Russian Federation "5-100".



This corpus is one of a group that uses the search platform tsakorpus. Instructions with a description of the general technical capabilities of the search function in a corpus of this type can be found in the “Information” section (link with a question mark in the upper right corner of the search page). Below are a few rules specific to this corpus.

The corpus search is divided into three levels of annotation: Yakut, intersentential (code-switching between utterances), and intrasentential (code-switching within an utterance). When searching on the Yakut language level, you can search for the exact occurrence (word form) within an EDU, for the exact occurrence in the translation of an EDU and for syntactic categories (see “List of syntactic categories”). For example, to find all words starting with the letter k that have the category N (noun), enter k* in the “Word" field, and enter N in the “Tag” field.

When searching on the intersentential and intrasentential levels, only the first search field “Word” is valid, where three sentential tags are entered: y (i.e. Yakut), r (i.e. Russian), and e (i.e. English). For example, to find all sentences with code-switching to Yakut, select the Intersentential layer and enter y in the “Word” field.

The search results are arranged in such a way that for each sentence you can see what EDUs are in it, how they are translated (if the EDUs are in Yakut) and what tag they have, as well as from which language the code was switched (provided that it happened)

Each sentence is aligned with the sound. To listen to the sentence, click on the top level of the annotation.

List of syntactic categories

ADJPadjective phrase
ADVPadverb phrase
CONJconjunctive clause
COORDcoordinate clause
ICindependent clause
MODPmodal particle
NPnoun phrase
PPprepositional phrase
QPquestion particle
STOPutterance interruption
VPverb phrase
DEdiscourse entity
REPutterance repetition
The language to which the code is switched
Other notation
=utterance interruption
game termgame term
{out of the conversation phrase}utterance uttered outside of the conversation (for example, during a telephone conversation)

Corpus Composition

The corpus consists of 10 texts with a total of more than 15,000 words. The total duration of the audio data is 146 minutes.

The participants of the conversation are four bilingual native speakers of Yakut born in 1999-2000 (three men and one woman). Similar language background was the main criterion for selecting the participants: Yakut as a first language, good command of Russian, and living in the same city. The conversation was recorded during a game of "Monopoly” ("Monopoly" is an economic strategy board game that aims to make other players bankrupt using the initial capital) and was mostly built around the course of the game, although the speakers also switched to personal topics.

Project participants

Anna Petukhova – recording, decoding, markup development, text markup.

Elena Sokur – technical solution.


You may contact us with questions about the Corpus:
Anna Petukhova: annap2.71828tukhova@gmail.com

Or with questions about the search platform:
Elena Sokur: elena.o.sokur@gmail.com

How to cite the corpus

If you use data from the Spoken corpus of Abaza in your research, please cite as follows:

A. A. Petukhova, E. O. Sokur. 2021. Yakut-Russian Corpus of Code-Switching, Moscow: International Laboratory of Language Convergence, Higher School of Economics. (Available online at: http://lingconlab.ru/cs_yakut, accessed on .)


Gullberg et al. 2009 – M. Gullberg, P. Indefrey, P. Muysken. Research techniques for the study of code-switching // Cambridge University Press. 2009. P. 21-39.

Poplack 1980 – S. Poplack. Sometimes I'll start a sentence in Spanish Y TERMINO EN ESPANOL: toward a typology of code-switching // Linguistics. 1980. Vol. 18. P. 581-618.