RU | EN |
The Spoken corpus of the dialects of Khakas contains transcribed annotated texts, synchronized with the sound. The texts were recorded during the 21st century with speakers born in 1916-1985 in different expeditions from Moscow to the Republic of Khakassia. All texts are translated to Russian. Texts were analyzed using the automatic parser, and then edited and synchronized with the sound with the help of the ELAN software.
This corpus is related to the project «Electronic Corpus of Khakas language». Follow the link to see more information about the aims and methods of the project (Russian only, for now).
We use the symbols of the Cyrillic Khakas alphabet for transcription of our texts because the parser was made with the goal of analyzing Literary Khakas. Regular phonetic dialect features are mainly ignored, but we tried to show the morphology and morphonology features.
В строке разбиения на морфемы основы представлены в фонологической записи (транскрипции), а аффиксы – в морфонологической. Каждый аффикс представлен в единой форме, объединяющей его алломорфы с регулярными чередованиями. Например, словоформа туралар ‘дома’ в строке разбиения на морфемы имеет вид тура‑ЛАр. Показатель множественного числа имеет алломорфы ‑лар, ‑лер, ‑нар, ‑нер, ‑тар, ‑тер, а в строке глоссирования этому соответствует одна морфема ‑ЛАр. Морфонемы без чередований записываются теми же символами, что и фонемы.
See more on the subject in the paper: Anna V. Dybo, Philip S. Krylov, Vera S. Maltseva, Aleksandra V. Sheimovich. Segmental rules in the automatic parser for the Khakass corpus. In: Ural-Altaoc studies. N 1 (32), 2019. P. 48-69 (in Russian) https://iling-ran.ru/library/ural-altaic/ua2019_32.pdf
Consonant morphonemes | Vocal morphonemes |
П: б/п/м | А: е/а |
К: ғ/г/х/к | Ы: i/ы |
Г: ғ/г/х/к/Ø | О: о/ö |
Т: т/д | |
Д: т/д/н | |
С: с/з | |
Л: л/н/т | |
L: л/н | |
Н: н/т | |
Ч: ч/ҷ |
This corpus belongs to a group of corpora that are built using the search platform tsakorpus. A more general instruction with common technical properties of these corpora can be found in the “Help” section (look for the button marked with a question mark in the top right corner of the search page). The current text describes the rules and conventions that are specific for this Corpus.
In this field, you can enter specific word forms that you want to find.
For example, ибде ‘at home’, килген ‘he came’.
This field should be used if you need to find all forms of a given word (lexeme or lemma).
For example, if you enter “иб” ‘home’, the search results will show all sentences where any form of this noun is used, e.g., иб ‘home’, ибні [home-ACC] ‘home (direct object)’, ибінде [home-3pos-Loc] ‘at his home’, etc.
Lemmas should be entered in this field in their base form, that is, in the same form which is used in dictionaries. For nouns, adjectives, adverbs, pronouns and numerals, base form is the same as stem (e.g., иб ‘house’, кічіг ‘small’, ам ‘now’, син ‘you’, ікі ‘two’). For verbs, in accordance with the lexicographic practice, the infinitive form with the suffixes Ar GA is used, e.g., тоғынарға ‘to work’, килерге ‘to come’ etc.
For the case forms, including locatives, of pronouns ол ‘this’ and пу ‘that’, lemmmas ол and пу are used, although in the dictionnaries their case forms are written as separate lexemes. The only exceptions are substantivized forms with 3rd person possessive marker ан(ы)зы, мын(ы)зы, пунызы which we consider to be separate lexemes. All the forms of the personal pronouns (with the same person and number) also relate to one lemma: мин ‘I’, син ‘you’, олар ‘they’, etc.
This field can be used for building search queries based on part of speech tags and grammatical categories. In order to use this search field, you should press the button immediately to the right of the “Grammar” search field itself; when you press this button, you will see a pop-up window where you can choose from the available grammatical tags. If you want to select a marker, you should click the left mouse button on it, and it will lighten. To cancel the selection of a marker you should click again, and the lightening will turn off.
The parts of speach markerks used in our corpus are explained in the following table.
a verb (including participle and converb), takes all the inflective markers.
a nominal (noun, adjective, pronoun, numeral, postposition), doesn’t take negation, time, aspect and mood markers.
Some nominals don’t take case markers but take personal markers (ex. осхас ‘resembling’). We want to unite such lexemes in one separate part of speech after a corpus research.
i1 – an invariable which can combine with endoclitics (including particle -ох/-ӧх/-ӧк, which absorbs the last vowel of the stem). For example, піди ‘так’. This category unites most adverbs, including the grammaticalized forms of converbs.
i – an invariable which can’t combine with endoclitics (particle, conjunction, interjection)
Both in “Gloss” field and in “Grammar” field you can only search for the forms with non-zero markers. The only exclusion is the imperative singular form which is a bare form of the verb. One can find it by selecting options “imp” and “2sg” in the “Grammar” field.
Marker with a number 1 or 2 (excluding “dur1”) are situated nearer to the stem then the same markers without numbers, and are used mainly as word formative markers. (Markers with number 1 are nearer to the stem then markers with number 2.)
Pl | ЛАр | non-predicative plural | иблер ‘homes’, парғаннарына ‘for those who went’ |
PredPl | ЛАр | predicative plural | парғаннар ‘they went’ |
Gen1 | НЫң, ДЫң | genitive | пістіңнер ‘ours’ |
Loc1 | ТА | locative | аалдағылар ‘those who are (living) in village’ |
“All1” and “Abl1” are very rare, you can only find them in some grammaticalized forms combined with other cases.
The combination of “Gen1” and “3pos” synchronically is a cumulative marker ни (dialectal variant Ди), therefore we divide them not by hyphen but a dot. Example: сілерни / сілерди ‘yours’.
All cases have allomorphs which are used with the possessive singular markers.
Most cases have dialectal variants. The ablative and the instrumental cases use one morpheme in some dialects.
Acc | НЫ, ДЫ, н | accusative | суғны / суғды ‘(drink) water’, суғын ‘(drink) his water’ |
Gen | НЫң, ДЫң, нЫң | genitive | азахтың ‘of leg’, азағының ‘of his leg’ |
Dat | ГА, (н)А | dative | ирге ‘to a man’, иріме ‘to my husband’ |
Loc | ТА, (н)ТА | locative | ибде ‘in the house’, ибінде ‘in his house’ |
All | САр, СА, САрЫ, нСАр, (н)СА, (н)САрЫ | allative | ибзер/ ибзері / ибзе ‘towards a house’, ибінзер /ибінзері / ибінзе ‘towards his house’ |
Abl | ДАң, нАң | ablative | аалнаң / аалдаң ‘from a village’, аалынаң ‘from his village’ |
Instr | ДАң, НАң, нАң, БАң, (н)БАң, мАң, (н)мАң | instrumental | малтынаң / малтыдаң / малтыбаң ‘by an axe’, абамнаң / абаммаң ‘with my dad’ |
Prol | ЧА, (н)ЧА | prolative (equative) | чолӌа ‘on a road’, соонӌа ‘following him’ |
Delib | нАңАр, ДАңАр(Ы) | deliberative | аннаңар ‘because’, кибірлердеңері ‘about the traditions’ |
1pos.sg | (Ы)м | 1st person singular possession (‘I’) | хызым ‘my daughter’ |
1pos.pl | (Ы)ПЫс | 1st person plural possession (‘us’) | хызыбыс ‘our daughter’ |
2pos.sg | (Ы)ң | 2nd person singular possession (‘you’) | 2nd person singular possession (‘you’) |
2pos.pl | (Ы)ңар | 2nd person plural possession (‘you’) | іӌеңер ‘your mother’ |
3pos | (з)Ы | 3rd person possession (‘he’, ‘she’, ‘it’, ‘they’) | аал пазы ‘village’s beginning’ |
3pos1 | (з)Ы | 3rd person possession (inner position) | аал пазындағылар ‘those who are (living) in the beginning of the village’ |
Perf | (Ы)бЫс | perfective | парыбысхан ‘he’s gone’ |
Perf0 | (Ы)с | perfective near the particle | чоохтаныпласчам ‘I speak almost every time’ |
Prosp.dial | АК, иК | prospective | парахча ‘is going to go’ |
Dur | чАт | durative | полчатсын ‘let it be’ |
Dur1 | А(р), и(р), ит | durative / present for the verbs парарға ‘go’, килерге ‘come’ | кили ‘comes now’ |
Iter | АдЫр, идЫр | iterative / present | тідирлер ‘they say’ |
RPast | ТЫ | recent past | килді ‘came (not long ago)’ |
Pres | чА | present | узупча ‘he sleeps’ |
Indir | ТЫр | evidential (indirective) | партыр ‘he went (they say)’ |
Evid | осхас | evidential (analytical form) | тіпчен осхас ‘he says (the speaker didn’t hear it himself)’ |
Affirm | ЧЫК | affirmative, subjuntive and other meanings | парарӌых ‘would come (if smth happened)’ |
Imp | imperative; takes the special set of personal markers | ат ‘shoot’, парим ‘should I go’ | |
Cond | СА | conditional | чатса ‘if it lies’ |
Opt | ГАй | optative | халғай ‘let it be left’ |
Simul | (А)АчЫК | simulative, converts the verb to a nominal | талаачых ‘simulating fainting’ |
We do not distinguish participle and finite forms with the same morphemes.
Past | ГАн | прошедшее время | одырған ‘сидел’ |
PresPt | чАн | present participle | хомай чуртапчан кізілер ‘badly living people’ |
PresPt1 | ин | present participle with the verbs пар ‘go’ and кил ‘come’ | сӱр парин остар ‘drive (as now)’ |
Fut | А(р), и(р) | future | килер ‘will come’ |
Neg.Fut | ПАс | negative future | килбес ‘will not come’ |
Hab | ЧА(ң) | habitual (past as finite form and present as non-finite form) | тоғынӌаң ‘worked (usually)’ |
Assum | ГАдАГ | assumptive («it seems that…») | хайтпаадағ ‘won’t happen (normally)’ |
Cunc | ГАлАК | cunctative («not yet…») | пысхалах ‘is not yet ripe’ |
ConvP | (Ы)п | consequative converb | алып алып, парыбысхан ‘having bought, went away’ |
ConvA | А, и | simultanious converb | чара парарға ‘to go separating’ |
Neg.Conv | Пи(н), ПААн | negative form of converb | хурғатпин тартырарға ‘to grind without drying’ |
1sg | (Ы)м, СЫм, ПЫн, им | 1st singular person marker | парам ‘I will go’ |
1pl | ПЫс, иБЫс | 1st plural person marker | парарбыс ‘we’ll go’ |
2sg | (Ы)ң, СЫң | 2nd singular person marker | парғаң ‘you went’ |
2pl | ңар, САр, (Ы)ңАр | 2nd plural person marker | парғазар ‘you (pl) went’ |
3 | Ø, СЫн | 3rd person marker (marked form only with imperative; it’s not possible to distinguish zero marker and the absence of marker in the word automatically) | ползын ‘let it be’ |
1.incl | Аң | inclusive imperative singular («I and you (sg)») | параң ‘let’s (two of us) go!’ |
1pl.incl | АңАр, АлАр | inclusive imperative plural («I and you (pl)») | параңар / паралар ‘let’s (all) go!’ |
Neg | ПА | negation | парба ‘don’t go’ |
Distr | (К)лА | distributive | тастағлаабыс ‘we throught (many things)’ |
NF | Ø / (Ы)п | word-formative marker from ConvP, which is used in some syntheical and analytical forms | пар-Ø-ча ‘goes’, сана-п-ча ‘counts’ |
Compl | тіп | complementizer (separate word) | парғам чаблах одалирға тіп ‘I went to dig potatoes’ |
All the word formative markers are not divided from the stem by hyphen, so they can be found only with the search in “Grammar” field.
Attr | КЫ | attributivizer (of locative and temporal forms) | аалдағы ‘situated in village’, пурунғы ‘prior’ |
Adv | Ли | adjectivizer | полосали ‘by strikes’ |
Comit | ЛЫГ | comitative («with…, «having…») | тадылығ ‘tasty’, аттығ ‘on a horse, with a horse’ |
Dimin | (Ы)ӌАК | diminutive | хызыӌах ‘(small) girl’ |
Coll | ОлАң, АлАң | collective numeral | ікӧлең / ікелең ‘twosome, two together’ |
Distr | Ар | distributive numeral | пизер ‘by five’ |
Caus | т, тЫр | causative (also used as parrive) | итірбе ‘don’t do (with the help of other)’ |
Pass | (Ы)л | passive | салылған ‘(been) put’ |
Refl | (Ы)н | reflexive | чоохтанча ‘says’ |
Rec | (Ы)с | reciproc | ылғазып ‘crying together’ |
Many of endoclitical particles are written as separate words in the Khakas orthography, though many of them have regular phonetical alternations, and some of them are used as enclitics.
Q | па, пе, ма, ме, ба, бе | general question particle | парған ма? ‘(he) came?’ |
qpart | чи | question particle | а тігілер чи? ‘and they?’ |
Foc | ТЫр | focus particles | адың кемдір? ‘what’s your name?’ |
Magn | reduplication of the 1st syllable + п | high degree, superlative | тап-тадылығ ‘very tasty’ |
Emph | за, зе, нооза, нізе, and other | emphatic particle | ылғапча нізе ‘cryes indeed’ |
Confpart | ізе | confirmative particle | “ізе” тіпче ‘says «yes»’ |
Indef | ТА, тА | indefinite pronoun particle | хайдағ-да / хайдағ-та ‘some’ |
Ass | ОК | associative | парохтар ‘they are (there) too’ |
Cont | LA | continuative | хырарлача ‘reddens all the time’ |
Add | ТАА | additive particle | мин дее ‘even me’ |
Prec | ТАК | precative particle (polite request in some dialects) | пирдек ‘give, please’ |
The field “Gloss” allows to submit search queries that concern the morphemic structure of the word forms. In general, this type of search is functionally similar to the search in the field “Grammar”. In particular, the list of markers that can be viewed by clicking the button next to the field “Gloss” largely overlaps with those given in the field “Grammar”.
The general principle of search by a gloss and the major differences of this type of search from the grammatical search are described in the “Help” section (the button with a question mark in the top right corner of the search window). The key features of the gloss-based search that are specific for this corpus are given below.
All the dialectal markers have the .dial label, both the morpheme variants (Acc.dial) and the markers which are not used in literary Khakas (Prosp.dial).
Gloss-based search does not include the word forms where there is no morphemic border between the marker in question and the stem. For instance, the dative case form of the pronoun син ‘you’ is сегее / сағаа / сее, which is not segmentable into morphemes and glossed “you.DAT”. This form will be among the hits of the grammatical query “dat”, but will not be included in the occurrences corresponding to the gloss-based search “STEM-DAT”.
The gloss-based query can be constructed using either specific glosses or the options CASE, CASE1, POSS, PRTCP, CONV, PERSON. These labels specify a group of morphemes rather than a specific gloss. CASE stands for any case marker, CASE1 stands for any case marker in inner position, POSS stands for any possessive marker, PRTCP stands for any participle marker, CONV stands for any converb marker, and PERSON stands for any person marker.
The corpus contains:
- 23 texts of Askiz dialect, collected in the Kazanovka village in 2001-2002 during the expedition of the Linguistic Department of the Russian State University for Humanities, headed by Nina Sumbatova. It contains about 13 000 tokens, the duration is 2h 18 min.
- 27 texts of Belty dialect, collected in 2011 in villages Butrachty, Chylany, Karagay by Anna Dybo and Elvira Kyrzhinakova. It contains about 45 000 tokens, the duration is 9 h 22 min.
Texts on other dialects (Kacha, Kyzyl, Shor) will be added.
The processing of the texts for inclusion in the corpus was conducted in 2017. This work was carried out by Vera Maltseva.
The Spoken corpus of the dialects of Khakas is a project supported by the Linguistic Convergence Laboratory of the National Research University Higher School of Economics. It is conducted within the framework of the Basic Research Program at the National Research University 'Higher School of Economics' (HSE) and supported as part of the Russian Academic Excellence Project '5-100'.
This project is a part of a large project on the documentation of the Khakas language http://khakas.altaica.ru. The following people are involved:
Anna Dybo – project supervisor, automatic parser creation
Elvira Sultrekova (Kyrzhinakova) – transcription and translation of most texts (text collected in Kazanovka village in 2001-2002 were mostly transcribed and translated by the members of expedition with the help of the people of the village).
Alexandra Sheymovich – dictionary (conversion of The Khakas-Russian dictionary
Phil Krylov – programming of the automatic parser
Vera Maltseva – automatic parser creation, text processing in Elan (correction of the parser’s results, annotation of sound)
Elena Tenkova – macros for writing of results of automatic parsing of texts in Elan files
You may contact us with questions about the Corpus:
Vera Maltseva: malt.wh@gmail.com
Or with questions about the search platform:
Elena Sokur: elena.o.sokur@gmail.com
If you use data from the Spoken corpus of the dialects of Khakas in your research, please cite as follows: