Corpus search is based on CQL syntax provided with an graphic user interface. Every request is converted into CQL syntax automatically in a field CQL Search. Corpus Data is divided into two basic units: words and fragments (roughly equivalent to the sentence). The point of searching in the corpus is to find all fragments that include given word.
Each informant in the corpus is described with a set of parameters, such as gender, year of birth, education etc. The data can be used in two ways:
The data about informants is anonymised, so each of them is encoded with sequence of letters and digits.
Every string entered in a search fields is to be interpreted as a regular expression. Regular expressions are the set of rules, that allow to extend capability of interpretation of a search request. For example, a regexp query string ca[tpn] it to find cat, cap and can. A signature ca.* stands for: find every string that starts with ca and continues with any combination of symbols. The result may include cat, call, cataclysm, capability and so on.
Regular expression is an extremely useful tool, and a user does not need a specific knowledge to use it. More about regular expressions:
In this mode user has it his disposal a number of fields:
Tag | Definition |
---|---|
NOUN | noun |
ADJF | full form of adjective |
ADJS | short form of adjective |
COMP | comparative adjective |
VERB | verb |
INFN | infinitive verb |
PRTF | full form of participle |
PRTS | short form of participle |
GRND | gerundive |
NUMBR | numeral |
ADVB | adverb |
NPRO | pronoun |
PRED | predicative |
PREP | preposition |
CONJ | conjunction |
PRCL | participle |
INTJ | interjection |
Tag | Definition |
---|---|
anim | animate |
inan | inanimate |
Tag | Definition |
---|---|
masc | masculine |
femn | feminine |
neut | neutral |
ms-f | common gender |
Tag | Definition |
---|---|
sing | singular |
plur | plural |
Sgtm | singularia tantum |
Pltm | pluralia tantum |
Fixd | immutable word |
Tag | Definition |
---|---|
nomn | nominative |
gent | genitive |
datv | dative |
accs | accusative |
ablt | ablative |
loct | locative |
voct | vocative |
gen2 | second genitive |
acc2 | second accusative |
loc2 | second locative |
Tag | Definition |
---|---|
perf | perfect |
impf | imperfective |
Tag | Definition |
---|---|
tran | transitive |
intr | intransitive |
Refl | reflexive |
Tag | Definition |
---|---|
1per | first person |
2per | second person |
3per | third person |
Tag | Definition |
---|---|
pres | present tense |
past | past tense |
futr | future tense |
Tag | Definition |
---|---|
indc | indicative |
impr | imperative |
Tag | Definition |
---|---|
incl | inclusive |
excl | exclusive |
Tag | Definition |
---|---|
actv | active |
pssv | passive |
Tag | Definition |
---|---|
LATN | the word consists of latin letters |
PNCT | punctuation mark |
NUMB | numeral token |
intg | integer number |
real | float number |
ROMN | roman numeral |
UNKN | unknown word, token failed to parse |
Below the search line there are checkboxes that can be used to modify the interpretation of a string entered in the appropriate field. They are: starts with, ends with and case sensitive (allows to disable case neglecting in a query, e.g. in case a user need a token, witch starts only with a capitalized letter)
To search word sequences, add extra query line using on the right. Using the form field tokens between, a user may specify the minimum and maximum distance (counted in words) between elements of a search sequence.
To limit the search with a particular consultants a user may click Filter button and specify the crucial metadata fields.
Watch out! Filters does not work if search query is not specified.
Advanced search requires at least minimum knowledge of CQL. A user may handle with a CQL Search field to construct CQL request. In case a search expression is incorrect, no result is to return. See brief CQL tutorial here.
To use CQL syntax, a user should consider encoding format of the data. The corpus is coded in CWB format, that based on concepts of positional (referring individual words) and structural (referring the fragments of a text) attributes. CWB encoded corpus looks like quasi-xml file. Two positional attributes are used in corpus CWB:
Structural tags and their attributes:
utterance - fragment:
meta - metadata:
The page presenting the result consists on two elements: the top frame containing the search summary and navigation icons, and the result section, where all fragments containing searched word or a sequence of words are presented.
Navigation icons are used to change the way results are presented. There are:
Search result is displayed in a table format by one segment (in text and audio format) in a row. A user can listen to an audio and save it on the disk. The matched query it highlighted with red. While hovering over words, additional grammatical information is displayed in pop-up windows. A user can click in primary mode or button in a KWIC mode to see the context of a fragment.
Query results can be saved in TSV (Tab-Separated Values) format via button. button helps to save selected result. It is a text format in which one line corresponds to one fragment. One row contains 13 columns separated with tabulation. Columns include: fragment id, fragment start and end time, informant metadata, left context, matching token and right context. TSV-file can be opened in any editor, like Excel, Calc, Google Sheets etc.
Still have questions? Feel free to contact with corpus developers!