Help

Basic concepts

Search engine

Corpus search is based on CQL syntax provided with an graphic user interface. Every request is converted into CQL syntax automatically in a field CQL Search. Corpus Data is divided into two basic units: words and fragments (roughly equivalent to the sentence). The point of searching in the corpus is to find all fragments that include given word.

Metadata

Each informant in the corpus is described with a set of parameters, such as gender, year of birth, education etc. The data can be used in two ways:

in searching: it is possible to limit a query response with a particular metadata unit, e.g. with age of informants;
in search result: it is possible to display information about the author of a statement.

The data about informants is anonymised, so each of them is encoded with sequence of letters and digits.

Regular expressions

Every string entered in a search fields is to be interpreted as a regular expression. Regular expressions are the set of rules, that allow to extend capability of interpretation of a search request. For example, a regexp query string ca[tpn] it to find cat, cap and can. A signature ca.* stands for: find every string that starts with ca and continues with any combination of symbols. The result may include cat, call, cataclysm, capability and so on.

Regular expression is an extremely useful tool, and a user does not need a specific knowledge to use it. More about regular expressions:

Search

Basic search

In this mode user has it his disposal a number of fields:

Token. It allows to search any precise word form a user need, e.g. query dogs results in every fragment, where token dogs has appeared;
Lemma. It allows to find out tokens by normalized word form, e.g. buy is to result in appearances of tokens buy, buys, bought etc;
Tag (grammatical features). It allows to specify grammatical characteristics of a token. E.g. if you need to take all pronouns in a corpus, you need to write NPRO.* See more on tagset of grammatical features in the tables below:

Part of speech tags

Tag	Definition
NOUN	noun
ADJF	full form of adjective
ADJS	short form of adjective
COMP	comparative adjective
VERB	verb
INFN	infinitive verb
PRTF	full form of participle
PRTS	short form of participle
GRND	gerundive
NUMBR	numeral
ADVB	adverb
NPRO	pronoun
PRED	predicative
PREP	preposition
CONJ	conjunction
PRCL	participle
INTJ	interjection

Animacity

Tag	Definition
anim	animate
inan	inanimate

Gender

Tag	Definition
masc	masculine
femn	feminine
neut	neutral
ms-f	common gender

Number

Tag	Definition
sing	singular
plur	plural
Sgtm	singularia tantum
Pltm	pluralia tantum
Fixd	immutable word

Case

Tag	Definition
nomn	nominative
gent	genitive
datv	dative
accs	accusative
ablt	ablative
loct	locative
voct	vocative
gen2	second genitive
acc2	second accusative
loc2	second locative

Aspect

Tag	Definition
perf	perfect
impf	imperfective

Transitivity

Tag	Definition
tran	transitive
intr	intransitive
Refl	reflexive

Person

Tag	Definition
1per	first person
2per	second person
3per	third person

Tense

Tag	Definition
pres	present tense
past	past tense
futr	future tense

Mood

Tag	Definition
indc	indicative
impr	imperative

Inclusivity

Tag	Definition
incl	inclusive
excl	exclusive

Voice

Tag	Definition
actv	active
pssv	passive

Specific tags

Tag	Definition
LATN	the word consists of latin letters
PNCT	punctuation mark
NUMB	numeral token
intg	integer number
real	float number
ROMN	roman numeral
UNKN	unknown word, token failed to parse

Below the search line there are checkboxes that can be used to modify the interpretation of a string entered in the appropriate field. They are: starts with, ends with and case sensitive (allows to disable case neglecting in a query, e.g. in case a user need a token, witch starts only with a capitalized letter)

To search word sequences, add extra query line using on the right. Using the form field tokens between, a user may specify the minimum and maximum distance (counted in words) between elements of a search sequence.

To limit the search with a particular consultants a user may click Filter button and specify the crucial metadata fields.

Watch out! Filters does not work if search query is not specified.

Advanced search

Advanced search requires at least minimum knowledge of CQL. A user may handle with a CQL Search field to construct CQL request. In case a search expression is incorrect, no result is to return. See brief CQL tutorial here.

To use CQL syntax, a user should consider encoding format of the data. The corpus is coded in CWB format, that based on concepts of positional (referring individual words) and structural (referring the fragments of a text) attributes. CWB encoded corpus looks like quasi-xml file. Two positional attributes are used in corpus CWB:

word - word form;
lemma;

Structural tags and their attributes:

utterance - fragment:
- utterance_file - the name of the file corresponding to fragment;
- utterance_from - fragment start time, calculated in milliseconds from the beginning of the file;
- utterance_to - fragment end time, calculated in milliseconds from the beginning of the file.
meta - metadata:
- meta_id - string id of an informant;
- ...and all metadata units given by the field data providers.

Result page

The page presenting the result consists on two elements: the top frame containing the search summary and navigation icons, and the result section, where all fragments containing searched word or a sequence of words are presented.

Navigation icons are used to change the way results are presented. There are:

- change the result display mode. Two modes are available: primary (used by default) and KWIC (Key Word in Context);
- show/hide metadata of a speaker;
- save selected fragments in TSV-file (tab-separated value);
- save the whole response in TSV.

Search result is displayed in a table format by one segment (in text and audio format) in a row. A user can listen to an audio and save it on the disk. The matched query it highlighted with red. While hovering over words, additional grammatical information is displayed in pop-up windows. A user can click in primary mode or button in a KWIC mode to see the context of a fragment.

Query results can be saved in TSV (Tab-Separated Values) format via button. button helps to save selected result. It is a text format in which one line corresponds to one fragment. One row contains 13 columns separated with tabulation. Columns include: fragment id, fragment start and end time, informant metadata, left context, matching token and right context. TSV-file can be opened in any editor, like Excel, Calc, Google Sheets etc.

Still have questions? Feel free to contact with corpus developers!