The RuThes thesaurus is a hierarchy of concepts viewed as units of thought. A concept is associated with the set of language expressions that refer to it in texts. Each concept should have distinctions from related concepts. These distinctions should be expressed in a specific set of relationships or associated language expressions: text entries.
Words and phrases whose meanings refer to the same concepts represented in the thesaurus are called ontological synonyms. Ontological synonyms can comprise sense-related words (e.g., privatization / privatize) belonging to different parts of speech (i.e., noun, verb, etc.). Also language expressions relating to different linguistic styles, technical terms, and lexical units can be presented as ontological synonyms related to the same concept. For example, the concept OIL INDUSTRY has the following text entries: íåôòÿíàÿ ïðîìûøëåííîñòü [oil industry] - neutral, íåôòÿíêà ("neftyanka") - slang, íåôòåïðîì ("nefteprom") - abbreviation. Free multiword expressions may be included into synonymic sets as well.
Each concept should have a clear, univocal, and concise name. Such names often help to express and delimit the denotational scope of the concept. In addition, names facilitate the analysis of the results of automatic document analysis.
The relations in RuThes are only conceptual, not lexical (as antonyms or derivational links in wordnets). They are constructed as more formal, ontological relations of traditional information-retrieval thesauri (Z39.19, 2005), which were designed to describe very broad, unstructured domains. The set of conceptual relations includes:
- the class-subclass relation;
- the part-whole relation applied with the following restriction: the existence of the concept-part should be strictly attached to the concept-whole. For example, trees can grow in many places not only in forests therefore concept TREE cannot be directly linked to concept FOREST with the part-whole relation, the additional concept FOREST TREE should be introduced;
- the external ontological dependence when the existence of a concept depends on the existence of another concept (in such a way forests depend on the existence of trees) (Guarino, Welty, 2002). In RuThes we denote this relation as association with indexes: asc1 is directed to the main concept, asc2 leads to the dependent concept;
- in the limited number of cases symmetric associations between concepts can be established.
The main idea behind this set of relations is to describe the most essential, reliable relations of concepts, which are relevant to various contexts of concept mentioning. Also this set of relations allows us to describe domain terminologies or domain-specific ontologies, combine descriptions of lexical and domain-specific knowledge in the same resource.
Thus, RuThes has considerable similarities with known WordNet thesaurus: the inclusion of concepts based on senses of real text units, representation of lexical senses, detailed coverage of word senses. At the same time the differences include attachment of different parts of speech to the same concepts, formulating of names of concepts, attention to multiword expressions, the set of conceptual relations, etc.
At present RuThes includes 54 thousand concepts, 158 thousand unique text entries (75 thousand single words), 178 thousand concept-text entry relations, more than 215 thousand conceptual relations. The published version of RuThes, RuThes-lite 2.0, contains 115 thousand text entries. It was singled out from full RuThes on the basis of words and phrases used in current Russian news flows with exclusion of several specific domains (Loukachevitch et al., 2014).
1. Loukachevitch, Natalia, and Boris Dobrov. "RuThes linguistic ontology vs. Russian wordnets." Proceedings of Global WordNet Conference GWC-2014. 2014
2. Loukachevitch, Natalia, Dobrov, Boris and Ilia Chetviorkin. "Ruthes-lite, a publicly available version of thesaurus of russian language ruthes." Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, Bekasovo, Russia. 2014.
3. Loukachevitch, Natalia, and Ilia Chetviorkin. "Determining the most frequent senses using Russian linguistic ontology RuThes." Proceedings of the Workshop on Semantic resources and Semantic Annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015, Vilnius, 11th May, 2015. No. 112. Linkoping University Electronic Press, 2015.
4. Loukachevitch N. V., Lashevich G., Gerasimova A. A., Ivanov V. V., Dobrov B. V. Creating Russian WordNet by Conversion. In Proceedings of Conference on Computatilnal linguistics and Intellectual technologies Dialog-2016, 2016. pp.405-415.
5. Loukachevitch N., Lashevich G. Multiword expressions in Russian Thesauri RuThes and RuWordNet. Proceedings of the AINL FRUCT 2016, 2016. pp.66-71.
6. Loukachevitch, Natalia, and Boris Dobrov. "The Sociopolitical Thesaurus as a resource for automatic document processing in Russian." Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 21.2 (2015): 237-262.