RuThes Linguistic Ontology

The RuThes thesaurus is a hierarchy of concepts viewed as units of thought. A concept is associated with the set of language expressions that refer to it in texts. Each concept should have distinctions from related concepts. These distinctions should be expressed in a specific set of relationships or associated language expressions: text entries.

Words and phrases whose meanings refer to the same concepts represented in the thesaurus are called ontological synonyms. Ontological synonyms can comprise sense-related words (e.g., privatization / privatize) belonging to different parts of speech (i.e., noun, verb, etc.). Also language expressions relating to different linguistic styles, technical terms, and lexical units can be presented as ontological synonyms related to the same concept. For example, the concept OIL INDUSTRY has the following text entries: [oil industry] - neutral, ("neftyanka") - slang, ("nefteprom") - abbreviation. Free multiword expressions may be included into synonymic sets as well.

Each concept should have a clear, univocal, and concise name. Such names often help to express and delimit the denotational scope of the concept. In addition, names facilitate the analysis of the results of automatic document analysis.

The relations in RuThes are only conceptual, not lexical (as antonyms or derivational links in wordnets). They are constructed as more formal, ontological relations of traditional information-retrieval thesauri (Z39.19, 2005), which were designed to describe very broad, unstructured domains. The set of conceptual relations includes:

The main idea behind this set of relations is to describe the most essential, reliable relations of concepts, which are relevant to various contexts of concept mentioning. Also this set of relations allows us to describe domain terminologies or domain-specific ontologies, combine descriptions of lexical and domain-specific knowledge in the same resource.

Thus, RuThes has considerable similarities with known WordNet thesaurus: the inclusion of concepts based on senses of real text units, representation of lexical senses, detailed coverage of word senses. At the same time the differences include attachment of different parts of speech to the same concepts, formulating of names of concepts, attention to multiword expressions, the set of conceptual relations, etc.

At present RuThes includes 54 thousand concepts, 158 thousand unique text entries (75 thousand single words), 178 thousand concept-text entry relations, more than 215 thousand conceptual relations. The published version of RuThes, RuThes-lite 2.0, contains 115 thousand text entries. It was singled out from full RuThes on the basis of words and phrases used in current Russian news flows with exclusion of several specific domains (Loukachevitch et al., 2014).


