Formant Combinatorics in Slovenian

Basic Info

Description

The objective of the project is to explore the combinatorics of word-formational formants, which will enable to present the characteristics of the word-formational and semantic extension mechanisms of Slovenian on contemporary language material, including all contemporary Slovenian dictionaries and contemporary corpora, by integrating the most state-of-the-art research methods in the fields of linguistics and language technologies. Slovenian – as well as other Slavic languages – is characterized by an extremely rich morphemic structure of words, which is a result of multistage formation: e.g. in the first stage, the adjective mlad ‘young’ yields the noun mladost ‘youth’, which in turn yields the adjective mladosten ‘youthful’ in the second stage, which in turn yields the noun mladostnik ‘adolescent’ in the third stage, which in turn yields the possessive adjective mladostnikov ‘adolescent’s’ in the fourth stage. The example shows the compatibility of four suffixal formants: ost + -en + -ik + ‑ov. The compatibility of formants is considered as the ability of different word-formational formants to co-exist within the multistage formation, taking into account the semantic extension aspect. The project will generate a new research field in Slovenian linguistics – morphotactics.

 

The language technology objective of the project is a pioneering creation of the first training set and the first language technology application allowing automatic morpheme segmentation of Slovenian words. This is of key importance also for the development of semantic language resources and language technologies for Slovenian and is of course undoubtedly important for linguistics as well. The language technology objective also confirms that the project is of strategic importance since the development of different language technologies is an absolute necessity for all modern languages. At the same time, project results will enable integration into international research community, also through comparisons with research into morphologically (word-formationally, genetically) more related languages (other Slavic languages) or less related languages (non-Slavic languages). Understanding morphological segmentation has considerable potential also for upgrades of deep learning models (see Hofmann et al. 2020b), on which most modern language technology tools are based. The project ensures applied value of the results in many subsequent applied projects (use in speech synthesizers, which is important also from the perspective of inclusion of language users from vulnerable social groups as well as from the perspective of development of smart devices, robotics etc.).


Project steps

PHASE 1: DIGITIZATION AND FORMALIZATION OF EXISTING PRINTED WORD-FORMATIONAL RESOURCES FOR SLOVENIAN

(October 2021 - January 2022)

  • coding data in a formalism created for this purpose
  • creating the schema of the word-formational database, which will be built on the basis of existing sources
  • extraction of statistically relevant data about frequency distribution and combinatorics of suffixal formants in Slovenian

 

PHASE 2: REVIEW AND EVALUATION OF THE EXTRACTED SET OF COMBINATIONS OF WORD-FORMATIONAL FORMANTS

(January - June 2022)

  • creation of the final set of combinations of word-formational formants
  • inventory of typical phonetic alternations at morpheme boundaries
  • inventory of morpheme reductions

 

 

PHASE 3: USE OF OTHER DICTIONARY SOURCES

(January – September 2022)

  • identification of formatives of the same types in other dictionary sources
  • inventorying new types of derivatives and combinations of word-formational formants

 

 

PHASE 4: ANALYSIS OF THE OBTAINED DATA

(April 2022 - September 2023)

  • analysis of the frequencies and combinatorics of suffixal formants in other dictionary resources
  • analysis of semantic functions of word-formational formants and combinatorial restrictions that stem from them
  • analysis of the combinatorics of individual components of derivatives with regard to their origin (borrowed vs. non-borrowed)
  • analysis of the combinatorics of individual components of derivatives with regard to connotation of individual components
  • analysis of formation of feminine forms of nouns, especially the regularities of distribution of competing formants

 

PHASE 5: USE OF CORPUS RESOURCES

(October 2022 - September 2023)

  • identification of derivatives of the same types in corpus resources
  • inventorying new types of derivatives and combinations of word-formational formants
  • comparison of the combinatorics of suffixal formants from the material obtained earlier and in specialized corpora, e.g. Janes etc.
  • linguistic analysis of the obtained data

 

PHASE 6: PREPARATION OF THE TRAINING SET FOR MORPHEMIC ANALYSIS OF WORDS AND USE OF MACHINE LEARNING

(October 2022 - February 2024)

  • preparation of the training set for automatic morphemic analysis of random vocabulary
  • machine learning on the basis of these data, including testing deep neural network-based methods

 

PHASE 7: PUBLICATION OF THE SYNTHESIS OF THE WORD-FORMATIONAL ANALYSIS FINDIGINGS

(October 2022 - September 2024)

  • synthesis of the analysis of the data obtained in dictionary resources and corpora
  • evaluation of the results of automatic morphemization of random words
  • presentation of the research results at conferences
  • preparation of a monograph which will present the combinatorics of word-formational formants on contemporary language material and thus the functioning of word-formational and semantic extension mechanisms in Slovenian