Slovenian word-prevalence: an online mega-study of word knowledge
Principal Investigator at ZRC SAZU
Andrej Perdih, PhD-
Original Title
Slovenian word-prevalence: an online mega-study of word knowledge
Project Team
Andrej Perdih, PhD, Matic Pavlič, PhD, Janoš Ježovnik, PhD, Artur Stepanov, PhD, Nataša Gliha Komac, PhD, Dejan Gabrovšek, PhD, Tina Pogorelčnik, Klara Trpkova Bergant-
Project ID
J6-50199
-
Duration
1 October 2023–30 September 2026 -
Lead Partner
-
Financial Source
Slovenian Research and Innovation Agency
Partners
Faculty of Education, University of Ljubljana, University of Nova Gorica, University Medical Centre Ljubljana
The goal of the project is to determine word-prevalence data for Slovenian by means of a mega-study of lexical decision and word–picture matching tasks. The project will provide the crucial subjective psycholinguistic norm, namely prevalence, defined as the percentage of the population that knows a word. Ratings of 10,000-20,000 Slovenian words will be collected from 4,000-8,000 native speakers through lexical decision and word–picture matching tasks, and converted into standardized, freely available, and reliably evaluated norms of word prevalence.
Based on word prevalence, researchers can select word stimuli more carefully and according to their intentions. First, by ranking words according to word prevalence (combined with word frequency), it is possible to delineate word difficulty ranges that can be used in selecting stimuli for psycholinguistic studies with factorial designs as well as for clinical use of diagnostic tests (such as receptive vocabulary tests). It can also be used to predict differences in word processing efficiency. Second, word prevalence can be used as an estimate of the difficulty of words in vocabulary tests. In addition, it is likely to be of interest to researchers developing algorithms for assessing the difficulty of texts. Third, word prevalence is useful in selecting vocabulary for preparing materials for teaching and learning a language as an L1 or L2. Finally, one of the main criteria for selecting headwords in general (monolingual or bilingual) dictionaries is currently word frequency. In the low-frequency ranges, it will be extremely useful to supplement this standard with word prevalence.
To achieve the goal, we will first prepare the experimental protocol for the mega-study; that is, building language datasets, such as a word list and a nonword list, and defining the socio-demographic metadata to be collected from the respondents. Then, the questionnaire will be promoted to obtain responses from a large number of Slovenian L1 adult speakers. The questionnaire will run for one year. The responses will then be analyzed to obtain answers to questions such as how respondents’ age, sex, place of growing up, education, number of languages spoken, and occupation affect word prevalence. We will also obtain information about which words are better known by Slovenian speakers and how corpus frequency, word length, and other variables correlate with word prevalence. Furthermore, a methodology will be developed for including word-prevalence data in dictionary compilation.
To tackle unforeseen challenges that may arise during the project, we have appointed an international independent observer and advisor with vast experience gained in the recent Catalan word-prevalence research project.
PHASE A1 – PREPARING THE WORD LIST
- Define the number of words to be tested
- Obtain the word list from the Dictionary of the Slovenian Standard Language, second edition (2014)
- Obtain frequency data from the Gigafida 2.0 corpus
- Define the corpus frequency threshold
- Remove words with frequency below/above the threshold
- Select the words to be tested
PHASE A2 – PREPARING THE PSEUDO/NONWORD LIST
Define the characteristics of pseudowords and nonwords
- Define the number of pseudowords and nonwords to be tested
- Define proportions of pseudo/nonwords according to their phonological structure
- Generate the list of pseudo/nonword candidates
- Remove pseudo/nonwords that are too similar to real Slovenian words or their morphological forms
- Remove pseudo/nonwords that are real English words
PHASE A3 – DEFINING SOCIO-DEMOGRAPHIC USER METADATA FOR THE QUESTIONNAIRE
- Define the socio-demographic data
- Prepare the questionnaire
PHASE A4 – PREPARING THE QUESTIONNAIRE SOFTWARE
- Import the word list and pseudo/nonword list
- Adjust sociodemographic meta-data fields
- Adjust the strategy for which words are offered to respondents
- Implement hyperlinks from words to the Dictionary of the Slovenian Standard Language, second edition, on the Fran dictionary portal
- Prepare and import Slovenian text for the website
- Install the software onto the web server
- Test the questionnaire and provide feedback
- Implement changes based on the feedback
PHASE A5 – PREPARING THE QUESTIONNAIRE PROMOTION STRATEGY
- Define the preferred target groups
- Prepare the detailed promotion strategy
- Prepare questionnaire promotion materials (e-mails, social media texts, etc.)
PHASE A6 – RUNNING THE QUESTIONNAIRE
- The questionnaire is launched
- Promotion activities
PHASE A7 – DEVELOPING THE CONCEPT OF USING WORD-PREVALENCE DATA IN LEXICOGRAPHY
- Establish a protocol for implementing word-prevalence data in compiling headword lists in dictionary projects
- Define how word-prevalence data influence lexicographic decisions with certain types of headwords
- Define how the determined prevalence of headwords influences their priority in dictionary compilation
PHASE A8 – DATA ANALYSES
- Determine the relevance of answers based on age, native language, outliers, technical issues, etc.
- Prepare the necessary R scripts for statistical analyses
- Model the results with regression analyses
- Interpret the results and correlations among psycholinguistic norms
PHASE A9 – IMPLEMENTING WORD-PREVALENCE DATA IN DICTIONARY COMPILATION
- Import word-prevalence data into the dictionary database
- Improve the inclusion/exclusion decision process for border-case lemmas based on word-prevalence data
- Adjust entry compilation order based on word-prevalence data
PHASE A10, B5 DATA PUBLICATION
- Export the raw data for responses and sessions, as well as the statistically processed data into a tab-delimited text format
- Add relevant meta-data descriptions to the data
- Publish the data under open license on a repository; e.g., the Clarin.SI repository for linguistic data and NLP tools
PHASE A11, B6 – PROJECT OVERSIGHT AND DISSEMINATING RESULTS
- The project leader and the external advisor cooperation
- Dissemination of results via publishing research articles and conference presentations
PHASE B1 – PREPARING THE WORD–PICTURE DATASET
- Define the number of target words (and their phonological fillers) to be tested
- Obtain the target and filler word–picture pairs from the Franček database
- Select the word–picture pairs to be tested
PHASE B2 – PREPARING THE QUESTIONNAIRE SOFTWARE FOR WORD–PICTURE PAIRS
- Import the word–picture data
- Adjust sociodemographic meta-data fields
- Adjust the strategy for which words are offered to respondents
- Implement hyperlinks from words to the Dictionary of the Slovenian Standard Language, second edition, on the Fran dictionary portal
- Prepare and import Slovenian text for the website
- Install the software onto the web server
- Test the questionnaire and provide feedback
- Implement changes based on the feedback
PHASE B3 – RUNNING THE WORD–PICTURE QUESTIONNAIRE ON WORD–PICTURE PAIRS
- The questionnaire is launched
- Promotion activities
PHASE B4 – DATA ANALYSES
- Determine the relevance of answers based on age, native language, outliers, technical issues, etc.
- Prepare the necessary R scripts for statistical analyses
- Model the results with regression analyses
- Interpretation of results
PHASE B5 – cf. A10, B5
PHASE B6 – cf. A11, B6
Web application
Besedomat. Web application, vocabulary test.
Media publications
RTV Slovenija. Jezikovni pogovori. https://365.rtvslo.si/podkast/jezikovni-pogovori/175087721.
Delo. Beseda tedna. Izbor. https://www.delo.si/magazin/zanimivosti/izbor.
Lectures and presentations
Perdih, Andrej, Pavlič, Matic, Pogorelčnik, Tina. Koliko besed poznaš? : Besedomat: množična raziskava razširjenosti slovenskih besed : predavanje, Lingvistični krožek, Filozofska fakulteta v Ljubljani, 11. nov. 2024.