### Comparing word properties across languages: A case study on ratings for arousal and valence Annika Tjuka, Robert Forkel, and Mattis List Words in the World conference (November 27, 2021) --- # The challenge ---- Psychologists and linguists collect an increasing amount of data for a growing number of languages to describe various properties of words and concepts. ---- But no resource exists yet where one could compare different properties of words across languages. ---- Therefore, we created the Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe) which combines data from psychology and linguistics. <img src="https://pad.gwdg.de/uploads/d21977cc-5973-4cb6-8385-74a2fa068e98.png" alt="drawing" width="400"/> --- # Data overview <img src="https://pad.gwdg.de/uploads/8b3840a9-cdb4-44c6-9027-4a65fb764e88.jpeg" alt="drawing" width="600"/> ---- ## Numbers In its current version (v0.2), NoRaRe includes **65** unique word and concept properties derived from **98** different data sets across **40** languages. ---- ## Norms Data that are determined by taking samples from a total quantity. They are collected and applied predominantly in the field of psychology. - word frequency - lexical decision ---- ## Ratings Data that are based on participant judgments of a given word in a particular language either on a scale or on other measures. - age-of-acquisition - psychological states (e.g, valence, arousal) - sensory modality (e.g., haptic, visual) ---- ## Relations Data that offer information on the relation between two words or concepts. They are collected in the field of comparative linguistics and Natural Language Processing (NLP). - colexifications (e.g., CLICS) - stability rankings - associations (e.g, WordNet) ---- ## Access Web interface: https://digling.org/norare/ GitHub: https://github.com/concepticon/norare-data --- # Application ---- ## Case study A comparison of ratings for arousal and valence of words on a 9-point scale across English, Dutch, and Spanish. ---- ## Material - English ([Scott et al. 2019](https://doi.org/10.3758/s13428-018-1099-3)) - Dutch ([Moors et al. 2013](https://doi.org/10.3758/s13428-012-0243-8)) - Spanish ([Stadthagen-González et al. 2017](https://doi.org/10.3758/s13428-015-0700-2)) ---- ## Results Arousal Ratings ---- ![](https://pad.gwdg.de/uploads/cbfea110-c957-4ccb-9c2d-28b4f2e62a4f.png) ---- ## Results Valence Ratings ---- ![](https://pad.gwdg.de/uploads/c58130fb-8998-480c-b4d9-2bbdb2a0150e.png) ---- - The correlation strenght varied for arousal: - highest Pearson coefficient in Dutch-Spanish pair (_R_=0.63) - lowest Pearson coefficient in English-Spanish pair (_R_=0.32). ---- - For valence, the Pearson coefficient was above 0.8 in all language pairs. ---- - The findings for arousal differ from Jackson et al. (2019), who found that closely related languages have more similar emotion semantics. --- # Discussion ---- The biggest challenge of our project was to transform a large number of different data sets so that they are comparable. ---- Especially for cross-linguistic studies, the NoRaRe database is the perfect starting point and properties can be compared easily across languages. ---- The results of the comparison of arousal and valence ratings across three languages showed that our approach provides important insights about the comparability of word properties. ---- Yet, there are also limitations to our approach. ---- - Although many studies collected data on the same properties, such as concreteness or imageability, most of them use different scales (e.g., 5-, 7-, or 9-point scales). ---- - The number of items that can be compared is limited by the concepts provided in Concepticon (List et al. 2021). ---- However, this is only the beginning and we will extend the data in the future. --- <img src="https://pad.gwdg.de/uploads/01b71c50-6d65-4656-828d-8964044704fa.jpeg" alt="drawing" width="75"/> <img src="https://pad.gwdg.de/uploads/695f4580-16f7-4cf3-8924-ce3f164a6d32.jpeg" alt="drawing" width="80"/> <img src="https://pad.gwdg.de/uploads/a8f45403-1e99-4da5-8c73-1881972245cc.jpeg" alt="drawing" width="80"/> :::info Tjuka, Annika, Robert Forkel & Johann-Mattis List. 2021. [Linking norms, ratings, and relations of words and concepts across multiple language varieties](https://doi.org/10.3758/s13428-021-01650-1). _Behavior Research Methods_. ::: --- # Tutorials ---- :::info Tjuka, Annika. "Adding data sets to NoRaRe: A guide for beginners," [Blog post] in _Computer-Assisted Language Comparison in Practice_, 11/08/2021, https://calc.hypotheses.org/2890. ::: ---- :::info Tjuka, Annika. "Comparing NoRaRe data sets: Calculation of correlations and creation of plots in R," [Blog post] in _Computer-Assisted Language Comparison in Practice_, 01/11/2021, https://calc.hypotheses.org/3109. ::: --- # Thank you Contact: [@AnnikaTjuka](https://twitter.com/AnnikaTjuka) tjuka@shh.mpg.de