174 views
<img src="https://pad.gwdg.de/uploads/upload_540d3443f66bbad09b68f4019abe7796.png" alt="drawing" width="300"/> ### Norms, Ratings, and Relations for multiple languages Annika Tjuka in collaboration with Robert Forkel and Mattis List DLCE Department Meeting (July 27, 2021) --- # The challenge ---- Psychologists and linguists collect an increasing amount of data for a growing number of languages to describe various properties of words and concepts. ---- But no resource exists yet where one could compare different properties of words across languages. ---- Therefore, we created the Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe) which combines data from psychology and linguistics. --- # Data overview <img src="https://pad.gwdg.de/uploads/upload_a516697b7526b4f285ad173e25c8ddb0.jpeg" alt="drawing" width="600"/> ---- ## Numbers In its current version (v0.2), NoRaRe includes **65** unique word and concept properties derived from **98** different data sets across **40** languages. ---- ## Norms Data that are determined by taking samples from a total quantity. They are collected and applied predominantly in the field of psychology. - word frequency - lexical decision ---- ## Ratings Data that are based on participant judgments of a given word in a particular language either on a scale or on other measures. - age-of-acquisition - discrete emotions (e.g, anger, disgust) - sensory modality (e.g., haptic, visual) ---- ## Relations Data that offer information on the relation between two words or concepts. They are collected in the field of comparative linguistics and Natural Language Processing (NLP). - colexifications (e.g., CLICS) - stability rankings - associations (e.g, WordNet) ---- ## Comparability We were confronted with a wide range of different data types and formats. ---- ### How to structure your table headers ![](https://pad.gwdg.de/uploads/upload_34b986abd9b847622dae67f55d56cfe0.png) <sub><sub>For R users: Wickham, Hadley. 2014. [Tidy Data](https://doi.org/10.18637/jss.v059.i10). _Journal of Statistical Software_.</sub></sub> --- # Workflows ---- We decided to establish three workflows to account for the different structures that we found for data on norms, ratings, and relations of word and concept properties. ---- ![](https://pad.gwdg.de/uploads/upload_d033ee8885436a44286a858ee1243922.png) ---- ## Manual workflow We used the well-established data curation workflow from the Concepticon project to link small to moderately large data sets (< 2,000 items). For details, see CALC blog (e.g., [Tjuka 2021](https://calc.hypotheses.org/2680); [Tjuka 2020](https://calc.hypotheses.org/2225)). ---- ## Automated workflow In order to make it possible to have access to the specific word properties offered by large data sets (> 2,000 items), we decided to set up a new algorithm for linking to Concepticon concept sets which is implemented in Python. ---- ## Semi-automated workflow The workflow uses the software APIs provided by individual online databases (e.g., OmegaWiki, BabelNet) to query the data and later manually choose which of the three or more possible matches should be the preferred one. --- # Access ---- We provide a web interface for a convenient overview: - https://digling.org/norare/ ---- ![](https://pad.gwdg.de/uploads/upload_9158c1c591bafdffa03a8b4ba825534a.png) ---- The data and Python package `pynorare` are curated on GitHub: - https://github.com/concepticon/norare-data - https://github.com/concepticon/pynorare --- # Application ---- ## Case study A comparison of word frequencies across English, German, and Chinese. ---- ## Material SUBTLEX data for: - English ([Brysbaert and New 2009](https://doi.org/10.3758/BRM.41.4.977)) - German ([Brysbaert et al. 2011](https://doi.org/10.1027/1618-3169/a000123)) - Chinese ([Cai and Brysbaert 2010](https://doi.org/10.1371/journal.pone.0010729)) ---- ## Hypotheses 1. Related languages have more similar frequencies across a set of shared concepts than non-related languages. 2. In related languages, there are fewer concepts that have a large difference between frequencies than in non-related languages. ---- ## Results English-German <img src="https://pad.gwdg.de/uploads/upload_964b74790719ea6952189a8f0228d150.png" alt="drawing" width="600"/> ---- ## Results English-Chinese <img src="https://pad.gwdg.de/uploads/upload_35215998fe1de81d62c7f61aed1f1d52.png" alt="drawing" width="600"/> ---- :::info Tjuka, Annika. 2020. [General patterns and language variation: Word frequencies across English, German, and Chinese](https://www.aclweb.org/anthology/2020.cogalex-1.3). _Proceedings of the Workshop on the Cognitive Aspects of the Lexicon at ACL_. ::: --- # Discussion ---- The biggest challenge of our project was to transform a large number of different data sets so that they are comparable. ---- Especially for cross-linguistic studies, the NoRaRe database is the perfect starting point and properties such as frequencies can be compared easily across languages. ---- Yet, there are also limitations to our approach. ---- - In comparison of Linked Data, i.e. WordNet which provides the concrete meaning of words in a given language, the Concepticon offers standardized concepts which indicate the denotation range of a given elicitation gloss. ---- - The number of concepts in the Concepticon limits the comparable items. ---- However, this is only the beginning and we will extend the data in the future. ---- :::info Tjuka, Annika, Robert Forkel & Johann-Mattis List. 2021. [Linking norms, ratings, and relations of words and concepts across multiple language varieties](https://doi.org/10.31234/osf.io/tgw3z). _Behavior Research Methods_. ::: --- ## Thank you