## Automated Identification of Borrowings in Multilingual Wordlists Johann-Mattis List Max Planck Institute for Evolutionary Anthropology Leipzig --- ## Background --- ### Background * Few linguistic phenomena are as pervasive as language contact. * There has been some progess in computational approaches to historical language comparison. * No real progress has been made with respect to automated borrowing detection. --- ### Background * In a review study from 2019, I discuss most available methods. * The phylogeny-based methdos (MLN approach) are rather unreliable. * Sequence-based method can detect borrowings across language families, but only work on language pairs. --- ### Background * As long as we cannot find reliable ways to identify borrowings based on correspondences and the like, it seems better to do what we CAN do: applying shallow borrowing detection methods to major contact areas to derive fresh statistics on borrowability. --- ## Borrowing Detection in Multilingual Wordlists --- ### A New Method for Borrowing Detection 1. select larger wordlists from a language contact area 2. use automatic cognate detection to identify language-internal cognates 3. compare identified cognate sets *across* language families to find *xenolog clusters* (etymologically related words which have experienced borrowing in their history) --- ### A New Method for Borrowing Detection * The method cannot identify borrowing directions. * But it identifies clusters of borrowed words across different language families. * It offers additional possibilities to analyze the data after xenologs have been identified. --- ## Case Study on South-East Asian Languages --- ### Data Selection * lexical data for various SEA languages converted to CLDF (Forkel et al. 2018, https://cldf.clld.org) * data aggregation is done via Concepticon (List et al. 2021, https://concepticon.clld.org) * normalization of phonetic transcriptions is done via CLTS (List et al. 2021b, https://clts.clld.org) --- ### Data Selection ![](https://pad.gwdg.de/uploads/upload_83579cb777cd12ef8479a15322d95890.png) --- ### Data Selection * Sino-Tibetan: * Sinitic languages (12) * Bai (1) * Loloish (1) * Hmong-Mien (23) * Mienic (4) * Hmongic (19) * Tai-Kadai * Sui (3) * Zhuang (7) --- ### Data Annotation ![Edictor Example](https://pad.gwdg.de/uploads/upload_1764763789573ef2d6c9b4b9d8232a0c.png) --- ### Analysis * New LingRex package (https://github.com/lingpy/lingrex) in Version 1.1.0 * New CLDF Dataset (https://github.com/lexibank/seabor) that offers the data in CLDF format and also runs the commands for analysis * New plotting routines * EDICTOR dataset with manual cognate and xenolog judgments --- ### Evaluation of the Workflow method | precision | recall | f-score ----------------------------- | ----------- | -------- | --------- automated cognate detection | 0.8728 | 0.8860 | 0.8794 automated borrowing detection | 0.9091 | 0.8397 | 0.8730 --- ### Borrowability Concept List | Non-Borrowed | Items --- | --- | --- Swadesh (1955) | 0.81 | 78 No Swadesh | 0.73 | 172 Leipzig-Jakarta | 0.81 | 61 No Leipzig-Jakarta | 0.74 | 189 All items | 0.76 | 250 --- ### Borrowability Significance concept list | significance | difference --- | --- | --- Swadesh (1955) | 0.03 | 0.08 Swadesh (1952) | 0.01 | 0.08 Leipzig-Jakarta | 0.07 | 0.07 --- ### Admixture Plots ![](https://pad.gwdg.de/uploads/upload_e8f1897e8608884085170f1dd1977c87.png) --- ### Invididual Examples: Correct (Right) ![](https://pad.gwdg.de/uploads/upload_620bf0ec132ae9f0ec415b1fccd03581.png) --- ### Individual Examples: Flower ![](https://pad.gwdg.de/uploads/upload_bc32117e4a0000723e0f3cc6386d6163.png) --- ### Individual Examples: Name ![](https://pad.gwdg.de/uploads/upload_b9b08d43734ad430966d88fe819b850b.png) --- ## Concluding Remarks --- ### Concluding Remarks * method is not necessarily spectacular but can help us to work on the efficient annotation of xenologs in a large number of languages of the world * workflow is fully replicable, data annotation is transparent * the results regarding the significance of the Leipzig-Jakarta list merit further evaluation --- ### Final Remarks * This study is common work with Robert Forkel, and we will submit the paper to a post-publication review journal next week, hoping for helpful comments. * Code is already online: https://github.com/lexibank/seabor/ --- ## Спасибо за внимание!
{}