# Research Software for Historical Language Comparison Johann-Mattis List Max Planck Institute for Evolutionary Anthropology --- ## Outline * Background * Tools for Digitization * Tools for Standardization * Tools for Annotation * Outlook --- # Background --- ## Background Language comparison opens windows * to our past * to our culture * to our cognition --- ## Background To compare languages, we need data which is * digitized, * standardized, and * annotated. --- ## Background **Digitization** refers to the representation of the data in numerical (tabular) format that allows us to *model* the data in various ways. --- ## Background **Standardization** refers to the *unification* of data representations so that they conform to common constructs, allowing us to compare data from different sources directly with each other. --- ## Background **Annotation** refers to the *enrichment* of data by adding additional information to individual data points, which are typically inferred manually by experts. --- ## Background **Annotation** can be divided into: A. Modeling B. Representation C. Implementation --- ## Background **Modeling** refers to the development of models that we can use to describe our data. --- ## Background **Representation** refers to the decisions we take to represent the model in concrete digital form with respect to our data. --- ## Background **Implementation** refers to the concrete decision we take to add information to our data in a machine- and human-readable way. --- # Tools for Digitization --- ## Tools for Digitization * no specified tools available * rather targeted solutions based on the source * test-driven data curation as a main paradigm --- # Tools for Standardization --- ## Tools for Standardization * reference catalogs help to link data to common constructs * tabular data with metadata (CSVW) are used for our Cross-Linguistic Data Formats (CLDF) * targeted software libraries for CLDF help to validate and create CLDF datasets --- ## Tools for Standardization ### Reference Catalogs * Glottolog (https://glottolog.org) for languages * Concepticon (https://concepticon.clld.org) for concepts * CLTS (https://clts.clld.org) for transcriptions --- ## Tools for Standardization ### Example for Standardized Data * https://tppsr.clld.org --- ## Tools for Annotation * LingPy (https://lingpy.org): Python package for automated tasks in computational historical linguistics. * CL Toolkit (https://github.com/cldf/cltoolkit): Python package for processing CLDF data. --- ## Tools for Annotation * EDICTOR (https://digling.org/edictor): JavaScript Application for the Curation of Etymological Data * MIS*L (https://lingpy.org/misol/): JavaScript Application for the Modeling of Sound Change --- ## Tools for Annotation ### Examples for Annotated Data * https://github.com/lexibank/lexibank-analysed * https://clics.clld.org --- # Outlook --- ## Outlook * finding the right balance between machine- and human-friendly software and data is our major challenge * so far, we have good experience with the use of targeted web-based applications complemented by Python code --- # Danke fürs Zuhören!
{}