Project: CLDF: Cross-Linguistic Data Formats

The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for validation and manipulation, a basic ontology which links to more general frameworks, and usage examples of best practices.


Publications from this Project:

  • Lexibank, a public repository of standardized wordlists with computed phonological and lexical features.

    List JM, Forkel R, Greenhill SJ, Rzymski C, Englisch J & Gray RD. 2022. Lexibank, a public repository of standardized wordlists with computed phonological and lexical features. Scientific Data, 9(1): 316.

    the past decades have seen substantial growth in digital data on the world’s languages. at the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, most published datasets lack standardization which makes their comparison difficult. Here, we present a new approach to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a …

    Abstract PDF 10.1038/s41597-022-01432-0

  • Managing Historical Linguistic Data for Computational Phylogenetics and Computer-Assisted Language Comparison.

    Tresoldi T, Rzymski C, Forkel R, Greenhill SJ, List JM, & Gray R. 2022. Managing historical linguistic data for computational phylogenetics and computer-assisted language comparison. In Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, & Lauren B. Collister (Eds). Open Handbook of Linguistic Data Management.

    Computational phylogenetics is a relatively recent branch of historical linguistics that uses quantitative techniques to investigate the history of related languages. As the classical comparative method is less explicit on the techniques for constructing phylogenies of language families (see discussion in Jacques & List 2019), such a new approach can complement traditional techniques for sub-grouping based on shared innovations (Ross & Durie 1996).

    Abstract PDF 10.7551/mitpress/12200.001.0001

  • Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics.

    Forkel R, List J-M, Greenhill SJ, Bank S, Rzymski C, Cysouw M, Hammarström H, Haspelmath M & Kaiping GA & Gray RD. 2018. Cross-linguistic Data Formats, advancing data sharing and reuse in comparative linguistics. Scientific Data, 5:180205.

    The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for …

    Abstract PDF 10.1038/sdata.2018.205 Website Overview