Projects I'm involved in:

grambank - a database of structural (typological) features of language

Grambank is a database of structural (typological) features of language. It consists of 195 logically independent features (most of them binary) spanning all subdomains of morphosyntax. The Grambank feature questionnaire has been filled in, based on reference grammars, for 2,467 languages. The aim is to eventually reach as many as 3,500 languages. The database can be used to investigate deep language prehistory, the geographical-distribution of features, language universals and the functional interaction of structural features.

kinbank - Database of Kinship terminology

Kinbank is a database of kinship terminologies to be used for exploring cross-linguistic diversity in kinship organisation. The database includes 1229 languages and a set of 100 core kin types between Grandparents and Grandchildren, and between Parent’s siblings, and Parent’s siblings’ children. A major advantage of Kinbank is the focused language family sampling and sampling based on occurrence in existing anthropological databases (e.g., allowing us to test the relationship between languages and behaviour. This allows the use of phylogenetic methods to reconstruct the states of proto-kinship, account for common ancestry in models of kinship change, and test for correlated evolution between linguistic and behavioural patterns.

Lexibank - a public repository of standardized wordlists with computed phonological and lexical features

The past decades have seen substantial growth in digital data on the world’s languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, most published datasets lack standardization which makes their comparison difficult. Here, we present a new approach to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that make these datasets more Findable, Accessible, Interoperable, and Reusable (FAIR). We test the Lexibank workflow on 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.

GELATO - GEnes and LAnguages TOgether

The GeLaTo dataset is a worldwide diversity panel of available population genetic samples matched with databases of linguistic, cultural and environmental diversity. Population genetic samples are assigned to existing GlottoCodes, following ethnolinguistic criteria: the data is filtered following the indication of geneticists, linguists, cultural anthropologists and historians. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.

CLICS3 - Database of Cross-Linguistic Colexifications

The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.

CLDF - Cross-Linguistic Data Formats

The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for validation and manipulation, a basic ontology which links to more general frameworks, and usage examples of best practices.

Pulotu - Database of Austronesian Religions

Pulotu, the proto-Polynesian word for the abode of the gods, is a database of supernatural beliefs and practices across Austronesian cultures. The database includes 137 Austronesian cultures and 63 variables on religion, history, society, and the natural environment. This database is specifically designed to test evolutionary hypotheses of religious belief and practice, with a primary focus on the traditional state of cultures. A major advantage of Pulotu is that robust language phylogenies are available for Austronesian cultures. This enables the use of phylogenetic comparative methods which provide the ability to reconstruct the states of proto-cultures, account for common ancestry in cross-cultural analysis, and test for correlated evolution between traits.

D-PLACE - Database of Places, Language, Culture and Environment

From the foods we eat, to who we can marry, to the types of games we teach our children, the diversity of cultural practices in the world is astounding. Yet, our ability to visualize and understand this diversity is often limited by the ways it traditionally has been documented and shared: on a culture-by-culture basis, in locally-told stories or difficult-to-access books and articles. D-PLACE represents an attempt to bring together this dispersed corpus of information. - Trans-New Guinea Online is a database of the Trans-New Guinea language family and friends. The Trans-New Guinea language family currently occupies most of the interior of New Guinea. This family is possibly the third largest in the world with 400 languages and is tentatively thought to have originated with root-crop agriculture around 10,000 years ago. However, vanishingly little is known about this family’s history.

POLLEX - Polynesian Lexicon Project Online

The Polynesian Lexicon Project Online is a large-scale comparative dictionary of Polynesian languages.

The Polynesian lexicon project, POLLEX, was initiated in 1965 by Bruce Biggs in order to provide a large-scale comparative dictionary of Polynesian languages. Since then, POLLEX has grown to include over 55,000 reflexes of more than 4,700 reconstructed forms in 68 languages. These data have enabled many fundamental advances in Polynesian linguistics and prehistory. At almost half a century old, POLLEX is one of the longest-standing databases of linguistic information, and has moved through various incarnations, from type- writer and edge-punched cards, through microfiche to mainframe computer.

ABVD - Austronesian Basic Vocabulary Database

The Austronesian Basic Vocabulary Database is the world’s largest cross-linguistic database of the Pacific. It contains ~300,000 lexical items from ~1,600 languages spoken throughout the Pacific region.