Dr. Simon J. Greenhill

Portrait of Simon Greenhill

I research why and how people created all the amazing languages around us, and what they tell us about human prehistory.

I use (mainly) Bayesian phylogenetic methods to tackle these questions and have investigated everything from how the Austronesian peoples settled the Pacific, to modelling the co-evolution of linguistic structure. And I have built a number of large-scale databases to help answer these questions.

You can find me on Twitter or Mastodon, at the University of Auckland.



    Languages of Barrier Islands, Sumatra - Description, History and Typology

    The Barrier Islands Languages project is a project funded by the Australian Research Council Discovery Project grant (DP230102019) titled: “Languages of Barrier Islands, Sumatra: Description, History and Typology”.

    This project will run between 2023 and 2027 and investigate under-/undocumented Austronesian languages of the Barrier Islands, including Mentawai, (Simaluaya) Nias, Semeulue and Sikule, as well as neighbouring Northwest Sumatra languages, such as Simalungun Batak. New knowledge will be generated into the languages, cultures and societies of the region and be made freely available to the public.

    Research will uncover past migration patterns in Southeast Asia, advance language theory, such as linguistic typology and language change, and support the computational modelling of Austronesian for future language technologies.

    Grambank - a database of structural (typological) features of language

    Grambank is a database of structural (typological) features of language. It consists of 195 logically independent features (most of them binary) spanning all subdomains of morphosyntax. The Grambank feature questionnaire has been filled in, based on reference grammars, for 2,467 languages. The aim is to eventually reach as many as 3,500 languages. The database can be used to investigate deep language prehistory, the geographical-distribution of features, language universals and the functional interaction of structural features.

    Kinbank - Database of Kinship terminology

    Kinbank is a database of kinship terminologies to be used for exploring cross-linguistic diversity in kinship organisation. The database includes 1229 languages and a set of 100 core kin types between Grandparents and Grandchildren, and between Parent’s siblings, and Parent’s siblings’ children. A major advantage of Kinbank is the focused language family sampling and sampling based on occurrence in existing anthropological databases (e.g. d-place.org), allowing us to test the relationship between languages and behaviour. This allows the use of phylogenetic methods to reconstruct the states of proto-kinship, account for common ancestry in models of kinship change, and test for correlated evolution between linguistic and behavioural patterns.

    Lexibank - a public repository of standardized wordlists with computed phonological and lexical features

    The past decades have seen substantial growth in digital data on the world’s languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, most published datasets lack standardization which makes their comparison difficult. Here, we present a new approach to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that make these datasets more Findable, Accessible, Interoperable, and Reusable (FAIR). We test the Lexibank workflow on 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.

    GELATO - GEnes and LAnguages TOgether

    The GeLaTo dataset is a worldwide diversity panel of available population genetic samples matched with databases of linguistic, cultural and environmental diversity. Population genetic samples are assigned to existing GlottoCodes, following ethnolinguistic criteria: the data is filtered following the indication of geneticists, linguists, cultural anthropologists and historians. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.

    CLICS3 - Database of Cross-Linguistic Colexifications

    The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.