Dr. Simon J. Greenhill

I research why and how people created all the amazing languages around us, and what they tell us about human prehistory.

I use (mainly) Bayesian phylogenetic methods to tackle these questions and have investigated everything from how the Austronesian peoples settled the Pacific, to modelling the co-evolution of linguistic structure. And I have built a number of large-scale databases to help answer these questions.

You can find me on Twitter or Mastodon, at the University of Auckland.


  • Bayesian phylogenetic analysis of Philippine languages supports a rapid migration of Malayo Polynesian languages.

    King B, Greenhill SJ, Reid LA, Ross M, Walworth M, & Gray R. 2024. Bayesian phylogenetic analysis of Philippine languages supports a rapid migration of Malayo Polynesian languages. Scientific Reports, 14, 14967.

    The Philippines are central to understanding the expansion of the Austronesian language family from its homeland in Taiwan. It remains unknown to what extent the distribution of Malayo-Polynesian languages has been shaped by back migrations and language leveling events following the initial Out-of-Taiwan expansion. Other aspects of language history, including the effect of language switching from non-Austronesian languages, also remain poorly understood. Here we apply Bayesian phylogenetic methods to a core-vocabulary dataset of Philippine languages. Our analysis strongly supports a sister …

  • Tentatively tracing Trans‐New Guinea: A phylogenetic evaluation of potential deeper relationships..

    Greenhill SJ. In Press. Tentatively tracing Trans‐New Guinea: A phylogenetic evaluation of potential deeper relationships.. In Evans N & Fedden S (Eds). The Oxford Guide to the Papuan Languages. Oxford University Press: Oxford.

    The Trans‐New Guinea language family is one of the world’s largest language families. Strikingly it is also one of the world’s least studied. There is ongoing debate about which of many languages should be included in Trans‐New Guinea and how these relate to each other. Resolving this debate is hard due to the complexities of studying New Guinea languages, and a lack of adequate data suitable for detailed historical linguistic work. These difficulties have led to suggestions that the only way forward is to wait for low‐level descriptive 15 field‐work and detailed bottom‐up historical …

  • The evolutionary dynamics of how languages signal who does what to whom.

    Shcherbakova O, Blasi DE, Gast V, Skirgård H, Gray RD, & Greenhill SJ. 2024. The evolutionary dynamics of how languages signal who does what to whom. Scientific Reports, 14, 7259.

    Languages vary in how they signal “who does what to whom”. Three main strategies to indicate the participant roles of “who” and “whom” are case, verbal indexing, and rigid word order. Languages that disambiguate these roles with case tend to have either verb-final or flexible word order. Most previous studies that found these patterns used limited language samples and overlooked the causal mechanisms that could jointly explain the association between all three features. Here we analyze grammatical data from a Grambank sample of 1705 languages with phylogenetic causal graph methods. Our results …

  • Variation in phoneme inventories: quantifying the problem and improving comparability.

    Anderson C, Tresoldi T, Greenhill SJ, Forkel R, Gray RD & List JML. 2023. Variation in phoneme inventories: quantifying the problem and improving comparability. Journal of Language Evolution, 11, lzad011.

    For over a century, the phoneme has played a central role in linguistic research. In recent years, collections of phoneme inventories, originally designed for cross-linguistic purposes, have increasingly been used in comparative studies involving neighbouring disciplines. Despite the extended application of this type of data, there has been no research into its comparability or tests of its reliability. In this study, we carry out a systematic comparison of nine popular phoneme inventory collections. We render them comparable by linking them to standardised formats for the handling of …

    Languages of Barrier Islands, Sumatra - Description, History and Typology

    The Barrier Islands Languages project is a project funded by the Australian Research Council Discovery Project grant (DP230102019) titled: “Languages of Barrier Islands, Sumatra: Description, History and Typology”.

    This project will run between 2023 and 2027 and investigate under-/undocumented Austronesian languages of the Barrier Islands, including Mentawai, (Simaluaya) Nias, Semeulue and Sikule, as well as neighbouring Northwest Sumatra languages, such as Simalungun Batak. New knowledge will be generated into the languages, cultures and societies of the region and be made freely available to the public.

    Research will uncover past migration patterns in Southeast Asia, advance language theory, such as linguistic typology and language change, and support the computational modelling of Austronesian for future language technologies.

    Grambank - a database of structural (typological) features of language

    Grambank is a database of structural (typological) features of language. It consists of 195 logically independent features (most of them binary) spanning all subdomains of morphosyntax. The Grambank feature questionnaire has been filled in, based on reference grammars, for 2,467 languages. The aim is to eventually reach as many as 3,500 languages. The database can be used to investigate deep language prehistory, the geographical-distribution of features, language universals and the functional interaction of structural features.

    Kinbank - Database of Kinship terminology

    Kinbank is a database of kinship terminologies to be used for exploring cross-linguistic diversity in kinship organisation. The database includes 1229 languages and a set of 100 core kin types between Grandparents and Grandchildren, and between Parent’s siblings, and Parent’s siblings’ children. A major advantage of Kinbank is the focused language family sampling and sampling based on occurrence in existing anthropological databases (e.g. d-place.org), allowing us to test the relationship between languages and behaviour. This allows the use of phylogenetic methods to reconstruct the states of proto-kinship, account for common ancestry in models of kinship change, and test for correlated evolution between linguistic and behavioural patterns.

    Lexibank - a public repository of standardized wordlists with computed phonological and lexical features

    The past decades have seen substantial growth in digital data on the world’s languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, most published datasets lack standardization which makes their comparison difficult. Here, we present a new approach to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that make these datasets more Findable, Accessible, Interoperable, and Reusable (FAIR). We test the Lexibank workflow on 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.

    GELATO - GEnes and LAnguages TOgether

    The GeLaTo dataset is a worldwide diversity panel of available population genetic samples matched with databases of linguistic, cultural and environmental diversity. Population genetic samples are assigned to existing GlottoCodes, following ethnolinguistic criteria: the data is filtered following the indication of geneticists, linguists, cultural anthropologists and historians. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.

    CLICS3 - Database of Cross-Linguistic Colexifications

    The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.