Dr. Simon J. Greenhill

Portrait of Simon Greenhill

I research why and how people created all the amazing languages around us, and what they tell us about human prehistory.

I use (mainly) Bayesian phylogenetic methods to tackle these questions and have investigated everything from how the Austronesian peoples settled the Pacific, to modelling the co-evolution of linguistic structure. And I have built a number of large-scale databases to help answer these questions.

You can find me on Twitter or Mastodon, at the University of Auckland.


  • Tentatively tracing Trans‐New Guinea: A phylogenetic evaluation of potential deeper relationships..

    Greenhill SJ. In Press. Tentatively tracing Trans‐New Guinea: A phylogenetic evaluation of potential deeper relationships.. In Evans N & Fedden S (Eds). The Oxford Guide to the Papuan Languages. Oxford University Press: Oxford.

    The Trans‐New Guinea language family is one of the world’s largest language families. Strikingly it is also one of the world’s least studied. There is ongoing debate about which of many languages should be included in Trans‐New Guinea and how these relate to each other. Resolving this debate is hard due to the complexities of studying New Guinea languages, and a lack of adequate data suitable for detailed historical linguistic work. These difficulties have led to suggestions that the only way forward is to wait for low‐level descriptive 15 field‐work and detailed bottom‐up historical …

    Abstract PDF 10.31235/osf.io/628cv

  • The evolutionary dynamics of how languages signal who does what to whom.

    Shcherbakova O, Blasi DE, Gast V, Skirgård H, Gray RD, & Greenhill SJ. 2024. The evolutionary dynamics of how languages signal who does what to whom. Scientific Reports, 14, 7259.

    Languages vary in how they signal “who does what to whom”. Three main strategies to indicate the participant roles of “who” and “whom” are case, verbal indexing, and rigid word order. Languages that disambiguate these roles with case tend to have either verb-final or flexible word order. Most previous studies that found these patterns used limited language samples and overlooked the causal mechanisms that could jointly explain the association between all three features. Here we analyze grammatical data from a Grambank sample of 1705 languages with phylogenetic causal graph methods. Our results …

    Abstract PDF 10.1038/s41598-024-51542-5

  • Variation in phoneme inventories: quantifying the problem and improving comparability.

    Anderson C, Tresoldi T, Greenhill SJ, Forkel R, Gray RD & List JML. 2023. Variation in phoneme inventories: quantifying the problem and improving comparability. Journal of Language Evolution, 11, lzad011.

    For over a century, the phoneme has played a central role in linguistic research. In recent years, collections of phoneme inventories, originally designed for cross-linguistic purposes, have increasingly been used in comparative studies involving neighbouring disciplines. Despite the extended application of this type of data, there has been no research into its comparability or tests of its reliability. In this study, we carry out a systematic comparison of nine popular phoneme inventory collections. We render them comparable by linking them to standardised formats for the handling of …

    Abstract PDF 10.1093/jole/lzad011

  • Societies of strangers do not speak grammatically simpler languages.

    Shcherbakova O, Michaelis SM, Haynie HJ, Passmore S, Gast V, Gray RD, Greenhill SJ, Blasi DE, & Skirgård H. 2023. Societies of strangers do not speak grammatically simpler languages. Science Advances, 9 (33), eadf7704.

    Many recent proposals claim that languages adapt to their environments. The linguistic niche hypothesis claims that languages with numerous native speakers and substantial proportions of nonnative speakers (societies of strangers) tend to lose grammatical distinctions. In contrast, languages in small, isolated communities should maintain or expand their grammatical markers. Here, we test these claims using a global dataset of grammatical structures, Grambank. We model the impact of the number of native speakers, the proportion of nonnative speakers, the number of linguistic neighbors, and the …

    Abstract PDF 10.1126/sciadv.adf7704


    Languages of Barrier Islands, Sumatra - Description, History and Typology

    The Barrier Islands Languages project is a project funded by the Australian Research Council Discovery Project grant (DP230102019) titled: “Languages of Barrier Islands, Sumatra: Description, History and Typology”.

    This project will run between 2023 and 2027 and investigate under-/undocumented Austronesian languages of the Barrier Islands, including Mentawai, (Simaluaya) Nias, Semeulue and Sikule, as well as neighbouring Northwest Sumatra languages, such as Simalungun Batak. New knowledge will be generated into the languages, cultures and societies of the region and be made freely available to the public.

    Research will uncover past migration patterns in Southeast Asia, advance language theory, such as linguistic typology and language change, and support the computational modelling of Austronesian for future language technologies.

    Grambank - a database of structural (typological) features of language

    Grambank is a database of structural (typological) features of language. It consists of 195 logically independent features (most of them binary) spanning all subdomains of morphosyntax. The Grambank feature questionnaire has been filled in, based on reference grammars, for 2,467 languages. The aim is to eventually reach as many as 3,500 languages. The database can be used to investigate deep language prehistory, the geographical-distribution of features, language universals and the functional interaction of structural features.

    Kinbank - Database of Kinship terminology

    Kinbank is a database of kinship terminologies to be used for exploring cross-linguistic diversity in kinship organisation. The database includes 1229 languages and a set of 100 core kin types between Grandparents and Grandchildren, and between Parent’s siblings, and Parent’s siblings’ children. A major advantage of Kinbank is the focused language family sampling and sampling based on occurrence in existing anthropological databases (e.g. d-place.org), allowing us to test the relationship between languages and behaviour. This allows the use of phylogenetic methods to reconstruct the states of proto-kinship, account for common ancestry in models of kinship change, and test for correlated evolution between linguistic and behavioural patterns.

    Lexibank - a public repository of standardized wordlists with computed phonological and lexical features

    The past decades have seen substantial growth in digital data on the world’s languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, most published datasets lack standardization which makes their comparison difficult. Here, we present a new approach to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that make these datasets more Findable, Accessible, Interoperable, and Reusable (FAIR). We test the Lexibank workflow on 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.

    GELATO - GEnes and LAnguages TOgether

    The GeLaTo dataset is a worldwide diversity panel of available population genetic samples matched with databases of linguistic, cultural and environmental diversity. Population genetic samples are assigned to existing GlottoCodes, following ethnolinguistic criteria: the data is filtered following the indication of geneticists, linguists, cultural anthropologists and historians. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.

    CLICS3 - Database of Cross-Linguistic Colexifications

    The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.