Dr. Simon J. Greenhill

I study language and cultural evolution, focusing on why and how humans created the remarkable diversity of languages we see today and what they reveal about our shared human prehistory.
Using Bayesian phylogenetic methods, I explore questions such as how the Austronesian peoples navigated and settled the Pacific and how linguistic structures evolve and co-adapt over time. To support this work, I’ve developed large-scale databases that help uncover the patterns and processes driving the evolution of language and culture, offering deeper insights into the story of humanity.
You can find me on Twitter or Mastodon, or Bsky, at the University of Auckland.
Projected speaker numbers and dormancy risks of Canada’s Indigenous languages.
Boissonneault M, Tallman A, Gast V & Greenhill SJ. 2025. Projected speaker numbers and dormancy risks of Canada’s Indigenous languages. Royal Society Open Science. 12241091.
UNESCO launched the International Decade of Indigenous Languages in 2022 to draw attention to the impending loss of nearly half of the world’s linguistic diversity. However, how the speaker numbers and dormancy risks of these languages will evolve remains largely unexplored. Here, we use Canadian census data and probabilistic population projection to estimate changes in speaker numbers and dormancy risks of 27 Indigenous languages. Our model suggests that speaker numbers could, over the period 2001–2101, decline by more than 90% in 16 languages and that dormancy risks could surpass 50% among …
Abstract PDF 10.1098/rsos.241091Methods in Malayo-Polynesian comparative-historical linguistics.
Ross M, & Greenhill SJ. 2024. Methods in Malayo-Polynesian comparative-historical linguistics. In Adelaar A & Schapper A (Eds) The Oxford Guide to the Malayo-Polynesian Languages of Southeast Asia. Oxford: Oxford University Press.
This chapter considers methodological issues in the classification and subgrouping of Malayo-Polynesian (MP) languages. It compares applications of the traditional comparative method and newer Bayesian phylogenetics in Austronesian historical linguistics, arguing that they present complementary rather than competitive approaches. The comparison can be used to illuminate contentious points in the MP tree. The chapter discusses the application of methods to MP by alluding to the higher-order phylogeny of Austronesian on which most Austronesianist historical linguists agree.
Abstract PDF 10.1093/oso/9780198807353.003.0003Bayesian phylogenetic analysis of Philippine languages supports a rapid migration of Malayo Polynesian languages.
King B, Greenhill SJ, Reid LA, Ross M, Walworth M, & Gray R. 2024. Bayesian phylogenetic analysis of Philippine languages supports a rapid migration of Malayo Polynesian languages. Scientific Reports, 14, 14967.
The Philippines are central to understanding the expansion of the Austronesian language family from its homeland in Taiwan. It remains unknown to what extent the distribution of Malayo-Polynesian languages has been shaped by back migrations and language leveling events following the initial Out-of-Taiwan expansion. Other aspects of language history, including the effect of language switching from non-Austronesian languages, also remain poorly understood. Here we apply Bayesian phylogenetic methods to a core-vocabulary dataset of Philippine languages. Our analysis strongly supports a sister …
Abstract PDF 10.1038/s41598-024-65810-xTentatively tracing Trans‐New Guinea: A phylogenetic evaluation of potential deeper relationships..
Greenhill SJ. In Press. Tentatively tracing Trans‐New Guinea: A phylogenetic evaluation of potential deeper relationships.. In Evans N & Fedden S (Eds). The Oxford Guide to the Papuan Languages. Oxford University Press: Oxford.
The Trans‐New Guinea language family is one of the world’s largest language families. Strikingly it is also one of the world’s least studied. There is ongoing debate about which of many languages should be included in Trans‐New Guinea and how these relate to each other. Resolving this debate is hard due to the complexities of studying New Guinea languages, and a lack of adequate data suitable for detailed historical linguistic work. These difficulties have led to suggestions that the only way forward is to wait for low‐level descriptive 15 field‐work and detailed bottom‐up historical …
Abstract PDF 10.31235/osf.io/628cv
Languages of Barrier Islands, Sumatra - Description, History and Typology
The Barrier Islands Languages project is a project funded by the Australian Research Council Discovery Project grant (DP230102019) titled: “Languages of Barrier Islands, Sumatra: Description, History and Typology”.
This project will run between 2023 and 2027 and investigate under-/undocumented Austronesian languages of the Barrier Islands, including Mentawai, (Simaluaya) Nias, Semeulue and Sikule, as well as neighbouring Northwest Sumatra languages, such as Simalungun Batak. New knowledge will be generated into the languages, cultures and societies of the region and be made freely available to the public.
Research will uncover past migration patterns in Southeast Asia, advance language theory, such as linguistic typology and language change, and support the computational modelling of Austronesian for future language technologies.
Grambank - a database of structural (typological) features of language
Grambank is a database of structural (typological) features of language. It consists of 195 logically independent features (most of them binary) spanning all subdomains of morphosyntax. The Grambank feature questionnaire has been filled in, based on reference grammars, for 2,467 languages. The aim is to eventually reach as many as 3,500 languages. The database can be used to investigate deep language prehistory, the geographical-distribution of features, language universals and the functional interaction of structural features.
Kinbank - Database of Kinship terminology
Kinbank is a database of kinship terminologies to be used for exploring cross-linguistic diversity in kinship organisation. The database includes 1229 languages and a set of 100 core kin types between Grandparents and Grandchildren, and between Parent’s siblings, and Parent’s siblings’ children. A major advantage of Kinbank is the focused language family sampling and sampling based on occurrence in existing anthropological databases (e.g. d-place.org), allowing us to test the relationship between languages and behaviour. This allows the use of phylogenetic methods to reconstruct the states of proto-kinship, account for common ancestry in models of kinship change, and test for correlated evolution between linguistic and behavioural patterns.
Lexibank - a public repository of standardized wordlists with computed phonological and lexical features
The past decades have seen substantial growth in digital data on the world’s languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, most published datasets lack standardization which makes their comparison difficult. Here, we present a new approach to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that make these datasets more Findable, Accessible, Interoperable, and Reusable (FAIR). We test the Lexibank workflow on 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.
GELATO - GEnes and LAnguages TOgether
The GeLaTo dataset is a worldwide diversity panel of available population genetic samples matched with databases of linguistic, cultural and environmental diversity. Population genetic samples are assigned to existing GlottoCodes, following ethnolinguistic criteria: the data is filtered following the indication of geneticists, linguists, cultural anthropologists and historians. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.
CLICS3 - Database of Cross-Linguistic Colexifications
The original Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ - the third installment of CLICS - exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.