Here’s a simple little script to query PubMed for a Digitial Object Identifier (a DOI)

Usage is quite simple, find a DOI somewhere, e.g. 10.1038/nature02029 (for this groundbreaking paper), and run this:

lurch:~ python 10.1038/nature02029

… and via the magic of webservices and XML, and with a bit of luck, you’ll get something like this back:

Language-tree divergence times support the Anatolian theory of Indo-European origin.

Gray, RD, Atkinson, QD
Nature 2003, 426 (6965):435-9

Languages, like genes, provide vital clues about human history. The origin of
the Indo-European language family is “the most intensively studied, yet still
most recalcitrant, problem of historical linguistics”. Numerous genetic studies
of Indo-European origins have also produced inconclusive results. Here we
analyse linguistic data using computational methods derived from evolutionary
biology. We test two theories of Indo-European origin: the ‘Kurgan expansion’
and the ‘Anatolian farming’ hypotheses. The Kurgan theory centres on possible
archaeological evidence for an expansion into Europe and the Near East by
Kurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian
theory claims that Indo-European languages expanded with the spread of
agriculture from Anatolia around 8,000-9,500 years bp. In striking agreement
with the Anatolian hypothesis, our analysis of a matrix of 87 languages with
2,449 lexical items produced an estimated age range for the initial Indo-European
divergence of between 7,800 and 9,800 years bp. These results were robust to
changes in coding procedures, calibration points, rooting of the trees and priors
in the bayesian analysis.

The Code:

#!/usr/bin/env python# Simple script to query pubmed for a DOI
# (c) Simon Greenhill, 2007

import urllib
from xml.dom import minidom

def get_citation_from_doi(query, email=’YOUR EMAIL GOES HERE’, tool=’SimonsPythonQuery’, database=’pubmed’):
params = {

# try to resolve the PubMed ID of the DOI
url = ‘’ + urllib.urlencode(params)
data = urllib.urlopen(url).read()

# parse XML output from PubMed…
xmldoc = minidom.parseString(data)
ids = xmldoc.getElementsByTagName(‘Id’)

# nothing found, exit
if len(ids) == 0:
raise Exception(“DoiNotFound”)

# get ID
id = ids[0].childNodes[0].data

# remove unwanted parameters
# and add new ones…
params[‘id’] = id

params[‘retmode’] = ‘xml’

# get citation info:
url = ‘’ + urllib.urlencode(params)
data = urllib.urlopen(url).read()

return data

def text_output(xml):
“””Makes a simple text output from the XML returned from efetch”””

xmldoc = minidom.parseString(xml)

title = xmldoc.getElementsByTagName(‘ArticleTitle’)[0]
title = title.childNodes[0].data

abstract = xmldoc.getElementsByTagName(‘AbstractText’)[0]
abstract = abstract.childNodes[0].data

authors = xmldoc.getElementsByTagName(‘AuthorList’)[0]
authors = authors.getElementsByTagName(‘Author’)
authorlist = []
for author in authors:
LastName = author.getElementsByTagName(‘LastName’)[0].childNodes[0].data
Initials = author.getElementsByTagName(‘Initials’)[0].childNodes[0].data
author = ‘%s, %s’ % (LastName, Initials)

journalinfo = xmldoc.getElementsByTagName(‘Journal’)[0]
journal = journalinfo.getElementsByTagName(‘Title’)[0].childNodes[0].data
journalinfo = journalinfo.getElementsByTagName(‘JournalIssue’)[0]
volume = journalinfo.getElementsByTagName(‘Volume’)[0].childNodes[0].data
issue = journalinfo.getElementsByTagName(‘Issue’)[0].childNodes[0].data
year = journalinfo.getElementsByTagName(‘Year’)[0].childNodes[0].data

# this is a bit odd?
pages = xmldoc.getElementsByTagName(‘MedlinePgn’)[0].childNodes[0].data

output = []
output.append(”) #empty line
output.append(‘, ‘.join(authorlist))
output.append( ‘%s %s, %s (%s):%s’ % (journal, year, volume, issue, pages) )
output.append(”) #empty line
return output

if __name__ == ‘__main__’:
from sys import argv, exit
if len(argv) == 1:
print ‘Usage: %s ‘ % argv[0]
print ‘ e.g. %s 10.1038/ng1946’ % argv[0]

citation = get_citation_from_doi(argv[1])
for line in text_output(citation):
print line