Here’s a simple little script to query PubMed for a Digitial Object Identifier (a DOI)
Usage is quite simple, find a DOI somewhere, e.g. 10.1038/nature02029 (for this groundbreaking paper), and run this:
[code]
lurch:~ python pythonquery.py 10.1038/nature02029
[/code]
… and via the magic of webservices and XML, and with a bit of luck, you’ll get something like this back:
[code]
Language-tree divergence times support the Anatolian theory of Indo-European origin.
Gray, RD, Atkinson, QD
Nature 2003, 426 (6965):435-9
Languages, like genes, provide vital clues about human history. The origin of
the Indo-European language family is “the most intensively studied, yet still
most recalcitrant, problem of historical linguistics”. Numerous genetic studies
of Indo-European origins have also produced inconclusive results. Here we
analyse linguistic data using computational methods derived from evolutionary
biology. We test two theories of Indo-European origin: the ‘Kurgan expansion’
and the ‘Anatolian farming’ hypotheses. The Kurgan theory centres on possible
archaeological evidence for an expansion into Europe and the Near East by
Kurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian
theory claims that Indo-European languages expanded with the spread of
agriculture from Anatolia around 8,000-9,500 years bp. In striking agreement
with the Anatolian hypothesis, our analysis of a matrix of 87 languages with
2,449 lexical items produced an estimated age range for the initial Indo-European
divergence of between 7,800 and 9,800 years bp. These results were robust to
changes in coding procedures, calibration points, rooting of the trees and priors
in the bayesian analysis.
[/code]
The Code:
[python]
#!/usr/bin/env python# Simple script to query pubmed for a DOI
# (c) Simon Greenhill, 2007
# http://simon.net.nz/
import urllib
from xml.dom import minidom
def get_citation_from_doi(query, email=’YOUR EMAIL GOES HERE’, tool=’SimonsPythonQuery’, database=’pubmed’):
params = {
‘db’:database,
‘tool’:tool,
’email’:email,
‘term’:query,
‘usehistory’:’y’,
‘retmax’:1
}
# try to resolve the PubMed ID of the DOI
url = ‘http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?’ + urllib.urlencode(params)
data = urllib.urlopen(url).read()
# parse XML output from PubMed…
xmldoc = minidom.parseString(data)
ids = xmldoc.getElementsByTagName(‘Id’)
# nothing found, exit
if len(ids) == 0:
raise Exception(“DoiNotFound”)
# get ID
id = ids[0].childNodes[0].data
# remove unwanted parameters
params.pop(‘term’)
params.pop(‘usehistory’)
params.pop(‘retmax’)
# and add new ones…
params[‘id’] = id
params[‘retmode’] = ‘xml’
# get citation info:
url = ‘http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?’ + urllib.urlencode(params)
data = urllib.urlopen(url).read()
return data
def text_output(xml):
“””Makes a simple text output from the XML returned from efetch”””
xmldoc = minidom.parseString(xml)
title = xmldoc.getElementsByTagName(‘ArticleTitle’)[0]
title = title.childNodes[0].data
abstract = xmldoc.getElementsByTagName(‘AbstractText’)[0]
abstract = abstract.childNodes[0].data
authors = xmldoc.getElementsByTagName(‘AuthorList’)[0]
authors = authors.getElementsByTagName(‘Author’)
authorlist = []
for author in authors:
LastName = author.getElementsByTagName(‘LastName’)[0].childNodes[0].data
Initials = author.getElementsByTagName(‘Initials’)[0].childNodes[0].data
author = ‘%s, %s’ % (LastName, Initials)
authorlist.append(author)
journalinfo = xmldoc.getElementsByTagName(‘Journal’)[0]
journal = journalinfo.getElementsByTagName(‘Title’)[0].childNodes[0].data
journalinfo = journalinfo.getElementsByTagName(‘JournalIssue’)[0]
volume = journalinfo.getElementsByTagName(‘Volume’)[0].childNodes[0].data
issue = journalinfo.getElementsByTagName(‘Issue’)[0].childNodes[0].data
year = journalinfo.getElementsByTagName(‘Year’)[0].childNodes[0].data
# this is a bit odd?
pages = xmldoc.getElementsByTagName(‘MedlinePgn’)[0].childNodes[0].data
output = []
output.append(title)
output.append(”) #empty line
output.append(‘, ‘.join(authorlist))
output.append( ‘%s %s, %s (%s):%s’ % (journal, year, volume, issue, pages) )
output.append(”) #empty line
output.append(abstract)
return output
if __name__ == ‘__main__’:
from sys import argv, exit
if len(argv) == 1:
print ‘Usage: %s
print ‘ e.g. %s 10.1038/ng1946’ % argv[0]
exit()
citation = get_citation_from_doi(argv[1])
for line in text_output(citation):
print line
[/python]
–Simon
Hi Simon,
I wanted to give your little Python script a run, and I’m getting this error:
File "pythonquery.py", line 93
exit()
^
SyntaxError: invalid syntax
I’m using Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Any ideas why it ain’t working?
Thanks,
Kambiz
Hi
Can you send me the code? There are no indentation in the webpage code so in some cases I have to guess where to indent.
Cheers
Paulo