Here's a simple little script to query PubMed for a Digitial Object Identifier (a DOI)
Usage is quite simple, find a DOI somewhere, e.g. 10.1038/nature02029 (for this groundbreaking paper), and run this:
CODE:
-
lurch:~ python pythonquery.py 10.1038/nature02029
... and via the magic of webservices and XML, and with a bit of luck, you'll get something like this back:
CODE:
-
Language-tree divergence times support the Anatolian theory of Indo-European origin.
-
-
Gray, RD, Atkinson, QD
-
Nature 2003, 426 (6965):435-9
-
-
Languages, like genes, provide vital clues about human history. The origin of
-
the Indo-European language family is "the most intensively studied, yet still
-
most recalcitrant, problem of historical linguistics". Numerous genetic studies
-
of Indo-European origins have also produced inconclusive results. Here we
-
analyse linguistic data using computational methods derived from evolutionary
-
biology. We test two theories of Indo-European origin: the 'Kurgan expansion'
-
and the 'Anatolian farming' hypotheses. The Kurgan theory centres on possible
-
archaeological evidence for an expansion into Europe and the Near East by
-
Kurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian
-
theory claims that Indo-European languages expanded with the spread of
-
agriculture from Anatolia around 8,000-9,500 years bp. In striking agreement
-
with the Anatolian hypothesis, our analysis of a matrix of 87 languages with
-
2,449 lexical items produced an estimated age range for the initial Indo-European
-
divergence of between 7,800 and 9,800 years bp. These results were robust to
-
changes in coding procedures, calibration points, rooting of the trees and priors
-
in the bayesian analysis.
The Code:
PYTHON:
-
#!/usr/bin/env python
-
-
# Simple script to query pubmed for a DOI
-
# (c) Simon Greenhill, 2007
-
# http://simon.net.nz/
-
-
import urllib
-
from xml.dom import minidom
-
-
def get_citation_from_doi(query, email='YOUR EMAIL GOES HERE', tool='SimonsPythonQuery', database='pubmed'):
-
params = {
-
'db':database,
-
'tool':tool,
-
'email':email,
-
'term':query,
-
'usehistory':'y',
-
'retmax':1
-
}
-
# try to resolve the PubMed ID of the DOI
-
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)
-
data = urllib.urlopen(url).read()
-
-
# parse XML output from PubMed...
-
xmldoc = minidom.parseString(data)
-
ids = xmldoc.getElementsByTagName('Id')
-
-
# nothing found, exit
-
if len(ids) == 0:
-
raise "DoiNotFound"
-
-
# get ID
-
id = ids[0].childNodes[0].data
-
-
# remove unwanted parameters
-
params.pop('term')
-
params.pop('usehistory')
-
params.pop('retmax')
-
# and add new ones...
-
params['id'] = id
-
-
params['retmode'] = 'xml'
-
-
# get citation info:
-
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
-
data = urllib.urlopen(url).read()
-
-
return data
-
-
def text_output(xml):
-
"""Makes a simple text output from the XML returned from efetch"""
-
-
xmldoc = minidom.parseString(xml)
-
-
title = xmldoc.getElementsByTagName('ArticleTitle')[0]
-
title = title.childNodes[0].data
-
-
abstract = xmldoc.getElementsByTagName('AbstractText')[0]
-
abstract = abstract.childNodes[0].data
-
-
authors = xmldoc.getElementsByTagName('AuthorList')[0]
-
authors = authors.getElementsByTagName('Author')
-
authorlist = []
-
for author in authors:
-
LastName = author.getElementsByTagName('LastName')[0].childNodes[0].data
-
Initials = author.getElementsByTagName('Initials')[0].childNodes[0].data
-
author = '%s, %s' % (LastName, Initials)
-
authorlist.append(author)
-
-
journalinfo = xmldoc.getElementsByTagName('Journal')[0]
-
journal = journalinfo.getElementsByTagName('Title')[0].childNodes[0].data
-
journalinfo = journalinfo.getElementsByTagName('JournalIssue')[0]
-
volume = journalinfo.getElementsByTagName('Volume')[0].childNodes[0].data
-
issue = journalinfo.getElementsByTagName('Issue')[0].childNodes[0].data
-
year = journalinfo.getElementsByTagName('Year')[0].childNodes[0].data
-
-
# this is a bit odd?
-
pages = xmldoc.getElementsByTagName('MedlinePgn')[0].childNodes[0].data
-
-
output = []
-
output.append(title)
-
output.append('') #empty line
-
output.append(', '.join(authorlist))
-
output.append( '%s %s, %s (%s):%s' % (journal, year, volume, issue, pages) )
-
output.append('') #empty line
-
output.append(abstract)
-
return output
-
-
if __name__ == '__main__':
-
from sys import argv, exit
-
if len(argv) == 1:
-
print 'Usage: %s <query>' % argv[0]
-
print ' e.g. %s 10.1038/ng1946' % argv[0]
-
exit()</query>
-
-
citation = get_citation_from_doi(argv[1])
-
for line in text_output(citation):
-
print line
--Simon
Hi Simon,
I wanted to give your little Python script a run, and I'm getting this error:
File "pythonquery.py", line 93
exit()
^
SyntaxError: invalid syntax
I'm using Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Any ideas why it ain't working?
Thanks,
Kambiz