Here’s a simple little script to query PubMed for a Digitial Object Identifier (a DOI)
Usage is quite simple, find a DOI somewhere, e.g. 10.1038/nature02029 (for this groundbreaking paper), and run this:
- lurch:~ python pythonquery.py 10.1038/nature02029
… and via the magic of webservices and XML, and with a bit of luck, you’ll get something like this back:
- Language-tree divergence times support the Anatolian theory of Indo-European origin.
- Gray, RD, Atkinson, QD
- Nature 2003, 426 (6965):435-9
- Languages, like genes, provide vital clues about human history. The origin of
- the Indo-European language family is "the most intensively studied, yet still
- most recalcitrant, problem of historical linguistics". Numerous genetic studies
- of Indo-European origins have also produced inconclusive results. Here we
- analyse linguistic data using computational methods derived from evolutionary
- biology. We test two theories of Indo-European origin: the 'Kurgan expansion'
- and the 'Anatolian farming' hypotheses. The Kurgan theory centres on possible
- archaeological evidence for an expansion into Europe and the Near East by
- Kurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian
- theory claims that Indo-European languages expanded with the spread of
- agriculture from Anatolia around 8,000-9,500 years bp. In striking agreement
- with the Anatolian hypothesis, our analysis of a matrix of 87 languages with
- 2,449 lexical items produced an estimated age range for the initial Indo-European
- divergence of between 7,800 and 9,800 years bp. These results were robust to
- changes in coding procedures, calibration points, rooting of the trees and priors
- in the bayesian analysis.
The Code:
- #!/usr/bin/env python# Simple script to query pubmed for a DOI
- # (c) Simon Greenhill, 2007
- # http://simon.net.nz/
- import urllib
- from xml.dom import minidom
- def get_citation_from_doi(query, email='YOUR EMAIL GOES HERE', tool='SimonsPythonQuery', database='pubmed'):
- params = {
- 'db':database,
- 'tool':tool,
- 'email':email,
- 'term':query,
- 'usehistory':'y',
- 'retmax':1
- }
- # try to resolve the PubMed ID of the DOI
- url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)
- data = urllib.urlopen(url).read()
- # parse XML output from PubMed...
- xmldoc = minidom.parseString(data)
- ids = xmldoc.getElementsByTagName('Id')
- # nothing found, exit
- if len(ids) == 0:
- raise Exception("DoiNotFound")
- # get ID
- id = ids[0].childNodes[0].data
- # remove unwanted parameters
- params.pop('term')
- params.pop('usehistory')
- params.pop('retmax')
- # and add new ones...
- params['id'] = id
- params['retmode'] = 'xml'
- # get citation info:
- url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
- data = urllib.urlopen(url).read()
- return data
- def text_output(xml):
- """Makes a simple text output from the XML returned from efetch"""
- xmldoc = minidom.parseString(xml)
- title = xmldoc.getElementsByTagName('ArticleTitle')[0]
- title = title.childNodes[0].data
- abstract = xmldoc.getElementsByTagName('AbstractText')[0]
- abstract = abstract.childNodes[0].data
- authors = xmldoc.getElementsByTagName('AuthorList')[0]
- authors = authors.getElementsByTagName('Author')
- authorlist = []
- for author in authors:
- LastName = author.getElementsByTagName('LastName')[0].childNodes[0].data
- Initials = author.getElementsByTagName('Initials')[0].childNodes[0].data
- author = '%s, %s' % (LastName, Initials)
- authorlist.append(author)
- journalinfo = xmldoc.getElementsByTagName('Journal')[0]
- journal = journalinfo.getElementsByTagName('Title')[0].childNodes[0].data
- journalinfo = journalinfo.getElementsByTagName('JournalIssue')[0]
- volume = journalinfo.getElementsByTagName('Volume')[0].childNodes[0].data
- issue = journalinfo.getElementsByTagName('Issue')[0].childNodes[0].data
- year = journalinfo.getElementsByTagName('Year')[0].childNodes[0].data
- # this is a bit odd?
- pages = xmldoc.getElementsByTagName('MedlinePgn')[0].childNodes[0].data
- output = []
- output.append(title)
- output.append('') #empty line
- output.append(', '.join(authorlist))
- output.append( '%s %s, %s (%s):%s' % (journal, year, volume, issue, pages) )
- output.append('') #empty line
- output.append(abstract)
- return output
- if __name__ == '__main__':
- from sys import argv, exit
- if len(argv) == 1:
- print 'Usage: %s <query>' % argv[0]
- print ' e.g. %s 10.1038/ng1946' % argv[0]
- exit()
- citation = get_citation_from_doi(argv[1])
- for line in text_output(citation):
- print line
–Simon
Hi Simon,
I wanted to give your little Python script a run, and I’m getting this error:
File "pythonquery.py", line 93
exit()
^
SyntaxError: invalid syntax
I’m using Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Any ideas why it ain’t working?
Thanks,
Kambiz
Hi
Can you send me the code? There are no indentation in the webpage code so in some cases I have to guess where to indent.
Cheers
Paulo