Query PubMed for citation information using a DOI and Python

Here’s a simple little script to query PubMed for a Digital Object Identifier (a DOI)

Usage is quite simple, find a DOI somewhere, e.g. 10.1038/nature02029 (for this groundbreaking paper), and run this:

lurch:~ python pythonquery.py 10.1038/nature02029

– and via the magic of webservices and XML, and with a bit of luck, you’ll get something like this back:

Language-tree divergence times support the Anatolian theory of Indo-European origin.

Gray, RD, Atkinson, QD
  
Nature 2003, 426 (6965):435-9

Languages, like genes, provide vital clues about human history. The origin of the Indo-European language family is "the most intensively studied, yet still most recalcitrant, problem of historical linguistics". Numerous genetic studiesof Indo-European origins have also produced inconclusive results. Here we analyse linguistic data using computational methods derived from evolutionary biology. We test two theories of Indo-European origin: the 'Kurgan expansion' and the 'Anatolian farming' hypotheses. The Kurgan theory centres on possible archaeological evidence for an expansion into Europe and the Near East byKurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian theory claims that Indo-European languages expanded with the spread of agriculture from Anatolia around 8,000-9,500 years bp. In striking agreement with the Anatolian hypothesis, our analysis of a matrix of 87 languages with 2,449 lexical items produced an estimated age range for the initial Indo-European divergence of between 7,800 and 9,800 years bp. These results were robust to changes in coding procedures, calibration points, rooting of the trees and priors in the Bayesian analysis.

The Code:


#!/usr/bin/env python
# Simple script to query pubmed for a DOI
# (c) Simon Greenhill, 2007
# http://simon.net.nz/

import urllib
from xml.dom import minidom

def get_citation_from_doi(query, email='YOUR EMAIL GOES HERE', tool='SimonsPythonQuery', database='pubmed'):

    params = {
        'db':database,
        'tool':tool,
        'email':email,
        'term':query,
        'usehistory':'y',
        'retmax':1
    }

    # try to resolve the PubMed ID of the DOI

    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)

    data = urllib.urlopen(url).read()
    # parse XML output from PubMed…
    xmldoc = minidom.parseString(data)
    ids = xmldoc.getElementsByTagName('Id')

    # nothing found, exit
    if len(ids) == 0:
        raise Exception('DoiNotFound')

    # get ID
    id = ids[0].childNodes[0].data
    # remove unwanted parameters
    params.pop('term')
    params.pop('usehistory')
    params.pop('retmax')
    # and add new ones:
    params['id'] = id
    params['retmode'] = 'xml'
    # get citation info:
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
    return urllib.urlopen(url).read()

def text_output(xml):
    """
    Makes a simple text output from the XML returned from efetch
    """
    xmldoc = minidom.parseString(xml)
    title = xmldoc.getElementsByTagName('ArticleTitle')[0]
    title = title.childNodes[0].data
    abstract = xmldoc.getElementsByTagName('AbstractText')[0]
    abstract = abstract.childNodes[0].data
    authors = xmldoc.getElementsByTagName('AuthorList')[0]
    authors = authors.getElementsByTagName('Author')
    authorlist = []
    for author in authors:
        LastName = author.getElementsByTagName('LastName')[0].childNodes[0].data
        Initials = author.getElementsByTagName('Initials')[0].childNodes[0].data
        author = '%s, %s' % (LastName, Initials)
        authorlist.append(author)

    journalinfo = xmldoc.getElementsByTagName('Journal')[0]
    journal = journalinfo.getElementsByTagName('Title')[0].childNodes[0].data
    journalinfo = journalinfo.getElementsByTagName('JournalIssue')[0]
    volume = journalinfo.getElementsByTagName('Volume')[0].childNodes[0].data
    issue = journalinfo.getElementsByTagName('Issue')[0].childNodes[0].data
    year = journalinfo.getElementsByTagName('Year')[0].childNodes[0].data
    # this is a bit odd?
    pages = xmldoc.getElementsByTagName('MedlinePgn')[0].childNodes[0].data

    output = []
    output.append(title)
    output.append("") #empty line
    output.append(', '.join(authorlist))
    output.append( '%s %s, %s (%s):%s' % (journal, year, volume, issue, pages) )
    output.append("") #empty line
    output.append(abstract)
    return output

if __name__ == '__main__':
      
from sys import argv, exit
      
if len(argv) == 1:
    print('Usage: %s <query>' % argv[0])
    print(' e.g. %s 10.1038/ng1946' % argv[0])
    exit()

citation = get_citation_from_doi(argv[1])
for line in text_output(citation):
    print line