r8spermute

r8spermute – Runs all permutations of r8s constraints.

© Simon J. Greenhill, 2009

This is a small program to assess how robust divergence times are based on the calibrations given to the program r8s.

The program r8s by Mike Sanderson (http://loco.biosci.arizona.edu/r8s/) [1, 2] performs a number of phylogenetic dating methods on a set of phylogenetic trees. In essence, you plug in a number of calibration points from known historical evidence and the program r8s takes your estimated trees, and smoothes" the rates of change observed on those trees, using the calibrations to convert the branches into time.

However, it is often useful to explore the effect that certain calibration has on the various date estimates. For example, in the paper 3, I wanted to estimate the age at which a large language family originated. We had about 10 different calibration points based on historical evidence. Some of these calibrations were, however, more controversial than the others. So, I wrote this program to take a tree (or set of trees) and analyse them under all the combinations of these calibrations. This allowed us to assess the relative effects on the date estimates of the different calibrations.

If you find this program useful, please cite reference 3.

Contact me if you have any problems.

Download:

You can download r8spermute from the bitbucket repository here: download r8spermute

Usage:

You will need two things: 1) A set of trees, 2) A r8s command block. Examples of both of these are provided in the “examples” directory.

Your r8s block file will look something like this:

begin rates;
    blformat lengths=persite nsites=1000 ultrametric=no round=yes;
    collapse;
    mrca node1 Daisy Fiona;
    constrain taxon=node1 min_age=1.8 max_age=2.5;

    mrca node2 Robert Tamara;
    constrain taxon=node2 min_age=1.1 max_age=1.3;

    unfixage taxon=Simon;
    constrain taxon=Simon min_age=0.7 max_age=1.2;

    set num_restarts=5;
    set smoothing=10;
    divtime method=pl algorithm=tn;
    showage shownamed = yes;
    profile taxon=node1 parameter=age;
    profile taxon=node2 parameter=age;
    describe plot=chrono_description;
end;

– thus, we have 3 calibrations: node1, node2, and a constraint on the terminal taxon “Simon”.

We can then run r8spermute like this:

r8spermute.py example/example.trees example/example.r8s

This will start r8s running and you should then see some output like this:

Constraints found: 3
Constraints ignored: 0
Total Constraint Combinations: 7
Running analysis 1 of 7 (0.14%)
  
Staging in: /var/folders/9Z/9Z3O2T2o2RmgRU+F75TSx++++TQ/-Tmp-/tmpXn1lw9
  
Constraints:
  
+ constrain taxon=node1 min_age=1.8 max_age=2.5;
  
Logging in: example/example_100.log
  
r8s version 1.71 (compiled May 16 2006)
  
[-reading file /var/folders/9Z/9Z3O2T2o2RmgRU+F75TSx++++TQ/-Tmp-/tmpXn1lw9]
  
Elapsed time: 4.28s (0.07 minutes)
  
Average run-time so far: 4.28s (0.07 minutes)
  
Estimated run-time left: 25.68s (0.43 minutes)
  

…etc.

This tells us that we’ve got one constraint turned on (+ constrain taxon=node1 min_age=1.8 max_age=2.5;)' in this sub-analysis, as well as various other run-time statistics. And now we wait for this to finish. It could take some time..

When the analysis has finished, we use the program r8spermute_results.py to parse the results:

python r8spermute_results.py example/

This will loop over the log files and extract the relevant information. You probably want to output this information to a file, so run it like this:

python r8spermute_results.py example/ > results.txt

Now, we can look at the results, which will look something like this:

node1 node2 Simon node1 node2
0 0 1 15.22 2.84
0 1 0 54.60 1.30
0 1 1 49.45 1.30
1 0 0 2.50 4.38
1 0 1 2.50 4.18
1 1 0 2.50 1.30
1 1 1 2.50 1.30

It’s probably easiest to load this into a spreadsheet. Each row is an analysis. The first 3 columns (containing 1’s and 0’s) are our constraints and whether they’re turned on (=1) or off (=0) in the analysis. The last two columns “node1” and “node2” are our estimated ages of those nodes, under the calibrations.

So, the first line shows that node1 is estimated to be 15.22 years old and node2 is 2.84 years old when only the calibration called “Simon” is used.

The second line shows that when only node2 is calibrated, the estimate for node1 and node2 is 54.60 and 1.30 respectively.

In contrast, the very last line shows that when all calibrations are turned on, then the age estimates are 2.50 and 1.30 respectively.

References:

1 Sanderson, M. J. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol. Biol. Evol. 14:1218-1231.

2 Sanderson, M. J. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol. Biol. Evol. 19:101-109.

3 Gray, R.D., Drummond, A.J., & Greenhill, S.J. (2009) Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement. Science, 323: 479-483.

Query PubMed for citation information using a DOI and Python

Here’s a simple little script to query PubMed for a Digital Object Identifier (a DOI)

Usage is quite simple, find a DOI somewhere, e.g. 10.1038/nature02029 (for this groundbreaking paper), and run this:

lurch:~ python pythonquery.py 10.1038/nature02029

– and via the magic of webservices and XML, and with a bit of luck, you’ll get something like this back:

Language-tree divergence times support the Anatolian theory of Indo-European origin.

Gray, RD, Atkinson, QD
  
Nature 2003, 426 (6965):435-9

Languages, like genes, provide vital clues about human history. The origin of the Indo-European language family is "the most intensively studied, yet still most recalcitrant, problem of historical linguistics". Numerous genetic studiesof Indo-European origins have also produced inconclusive results. Here we analyse linguistic data using computational methods derived from evolutionary biology. We test two theories of Indo-European origin: the 'Kurgan expansion' and the 'Anatolian farming' hypotheses. The Kurgan theory centres on possible archaeological evidence for an expansion into Europe and the Near East byKurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian theory claims that Indo-European languages expanded with the spread of agriculture from Anatolia around 8,000-9,500 years bp. In striking agreement with the Anatolian hypothesis, our analysis of a matrix of 87 languages with 2,449 lexical items produced an estimated age range for the initial Indo-European divergence of between 7,800 and 9,800 years bp. These results were robust to changes in coding procedures, calibration points, rooting of the trees and priors in the Bayesian analysis.

The Code:


#!/usr/bin/env python
# Simple script to query pubmed for a DOI
# (c) Simon Greenhill, 2007
# http://simon.net.nz/

import urllib
from xml.dom import minidom

def get_citation_from_doi(query, email='YOUR EMAIL GOES HERE', tool='SimonsPythonQuery', database='pubmed'):

    params = {
        'db':database,
        'tool':tool,
        'email':email,
        'term':query,
        'usehistory':'y',
        'retmax':1
    }

    # try to resolve the PubMed ID of the DOI

    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)

    data = urllib.urlopen(url).read()
    # parse XML output from PubMed…
    xmldoc = minidom.parseString(data)
    ids = xmldoc.getElementsByTagName('Id')

    # nothing found, exit
    if len(ids) == 0:
        raise Exception('DoiNotFound')

    # get ID
    id = ids[0].childNodes[0].data
    # remove unwanted parameters
    params.pop('term')
    params.pop('usehistory')
    params.pop('retmax')
    # and add new ones:
    params['id'] = id
    params['retmode'] = 'xml'
    # get citation info:
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
    return urllib.urlopen(url).read()

def text_output(xml):
    """
    Makes a simple text output from the XML returned from efetch
    """
    xmldoc = minidom.parseString(xml)
    title = xmldoc.getElementsByTagName('ArticleTitle')[0]
    title = title.childNodes[0].data
    abstract = xmldoc.getElementsByTagName('AbstractText')[0]
    abstract = abstract.childNodes[0].data
    authors = xmldoc.getElementsByTagName('AuthorList')[0]
    authors = authors.getElementsByTagName('Author')
    authorlist = []
    for author in authors:
        LastName = author.getElementsByTagName('LastName')[0].childNodes[0].data
        Initials = author.getElementsByTagName('Initials')[0].childNodes[0].data
        author = '%s, %s' % (LastName, Initials)
        authorlist.append(author)

    journalinfo = xmldoc.getElementsByTagName('Journal')[0]
    journal = journalinfo.getElementsByTagName('Title')[0].childNodes[0].data
    journalinfo = journalinfo.getElementsByTagName('JournalIssue')[0]
    volume = journalinfo.getElementsByTagName('Volume')[0].childNodes[0].data
    issue = journalinfo.getElementsByTagName('Issue')[0].childNodes[0].data
    year = journalinfo.getElementsByTagName('Year')[0].childNodes[0].data
    # this is a bit odd?
    pages = xmldoc.getElementsByTagName('MedlinePgn')[0].childNodes[0].data

    output = []
    output.append(title)
    output.append("") #empty line
    output.append(', '.join(authorlist))
    output.append( '%s %s, %s (%s):%s' % (journal, year, volume, issue, pages) )
    output.append("") #empty line
    output.append(abstract)
    return output

if __name__ == '__main__':
      
from sys import argv, exit
      
if len(argv) == 1:
    print('Usage: %s <query>' % argv[0])
    print(' e.g. %s 10.1038/ng1946' % argv[0])
    exit()

citation = get_citation_from_doi(argv[1])
for line in text_output(citation):
    print line
All Posts by Category or Tags.