r8spermute

r8spermute – Runs all permutations of r8s constraints.

© Simon J. Greenhill, 2009

This is a small program to assess how robust divergence times are based on the calibrations given to the program r8s.

The program r8s by Mike Sanderson (http://loco.biosci.arizona.edu/r8s/) [1, 2] performs a number of phylogenetic dating methods on a set of phylogenetic trees. In essence, you plug in a number of calibration points from known historical evidence and the program r8s takes your estimated trees, and smoothes” the rates of change observed on those trees, using the calibrations to convert the branches into time.

However, it is often useful to explore the effect that certain calibration has on the various date estimates. For example, in the paper 3, I wanted to estimate the age at which a large language family originated. We had about 10 different calibration points based on historical evidence. Some of these calibrations were, however, more controversial than the others. So, I wrote this program to take a tree (or set of trees) and analyse them under all the combinations of these calibrations. This allowed us to assess the relative effects on the date estimates of the different calibrations.

If you find this program useful, please cite reference 3.

Contact me if you have any problems.

Download:

You can download r8spermute from the bitbucket repository here: download r8spermute

Usage:

You will need two things: 1) A set of trees, 2) A r8s command block. Examples of both of these are provided in the “examples” directory.

Your r8s block file will look something like this:

begin rates;
    blformat lengths=persite nsites=1000 ultrametric=no round=yes;
    collapse;
    mrca node1 Daisy Fiona;
    constrain taxon=node1 min_age=1.8 max_age=2.5;

    mrca node2 Robert Tamara;
    constrain taxon=node2 min_age=1.1 max_age=1.3;

    unfixage taxon=Simon;
    constrain taxon=Simon min_age=0.7 max_age=1.2;

    set num_restarts=5;
    set smoothing=10;
    divtime method=pl algorithm=tn;
    showage shownamed = yes;
    profile taxon=node1 parameter=age;
    profile taxon=node2 parameter=age;
    describe plot=chrono_description;
end;

– thus, we have 3 calibrations: node1, node2, and a constraint on the terminal taxon “Simon”.

We can then run r8spermute like this:

r8spermute.py example/example.trees example/example.r8s

This will start r8s running and you should then see some output like this:

Constraints found: 3
Constraints ignored: 0
Total Constraint Combinations: 7
Running analysis 1 of 7 (0.14%)
  
Staging in: /var/folders/9Z/9Z3O2T2o2RmgRU+F75TSx++++TQ/-Tmp-/tmpXn1lw9
  
Constraints:
  
+ constrain taxon=node1 min_age=1.8 max_age=2.5;
  
Logging in: example/example_100.log
  
r8s version 1.71 (compiled May 16 2006)
  
[-reading file /var/folders/9Z/9Z3O2T2o2RmgRU+F75TSx++++TQ/-Tmp-/tmpXn1lw9]
  
Elapsed time: 4.28s (0.07 minutes)
  
Average run-time so far: 4.28s (0.07 minutes)
  
Estimated run-time left: 25.68s (0.43 minutes)
  

…etc.

This tells us that we’ve got one constraint turned on (+ constrain taxon=node1 min_age=1.8 max_age=2.5;)’ in this sub-analysis, as well as various other run-time statistics. And now we wait for this to finish. It could take some time..

When the analysis has finished, we use the program r8spermute_results.py to parse the results:

python r8spermute_results.py example/

This will loop over the log files and extract the relevant information. You probably want to output this information to a file, so run it like this:

python r8spermute_results.py example/ > results.txt

Now, we can look at the results, which will look something like this:

node1 node2 Simon node1 node2
0 0 1 15.22 2.84
0 1 0 54.60 1.30
0 1 1 49.45 1.30
1 0 0 2.50 4.38
1 0 1 2.50 4.18
1 1 0 2.50 1.30
1 1 1 2.50 1.30

It’s probably easiest to load this into a spreadsheet. Each row is an analysis. The first 3 columns (containing 1’s and 0’s) are our constraints and whether they’re turned on (=1) or off (=0) in the analysis. The last two columns “node1” and “node2” are our estimated ages of those nodes, under the calibrations.

So, the first line shows that node1 is estimated to be 15.22 years old and node2 is 2.84 years old when only the calibration called “Simon” is used.

The second line shows that when only node2 is calibrated, the estimate for node1 and node2 is 54.60 and 1.30 respectively.

In contrast, the very last line shows that when all calibrations are turned on, then the age estimates are 2.50 and 1.30 respectively.

References:

1 Sanderson, M. J. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol. Biol. Evol. 14:1218-1231.

2 Sanderson, M. J. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol. Biol. Evol. 19:101-109.

3 Gray, R.D., Drummond, A.J., & Greenhill, S.J. (2009) Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement. Science, 323: 479-483.

genbank-download

genbank-download

genbank-download is a small script to download nucleotide sequences from genbank using an accession number.

Usage:


> python genbankdownload.py [options] ACCESSION-NUMBER
> 
> **Example Usage**
> python genbankdownload.py J01415
> python genbankdownload.py J01415 > mysequence.xml
> python genbankdownload.py -m fasta J01415 > mysequence.fasta 
> 

The program source-code is available here or you can run:

> easy_install genbank-download

Simple webserver file alteration monitoring using integrit

Intrusion detection?

Over on Mezzoblue, Dave Shea found out that his website had been compromised subtly. The attacker had exploited some (as yet unknown) security hole and quietly modified his website to link to the standard spam sites.

Within a few hours there were tens of posts from people who’d checked their websites and found similar modifications that had been sitting their un-noticed, with people pointing the finger either at old WordPress installations or a guesses that their hosting service had been compromised.

This shows us once again, that any software you run on your website needs to be kept up-to-date immediately, but what shocked me was that so many people out there running websites and are not watching them for file changes. They had no idea that their sites had been hacked until they went and looked for it.

So – in bold: Anyone running a website or webserver of any type needs to watch out for unexpected access and changes.

The easiest way to do this is to use some intrusion detection software (IDS). This sounds complex, but it’s actually quite easy to do. All these programs do is to monitor your files and warn you when they change. This would have immediately spotted this type of attack.

Because of this, I’ve decided to write up an easy guide to simple file alteration monitoring – here it is.

Choose your weapon:

There are plenty of intrusion detection/file modification apps out there – some of the better known ones include AIDE, Samhain and Tripwire. These are all very cool, and highly powerful, but are also quite complex and hard to install, especially on cheap shared hosting.

Therefore, my weapon of choice, for the last few months has been a lightweight and fast application called integrit, so I’m going to tell you how to install it here.

Before we start: Do make sure that you’re not compromised right now, there’s no point running an IDS if you’re already hacked. While you’re at it, make sure everything’s upgraded too.

Step 1: Make a place to store integrit:

Since you’re on shared hosting, you can’t install integrit properly into /usr, but you need to to put it somewhere anyway.

I decided to install it to a directory called “integrit” inside my home dir, so:

mkdir ~/integrit

In the following commands, do remember to replace any mention of ~/integrit with the directory you used.

Step 2: Download and install integrit:

The integrit webpage is at: http://integrit.sourceforge.net/, so go there and get the latest version (currently 4.1), or you could cut’n’ paste this:

cd ~/integrit
wget http://optusnet.dl.sourceforge.net/sourceforge/integrit/integrit-4.1.tar.gz

Once the integrit archive file is in your ~/integrit directory, we need to decompress it and install it:

cd ~/integrit
tar -zxvf integrit-4.1.tar.gz
cd integrit-4.1
./configure
make

Next we need to put the integrit binary somewhere where we can get it, here I’ve just dumped it into the ~/integrit directory, but you could put it in ~/bin or something nicer if you want:

cp integrit ~/integrit

Step 3: Set up integrit:

Look in the integrit-4.1/examples directory and make a config file from the example.

You need three things at least:

  1. The known file database – this is where the integrit database is stored.
  2. The current file database – this is where integrit stores the modified info.
  3. A root directory to monitor – this is the full path to the directory we want to watch.

We can also tell integrit to ignore directories, by listing with an exclamation mark at the start of the line. We want to ignore the ~/integrit directory, and on dreamhost, we’ll need to ignore the webserver log directory (because it changes a lot, and parts of it our user can’t access which will cause errors).

All in all, it’ll look something like this:

# database locations (FULL PATHS!)
known=/home/simon/integrit/src_known.cdb
current=/home/simon/integrit/src_current.cdb

# What do we want to check (no trailing slash!)
root=/home/simon
# ignore the integrit dir:
!/home/simon/integrit
# ignore the webserver logs dir:
!/home/simon/logs
# oh, and the bash logfile
/home/simon/.bash_history

Note: You’ll need to change “simon” to your user name, and “integrit” to where you installed integrit to in step 2.

Step 4: Create integrit database

We need to get integrit to store a list of the current files, and their vital statistics, so run this command:

~/integrit/integrit -C home.conf -u

– where “home.conf” is the integrit configuration file that you generated in step 3.

If all goes well, you’ll see something like this:

integrit: -- integrit, version 4.1 -------
integrit: output : human-readable
integrit: conf file : home.conf
integrit: known db : /home/simon/integrit/src_known.cdb
integrit: current db : /home/simon/integrit/src_current.cdb
integrit: root : /home/simon
integrit: do check : no
integrit: do update : yes
integrit: current-state db RMD160 ------
integrit: 3d6b135343a5031d357b5bb2d7d7dc39c7ab5646 /home/simon8/integrit/src_current.cdb

Once that’s done, copy the newly created database to the known database location:

cp ~/integrit/src_current.cdb ~/integrit/src_known.cdb

Step 5: Test that integrit’s working:

So lets make sure that integrit’s working properly. To do this, we can add an empty file somewhere and see if integrit spots it.

cd ~
touch foo

Now we can run integrit:

~/integrit/integrit -C ~/integrit/home.conf -c

If all goes well, you’ll see something like this:

integrit: -- integrit, version 4.1 -------
integrit: output : human-readable
integrit: conf file : /home/simon8/integrit/home.conf
integrit: known db : /home/simon8/integrit/src_known.cdb
integrit: current db : /home/simon8/integrit/src_current.cdb
integrit: root : /home/simon8
integrit: do check : yes
integrit: do update : no
new: /home/simon8/foo p(664) t(100000) u(767504) g(203016) z(0) m(20070605-162156)
new: /home/simon8/foo s(9c1185a5c5e9fc54612808977ee8f548b2258d31)
integrit: not doing update, so no check for missing files

Notice how integrit’s spotted the foo file that’s not in the database? If one of the files has changed, you’ll get much the same output, with “changed:” instead of “new:”. So – remove the dummy file:

rm ~/foo

Step 7: Get integrit to run daily:

Now, we want to set up a cron job, so that integrit is run automatically for us. First of all, we should make a quick little shell script to run integrit and email the results to us:

#!/bin/bash
DATE=`/bin/date "+%F"`
/home/simon/integrit/integrit -C /home/simon/integrit/home.conf -cu | /usr/bin/mutt -s "integrit -- $DATE" email@example.com

Paste the above into a file called “run_integrit.sh” (a good place to put it would be in your ~/integrit directory), edit the paths to match your setup, and change the email address. Finally, make this file executable:

chmod +x ~/integrit/run_integrit.sh

Now all we have to do is to add that to our crontab.

crontab -e

– and add a line that looks something like this:

59 21 * * * /home/simon/integrit/run_integrit.sh

This will run integrit at 21.59 every day, if you don’t know what that means, then have a google for “crontab tutorial”.

Save the file, and you’re off.

Updating the database after valid changes:

When you’ve changed or added a file yourself, then you’ll need to update your known database with these changes. To do this, just generate a current database, and copy it over the old one. The script I’ve got above will automatically generate a current one, so you can just use that version, or repeat Step 4.

Final considerations:

Note: If you can, you should run integrit (that is BOTH the database files and the binary files) off a “safe” partition, that’s not writable. Unfortunately, most of us on shared hosting don’t have that privilege, so just be aware that if a really clever attacker does get you, then they’re likely to disable or modify the IDS if they can.

A good way of dealing with this is to copy your known file database off the webserver and make sure that the one on the server matches this one every so often.

–Simon

Query PubMed for citation information using a DOI and Python

Here’s a simple little script to query PubMed for a Digital Object Identifier (a DOI)

Usage is quite simple, find a DOI somewhere, e.g. 10.1038/nature02029 (for this groundbreaking paper), and run this:

lurch:~ python pythonquery.py 10.1038/nature02029

– and via the magic of webservices and XML, and with a bit of luck, you’ll get something like this back:

Language-tree divergence times support the Anatolian theory of Indo-European origin.

Gray, RD, Atkinson, QD
  
Nature 2003, 426 (6965):435-9

Languages, like genes, provide vital clues about human history. The origin of the Indo-European language family is "the most intensively studied, yet still most recalcitrant, problem of historical linguistics". Numerous genetic studiesof Indo-European origins have also produced inconclusive results. Here we analyse linguistic data using computational methods derived from evolutionary biology. We test two theories of Indo-European origin: the 'Kurgan expansion' and the 'Anatolian farming' hypotheses. The Kurgan theory centres on possible archaeological evidence for an expansion into Europe and the Near East byKurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian theory claims that Indo-European languages expanded with the spread of agriculture from Anatolia around 8,000-9,500 years bp. In striking agreement with the Anatolian hypothesis, our analysis of a matrix of 87 languages with 2,449 lexical items produced an estimated age range for the initial Indo-European divergence of between 7,800 and 9,800 years bp. These results were robust to changes in coding procedures, calibration points, rooting of the trees and priors in the Bayesian analysis.

The Code:


#!/usr/bin/env python
# Simple script to query pubmed for a DOI
# (c) Simon Greenhill, 2007
# http://simon.net.nz/

import urllib
from xml.dom import minidom

def get_citation_from_doi(query, email='YOUR EMAIL GOES HERE', tool='SimonsPythonQuery', database='pubmed'):

    params = {
        'db':database,
        'tool':tool,
        'email':email,
        'term':query,
        'usehistory':'y',
        'retmax':1
    }

    # try to resolve the PubMed ID of the DOI

    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)

    data = urllib.urlopen(url).read()
    # parse XML output from PubMed…
    xmldoc = minidom.parseString(data)
    ids = xmldoc.getElementsByTagName('Id')

    # nothing found, exit
    if len(ids) == 0:
        raise Exception('DoiNotFound')

    # get ID
    id = ids[0].childNodes[0].data
    # remove unwanted parameters
    params.pop('term')
    params.pop('usehistory')
    params.pop('retmax')
    # and add new ones:
    params['id'] = id
    params['retmode'] = 'xml'
    # get citation info:
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
    return urllib.urlopen(url).read()

def text_output(xml):
    """
    Makes a simple text output from the XML returned from efetch
    """
    xmldoc = minidom.parseString(xml)
    title = xmldoc.getElementsByTagName('ArticleTitle')[0]
    title = title.childNodes[0].data
    abstract = xmldoc.getElementsByTagName('AbstractText')[0]
    abstract = abstract.childNodes[0].data
    authors = xmldoc.getElementsByTagName('AuthorList')[0]
    authors = authors.getElementsByTagName('Author')
    authorlist = []
    for author in authors:
        LastName = author.getElementsByTagName('LastName')[0].childNodes[0].data
        Initials = author.getElementsByTagName('Initials')[0].childNodes[0].data
        author = '%s, %s' % (LastName, Initials)
        authorlist.append(author)

    journalinfo = xmldoc.getElementsByTagName('Journal')[0]
    journal = journalinfo.getElementsByTagName('Title')[0].childNodes[0].data
    journalinfo = journalinfo.getElementsByTagName('JournalIssue')[0]
    volume = journalinfo.getElementsByTagName('Volume')[0].childNodes[0].data
    issue = journalinfo.getElementsByTagName('Issue')[0].childNodes[0].data
    year = journalinfo.getElementsByTagName('Year')[0].childNodes[0].data
    # this is a bit odd?
    pages = xmldoc.getElementsByTagName('MedlinePgn')[0].childNodes[0].data

    output = []
    output.append(title)
    output.append("") #empty line
    output.append(', '.join(authorlist))
    output.append( '%s %s, %s (%s):%s' % (journal, year, volume, issue, pages) )
    output.append("") #empty line
    output.append(abstract)
    return output

if __name__ == '__main__':
      
from sys import argv, exit
      
if len(argv) == 1:
    print('Usage: %s <query>' % argv[0])
    print(' e.g. %s 10.1038/ng1946' % argv[0])
    exit()

citation = get_citation_from_doi(argv[1])
for line in text_output(citation):
    print line

benchmarking .htaccess performance

Introduction:

One of the common tips to increasing Apache performance is to turn off the per-directory configuration files (aka .htaccess files) and merge them all into your main Apache server configuration file (httpd.conf).

Jeremy raised an interesting question about when the performance loss caused by using many htaccess files is offset by the ease of maintenance. He’s arguing – and I agree – that it makes sense to keep the configuration locally inside .htaccess files, despite the performance loss as these are easier to maintain.

It’s fairly logical that the multiple .htaccess file route will be slower – for every node in the request URI, the webserver has to look for an .htaccess file and merge the rules found in every one. So, we’re going to have to have a filesystem seek’n’read for every subdirectory.

However, is this a major issue? How much of a performance hit is there? Let’s find out:

Set-up:

Ok. Let’s make two docroots each with the same structure and files.

  1. htdocs_access – the .htaccess version. This has one .htaccess file in the leaf directory.

  2. htdocs_config – the httpd.conf version. This has the same rule as the above, but the rule is in the server-wide httpd.conf file and htaccess support is turned OFF (AllowOverride None).

Next, we need to get the .htaccess/httpd.conf files to do something ( mainly so we can see if Apache’s merged them in ). So, we’ll make a number of files in the last random directory (the leaf node), and give half of them the extension .foo, and the other half .bar. We’ll then tell Apache to process the .bar’s with PHP, and the .foo’s as text. All files will have the same content:

Here’s the (python) code I used to generate this structure:

import os

# where we'll place the generated structure
  
staging = '/Users/simon/server'
htdocs_access = os.path.join(staging, 'htdocs_access')
htdocs_config = os.path.join(staging, 'htdocs_config')

# how deep to go!
dir_depth = 10

# how many files in the leaf node of the dir.
num_files = 50

# what content to put in the files
content = ""

# the actual htaccess file
htaccess = ""
AddHandler application/x-httpd-php .bar
""  
# make directory structure
  
dir = ""
for dirnum in range(0, dir_depth):
    dir = os.path.join( dir, str( dirnum ) )
    hta = os.path.join( htdocs_access, dir )
    htc = os.path.join( htdocs_config, dir )
    os.makedirs( hta )
    os.makedirs( htc )

    # make the files:
    for filenum in range( 0, num_files ):
        # assign the file types -- half .foo, and half .bar
        if filenum % 2 == 0:
            filename = '%d.foo' % filenum
        else:
            filename = '%d.bar' % filenum
        f = open( os.path.join( hta, filename ), 'w+' )
        f.write( content )
        f.close()

        f = open( os.path.join( htc, filename ), 'w+' )
        f.write( content )
        f.close()

    # now, add the .htaccess file inside the lead htdocs_access dir
    f = open( os.path.join( hta, '.htaccess' ), 'w+' )
    f.write( htaccess )
    f.close()

# and we'll place it in the root of the htdocs_config dir as
# httpd.conf to remind ourselves to add it to the httpd.conf file
  
f = open( os.path.join( htdocs_config, 'httpd.conf' ), 'w+' )
f.write( htaccess )
f.close()

Here’s what we end up with:

0/
1/
2/
3/
4/
5/
6/
7/
8/
9/
0.foo
1.bar
10.foo
11.bar
(...etc...)
6.foo
7.bar
8.foo
9.bar

Where htdocs_access has a .htaccess file in 9/ and htdocs_config doesn’t.

Server Configuration:

Here are the two httpd.conf files for the configurations:

htdocs_config httpd.conf:

ServerRoot "/usr/local/apache2"
PidFile logs/httpd.pid
Timeout 300
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15
DirectoryIndex index.html
AccessFileName .htaccess
HostnameLookups Off
AcceptMutex fcntl
StartServers 5
MinSpareServers 5
MaxSpareServers 5
MaxClients 100
MaxRequestsPerChild 10

User nobody
Group #-1
DocumentRoot "/Users/simon/server/htdocs_config"
LoadModule php5_module modules/libphp5.so
Listen 8111
<directory>
    Options Indexes FollowSymLinks
    AllowOverride None
    AddHandler application/x-httpd-php .bar
</directory>

htdocs_access httpd.conf:

ServerRoot "/usr/local/apache2"
PidFile logs/httpd.pid
Timeout 300
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15
DirectoryIndex index.html
AccessFileName .htaccess
HostnameLookups Off
AcceptMutex fcntl
StartServers 5
MinSpareServers 5
MaxSpareServers 5
MaxClients 100
MaxRequestsPerChild 10

### Section 2: 'Main' server configuration
  
User nobody
Group #-1
DocumentRoot "/Users/simon/server/htdocs_access"
LoadModule php5_module modules/libphp5.so
Listen 8111
<directory>
    Options Indexes FollowSymLinks
    AllowOverride All
</directory>

Results:

Benchmarking was done with “ab” the Apache Benchmark program, which was set to access one page 1,000 times with 10 concurrencies. Each configuration was benchmarked five times in random order (to minimise the effect of any running background processes etc).

<th colspan="2">
  htdocs_config
</th>

<th colspan="2">
  htdocs_access
</th>
<th>
  Time Taken (s):
</th>

<th>
  Requests per Second:
</th>

<th>
  Time Taken (s):
</th>

<th>
  Requests per Second:
</th>
<td>
  12.683213
</td>

<td>
  78.84
</td>

<td>
  13.21618
</td>

<td>
  75.66
</td>
<td>
  12.854491
</td>

<td>
  77.79
</td>

<td>
  13.574916
</td>

<td>
  73.67
</td>
<td>
  11.777676
</td>

<td>
  84.91
</td>

<td>
  13.163296
</td>

<td>
  75.97
</td>
<td>
  13.668398
</td>

<td>
  73.16
</td>

<td>
  12.26475
</td>

<td>
  81.53
</td>
<td>
  13.76753
</td>

<td>
  76.47
</td>

<td>
  13.264527
</td>

<td>
  75.39
</td>
<th>
  12.9
</th>

<th>
  78.23
</th>

<th>
  13.1
</th>

<th>
  76.4
</th>

So – we’re looking at a difference of around 2.3% extra requests per second when htaccess files are disabled. This is really quite trivial, and should only be worried about when you’re really loaded.

Issues:

There are a number of areas where this could be improved:

  • Try different directory depths i.e. the more nested the directory is, the slower it should be under the .htaccess scenario. In contrast, if there’s only 2 or 3 levels then it should be faster.
  • Have multiple .htaccess files in the intermediate nodes to see how Apache handles the merging of these files. Here we’ve just used one .htaccess file, and we should probably see further slowdowns if Apache has to merge some complicated rule sets.
  • Access different files – I just requested one file repeatedly, so we might be getting a lot of interference from any caching systems (harddrive, ram, php caches etc) that I forgot about. Additionally, requesting multiple URI’s is a more realistic test case for a webserver.

Processing mod_security audit logs with logwatch

This is a quick and dirty LogWatch script to process mod_security audit logs. I had a google, and couldn’t find any released scripts, so here’s my one.

#!/usr/bin/perl -w
my $Detail = $ENV{'LOGWATCH_DETAIL_LEVEL'} || 0;
while (defined($ThisLine = <stdin>)) {
    chomp($ThisLine);
    if ( $ThisLine =~ m/^Request:/ )
    {
        print $ThisLine."\n";
    }
    if( $ThisLine =~ m/^mod_security-message/ )
    {
        print $ThisLine."\n";
        print "\n";
    }
    </stdin>
}
exit(0);

On Debian, you should just be able to drop the above script into /usr/share/logwatch/scripts/services (chmod +x too!), and then add this to /etc/logwatch/conf/logfiles/mod_security.conf:

–Simon

process a url query string in PHP

PHP Function to parse a url and extract its arguments.

  
  
function process_url( $url ) {
    $processed_url = parse_url( $url );
    $query_string = ;
    # split into arguments and values
    $query_string = explode( '&', $processed_url[ 'query' ]);
    $args = array( ); // return array
    foreach( $query_string as $chunk ) {
        $chunk = explode('=', $chunk );
        # it's only really worth keeping if the parameter has an argument.
        if ( count( $chunk ) == 2 ) {
            list( $key, $val ) = $chunk;
            $args[ $key ] = urldecode( $val );
        }
    }
    return $args;
}

$url = 'http://www.google.co.nz/search?q=simon+rocks!&start=0&ie=utf-8&oe=utf-8&client=firefox-a&rls=org.mozilla:en-US:official';

$result = process_url( $url );

print_r( $result );

Will result in something like this:

  
Array 
(
    [q] => simon rocks!
    [start] => 0  
    [ie] => utf-8
    [oe] => utf-8
    [client] => firefox-a
    [rls] => org.mozilla:en-US:official
)
  

Protecting MySQL from SQL Injection Attacks with PHP.

This is intended as a brief guide to protecting your MySQL database from SQL injection attacks. Unfortunately, a large amount of the code that I’ve seen written by people on forums, and in countless crappy PHP tutorials lurking around on the net, and in the many websites that display the magic breeding slashed-quote (''' see below) show that many people just do not understand what’s going on and how to protect themselves against SQL injection attacks.

In fact, the only reason that many websites are ‘protected’ is due to magic quotes, and given that this is due to be disabled in the forthcoming PHP6, then there’s going to be some major problems cropping up.

I’ll talk about the problem of SQL injection, the half-hearted attempt to fix it with these ‘magic quotes’, and what you should really be doing EVERY TIME you send user inputted data to your database.

The Problem What is SQL Injection:

As the name suggests, SQL Injection is quite simply, when the user injects SQL into your application. How does this happen? Say we have a nice simple login form that takes a username and password, and checks if that’s in the database. If it is, the user is logged into an admin section or something. The code for this could look something like this:

 
// user and password come from a simple POST'd form
  
$user = $_POST[ 'user' ];
$password = $_POST[ 'password' ];

$query = "SELECT name, age, credit_card FROM usertable WHERE username = '$user' AND password = '$password' ";

$result = mysql_query( $query );

// check if mysql found anything, and get the record if it did
  
if ( mysql_num_rows( $result ) > 0 ) {
    $data = mysql_fetch_assoc( $result );
    echo 'Hello '.$user.'!';
    echo 'Your credit card number is '.$data[ 'credit_card' ].'';
} else {
    echo 'Incorrect Username or Password! Go Away!';
}

Ok. This works, BUT it’s about as safe as juggling with scalpels. If I enter “simon” as my username, and “secret” as my password, then the query that goes to MySQL looks like this:

SELECT name, age, credit_card FROM usertable WHERE username = 'simon' AND password = 'secret'

and I get logged in quite happily. Fantastic.

The problem comes when I start entering other characters. Let’s say that the next user who tries to login is Peter O’Reilly. Naturally he’ll want a username something like PeterO’Reilly. If we plug that into our query we get this:

SELECT name, age, credit_card FROM usertable WHERE username = 'PeterO'Reilly' AND password = 'secret' 

MySQL blasts along quite happily and hits username='PeterO’, and then it gets this “Reilly” thing which it doesn’t know what to do with and this happens:

ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Reilly' and password="secret"; " at line 1

Nice. We have a broken website, Pete can’t login, and he’s left with misgivings about our web programming skills.

Even worse ' what happens if I enter my password as this?

' or 1=1 ; --

The query that gets sent to MySQL will look like this:

SELECT name, age, credit_card FROM usertable WHERE username = 'simon' AND password = " or 1=1 ; " '

What does this mean? It tells MySQL to find all rows with a username equal to “simon” and a password equal to an empty string OR “1=1”. To represent that a bit more logically:

( username = “simon” and password = "” ) || ( 1 = 1 )

Now, 1=1 is always going to be true, so this is equal to:

( false ) || ( true )

Which means that ALL the records in the table will get returned. Our login processer above is going to log me on with someone else’s credentials – in fact, those of the first record returned.

Keep in mind, however, that we don’t need to escape numbers, and we shouldn’t put quote marks around them (it’s not standard SQL)- if a variable is a number, then it’ll be fine.

The crap attempt to fix it ( magic quotes ):

How to fix this? We need to be escape these quote characters ( both single and double quotes, as well as backslashes). This is done by putting a slash in front of them, e.g. so a ' becomes a ', and MySQL can work out that that quote mark is “protected” by the slash, and is part of the value and ignores it. So, Peter’s attempt to login becomes:

SELECT name, age, credit_card FROM usertable WHERE username = 'PeterO\'Reilly' AND password = 'secret';

and my attempt to hack my way in becomes:

  
SELECT name, age, credit_card FROM usertable WHERE username = 'simon' AND password = '\' or 1=1 ; -- ';
  

MySQL now thinks my password is the string

' or 1=1 ;

and I won’t be able to login.

So where do we get these slashes? Since around PHP version 3.06, PHP tries to do this for you, with a setting called “magic_quotes”. What this does is to automatically add slashes to anything coming in via HTTP get or post requests and via cookies. You can also do this manually using the addslashes() function.

But do not rely on this!

  • It could be turned off or on, or not present in your version (and it won’t be in PHP6). Therefore YOU CANNOT RELY ON IT, AND HAVE TO HANDLE THIS YOURSELF.
  • This means that if you rely on magic quotes, then your code is not portable. Unfortunately, escaping things with a slash is one of those irritating non-standard MySQL features. Most other databases which follow the SQL standards (like Postgres), escape things with another single quote ( O’Reilly => O"Reilly ).
  • It’s crappy. It doesn’t work well with extended characters, and these can be used to get around the slashes. See Chris Shiflett’s discussion of this problem
  • It’s irritating ' it pollutes your data with stuff that the user didn’t enter. This is the major cause of the magic breeding slashed-quote.

(Aside) The Magic Breeding Slashed-Quote:

I’m sure you’ve all seen websites that have this really annoying habit of messing up their user's post's quote marks ( just like that ). I’m calling this the magic breeding slashed quote, because these things propagate like crazy. What’s happening here is that the hard working web developer is adding slashes to the data they send to their database ' great! BUT, they’re not checking for magic quotes, so PHP is escaping the ' once to ', and then when the website runs addslashes() again, php sees a backslash AND a single quote which need escaping ( remember that the three characters that get escaped are , ‘, and " ). This therefore becomes \'. MySQL comes along and sees an escaped backslash AND an escaped single quote.

Here’s what’s happening:

  1. user input: O’Reilly
  2. magic quotes: O'Reilly
  3. addslashes: O\'Reilly
  4. MySQL processes this to: O'Reilly

and we end up with O'Reilly stored when we really want O’Reilly.

I’ve actually seen applications which quite happily store O'Reilly, and stripslashes() before they display the data ' this is just blindingly stupid.

Fixing it.

So, we need a way of escaping data that isn’t crappy, isn’t as prone to character set issues, and isn’t prone to magic breeding slashed quotes.

There are two ways to do this.

  • Use better slashes ( PHP4, old mysql client library using the mysql_* functions )
  • Use a better technique ' bound parameters ( PHP5 with the new mysqli_* client library)

Using better slashes ' mysql_real_escape_string( ) (PHP4, mysql_* )

If you’re using the old mysql client library ( i.e. the mysql_* functions ), then you have to use the hideously named mysql_real_escape_string() function. This takes into account the character set of the database connection and should handle things appropriately.

Note: mysql_real_escape_string needs an active database connection, or anything sent to it will disappear ( WTF? ), or it will generate an error.

BUT we still need to check for the evil magic_quotes setting, which and remove it. We can do this with the get_magic_quotes_gpc() function ( “gpc” refers to Get, Post, and Cookies which magic quotes operates on ).

So ' something like this:

// remove the pesky slashes from magic quotes if it's turned on
  
function clean_string( $value, $DB ) {
    if ( get_magic_quotes_gpc() ) {
        $value = stripslashes( $value );
    }
    // escape things properly
    return mysql_real_escape_string( $value, $DB );
}

$string = "O'Reilly";
// where $db is your active database connection resource id.
$safe_string = clean_string( $string, $db );

There’s a function described in the PHP manual called quote_smart, that handles this and handles both strings and integers:

// Quote variable to make safe
function quote_smart($value) {
    if ( get_magic_quotes_gpc() ) {
        $value = stripslashes( $value );
    }
    // Quote if not a number or a numeric string
    if ( !is_numeric( $value ) ) {
        $value = "'" . mysql_real_escape_string($value) . "'";
    }
    return $value;
}

Note that you’ll need to implement this yourself, and you’ll have to rewrite your queries to not have quotes in them e.g.:


$variable = "O'Reilly";
  
$variable = quote_smart( $variable );
  
// note that we haven't surrounded $variable with quote marks in
  
// the query below since quote_smart does that for us.
  
$query = "SELECT x, y, z FROM tablename WHERE user = $variable";
  

However, this leaving quote marks out of the query irritates me enough, that I generally just type-cast anything which should be a number to a number:

function clean_int($i) {
    if ( is_numeric( $i ) ) {
        return ( int ) $i;
    }
    // return False if we don't get a number
    else {
        return False;
    }
}

Warning:

This is NOT foolproof. In fact, if the attacker can change the character set on the fly, then this whole system can be avoided. Ilia Alshanetsky has an excellent write up on this.

Fixing it with better technique ' bound parameters ( PHP5, MySQLi ):

So ' the best solution? Use bound parameters. To use these you’ll need to be using the improved mysqli library that comes with PHP5. This technique differs slightly in that you define a query “template” first with placeholders, and then “bind” the parameters to it, and the mysqli library takes care of the appropriate escaping for us:

  
$variable = "O'Reilly";
  
// prepare the query
  
$query = $mysqli->prepare( "SELECT x, y, z FROM tablename WHERE user = ?" );

// bind a parameter ' here the first parameter is a short string that specifies the type that the
  
// subsequent arguments should be:
  
// 's' means a string
  
// 'd' means a double
  
// 'i' means an integer
  
// 'b' is a blob
  
$query->bind_param( 's', $variable );

// execute query:
  
$query->execute( );

// so if we had a more complex query, which updated the user info // with "favorite_color" (a string), "age" ( an integer ) and
  
// "description", a blob:

$query = $mysqli->prepare( "UPDATE tablename SET favorite_color = ?, age = ?, description = ? WHERE user = ?" );
  
// we would have a bind looking like this:
  
$query->bind_param( 'sibs', 'red', 27, $some_blob, $variable );
  
$query->execute();
  

Another benefit of this method is that it’s faster to transfer data to the db server. Harrison Fisk has a good discussion of these here.

Another thing to keep in mind:

Now, properly using mysql_real_escape_string or prepared statements should keep you pretty safe, but there are a few characters you might also want to watch out for:

The Percentage Sign (%)

The percentage symbol is commonly used by MySQL to perform LIKE queries ' this WON’T get escaped. If your application is doing LIKE comparisons, and your database is large, then it’s worth checking for this specifically to avoid a friendly user entering “%” and making your database grind to a halt ' e.g.

  
$user_input = '%';
  
$query = "SELECT x,y,z FROM tablename WHERE user LIKE '%$user_input%';
  
// becomes LIKE %%% -> and returns all rows in tablename.
  

Edit: 15th August, 2006 '

James Laver has written a nice lightweight database access class for MySQLi which takes care of the binding of parameters for you.

Edit: 9th November, 2006 '

Fixing a link, thanks Peter 🙂

-Simon

All Posts by Category or Tags.