I’ve recently been playing around with the program Mendeley for storing my massive collection of academic papers in PDF format. Mendeley looks to be a really useful bit of software, but at the moment it’s rather horrifically buggy. A major problem I’ve been running into is that it’s quite happy to import duplicate PDFs. This lead to much amusement when I set Mendeley to watch my collection of papers, and it decided to parse and import all of the papers every time it started up. Before long, Mendeley was trying to extract meta-data for ~20,000 PDFs…

Cleaning the dupes out isn’t too hard. Here’s how I did it.

  1. Close Mendeley
  2. Find your Mendeley data directory. On the mac, it’s “~/Library/Application Support/Mendeley Desktop”.
  3. Find the SQLite database in that directory. It’ll be named something like your-email-address\@www.mendeley.com.sqlite
  4. Make a backup! (replace “your-email-address\@www.mendeley.com.sqlite” with your database file)

    cp your-email-address\@www.mendeley.com.sqlite backup.sqlite

  5. Access the database:

    sqlite3 your-email-address\@www.mendeley.com.sqlite

  6. You should now be at the SQLite prompt.
  7. At the prompt, type;

    SELECT COUNT(*) as entries, title, year FROM Documents GROUP BY title, year HAVING entries > 1;

    … and you’ll get a list of the entries in your database that have the same paper title and paper year.

  8. If this looks ok, we can delete the duplicates:

    DELETE FROM Documents WHERE id NOT IN (SELECT MAX(id) FROM Documents GROUP BY title,year);

  9. and then do a bit of tidying up to clean up all the empty space:

    VACUUM;

  10. Restart Mendeley, and cross your fingers, and hope it worked!

If it doesn’t work, or you lose papers you didn’t want to, then you can copy the backup file (backup.sqlite) over the database file and restart again. Hopefully, the Mendeley developers will implement a better way of doing this soon, but until then – use this at your own risk!