PhyloGena - An Automated Interactive Phylogenetic Annotation Tool

 

WARNING:

the development of the version of Phylogena described here has been discontinued since 2007. 

 

Motivation

The idea was born out of frustration during diatom genome annotation. As many others before we found that simple BLAST searches and pairwise alignments often cannot tell you whether a given ORF is homologous to a known gene or not. A much better evidence for homology between ORFs is that they are neighbours in a phylogenetic tree. But construction of a useful tree takes time and one has to combine a number of steps like collecting similar sequences using Blast search, selecting a subset of the hits, do a multiple alignment, compute a tree, display and manipulate the tree, and frequently go back to the Blast table or the alignment and remove or add sequences. This approach can become very tedious in case one has to work on long Blast tables and many ORFs to annotate. Also a BLAST table may not be very informative in that it contains many cryptic abbreviations, especially for the species, and different names for the same gene. Only quite experienced users can tell at a glance which gene from which species was found. (Quickly tell me who is ARATH, EUGRA, or CYPAR!)

 

The birth of PhyloGena

PhyloGena was developed by Kris Hanekamp for his Master´s work during 2005 under the supervision of Uta Bohnebeck and Klaus Valentin. Later, Christophe Garnier and Bánk Beszteri made a couple of bug fixes and added new features. Currently it is being improved by Bánk Beszteri and Sascha Zielke (the latter again doing his diploma work on the program development) with additional help from Stephan Frickenhaus.

Short description

PhyloGena is a software tool to facilitate phylogenetic annotation of unknown sequences. It has a user-friendly graphical interface and you will intuitively learn how to use it. You can import 1, 10, 100, 1000? (see below) DNA or protein sequences and the program will search for similar sequences, construct a multiple alignment and subsequently a phylogenetic tree for all of them and show them to you. You can easily manipulate the data sets, add or remove sequences, change parameters etc. This is of great help in identifying the function and phylogenetic affiliation of ORFs and makes annotation of genes or ESTs easier and less error-prone.

Details

The program mimics the steps a human would go through to construct a simple phylogenetic tree for a given ORF. It automatically performs the following steps:

  1. Local BLAST against UniProt/SWISSPROT
  2. Selection of a “meaningful subset” of the BLAST hits (see below)
  3. Construction of a multiple alignment from the selected hits
  4. Construction of a phylogenetic tree from the alignment
  5. Display of the alignment using JalView and of the tree using ATV

Although you have the choice among numerous different selection procedures (and you can also implement new ones in Prolog), you also have the possibility to manually modify the choice the software made and re-calculate the multiple alilgnment and the tree.

Technical details

  • Database: the program comes with a recent version of UniProt/SwissProt. The user may include other databases, e.g. UniProt/TrEMBL. It is also possible to create your own database. Currently, databases in SWISSPROT / EMBL or fasta format can be used with PhyloGena. However, the non-trivial selection rules (see below) make use of taxonomic and functional annotation of sequences as they can be extracted from UniProt/SWISSPROT, i.e., these rules will not work with other databases. The only selection rule currently available for custom (=non-SWISSPROT)databases is selection of the best x BLAST hits.
  • Similarity search: PhyloGena uses NCBI BLAST to search for similar sequences to the query
  • Selection rules: they are applied to the collection of BLAST hits to extract a meaningful subset of them for further analyses (multiple alignment and tree construction). Currently, four selection rules are implemented:

    1. Selection of the best X BLAST hits
    2. Selection of X hits from each taxon to a user defined taxonomic level (e.g. kingdom, class, ...)
    3. "Autodepth" selection: selection of as many hits from as many taxa as possible, i.e. chose the deepest possible phylogenetic depth but do not exceed X hits in total and Y hits per taxon
    4. Intelligent branching: find at the most X hits showing the best balance between the quality of hits and the phylogenetic depth, with special treatment of some special types of BLAST results

  • Multiple alignments: the selected sequences are being aligned using a multiple alignment program. Currently we implemented ClustalW, POA, dialign, T-Coffee, kalign and mafft; others can follow if required.
  • Trees: Next, a phylogenetic tree is being constructed from the multiple alignment. We have implemented NJ and ML from Phylip, simple and bootstrapped NJ using QuickTree, and simple or bootstrapped ML using PhyML.
  • JalView: displays the multiple alignment
  • ATV: allows you to look at the tree, zoom in and out, reroot, etc.

Special featuresSystem requirements

  • One can analyse a single DNA or protein sequence (Note: PhyloGena will use AA sequences only, except for BLASTing. This means you CANNOT use it with RNA coding sequences!)
  • The program is of modular architecture. On the one hand, it is possible to include custom made databases or new selection rules without having to touch the Java code. On the other hand, interfaces to other alignment or tree reconstruction programs can be added with minimal effort (of course depending on input / output formats of these programs) - let us know if you´d like to see a particular addition!
  • The user can interact and do changes at any step. E.g. in case you do not like the tree you can go back to the alignment and remove sequences. The alignment can be exported for more sophisticated phylogenetic analyses. You can also go back to the BLAST table and remove or add hits.



System requirements

PhyloGena was written in Java, i.e., you will need a Java Runtime Environment, version 1.4 or higher, to run it. We have mainly tested it on Windows 2000 and XP; and less extensively, but also on different LINUX systems. You will need around 1 GB disk space to run the program with UniProt/SWISSPROT (around 5 GB for UniProt/TrEMBL)

We recommmend to install PhyloGena on a computer with at least 256 MB memory; the more the better : )

Download and installation

  • Windows: To install PhyloGena on Windows, download phylogena.zip and follow the instructions provided in the maual (links to both files can be found on the right hand side of the page). Upon the request of several users, we also provide a "ready-to-go" version, including all external software (NCBI BLAST, ClustalW and PHYLIP) plus UniProt/SWISSPROT zipped together in a file named phylogenaEasy.zip. For the moment, the version of the software found in this file is outdated; if you want to obtain an up-to-date version without having to install SWISSPROT and the tools yourself, as a workaround you could get install the easy version first, and then unpack the new phylogena.zip in the same location where you unpacked phylogenaEasy.zip (normally, under C:).
  • LINUX: For LINUX, you´ll have to go the hard way: download and install all third-party software and the database(s) you need, and then phylogena.zip itself as described in the manual.
  • Mac: We do not have a Mac version yet - see under drawbacks.
  • The source code of PhyloGena is available under the GPL from the AWI-forge repository.

Citing PhyloGena

A paper describing PhyloGena has been published in Bioinformatics - please cite it if you use the software for your research.

Drawbacks, known problems

  • Currently the output files are VERY large because they contain everything, e.g. all blast hits with sequences. Also, they are stored in the memory throughout the analysis. Afterwards you can save and reload them. This implies that with 1 gb memory you can analyse between 200 and 400 orfs at a time. Sascha Zielke is currently working on a relational database backend to PhyloGena, which should help get rid of these problems.
  • Installation of PhyloGena needs some work at the moment - we will try to provide an installer to ease the process of gathering all required software and database components in the close future.
  • The program comes for free, but without any guarantees of its usefulness or future developments etc.
  • We do not have a Mac version. Although in principle, PhyloGena should be easily ported also to the Macintosh platform, we don´t use Macs and don´t know how to port it - contact us if you would be interested in helping us out with this!