Mercator

Multiple Whole-Genome Orthology Map Construction

[Introduction] [Methods] [Maps/Alignments] [Source Code] [References]

Introduction

Whole-genome homology maps attempt to identify the evolutionary relationships between and within multiple genomes. The term "syntenic" is often used to describe regions of multiple genomes that are believed to have evolved from the same region in an ancestral genome. However, it has been pointed out that this use of the term is incorrect (Passarge et al. 1999) and thus we will use the terms "homologous", "orthologous", and "paralogous" instead. Ideally, given K genomes, we would like to identify all orthologous genomic regions as well as paralogous regions within each genome and hypothetical ancestral genome. Maps listing these relationships are extremely valuable to researchers performing comparative analyses of genomic sequence. Here we present our initial work in the form a program called Mercator that constructs orthology maps between multiple whole genomes.

Methods

Our basic strategy in building homology maps is to use exons that are orthologous in multiple genomes as map "anchors." Given K genomes, the steps in the map construction are as follows:

For each genome, obtain a set of exon annotations. These annotations can be a combination of both exon predictions (e.g. Genscan) and annotations that have been experimentally verified (e.g. RefSeq). Ideally, we would like to have these annotations be as sensitive as possible. Specificity is not a concern, as incorrect annotations are not likely not have significant alignments with other gene annotations.
Compare all exons against all exons in other genomes and record significant alignments between exons. Currently, we use BLAT to do this all-vs-all comparison with alignments being performed in protein space.
Construct a graph with each vertex corresponding to a exon and edges between vertices whose corresponding exons have significant alignments.
Identify cliques in this graph. These cliques are potential anchors to be used in the map.
Starting with the largest cliques (those that have exons in all or most of the genomes), join neighboring (adjacent in genomic coordinates, in each genome) cliques to form runs. Smaller cliques that are inconsistent with runs formed by larger cliques are filtered out. After the smallest cliques have been considered, cliques that are not part of a run are discarded.
The extents of each run in each genome are outputted as orthologous segments. The cliques from each run are used to output the exact genomic coordinates of anchors within each orthologous segment. These anchors can be used by genomic alignment programs (such as MAVID) to do a detailed alignment of each orthologous segment.

Maps/Alignments

Current maps and alignments are found at the Berkeley Comparative Genomics site.
The multiple whole-genome alignment of C. elegans, C. briggsae, and C. remanei for Lall et al. 2006 is available here.
After the sequencing of the rat genome, we constructed three-way human-mouse-rat orthology maps.

Source Code

Mercator is free software under the GNU General Public License and is available as part of Colin Dewey's source code distribution.

Binaries may be made available upon request.

References

S. Lall, D. Grün, A. Krek, K. Chen, Y. Wang, C. Dewey, P. Sood, T. Colombo, N. Bray, P. MacMenamin, H. Kao, K. Gunsalus, L. Pachter, F. Piano, and N. Rajewsky. A genome-wide map of conserved microRNA targets in C. elegans. Current Biology. (2006), In press.
Bray, N and Pachter, L. (2003) MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Research, 14:693-699.
Kent, W.J. (2002) BLAT: The BLAST-like alignment tool. Genome Research, 12: 656-664.
Passarge, E., Horsthemke, B., and Farber, R.A. (1999) Incorrect use of the term synteny. Nature Genetics, 23:387
Rat Genome Sequencing Project Consortium. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature vol. 428, 493-521.