Whole-Genome Homology Maps

[Introduction] [Methods] [Downloads] [Acknowledgements] [References]

Introduction

Whole-genome homology maps attempt to identify the evolutionary relationships between and within multiple genomes. The term "syntenic" is often used to describe regions of multiple genomes that are believed to have evolved from the same region in an ancestral genome. However, it has been pointed out that this use of the term is incorrect (Passarge et al. 1999) and thus we will use the terms "homologous", "orthologous", and "paralogous" instead. Ideally, given K genomes, we would like to identify all orthologous genomic regions as well as paralogous regions within each genome and hypothetical ancestral genome. Maps listing these relationships are extremely valuable to researchers performing comparative analyses of genomic sequence. Presented here is initial work on creating an orthology map for the human, mouse, and rat genomes.

Methods

Our basic strategy in building homology maps is to use exons that are orthologous in multiple genomes as map "anchors." Given K genomes, the steps in the map construction are as follows:

For each genome, obtain a set of exon annotations. These annotations can be a combination of both exon predictions (e.g. Genscan) and annotations that have been experimentally verified (e.g. RefSeq). Ideally, we would like to have these annotations be as sensitive as possible. Specificity is not a concern, as incorrect annotations are not likely not have significant alignments with other gene annotations.
Compare all exons against all exons in other genomes and record significant alignments between exons. Currently, we use BLAT to do this all-vs-all comparison with alignments being performed in protein space.
Construct a graph with each vertex corresponding to a exon and edges between vertices whose corresponding exons have significant alignments.
Identify cliques in this graph. These cliques are potential anchors to be used in the map.
Join neighboring (adjacent in genomic coordinates, in each genome) cliques to form runs. Cliques that are not part of a run are discarded.
The extents of each run in each genome are outputted as orthologous segments. The cliques from each run are used to output the exact genomic coordinates of anchors within each orthologous segment. These anchors can be used by genomic alignment programs (such as MAVID) to do a detailed alignment of each orthologous segment.

Downloads

Lines in the map files are of the form:

 [Segment #] [Chrom] [Start] [End] [Strand] ...

where the last 4 fields are repeated for each genome in the map. The fields are tab-delimited. For coordinates on the reverse strand "-", the start coordinate is greater than the end coordinate. Coordinates are 0-based and half-open (the larger of Start and End is one more than the coordinate of the last base included in the segment). Pieces for which no orthologous region could be identified in one of the genomes have "NA" in the fields for the appropriate genomes. The order of the genomes in each line is given by the order of the genomes in the name of the map file.

Lines in the anchor files are of the form:

 [Segment #] [Genome1] [Chrom1] [Strand1] [Start1] [End1] [Genome2] [Chrom2] [Strand2] [Start2] [End2]

Each line represents an anchor between two of the genomes (Genome1 & Genome2) in a certain map segment (Segment #). The coordinate conventions are the same as for those in the map file.

Human (NCBI build 33 Apr. 2003), Mouse (MGSC v30, Feb. 2003), Rat (RGSC v3.1 Jun. 2003) Map ftp://ftp.biostat.wisc.edu//pub/cdewey/data/hiv_recombination.tar.gz (Updated 08/20/2003)
- map
- anchors
Human (NCBI build 34 Jul. 2003), Mouse (MGSC v30, Feb. 2003), Rat (RGSC v3.1 Jun. 2003) Map (Updated 09/12/2003)
- map
- anchors
Older maps constructed using a different method can be found here.

Acknowledgements

The map construction was done by Colin Dewey, with assistance from Lior Pachter.

References

Bray, N and Pachter, L. 2003 MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Research, 14:693-699.
Kent, W.J. 2002. BLAT: The BLAST-like alignment tool. Genome Research, 12: 656-664.
Passarge, E., Horsthemke, B., and Farber, R.A. 1999. Incorrect use of the term synteny. Nature Genetics, 23:387

hanuman.math.berkeley.edu