Whole-Genome Homology Maps
[Introduction]
[Methods]
[Downloads]
[Acknowledgements]
[References]
Whole-genome homology maps attempt to identify the evolutionary relationships
between and within multiple genomes. The term "syntenic" is often used to
describe regions of multiple genomes that are believed to have evolved from the
same region in an ancestral genome. However, it has been pointed out that this
use of the term is incorrect (Passarge et al. 1999) and thus
we will use the terms "homologous", "orthologous", and "paralogous" instead.
Ideally, given K genomes, we would like to identify all orthologous genomic
regions as well as paralogous regions within each genome and hypothetical
ancestral genome. Maps listing these relationships are extremely valuable to
researchers performing comparative analyses of genomic sequence. Presented
here is initial work on creating an orthology map for the human, mouse, and rat
genomes.
Our basic strategy in building homology maps is to use exons that are
orthologous in multiple genomes as map "anchors." Given K genomes,
the steps in the map construction are as follows:
- For each genome, obtain a set of exon annotations. These
annotations can be a combination of both exon predictions (e.g. Genscan)
and annotations that have been experimentally verified (e.g. RefSeq).
Ideally, we would like to have these annotations be as sensitive as
possible. Specificity is not a concern, as incorrect annotations are not
likely not have significant alignments with other gene annotations.
- Compare all exons against all exons in other genomes and record
significant alignments between exons. Currently, we use BLAT to do this all-vs-all comparison with
alignments being performed in protein space.
- Construct a graph with each vertex corresponding to a exon and
edges between vertices whose corresponding exons have significant
alignments.
- Identify cliques in this graph. These cliques are potential
anchors to be used in the map.
- Join neighboring (adjacent in genomic coordinates, in each
genome) cliques to form runs. Cliques that are not part
of a run are discarded.
- The extents of each run in each genome are outputted as
orthologous segments. The cliques from each run are used to output
the exact genomic coordinates of anchors within each orthologous
segment. These anchors can be used by genomic alignment programs
(such as MAVID) to do a detailed alignment of
each orthologous segment.
Lines in the map files are of the form:
[Segment #] [Chrom] [Start] [End] [Strand] ...
where the last 4 fields are repeated for each genome in the map. The
fields are tab-delimited. For coordinates on the reverse strand "-",
the start coordinate is greater than the end coordinate. Coordinates
are 0-based and half-open (the larger of Start and End is one more
than the coordinate of the last base included in the segment). Pieces for
which no orthologous region could be identified in one of the genomes
have "NA" in the fields for the appropriate genomes. The order of the
genomes in each line is given by the order of the genomes in the name
of the map file.
Lines in the anchor files are of the form:
[Segment #] [Genome1] [Chrom1] [Strand1] [Start1] [End1] [Genome2] [Chrom2] [Strand2] [Start2] [End2]
Each line represents an anchor between two of the genomes (Genome1 &
Genome2) in a certain map segment (Segment #). The coordinate
conventions are the same as for those in the map file.
-
Human (NCBI build 33 Apr. 2003),
Mouse (MGSC v30, Feb. 2003),
Rat (RGSC v3.1 Jun. 2003) Map
ftp://ftp.biostat.wisc.edu//pub/cdewey/data/hiv_recombination.tar.gz
(Updated 08/20/2003)
-
Human (NCBI build 34 Jul. 2003),
Mouse (MGSC v30, Feb. 2003),
Rat (RGSC v3.1 Jun. 2003) Map
(Updated 09/12/2003)
-
Older maps constructed using a different method can be found
here.
-
Bray, N and Pachter, L. 2003
MAVID: Constrained Ancestral Alignment of Multiple Sequences.
Genome Research,
14:693-699.
-
Kent, W.J. 2002.
BLAT:
The BLAST-like alignment tool.
Genome Research,
12: 656-664.
-
Passarge, E., Horsthemke, B., and Farber, R.A. 1999.
Incorrect use of the term synteny.
Nature Genetics, 23:387
hanuman.math.berkeley.edu