KDD Cup 2002 Tasks
Task 1: Information Extraction from Biomedical ArticlesBiomedical information exists in both the research literature and various structured databases. Some of these databases serve, in part, as a distillation of what is described in the literature. Such databases exist for genes and proteins in general, and also for more specific areas, such as the genome of a specific organism.
For this KDD Challenge Cup task, we examined the work performed by one group of curators for FlyBase (http://www.flybase.org/), a publically available database on the genetics and molecular biology of Drosophila (fruitflies). This task focused on helping to automate the work of curating biomedical databases by identifying what papers need to be curated for Drosophila gene expression information.
The Flybase criterion for curation for gene expression is:
For the Challenge Task, we asked the contestants to develop a system that does the following:
Thus for each paper containing experimental evidence of gene expression, we asked that a system return a check-list for each gene indicating whether it has associated RNA and/or protein products.
Task 2: Yeast Gene Regulation PredictionThere are now experimental methods that allow biologists to measure some aspect of cellular "activity" for thousands of genes or proteins at a time. A key problem that often arises in such experiments is in interpreting or annotating these thousands of measurements. This KDD Cup task focused on using data mining methods to capture the regularities of genes that are characterized by similar activity in a given high-throughput experiment.
To facilitate objective evaluation, this task did not involve experiment interpretation or annotation directly, but instead it involved devising models that, when trained to classify the measurements of some instances (i.e. genes), can accurately predict the response of held aside test instances.
The training and test data came from recent experiments with a
A subset of the genes was held aside as a test set. For the Challenge Task, we asked the contestants to develop a system that does the following:
The data sources for this task include nominal (categorical) features describing gene localization and function, abstracts from the scientific literature (MEDLINE), and a table of protein-protein interactions that relate the products of pairs of genes.
The Hidden SystemThe experimental data (the target values) for this task were generated by Guang Yao and Dr. Chris Bradfield from the McArdle Laboratory for Cancer Research at the University of Wisconsin. During the KDD Cup competition, the nature of the system being measured was kept secret. Now it can be revealed...
Yao and Bradfield were measuring the activity of a yeast model of the AHR (Aryl Hydrocarbon Receptor) signaling pathway. This pathway plays a key role in how cells respond to certain toxic chemicals (among other things), and it is similar to pathways that control how cells respond to a variety of other environmental stimuli.
AHR is a protein that can act as a transcription factor. When a cell is exposed to certain toxic chemicals, such as dioxin, the AHR system acts to turn on/off the expression of certain genes. The complete inventory of gene products (proteins) that are involved in the signaling pathway is not known. The goal of Yao and Bradfield's experiment was to identify the set of genes that can play a role in the pathway.
Although the gene for AH receptor itself is not native to yeast, Yao and Bradfield transformed 4500+ strains in the yeast deletion library by inserting into each the AHR gene along with a reporter system that enabled them to measure how active AHR signaling was in any given strain.
The result of this experiment was the identification of 134 genes that,
when knocked out, cause a significant change in the level of activity
of the AHR signaling system.
Mark Craven Last modified: Mon Aug 26 12:32:42 CDT 2002