KDD Cup Home
Training and Test Data
Winners & Slides

KDD Cup 2002 Tasks

Task 1: Information Extraction from Biomedical Articles

Biomedical information exists in both the research literature and various structured databases. Some of these databases serve, in part, as a distillation of what is described in the literature. Such databases exist for genes and proteins in general, and also for more specific areas, such as the genome of a specific organism.

For this KDD Challenge Cup task, we examined the work performed by one group of curators for FlyBase (http://www.flybase.org/), a publically available database on the genetics and molecular biology of Drosophila (fruitflies). This task focused on helping to automate the work of curating biomedical databases by identifying what papers need to be curated for Drosophila gene expression information.

The Flybase criterion for curation for gene expression is:
Does the paper contain experimental evidence for gene expression, specifically, information about the gene products (RNA transcripts or polypeptides or proteins) associated with a given gene?

For the Challenge Task, we asked the contestants to develop a system that does the following:


  • A set of papers on genetics or molecular biology
  • For each paper, a list of the genes mentioned in that paper


  • Whether the paper meets the Flybase gene-expression curation criteria, and for each gene, indicate whether the full paper has experimental evidence for gene products (RNA and/or protein).

Thus for each paper containing experimental evidence of gene expression, we asked that a system return a check-list for each gene indicating whether it has associated RNA and/or protein products.

Task 2: Yeast Gene Regulation Prediction

There are now experimental methods that allow biologists to measure some aspect of cellular "activity" for thousands of genes or proteins at a time. A key problem that often arises in such experiments is in interpreting or annotating these thousands of measurements. This KDD Cup task focused on using data mining methods to capture the regularities of genes that are characterized by similar activity in a given high-throughput experiment.

To facilitate objective evaluation, this task did not involve experiment interpretation or annotation directly, but instead it involved devising models that, when trained to classify the measurements of some instances (i.e. genes), can accurately predict the response of held aside test instances.

The training and test data came from recent experiments with a set of
S. cerevisiae (yeast) strains in which each strain is characterized by a single gene being knocked out. Each instance in the data set represents a single gene, and the target value for an instance is a discretized measurement of how active some (hidden) system in the cell is when this gene is knocked out. The goal of the task is to learn a model that can accurately predict these discretized values. Such a model would be helpful in understanding how various genes are related to the hidden system.

A subset of the genes was held aside as a test set. For the Challenge Task, we asked the contestants to develop a system that does the following:


  • A list of test set genes
  • Various data sources describing the genes of interest


  • For each test set gene, which "activity" class the strain with the gene knocked out falls into.

The data sources for this task include nominal (categorical) features describing gene localization and function, abstracts from the scientific literature (MEDLINE), and a table of protein-protein interactions that relate the products of pairs of genes.

The Hidden System

The experimental data (the target values) for this task were generated by Guang Yao and Dr. Chris Bradfield from the McArdle Laboratory for Cancer Research at the University of Wisconsin. During the KDD Cup competition, the nature of the system being measured was kept secret. Now it can be revealed...

Yao and Bradfield were measuring the activity of a yeast model of the AHR (Aryl Hydrocarbon Receptor) signaling pathway. This pathway plays a key role in how cells respond to certain toxic chemicals (among other things), and it is similar to pathways that control how cells respond to a variety of other environmental stimuli.

AHR is a protein that can act as a transcription factor. When a cell is exposed to certain toxic chemicals, such as dioxin, the AHR system acts to turn on/off the expression of certain genes. The complete inventory of gene products (proteins) that are involved in the signaling pathway is not known. The goal of Yao and Bradfield's experiment was to identify the set of genes that can play a role in the pathway.

Although the gene for AH receptor itself is not native to yeast, Yao and Bradfield transformed 4500+ strains in the yeast deletion library by inserting into each the AHR gene along with a reporter system that enabled them to measure how active AHR signaling was in any given strain.

The result of this experiment was the identification of 134 genes that, when knocked out, cause a significant change in the level of activity of the AHR signaling system.

Mark Craven
Last modified: Mon Aug 26 12:32:42 CDT 2002