KDD Cup 2002 Task 2 Evaluation


This document describes the scoring scheme that will be used to
evaluate competitors in Task 2 of KDD Cup 2002.


BACKGROUND

To motivate the scoring scheme, let us first consider the relationship
of the three classes in the problem.  Recall that each instance in the
data set represents a strain of yeast in which a particular gene has
been "knocked out" (i.e. disabled).  The class labels in the data set
come from two experiments that were run in sequence.  First, the
activity level of the "hidden" system was measured in 4507 of these
knockout yeast strains.  In about 130 of these strains, the activity
of the hidden system was significantly different than in the baseline
(the wild-type strain).  This experiment is interesting because it
suggests which genes might be involved somehow in regulating the
hidden system.  That is, if a given strain of yeast had a
significantly different level of activity in the hidden system, then
it is likely that the knocked-out gene associated with the strain is
involved in regulating the hidden system.

For those strains where the activity of the hidden system was
significantly different, the activity of a different, "control" system
was measured.  The purpose of measuring the control was to distinguish
the genes that have a specific relationship to the hidden system from
those that affect the hidden system because they play general roles
and affect many functions in the cell.

The goal of the data mining task here is to identify the regularities
and relationships that characterize the genes involved in regulating
the hidden system.  We could do this with a narrow focus:
characterizing the genes that affect the hidden system but not the
control system.  Or we could do this with a broad focus:
characterizing the genes that affect the hidden system without regard
to what happens in the control system.  Both of these perspectives are
valuable, and both will be considered in the scoring scheme for Task 2.

	
TWO CLASS PARTITIONINGS

Competitors in Task 2 will be asked to provide predictions based on
two different binary partitionings of the class labels.  

In the first case, the "positive" class consists of those genes with
the "change" label and the "negative" class consists of those genes
with *either* the "nc" or the "control" label.  This partitioning
corresponds to the narrow characterization of genes affecting the
hidden system described above.

In the second case, the "positive" class consists of those genes
labeled with *either* the "change" or the "control" label, and the
"negative" class consists of those genes labeled with the "nc" label.
This partitioning corresponds to the broad characterization of the
genes affecting the hidden system.


SCORING METRIC

For each of the partitions described above, competitors will be asked
to provide their predictions sorted by confidence that the gene
belongs to the positive class.  That is, the first genes in this
sorted list should be those that are most confidently predicted to
belong to the positive class.  The last genes in the list should be
those that are most confidently predicted to belong to the negative
class.

By varying a threshold on these sorted predictions we can construct a
Receiver Operating Characteristic (ROC) curve.  An ROC curve is a plot
of the true positive rate against the false positive rate for the
different possible thresholds. Here the true positive rate is the
fraction of the positive instances for which the system predicts
"positive". The false positive rate is the fraction of the negative
instances for which the system erroneously predicts "positive".  The
larger the area under the curve (the more closely the curve follows
the left-hand border and then the top border of the ROC space), the
more accurate the test.  The expected curve for a system making random
predictions will be a line on the 45-degree diagonal.

The evaluation metric we use will be the area under the ROC curve.  

The ROC curve for a perfect system has an area of 1.  The ROC curve
for a system making random predictions has an expected area of 0.5.


OVERALL SCORING

Competitors will provide two separate sets of predictions for the two
class partitionings. We will construct two ROC curves (one for each
partitioning) and measure the area under each.  The winner will be the
competitor with the greatest summed area for the two curves.  That is:

OverallScore = Area(ROC curve for narrow partition) +
               Area(ROC curve for broad partition)


In addition to determining the overall winner, we will also be
interested in determining which competitor has the best result for
each individual partition, and we will report these results at KDD
2002 as well.

Questions about Task 2 and the scoring system should be directed to
Mark Craven (craven@biostat.wisc.edu).