KDD Cup 2002 Task 2 Evaluation This document describes the scoring scheme that will be used to evaluate competitors in Task 2 of KDD Cup 2002. BACKGROUND To motivate the scoring scheme, let us first consider the relationship of the three classes in the problem. Recall that each instance in the data set represents a strain of yeast in which a particular gene has been "knocked out" (i.e. disabled). The class labels in the data set come from two experiments that were run in sequence. First, the activity level of the "hidden" system was measured in 4507 of these knockout yeast strains. In about 130 of these strains, the activity of the hidden system was significantly different than in the baseline (the wild-type strain). This experiment is interesting because it suggests which genes might be involved somehow in regulating the hidden system. That is, if a given strain of yeast had a significantly different level of activity in the hidden system, then it is likely that the knocked-out gene associated with the strain is involved in regulating the hidden system. For those strains where the activity of the hidden system was significantly different, the activity of a different, "control" system was measured. The purpose of measuring the control was to distinguish the genes that have a specific relationship to the hidden system from those that affect the hidden system because they play general roles and affect many functions in the cell. The goal of the data mining task here is to identify the regularities and relationships that characterize the genes involved in regulating the hidden system. We could do this with a narrow focus: characterizing the genes that affect the hidden system but not the control system. Or we could do this with a broad focus: characterizing the genes that affect the hidden system without regard to what happens in the control system. Both of these perspectives are valuable, and both will be considered in the scoring scheme for Task 2. TWO CLASS PARTITIONINGS Competitors in Task 2 will be asked to provide predictions based on two different binary partitionings of the class labels. In the first case, the "positive" class consists of those genes with the "change" label and the "negative" class consists of those genes with *either* the "nc" or the "control" label. This partitioning corresponds to the narrow characterization of genes affecting the hidden system described above. In the second case, the "positive" class consists of those genes labeled with *either* the "change" or the "control" label, and the "negative" class consists of those genes labeled with the "nc" label. This partitioning corresponds to the broad characterization of the genes affecting the hidden system. SCORING METRIC For each of the partitions described above, competitors will be asked to provide their predictions sorted by confidence that the gene belongs to the positive class. That is, the first genes in this sorted list should be those that are most confidently predicted to belong to the positive class. The last genes in the list should be those that are most confidently predicted to belong to the negative class. By varying a threshold on these sorted predictions we can construct a Receiver Operating Characteristic (ROC) curve. An ROC curve is a plot of the true positive rate against the false positive rate for the different possible thresholds. Here the true positive rate is the fraction of the positive instances for which the system predicts "positive". The false positive rate is the fraction of the negative instances for which the system erroneously predicts "positive". The larger the area under the curve (the more closely the curve follows the left-hand border and then the top border of the ROC space), the more accurate the test. The expected curve for a system making random predictions will be a line on the 45-degree diagonal. The evaluation metric we use will be the area under the ROC curve. The ROC curve for a perfect system has an area of 1. The ROC curve for a system making random predictions has an expected area of 0.5. OVERALL SCORING Competitors will provide two separate sets of predictions for the two class partitionings. We will construct two ROC curves (one for each partitioning) and measure the area under each. The winner will be the competitor with the greatest summed area for the two curves. That is: OverallScore = Area(ROC curve for narrow partition) + Area(ROC curve for broad partition) In addition to determining the overall winner, we will also be interested in determining which competitor has the best result for each individual partition, and we will report these results at KDD 2002 as well. Questions about Task 2 and the scoring system should be directed to Mark Craven (craven@biostat.wisc.edu).