KDD Cup 2002 Frequently Asked Questions
- Can I compete on just one task in KDD Cup?
Yes, the two tasks will be treated
as separate competitions.
- Do I need to do anything to register for KDD Cup?
Not at this point. You should, however, sign up
for the appropriate mailing lists so that you are kept
informed about any
issues that come up.
Task 1 (Information Extraction) FAQs
For answers to questions about Task 1 see the
mailing list archive.
Task 2 (Gene Regulation) FAQs
- Will KDD Cup entrants be allowed to use other resources
for training their models in addition to those provided in
the training and test sets?
We have attempted to provide the most relevant data that
is readily accessible. We ask that entrants not use other
data sources in their models. .
- How big is the test set?
The test set consists of 1489 instances.
- Is the ratio of the different classes the same
in the test set as in the training set?
- Were the training and test set split apart randomly?
Yes, using a stratified method so that the class
ratios would be the same in both sets.
- Do the data files distributed with the training
set describe the test set instances in addition to the
training set instances?
Yes. The abstracts, protein-protein
interactions, localization values, function values and
aliases represent knowledge about all of the genes in yeast.
The test set to be provided will consist solely of a list
of gene identifiers. All of the information required to
instantiate features for the test set instances is in
the data files that were included with the training instances.
- Are the MEDLINE abstracts meant to be used as input data?
Yes, in fact it is probably necessary to use them to
get competitive accuracies.
- Why do the abstracts often contain references to gene names
followed by a "p". For example, abstract 10022848 references
"sec4p" and "sec15p", but the file gene-abstracts.txt associates
this abstract with the genes "sec4" and "sec15".
The "p" suffix is often used to refer to the protein
encoded by a given gene. For example, "sec4p" denotes the
protein encoded by the gene "sec4". Since the protein is
the "product" of the gene, you can think of references to
"sec4p" as saying something about "sec4".
- Are some of the abstracts more relevant than others?
Yes, certainly. The range of relevance for the abstracts
probably varies widely.
- Is the relation represented by the protein-protein
symmetric? Why are some pairs listed both ways, while
most are not?
Yes, the relation is symmetric. Therefore the order
of the pair in each row does not matter. Some pairs
are listed in both orders simply because the original
table had these redundancies and my code that tried to
clean these up overlooked a few cases (it didn't check
- Why are some entries repeated in the protein-protein
interaction table? Do the multiple entries have any
This is the result of another overlooked data cleaning
issue. There is no significance to the fact that some
pairs have multiple entries.
- Why are there some entries in the protein-protein
interaction table in which a gene's product interacts with itself
(i.e. the gene listed in the second column is the same as
the one listed in the first)?
Certain proteins form homodimers, meaning that
two copies of the same protein molecule bind to each other to
form a complex. The instances of reflexive interaction
(e.g. YNL331C, YNL331C) in the data set are putative homodimers.
- If gene A's protein interacts with gene B's protein
and gene B's protein interacts with gene C's protein,
can we infer that A's protein interacts with C's protein?
No, the interaction relation is not transitive.
You cannot conclude that A's protein physically interacts
with C's protein.
However, it may be reasonable to conclude that A and C
are related in some way.
- Is the list in interactions.txt exhaustive.
No, there are surely actual interactions that are not
represented in the list. Moreover, there are surely some
interactions in the list that are false positives.
- Is is true that one protein may have multiple
functions? If so, why does function.txt list only one
function for each protein.
Yes, a given protein may have multiple functions.
The most recent version of function.txt does list multiple
functions. The original version was incomplete.
- Is there clear (biological) independence between
"functions" and the "protein classes".
No, in fact there is probably a high degree of dependence.
But these two features do provide somewhat different views
on the functions of various genes.
- Is there noise in the class labels of the instances?
Since the class labels were determined via an
experimental process in a lab, there is likely to be
some noise in them. However, we haven't artificially
added any noise to the labels.
- Has any artificial noise been added to the data.
- Can you give an example of what is meant by
a "hidden system" and a "control system"?
One example of a system we might measure is how
well a particular virus replicates when various genes
have been knocked out. Other examples include the
activity levels of specific metabolic or signaling pathways.
The motivation for having a "control system" in these
experiments is to determine whether a given knockout
seems to be specific in affecting our system of interest
or whether it affects the functioning of the cell broadly.
- In the gene-aliases.txt file, there are some aliases
that are shared by several genes. Is this correct that multiple genes can share
the same alias?
Yes. In some cases (e.g. ASP3) there are multiple
copies of the same gene in the genome. In other cases
(e.g. YFR1) the alias is sometimes loosely used to
refer to any of of a family of six genes (YFR1-1, ... YFR1-6).
In yet other cases
(e.g. SAT2) the yeast community seems to have inadvertently
used the same alias for multiple, unrelated genes.
Rik Belew has provided a list of
the overloaded gene aliases in the data set.
- There are many more than 4507 referenced in the data set?
Were the 4507 training and test genes selected randomly
from the total complement of genes in yeast?
No, the 4507 strains that were measured in this experiment
represent the strains that are viable when the gene
associated with each is knocked out.
- Where can I find background material on the
the problem domain?
- Is it appropriate to think of the control and
hidden systems separately, such that there are 4 possible cases:
If so, would there be any chance of getting refined codings,
discriminating the two NC cases?
|Hidden Change ||Control Change ||Class
| 0 ||0 ||NC
| 0 ||1 ||NC
| 1 ||0 ||CHANGE
| 1 ||1 ||CONTROL
In theory there are four separate cases, but in practice the experiments
were run as follows. Some subset of genes H was identified as having
their knockouts significantly up/down-regulate the hidden system.
Then the control system was measured only for the genes in H.
So in effect we don't have the information to distinguish between the
first two lines in the table above.
- Can you give more detail about how the area under
an ROC curve will be calculated in the evaluation?
Here is pseudocode for this calculation.
predictions: n ordered test-set predictions
total_pos: # positive instances in test set
total_neg: # negative instances in test set
area = 0.0
tp_rate = 0.0
fp_rate = 0.0
tp_count = 0
fp_count = 0
i = 0
while (i < n && tp-rate < 1.0)
// remember rates from last point
last_fp_rate = fp_rate
last_tp_rate = tp_rate
// consider the next instance to be another pos prediction
if class([predictions[i]) == pos
// determine coordinates of ROC point
tp_rate = tp_count / total_pos
fp_rate = fp_count / total_neg
// update area
if (fp-rate > last-fp-rate)
// use trapezoid rule
area += 0.5 * (fp_rate - last_fp_rate) *
(last_tp_rate + tp_rate)
// account for the rest of the area after tp_rate hits 1.0
area += (1 - fp_rate) * 1.0
- There are some genes in the test set (as well as the
training set) for which there is no data available.
How are we supposed to make predictions for these cases?
These cases represent (putative) genes for which the
corresponding knockouts were actually used in the experiments
that measured the hidden and control systems. These cases were
not removed from the data set because they reflect the nature of the real
problem at hand: little or nothing is known about some genes.
It is up to each competitor to decide how to make predictions
for these cases. One reasonable approach is to assume that
they belong to the most populous class (nc).
Last modified: Fri Aug 23 14:52:43 CDT 2002