KDD Cup 2002 Frequently Asked Questions

General FAQs

Can I compete on just one task in KDD Cup?

Yes, the two tasks will be treated as separate competitions.
Do I need to do anything to register for KDD Cup?

Not at this point. You should, however, sign up for the appropriate mailing lists so that you are kept informed about any issues that come up.
- mailing list for Task 1
- mailing list for Task 2

Task 1 (Information Extraction) FAQs

For answers to questions about Task 1 see the mailing list archive.

Task 2 (Gene Regulation) FAQs

Will KDD Cup entrants be allowed to use other resources for training their models in addition to those provided in the training and test sets?

We have attempted to provide the most relevant data that is readily accessible. We ask that entrants not use other data sources in their models. .
How big is the test set?

The test set consists of 1489 instances.
Is the ratio of the different classes the same in the test set as in the training set?

Yes.
Were the training and test set split apart randomly?

Yes, using a stratified method so that the class ratios would be the same in both sets.
Do the data files distributed with the training set describe the test set instances in addition to the training set instances?

Yes. The abstracts, protein-protein interactions, localization values, function values and aliases represent knowledge about all of the genes in yeast. The test set to be provided will consist solely of a list of gene identifiers. All of the information required to instantiate features for the test set instances is in the data files that were included with the training instances.
Are the MEDLINE abstracts meant to be used as input data?

Yes, in fact it is probably necessary to use them to get competitive accuracies.
Why do the abstracts often contain references to gene names followed by a "p". For example, abstract 10022848 references "sec4p" and "sec15p", but the file gene-abstracts.txt associates this abstract with the genes "sec4" and "sec15".

The "p" suffix is often used to refer to the protein encoded by a given gene. For example, "sec4p" denotes the protein encoded by the gene "sec4". Since the protein is the "product" of the gene, you can think of references to "sec4p" as saying something about "sec4".
Are some of the abstracts more relevant than others?

Yes, certainly. The range of relevance for the abstracts probably varies widely.
Is the relation represented by the protein-protein interaction table symmetric? Why are some pairs listed both ways, while most are not?

Yes, the relation is symmetric. Therefore the order of the pair in each row does not matter. Some pairs are listed in both orders simply because the original table had these redundancies and my code that tried to clean these up overlooked a few cases (it didn't check all aliases).
Why are some entries repeated in the protein-protein interaction table? Do the multiple entries have any significance?

This is the result of another overlooked data cleaning issue. There is no significance to the fact that some pairs have multiple entries.
Why are there some entries in the protein-protein interaction table in which a gene's product interacts with itself (i.e. the gene listed in the second column is the same as the one listed in the first)?

Certain proteins form homodimers, meaning that two copies of the same protein molecule bind to each other to form a complex. The instances of reflexive interaction (e.g. YNL331C, YNL331C) in the data set are putative homodimers.
If gene A's protein interacts with gene B's protein and gene B's protein interacts with gene C's protein, can we infer that A's protein interacts with C's protein?

No, the interaction relation is not transitive. You cannot conclude that A's protein physically interacts with C's protein. However, it may be reasonable to conclude that A and C are related in some way.
Is the list in interactions.txt exhaustive.

No, there are surely actual interactions that are not represented in the list. Moreover, there are surely some interactions in the list that are false positives.
Is is true that one protein may have multiple functions? If so, why does function.txt list only one function for each protein.

Yes, a given protein may have multiple functions. The most recent version of function.txt does list multiple functions. The original version was incomplete.
Is there clear (biological) independence between "functions" and the "protein classes".

No, in fact there is probably a high degree of dependence. But these two features do provide somewhat different views on the functions of various genes.
Is there noise in the class labels of the instances?

Since the class labels were determined via an experimental process in a lab, there is likely to be some noise in them. However, we haven't artificially added any noise to the labels.
Has any artificial noise been added to the data.

No.
Can you give an example of what is meant by a "hidden system" and a "control system"?

One example of a system we might measure is how well a particular virus replicates when various genes have been knocked out. Other examples include the activity levels of specific metabolic or signaling pathways. The motivation for having a "control system" in these experiments is to determine whether a given knockout seems to be specific in affecting our system of interest or whether it affects the functioning of the cell broadly.
In the gene-aliases.txt file, there are some aliases that are shared by several genes. Is this correct that multiple genes can share the same alias?

Yes. In some cases (e.g. ASP3) there are multiple copies of the same gene in the genome. In other cases (e.g. YFR1) the alias is sometimes loosely used to refer to any of of a family of six genes (YFR1-1, ... YFR1-6). In yet other cases (e.g. SAT2) the yeast community seems to have inadvertently used the same alias for multiple, unrelated genes. Rik Belew has provided a list of the overloaded gene aliases in the data set.
There are many more than 4507 referenced in the data set? Were the 4507 training and test genes selected randomly from the total complement of genes in yeast?

No, the 4507 strains that were measured in this experiment represent the strains that are viable when the gene associated with each is knocked out.
Where can I find background material on the the problem domain?

L. Hunter. Molecular Biology for Computer Scientists. In Artificial Intelligence and Molecular Biology, L. Hunter editor, 1993, AAAI Press.
DOE Primer on Molecular Genetics
F. Sherman. An Introduction to the Genetics and Molecular Biology of the Yeast Saccharomyces cerevisiae
Saccharomyces Genome Deletion Project
Is it appropriate to think of the control and hidden systems separately, such that there are 4 possible cases:

Hidden Change Control Change Class
0 0 NC
0 1 NC
1 0 CHANGE
1 1 CONTROL
If so, would there be any chance of getting refined codings, discriminating the two NC cases?

In theory there are four separate cases, but in practice the experiments were run as follows. Some subset of genes H was identified as having their knockouts significantly up/down-regulate the hidden system. Then the control system was measured only for the genes in H. So in effect we don't have the information to distinguish between the first two lines in the table above.

Can you give more detail about how the area under an ROC curve will be calculated in the evaluation?

Here is pseudocode for this calculation.


Given:
	predictions: n ordered test-set predictions
	total_pos: # positive instances in test set
	total_neg: # negative instances in test set

area = 0.0
tp_rate = 0.0
fp_rate = 0.0
tp_count = 0
fp_count = 0
i = 0
while (i < n && tp-rate < 1.0)
{
	// remember rates from last point
	last_fp_rate = fp_rate
	last_tp_rate = tp_rate

	// consider the next instance to be another pos prediction
	if class([predictions[i]) == pos
		++tp_count
	else
		++fp_count

	// determine coordinates of ROC point
	tp_rate = tp_count / total_pos
	fp_rate = fp_count / total_neg

	// update area
	if (fp-rate > last-fp-rate)
		// use trapezoid rule
		area += 0.5 * (fp_rate - last_fp_rate) * 
		        (last_tp_rate + tp_rate)
}

// account for the rest of the area after tp_rate hits 1.0
area += (1 - fp_rate) * 1.0

Return: area

There are some genes in the test set (as well as the training set) for which there is no data available. How are we supposed to make predictions for these cases?

These cases represent (putative) genes for which the corresponding knockouts were actually used in the experiments that measured the hidden and control systems. These cases were not removed from the data set because they reflect the nature of the real problem at hand: little or nothing is known about some genes. It is up to each competitor to decide how to make predictions for these cases. One reasonable approach is to assume that they belong to the most populous class (nc).

Mark Craven

Last modified: Fri Aug 23 14:52:43 CDT 2002