Craven Group Information Extraction Data Sets 
(version 2, 8/29/02)
(documentation updated 5/2/03)

  Mark Craven (craven@biostat.wisc.edu)
  Soumya Ray
  Marios Skounakis

  Department of Biostatistics & Medical Informatics
  Department of Computer Sciences
  University of Wisconsin


BACKGROUND
----------

For our research in information extraction, we have put together three
data sets from MEDLINE abstracts.  Each of the data sets has been
labeled with instances of a particular binary relation.
The three relations and the databases from which the relation
instances were obtained are as follows:

  subcellular-localization(PROTEIN, LOCATION): This relation characterizes
    the subcellular compartments in which particular yeast proteins localize.
    The relation tuples were gathered from the (now defunct) 
    Yeast Proteome Database (YPD).

  disease-association(GENE, DISEASE): This relation lists associations
    between particular genes and diseases.  The relation tuples were
    gathered from the Online Mendelian Inheritance in Man (OMIM) database.

  protein-interaction(PROTEIN, PROTEIN): This symmetric relation lists 
    pairs of yeast proteins that physically interact.  The relation tuples
    were gathered from the This data was collected from the MIPS
    Comprehensive Yeast Genome Database.

To create data sets for information extraction research, we used the
relation instances (tuples) gathered from the databases above to label
abstracts gathered from MEDLINE.  It is important to note that the
labeling is somewhat noisy since it was done using a completely
automated method.  The method involved searching the abstracts for
sentences that contained co-occurrences of both terms of a given
tuple, and then labeling these sentences as positive instances.

Thus, some of the instances which we treat as positive instances may
not be considered as such if they were hand-labeled.  The relation
instances themselves represent ground truth (according to the YPD,
OMIM and MIPS databases anyway).  However, it may be the case that
some positive sentences do not really assert a relation instance --
they just happen to reference both of the terms of the instance.
Additionally, there may be unlabeled positive instances in the corpora
since the databases used for labeling are not complete.

This method for assembling what we call "weakly labeled" training data
is described in Craven & Kumlien's ISMB '99 paper.  An earlier version
of the subcellular-localization and disease-association data sets were
used in Ray & Craven's IJCAI '01 paper.  The data sets have been
cleaned up since then, however.  The sentence segmentation, parses and
labelings all have changed some in the newest version of the data
sets.  Our most recent experiments with the cleaned up data sets are
described in the IJCAI '03 paper by Skounakis, Craven and Ray.


DIRECTORY CONTENTS
------------------

Each data set is contained in its own tarball that expands out to a
parent directory.  The parent directory contains the following
directories and files:

  all/                     contains all of the instances in the data set

  pos/                     contains only the positive instances

  fold[12345].files        list the instances we use in each of the folds
                           of our 5-fold cross validation experiments

  fold[12345].files.equal  list the instances we use in each of the folds
                           of our 5-fold CV experiments when we have an
                           equal number of positive and negative instances
                           in each fold


FILE FORMATS
------------

Each file contains a single sentence, which we refer to as an
instance.  Simple heuristics were used to partition abstracts into
sentences, and so there are likely to be a few instances that either
contain multiple sentences or are sentence fragments.

Here is an example file for a negative instance:
-----
"We have studied a mitochondrial inorganic pyrophosphatase (PPase) in the yeast Saccharomyces cerevisiae. "
0 NP_SEGMENT we{PN}
1 VP_SEGMENT have{V} studied{V}
2 NP_SEGMENT a{ART} mitochondrial{N} inorganic{UNK} pyrophosphatase{UNK}
3 PP_SEGMENT in{PREP}
4 NP_SEGMENT the{ART} yeast{N} saccharomyces{UNK} cerevisiae{UNK}
-----

The first line of the file contains the sentence itself.  Subsequent
lines show the representation we use in our models.  These lines show
phrase segments acquired by "flattening" output from the Sundance
Parser (developed by Ellen Riloff's group at the University of Utah).
Each line, which is prefixed by an index number, represents a phrase
type (NP_SEGMENT, VP_SEGMENT, etc.) and words associated with that
phrase.  Each word is followed by a part-of-speech tag (as predicted
by Sundance) in braces.  The "UNK" tag above represents "unknown".

Positive instances differ in that they annotate the tuples that should
be extracted from them.  Here is an example file for a positive instance.

-----
"We cloned and sequenced the SKI3 gene and found that it encodes a 163 kDa protein including a typical nuclear localization signal. "
[SKI3/SKI5/P9677.7/YPR189W,nucleus/nuclei/nuclear]
[4,12]

0 NP_SEGMENT we{PN}
1 VP_SEGMENT cloned{V}
2 CONJ and{CONJ}
3 VP_SEGMENT sequenced{V}
4 NP_SEGMENT:PROTEIN the{ART} ski3{N:PROTEIN} gene{N}
5 CONJ and{CONJ}
6 VP_SEGMENT found{V}
7 C_M that{C_M}
8 NP_SEGMENT it{PN}
9 VP_SEGMENT encodes{V}
10 NP_SEGMENT a{ART} 163{NUM} kda{UNK} protein{N}
11 VP_SEGMENT including{V}
12 NP_SEGMENT:LOCATION a{ART} typical{ADJ} nuclear{ADJ:LOCATION} localization{N} signal{N}
-----

There are several key things to notice.  First, the second line of the
file provides a canonical representation of the tuples represented by
the sentence.  Each tuple is listed inside of braces and consists of
two parts: a list of all of the allowable terms for the first element
of the tuple (a protein, in this example), and a list of all of the
terms for the second element of the tuple (a location, in this
example).  Different terms are separated by /'s, and the two elements
of the tuple are separated by a comma.

The third line of the file provides a different representation of the
tuples.  Here again, each tuple is listed inside a pair of braces, but
in this representation the tuple is described by the indices of the
phrase segments containing the term for the first element, and the
term for the second element.

Another difference in the files for positive instances is that the
phrase segments containing tuple terms have labels attached to them
indicating the domain of the corresponding term(s).  The specific
terms in these phrase segments that match the tuple also have domain
labels attached to them.  These labels are included in the braces that
follow the terms (e.g. "nuclear{ADJ:LOCATION}" above).  Note that some
terms may involve multiple words.

For the subcellular-localization data set, the domains are PROTEIN AND
LOCATION.  For the disease-association data set, the domains are GENE
and DISEASE.  For the protein-interaction data set, the domains are
PROTEIN1 and PROTEIN2.

Finally we note that the Sundance parser was provided with a domain
specific lexicon to help improve the accuracy of its parses.  When it
processes compound terms that are in the lexicon (e.g. "endoplasmic
reticulum"), Sundance outputs these terms as one token with
underscores separating the individual constituents
(e.g. "endoplasmic_reticulum").  This lexicon includes some
subcellular locations.


REFERENCES
----------

@inproceedings{craven.ismb99
   ,author = "M. Craven and J. Kumlien"
   ,title = "Constructing Biological Knowledge Bases by
             Extracting Information from Text Sources"
   ,booktitle = "Proceedings of the Seventh International Conference on
                 Intelligent Systems for Molecular Biology"
   ,year = 1999
   ,publisher = "{AAAI} Press"
   ,address = "Heidelberg, Germany"
   ,pages = "77--86"
}


@inproceedings{ray.ijcai01
  ,author       = "S. Ray and M. Craven"
  ,title        = "Representing Sentence Structure in Hidden {M}arkov Models 
                   for Information Extraction"
  ,booktitle    = "Proceedings of the Seventeenth International Joint 
                   Conference on Artificial Intelligence"
  ,year         = 2001
  ,address      = "Seattle, WA"
  ,publisher    = "Morgan Kaufmann"
  ,pages        = "1273-1279"
}


@inproceedings(skounakis.ijcai03
   ,author = "M. Skounakis and M. Craven and S. Ray"
   ,title = "Hierarchical Hidden {M}arkov Models for
             Information Extraction"
   ,booktitle    = "Proceedings of the Eighteenth International Joint
                   Conference on Artificial Intelligence"
   ,year = 2003
}


@misc{omim
   ,title = "Online {M}endelian Inheritance in Man, {OMIM} ({TM})"
   ,author = "{Center for Medical Genetics, Johns Hopkins University
               and National Center for Biotechnology Information}"
   ,note = "http://www.ncbi.nlm.nih.gov/omim/"
   ,year = 2001
}


@article{mips
  ,author =  "H. W. Mewes and D. Frishman and C. Gruber and B. Geier and 
              D. Haase and A. Kaps and  K. Lemcke and  G. Mannhaupt and 
              F. Pfeiffer and C. Schüller and S. Stocker and B. Weil"
  ,title =    "{MIPS}: a database for genomes and protein sequences"
  ,journal =  "Nucleic Acids Research"
  ,year =     "2000",
  ,volume =   "28",
  ,pages =    "37-40"
}