Craven Group Information Extraction Data Sets (version 2, 8/29/02) (documentation updated 5/2/03) Mark Craven (craven@biostat.wisc.edu) Soumya Ray Marios Skounakis Department of Biostatistics & Medical Informatics Department of Computer Sciences University of Wisconsin BACKGROUND ---------- For our research in information extraction, we have put together three data sets from MEDLINE abstracts. Each of the data sets has been labeled with instances of a particular binary relation. The three relations and the databases from which the relation instances were obtained are as follows: subcellular-localization(PROTEIN, LOCATION): This relation characterizes the subcellular compartments in which particular yeast proteins localize. The relation tuples were gathered from the (now defunct) Yeast Proteome Database (YPD). disease-association(GENE, DISEASE): This relation lists associations between particular genes and diseases. The relation tuples were gathered from the Online Mendelian Inheritance in Man (OMIM) database. protein-interaction(PROTEIN, PROTEIN): This symmetric relation lists pairs of yeast proteins that physically interact. The relation tuples were gathered from the This data was collected from the MIPS Comprehensive Yeast Genome Database. To create data sets for information extraction research, we used the relation instances (tuples) gathered from the databases above to label abstracts gathered from MEDLINE. It is important to note that the labeling is somewhat noisy since it was done using a completely automated method. The method involved searching the abstracts for sentences that contained co-occurrences of both terms of a given tuple, and then labeling these sentences as positive instances. Thus, some of the instances which we treat as positive instances may not be considered as such if they were hand-labeled. The relation instances themselves represent ground truth (according to the YPD, OMIM and MIPS databases anyway). However, it may be the case that some positive sentences do not really assert a relation instance -- they just happen to reference both of the terms of the instance. Additionally, there may be unlabeled positive instances in the corpora since the databases used for labeling are not complete. This method for assembling what we call "weakly labeled" training data is described in Craven & Kumlien's ISMB '99 paper. An earlier version of the subcellular-localization and disease-association data sets were used in Ray & Craven's IJCAI '01 paper. The data sets have been cleaned up since then, however. The sentence segmentation, parses and labelings all have changed some in the newest version of the data sets. Our most recent experiments with the cleaned up data sets are described in the IJCAI '03 paper by Skounakis, Craven and Ray. DIRECTORY CONTENTS ------------------ Each data set is contained in its own tarball that expands out to a parent directory. The parent directory contains the following directories and files: all/ contains all of the instances in the data set pos/ contains only the positive instances fold[12345].files list the instances we use in each of the folds of our 5-fold cross validation experiments fold[12345].files.equal list the instances we use in each of the folds of our 5-fold CV experiments when we have an equal number of positive and negative instances in each fold FILE FORMATS ------------ Each file contains a single sentence, which we refer to as an instance. Simple heuristics were used to partition abstracts into sentences, and so there are likely to be a few instances that either contain multiple sentences or are sentence fragments. Here is an example file for a negative instance: ----- "We have studied a mitochondrial inorganic pyrophosphatase (PPase) in the yeast Saccharomyces cerevisiae. " 0 NP_SEGMENT we{PN} 1 VP_SEGMENT have{V} studied{V} 2 NP_SEGMENT a{ART} mitochondrial{N} inorganic{UNK} pyrophosphatase{UNK} 3 PP_SEGMENT in{PREP} 4 NP_SEGMENT the{ART} yeast{N} saccharomyces{UNK} cerevisiae{UNK} ----- The first line of the file contains the sentence itself. Subsequent lines show the representation we use in our models. These lines show phrase segments acquired by "flattening" output from the Sundance Parser (developed by Ellen Riloff's group at the University of Utah). Each line, which is prefixed by an index number, represents a phrase type (NP_SEGMENT, VP_SEGMENT, etc.) and words associated with that phrase. Each word is followed by a part-of-speech tag (as predicted by Sundance) in braces. The "UNK" tag above represents "unknown". Positive instances differ in that they annotate the tuples that should be extracted from them. Here is an example file for a positive instance. ----- "We cloned and sequenced the SKI3 gene and found that it encodes a 163 kDa protein including a typical nuclear localization signal. " [SKI3/SKI5/P9677.7/YPR189W,nucleus/nuclei/nuclear] [4,12] 0 NP_SEGMENT we{PN} 1 VP_SEGMENT cloned{V} 2 CONJ and{CONJ} 3 VP_SEGMENT sequenced{V} 4 NP_SEGMENT:PROTEIN the{ART} ski3{N:PROTEIN} gene{N} 5 CONJ and{CONJ} 6 VP_SEGMENT found{V} 7 C_M that{C_M} 8 NP_SEGMENT it{PN} 9 VP_SEGMENT encodes{V} 10 NP_SEGMENT a{ART} 163{NUM} kda{UNK} protein{N} 11 VP_SEGMENT including{V} 12 NP_SEGMENT:LOCATION a{ART} typical{ADJ} nuclear{ADJ:LOCATION} localization{N} signal{N} ----- There are several key things to notice. First, the second line of the file provides a canonical representation of the tuples represented by the sentence. Each tuple is listed inside of braces and consists of two parts: a list of all of the allowable terms for the first element of the tuple (a protein, in this example), and a list of all of the terms for the second element of the tuple (a location, in this example). Different terms are separated by /'s, and the two elements of the tuple are separated by a comma. The third line of the file provides a different representation of the tuples. Here again, each tuple is listed inside a pair of braces, but in this representation the tuple is described by the indices of the phrase segments containing the term for the first element, and the term for the second element. Another difference in the files for positive instances is that the phrase segments containing tuple terms have labels attached to them indicating the domain of the corresponding term(s). The specific terms in these phrase segments that match the tuple also have domain labels attached to them. These labels are included in the braces that follow the terms (e.g. "nuclear{ADJ:LOCATION}" above). Note that some terms may involve multiple words. For the subcellular-localization data set, the domains are PROTEIN AND LOCATION. For the disease-association data set, the domains are GENE and DISEASE. For the protein-interaction data set, the domains are PROTEIN1 and PROTEIN2. Finally we note that the Sundance parser was provided with a domain specific lexicon to help improve the accuracy of its parses. When it processes compound terms that are in the lexicon (e.g. "endoplasmic reticulum"), Sundance outputs these terms as one token with underscores separating the individual constituents (e.g. "endoplasmic_reticulum"). This lexicon includes some subcellular locations. REFERENCES ---------- @inproceedings{craven.ismb99 ,author = "M. Craven and J. Kumlien" ,title = "Constructing Biological Knowledge Bases by Extracting Information from Text Sources" ,booktitle = "Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology" ,year = 1999 ,publisher = "{AAAI} Press" ,address = "Heidelberg, Germany" ,pages = "77--86" } @inproceedings{ray.ijcai01 ,author = "S. Ray and M. Craven" ,title = "Representing Sentence Structure in Hidden {M}arkov Models for Information Extraction" ,booktitle = "Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence" ,year = 2001 ,address = "Seattle, WA" ,publisher = "Morgan Kaufmann" ,pages = "1273-1279" } @inproceedings(skounakis.ijcai03 ,author = "M. Skounakis and M. Craven and S. Ray" ,title = "Hierarchical Hidden {M}arkov Models for Information Extraction" ,booktitle = "Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence" ,year = 2003 } @misc{omim ,title = "Online {M}endelian Inheritance in Man, {OMIM} ({TM})" ,author = "{Center for Medical Genetics, Johns Hopkins University and National Center for Biotechnology Information}" ,note = "http://www.ncbi.nlm.nih.gov/omim/" ,year = 2001 } @article{mips ,author = "H. W. Mewes and D. Frishman and C. Gruber and B. Geier and D. Haase and A. Kaps and K. Lemcke and G. Mannhaupt and F. Pfeiffer and C. Schüller and S. Stocker and B. Weil" ,title = "{MIPS}: a database for genomes and protein sequences" ,journal = "Nucleic Acids Research" ,year = "2000", ,volume = "28", ,pages = "37-40" }