This directory contains the "University Computer Science Department" data set
used for the experiments in the publications:

  @article{craven.mlj01
     ,author      = "M. Craven and S. Slattery"
     ,title       = "Relational Learning with Statistical Predicate Invention:
                     Better Models for Hypertext"
     ,journal     = "Machine Learning"
     ,year        = 2001
     ,volume      = 43
     ,number      = "1-2"
     ,pages       = "97--119"
  }

   @inproceedings{slattery.ilp98
     ,author = "S. Slattery and M. Craven"
     ,title = "Combining Statistical and Relational Methods for Learning in
               Hypertext Domains"
     ,booktitle = "Proceedings of the Eighth International Conference on
                   Inductive Logic Programming"
     ,year = 1998
   }

This data set consists of Web pages and hyperlinks collected from four
computer science departments: Cornell University, The University of
Texas, The University of Washington, and The University of Wisconsin.

The methodology used for our experiments was leave-one-university-out
cross-validation in which we would train on data from three of the universities
and test on data from the university held out.

The following files contain data in the input format used by Quinlan's FOIL
code.  We assembled our training/test data sets by concatenating an
appropriate set of these files.

   common.gz
      Type definitions and background relations used in all of our experiments.
      The types defined are "page" and "linkid".  The background relations
      defined in this file are "link-to" which specifies hyperlink connections,
      and "all-words-capitalized" and "has-alphanumeric-word" which are Boolean
      predicates characterizing the anchor text of hyperlinks.

   page-words.sans-<univ>.gz
      This set of files provides a bag-of-words representation of the words
      that occur in pages in the data set.  Each predicate in these files
      specifies a stemmed word, and the instances of the predicate are those
      pages that contain the word.  There are four files in this set -- one
      for each university.  The files differ for each training/test partition
      because the vocabulary was pruned by considering the frequency of
      word occurrences only in the training set.  The notation "sans-<univ>"
      means that when Cornell is the university in the test set, you should
      use the file page-words.sans-cornell.gz.

   anchor-words.sans-<univ>.gz
      This is the analogous set of files for words that occur in the anchor
      text of hyperlinks.

   neighborhood-words.sans-<univ>.gz
      This is the analogous set of files for words that occur in the
      "neighboring" text of hyperlinks.  The neighborhood of a hyperlink
      includes words in a single paragraph, list item, table entry, title
      or heading in which the hyperlink is contained.
		              
   page-classes.sans-<univ>.gz
      This set of files contains a set of predicates indicating the class
      of each page in the data set.  For training-set instances, the true
      class labels are used.  For test-set instances, predicted class labels
      are used.  These predictions were made using a method that combined
      statistical text classifiers with a URL-based clustering method.
      IMPORTANT: these files should be used only for the binary target
      relations (department-of, instructors-of, and members-of-project).

   department-of.sans-<univ>.gz
      These are the training and test set instances for the target relation
      "department-of".

   instructors-of.sans-<univ>.gz
      These are the training and test set instances for the target relation
      "instructors-of".

   members-of-project.sans-<univ>.gz
      These are the training and test set instances for the target relation
      "members-of-project".

   student.sans-<univ>.gz 
      These are the training and test set instances for the target relation
      "student".

   course.sans-<univ>.gz 
      These are the training and test set instances for the target relation
      "course".

   research.project.sans-<univ>.gz 
      These are the training and test set instances for the target relation
      "research.project".

   faculty.sans-<univ>.gz 
      These are the training and test set instances for the target relation
      "faculty".
      
To set up an input file for FOIL or another algorithm, you should 
concatenate the right set of files together.  For example, to train
on Cornell, Texas and Washington while testing on Wisconsin for the
"instructors.of" relation, you should concatenate the following files:
   common.gz
   page-words.sans-wisconsin.gz
   anchor-words.sans-wisconsin.gz
   neighborhood-words.sans-wisconsin.gz
   page-classes.sans-wisconsin.gz
   instructors-of.sans-wisconsin.gz

To learn the "student" target relation using the same training/test partition, 
you should concatenate the following files:
   common.gz
   page-words.sans-wisconsin.gz
   anchor-words.sans-wisconsin.gz
   neighborhood-words.sans-wisconsin.gz
   student.sans-wisconsin.gz


In our experiments with the FOIL-PILFS algorithm that we developed, we
did not give the learner the page-words.sans-<univ>, anchor-words.sans-<univ>,
or neighborhood-words.sans-<univ> relations.  Instead the algorithm had
direct access to the documents representing these features.  The files
containing these documents are in the following subdirectories:

   page-text
      This directory contains the full text of all the web pages used in our
      experiments. Seven subdirectories contain the documents for each 
      label (course, department, faculty, other, research.project, staff and
      student). In each of these directories, there is a directory for 
      each university and it contains the actual documents. 

      The name of each document is taken from the URL of the document, with
      any '/' characters replaced by '^' characters. These derived names
      are exactly the same as the names used in the background relation files
      above.

      It is VERY IMPORTANT that the MIME header (everything up to the first
      blank line) be discarded from each document before doing any learning.
      The header contains information about when the document was fetched 
      from the web, and it may be the case that using this information to
      predict class membership is helpful (and cheating).

   anchor-text
      This directory contains a subdirectory for each university. The 
      anchor text for each hyperlink in the web pages for a university
      is contained in that university's subdirectory. These files have
      no MIME header, but can contain HTML tags.

   neighborhood-text
      This directory is structured identically to anchor-text. Here the files
      contain fragments of hypertext corresponding to the "neighborhood"
      or the hyperlink. Again these files do not contain MIME headers and
      can contain HTML tags.