This directory contains the "University Computer Science Department" data set used for the experiments in the publications: @article{craven.mlj01 ,author = "M. Craven and S. Slattery" ,title = "Relational Learning with Statistical Predicate Invention: Better Models for Hypertext" ,journal = "Machine Learning" ,year = 2001 ,volume = 43 ,number = "1-2" ,pages = "97--119" } @inproceedings{slattery.ilp98 ,author = "S. Slattery and M. Craven" ,title = "Combining Statistical and Relational Methods for Learning in Hypertext Domains" ,booktitle = "Proceedings of the Eighth International Conference on Inductive Logic Programming" ,year = 1998 } This data set consists of Web pages and hyperlinks collected from four computer science departments: Cornell University, The University of Texas, The University of Washington, and The University of Wisconsin. The methodology used for our experiments was leave-one-university-out cross-validation in which we would train on data from three of the universities and test on data from the university held out. The following files contain data in the input format used by Quinlan's FOIL code. We assembled our training/test data sets by concatenating an appropriate set of these files. common.gz Type definitions and background relations used in all of our experiments. The types defined are "page" and "linkid". The background relations defined in this file are "link-to" which specifies hyperlink connections, and "all-words-capitalized" and "has-alphanumeric-word" which are Boolean predicates characterizing the anchor text of hyperlinks. page-words.sans-.gz This set of files provides a bag-of-words representation of the words that occur in pages in the data set. Each predicate in these files specifies a stemmed word, and the instances of the predicate are those pages that contain the word. There are four files in this set -- one for each university. The files differ for each training/test partition because the vocabulary was pruned by considering the frequency of word occurrences only in the training set. The notation "sans-" means that when Cornell is the university in the test set, you should use the file page-words.sans-cornell.gz. anchor-words.sans-.gz This is the analogous set of files for words that occur in the anchor text of hyperlinks. neighborhood-words.sans-.gz This is the analogous set of files for words that occur in the "neighboring" text of hyperlinks. The neighborhood of a hyperlink includes words in a single paragraph, list item, table entry, title or heading in which the hyperlink is contained. page-classes.sans-.gz This set of files contains a set of predicates indicating the class of each page in the data set. For training-set instances, the true class labels are used. For test-set instances, predicted class labels are used. These predictions were made using a method that combined statistical text classifiers with a URL-based clustering method. IMPORTANT: these files should be used only for the binary target relations (department-of, instructors-of, and members-of-project). department-of.sans-.gz These are the training and test set instances for the target relation "department-of". instructors-of.sans-.gz These are the training and test set instances for the target relation "instructors-of". members-of-project.sans-.gz These are the training and test set instances for the target relation "members-of-project". student.sans-.gz These are the training and test set instances for the target relation "student". course.sans-.gz These are the training and test set instances for the target relation "course". research.project.sans-.gz These are the training and test set instances for the target relation "research.project". faculty.sans-.gz These are the training and test set instances for the target relation "faculty". To set up an input file for FOIL or another algorithm, you should concatenate the right set of files together. For example, to train on Cornell, Texas and Washington while testing on Wisconsin for the "instructors.of" relation, you should concatenate the following files: common.gz page-words.sans-wisconsin.gz anchor-words.sans-wisconsin.gz neighborhood-words.sans-wisconsin.gz page-classes.sans-wisconsin.gz instructors-of.sans-wisconsin.gz To learn the "student" target relation using the same training/test partition, you should concatenate the following files: common.gz page-words.sans-wisconsin.gz anchor-words.sans-wisconsin.gz neighborhood-words.sans-wisconsin.gz student.sans-wisconsin.gz In our experiments with the FOIL-PILFS algorithm that we developed, we did not give the learner the page-words.sans-, anchor-words.sans-, or neighborhood-words.sans- relations. Instead the algorithm had direct access to the documents representing these features. The files containing these documents are in the following subdirectories: page-text This directory contains the full text of all the web pages used in our experiments. Seven subdirectories contain the documents for each label (course, department, faculty, other, research.project, staff and student). In each of these directories, there is a directory for each university and it contains the actual documents. The name of each document is taken from the URL of the document, with any '/' characters replaced by '^' characters. These derived names are exactly the same as the names used in the background relation files above. It is VERY IMPORTANT that the MIME header (everything up to the first blank line) be discarded from each document before doing any learning. The header contains information about when the document was fetched from the web, and it may be the case that using this information to predict class membership is helpful (and cheating). anchor-text This directory contains a subdirectory for each university. The anchor text for each hyperlink in the web pages for a university is contained in that university's subdirectory. These files have no MIME header, but can contain HTML tags. neighborhood-text This directory is structured identically to anchor-text. Here the files contain fragments of hypertext corresponding to the "neighborhood" or the hyperlink. Again these files do not contain MIME headers and can contain HTML tags.