Homework Assignment #2
Due at midnight on Tuesday, 10/25

Part 1

For this part of the homework, you are to implement a k-nearest neighbor learner for both classification and regression.

Your programs should read files that are in the ARFF format. In this format, each instance is described on a single line. The feature values are separated by commas, and the last value on each line is the class label (for classification) or numeric response (for regression) of the instance. Each ARFF file starts with a header section describing the features, class labels, and numeric responses. Lines starting with '%' are comments. See the link above for a brief, but more detailed description of the ARFF format. Your program should handle numeric and nominal attributes, and simple ARFF files (i.e. don't worry about sparse ARFF files and instance weights). Example ARFF files are provided below.

Your programs should should implement a k-nearest neighbor learner according to the following guidelines:

Your programs should be callable from the command line. The first program should be named kNN and should accept three command-line arguments as follows:
kNN <train-set-file> <test-set-file> k
This program should use the training set and the given value of k to make classifications/predictions for every instance in the test set.

The second program should be named kNN-select and should accept five command-line arguments as follows:
kNN-select <train-set-file> <test-set-file> k1 k2 k3
This program should use leave-one-out cross validation with just the training data to select the value of k to use for the test set by evaluating k1 k2 k3 and selecting the one that results in the minimal cross-validated error within the training set. To measure error for regression tasks, you should use mean absolute error.

If you are using a language that is not compiled to machine code (e.g. Java), then you should make scripts called kNN and kNN-select that accept the command-line arguments and invokes the appropriate source-code program and interpreter. See HW #1 for instructions on how to make such a script.

As output, your programs should print the value of k used for the test set on the first line, and then the predictions for the test-set instances. For each instance in the test set, your program should print one line of output with spaces separating the fields. For a classification task, each output line should list the predicted class label, and actual class label. This should be followed by a line listing the number of correctly classified test instances, and the total number of instances in the test set. For a regression task, each output line should list the predicted response, and the actual response value. This should be followed by a line listing the mean absolute error for the test instances, and the total number of instances in the test set.

You should test your code on the following two data sets:

Here are the outputs your first program (kNN) should produce (where k is specified as an argument): Here are the outputs kNN-select program should produce when run with k1=3, k2=5, k3=7: The first data set is for a task that involves predicting wine scores from chemical properties. The second data set is for a task that involves predicting the compartment in a cell that a yeast protein will localize to based on properties of its sequence.

Part 2

For this part you will explore the effect of the k parameter on predictive accuracy.

Part 3

Using the k-d tree and the training set displayed in the figure below, show how the nearest neighbor for x(q) = (7, 10) would be found. Assume that, for all instances in the figure, the features (i.e. coordinates) have integer values. For each step in the search, show the distance to the current node, the best distance found so far, the best node found so far, and the contents of the priority queue. You should use Euclidean distance and assume that the coordinates of the instances in the figure below are those shown in the table below.

kd-tree

Instance x y
a 2 11
b 3 12
c 5 10
d 2 8
e 2 4
f 6 3
g 9 2
h 12 5
i 10 10
j 13 11.5

Submitting Your Work

You should turn in your work electronically using the Canvas course management system. Turn in all source files and your runnable program as well as a file called hw2.pdf that shows your work for Parts 2 and 3. All files should be compressed as one zip file named <Wisc username>_hw2.zip. Upload this zip file as Homework #2 at the course Canvas site.