Project 3: Naive Bayes Classification

For this project, you will build a Naive Bayes classifier which will classify emails into one of two possible categories or classes: spam and not-spam (also known as ham). Your program will learn the difference (as best as it can) between the two classes by examining a collection of pre-classified example emails. Then, your program will see how well it can generalize what it has learned to new emails it has not seen before. The features your classifier will use are the presence or absence of every word in the training set of emails. Multiple occurrences of a word in an email should not affect your classifier --- only whether the word is present or not.

The program operates in two phases. First, in the training phase, your program reads in two different text files containing training examples for the two classes. These two files together are called the training set. Your program will examine the emails in the training set and, for each class, tabulate information about the number of training examples and how many times each word in the vocabulary appears in the examples for that class. (The "vocabulary" is the set of all words that appear in the training set emails.)

Second, in the testing phase, your program will read in another two text files containing a new set of examples for each of the two classes. These two files together are called the testing set. Your program will run the classifier on each email in the testing set and classify it into one of the two classes by identifying the MAP hypothesis. Your code will report, for each email in the training set, the log-probability of the email belonging to each of the two classes, the predicted class, and whether or not the prediction was correct. At the end, your program will report the the number of examples in the testing set that were indeed classified correctly.

It is not realistic to think that your classifier will perform perfectly. After all, the only information your classifier will have to work with is which of the vocabulary words are present in an email. Even more sophisticated spam classifiers make mistakes, too! Therefore, do not be concerned if the program reports a classifier accuracy below 100%, that does not necessarily imply your program is doing something wrong.

File format

Each file is set up like this:

<SUBJECT>
a single line with the subject of email 1 (might be blank)
</SUBJECT>
<BODY>
line 1 of the body
line 2 of the body
more lines.... possibly blank
</BODY>
<SUBJECT>
a single line with the subject of email 2 (might be blank)
</SUBJECT>
<BODY>
line 1 of the body
line 2 of the body
more lines.... possibly blank
</BODY>

Log probabilities

Due to the limits of how floating-numbers are stored on computers, you should use log-probabilities in your calculations.

Notes on training and testing

For the basic project, ignore any distinction between words in the body and words in the subject. Tokenize the input by splitting on spaces or newlines, convert all words to lowercase, and you should be good to go. Be sure the number of features (vocabulary size) matches my output.

Output

For each email in the testing set (consisting of two files, the spam testing file and the ham testing file), print the following information:

The word TEST in all caps.
The message number within the file (first message is #1)
The number of features (words from the vocabulary) that were true in this message
The log-probabilities calculated for spam and ham (use natural log), rounded to three decimal places. These log-probabilities should not be normalized, so they will not sum to 1.
The word "spam" or the word "ham" depending on how your classifier predicts this message
The word "right" or the word "wrong" depending on if your prediction matches the true classification of this message

You must format the line like this:

TEST 144 66/78082 features true -344.684 -331.764 ham right

I'm asking for this specific format because I will use the "diff" utility to compare your output against the correct output, so the format has to match exactly.

At the very end of your output, report the number of emails in the testing set that were classified correctly.

You can have other things in the output, such as diagnostic information about priors or likelihoods or anything you want that helps you debug your project, as long as the specific output lines above are there.

Sample output

Your program should prompt for filenames in this order: training file for spam, training file for ham, testing file for spam, testing file for ham.

Your output must contain the TEST lines, the final number of emails correctly classified at the end, and any other diagnostic messages you want.

Small data set

train-spam-small.txt train-ham-small.txt test-spam-small.txt test-ham-small.txt

[ output ] [ output with only the TEST lines ]

Useful statistics: vocab size = 3, priors = 0.6 and 0.4, 2 out of 2 test set emails classified correctly.

Full data set

(Note that the training set sizes are 10 and 17 megabytes in size.)

train-spam.txt train-ham.txt test-spam.txt test-ham.txt

[ output ] [ output with only the TEST lines ]

Useful statistics: vocab size = 78082, priors = 0.7866966480154788 and 0.21330335198452124, 1179 out of 1551 test set emails classified correctly.

FAQ

Q: If the vocabulary is built only from the training set, what happens when I see a word in the testing set that isn't in the vocabulary?
A: You ignore the word. Since features are only constructed during the training phase, when you see a new word during the testing phase, there's no way to figure out an appropriate probability for it, so you just ignore it.
Q: What do I need to smooth?
A: Smooth the probabilities for each feature (word) given a hypothesis, but do not smooth the priors.

Extra Credit

Before you attempt the extra credit, get the main program working first, and make sure your output matches mine! Then save a copy of your original program before you start modifying it, because I'll still want to run the original version before I start grading the extra credit version.

For extra credit (up to 10 percentage points), add additional features to the model or change the existing features as you see fit to try to improve the accuracy of the classification. You should still use the Naive Bayes setup, so don't muck around with the math, but you can add new features for e.g., length of the email, words present/absent in subject vs body, word count instead of just presence/absence, etc.

If you attempt the extra credit, turn in a separate copy of your program for this, with a writeup explaining what features you added, why you thought they would help, and sample output showing if they did (report how the classification accuracy changed).

Grading

You will be graded on correctness of your program and the thoroughness of your writeup (for extra credit).