Data Mining Project Page

This project is meant for you to get your hands "dirty" with real world data and tackle a problem of interest to you. Since the project is open-ended, select a topic that interests you and your group.

The project should be data-focused. Although in statistics you usually try to formulate a hypothesis first and then go about collecting data, often in data mining it is the other way around. There are multitudes of interesting data sets out there that can be collected fairly easily. See below for ideas.

Guidelines for Group Project

Group Size: Groups should consist of 3 people.
Software:You may either do your data analysis using WEKA or write code to implement the desired algorithms. If you'd like to use different software, please see me so we can chat about this option.
Timeline:

The project is extremely open-ended. It should consist of the following:

Ideas for projects and datasets:

There are lots of data sets available online. Pick something that you will enjoy working on, and something where there is a rich source of data available. Take some time in selecting a good data set - feel free to ask me for suggestions. You should take the time to research where and how your data was collected. A good final report will dive into biases that may exist or data quality issues.

Resources for getting data into .arff format

Project Proposal

Your data mining project proposal is due on Monday, Oct. 1st/Tuesday, Oct. 2nd in class.  The proposal should be 1 page and contain the following elements:

While I'd like you to think through your plan carefully, please understand that this is a proposal, and I expect that your question and your approaches will likely change as the semester progresses.

Checkpoint

I want to make sure that you're starting early and giving yourself enough time to get reasonable results for your project. Therefore, on Mon, Nov. 5th/Tues, Nov. 6th, I'd like each group to hand in a one page paper explaining the progress you've made so far, any challenges you've run into, and any changes you've made to your original proposal. You should have at least put your data into the format required by Weka or the program you're writing, and run at least 1 algorithm on your data set.

Project Presentations

These will be in class on Wednesday, Nov. 28th/Thursday, Nov. 29th and Monday, Dec. 3rd/Tuesday, Dec. 4th. Once dates have been selected for each group they will be listed on the course website.

Group presentations will be 15 minutes long; please include time for questions into your overall presentation time.

Topics to include:

You will be graded on the following:

Final Report