Data Mining Project Page
This project is meant for you to get your hands "dirty" with real world data and tackle a problem of interest to you. Since the project is open-ended, select a topic that interests you and your group.
The project should be data-focused. Although in statistics you usually try to formulate a hypothesis first and then go about collecting data, often in data mining it is the other way around. There are multitudes of interesting data sets out there that can be collected fairly easily. See below for ideas.
Guidelines for Group Project
Group Size: Groups should consist of 3 people.
Software:You may either do your data analysis using WEKA or write code to implement the desired algorithms. If you'd like to use different software, please see me so we can chat about this option.
Timeline:
- Project proposal: Due Monday, Oct. 1st/Tuesday, Oct. 2nd.
- Checkpoint: Due Monday, Nov. 5th/Tuesday, Nov. 6th.
- Group Presentations: Wednesday, Nov. 28th/Thursday, Nov. 29th and Monday, Dec. 3rd/Tuesday, Dec. 4th
- Final Paper: Due Wednesday, Dec. 5th for both sections
The project is extremely open-ended. It should consist of the following:
- Find or collect a data set of interest. There are many sources on the web for data sets. I would prefer the data to be of a reasonably large size (this is a data mining class after all), but really large data sets can bog down computers. A lower limit for data size should be n=1000 although I am willing to accept exceptions. See below for places to look for data sets of interest.
- Consider your data carefully. Even if you downloaded it, you should look for information about it. How was it collected? What are the data quality issues? Are there biases inherent in who collected the data or how it was collected? How might this impact the subsequent conclusions?
- Formulate questions that you would like to answer about this data set. What is the dependent variable or variables? What are the predictors?
- Implement your analysis using data mining tools. These should have some relation to what we have learned in the class! Are you doing a classification or clustering task? Can the data be expressed as a network of some kind? Are there interesting visualizations to do? How will you evaluate the performance of your model, or choose between competing models?
- Determine how well the algorithms you selected answered your questions of interest. Evaluate your results. What novel patterns and knowledge did you discover from your data set?
Ideas for projects and datasets:
There are lots of data sets available online. Pick something that you will enjoy working on, and something where there is a rich source of data available. Take some time in selecting a good data set - feel free to ask me for suggestions. You should take the time to research where and how your data was collected. A good final report will dive into biases that may exist or data quality issues.
- A nice selection of data mining data sets is at the KDNuggets website.
- The University of California at Irvine has put together a large repository of data sets for machine learning at http://archive.ics.uci.edu/ml/
- Another repository of data sets for data mining at the University of Edinburgh.
- Statlib is a general repository for all things statistical, they have a nice collection.
- Datasets from Chance Magazine
- Health Data Sets
- Kaggle - This site hosts data mining competition. Each competition comes with a data set. You can access most datasets without taking part in the competition, but feel free to submit your results if you're so inclined.
- KDD Cup Datasets - KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners.
- Baseball/Basketball statistics - there are a number of repositories for this type of data.
- US Government datasets - tons of census, voting, demographics datasets available.
- Email local businesses to see if they'll let you have access to their data. Talk to me about this option - don't do this directly.
Resources for getting data into .arff format
Project Proposal
Your data mining project proposal is due on Monday, Oct. 1st/Tuesday, Oct. 2nd in class.
The proposal should be 1 page and contain the following elements:
- List of group members.
- Preferred date for final presentation. (Wednesday, Nov. 28th/Thursday, Nov. 29th and Monday, Dec. 3rd/Tuesday, Dec. 4th)
- What data set is being used - where does the data come from, and what are some characteristics of it (size, missing values, types of attributes)
- Is there a reason you picked this data set? Tell me.
- What is the question(s) of interest - be specific. Generic questions like "I want to look for patterns in stock prices" are bad. Devise a specific question you
can answer with the data.
- What methods do you plan to use - understanding that this might change and that we have yet to cover many methods in class.
While I'd like you to think through your plan carefully, please understand that this is a proposal, and I expect that your question and your approaches will likely change as the semester progresses.
Checkpoint
I want to make sure that you're starting early and giving yourself enough time to get reasonable results for your project. Therefore, on Mon, Nov. 5th/Tues, Nov. 6th, I'd like each group to hand in a one page paper explaining the progress you've made so far, any challenges you've run into, and any changes you've made to your original proposal. You should have at least put your data into the format required by Weka or the program you're writing, and run at least 1 algorithm on your data set.
Project Presentations
These will be in class on Wednesday, Nov. 28th/Thursday, Nov. 29th and Monday, Dec. 3rd/Tuesday, Dec. 4th. Once dates have been selected for each group they will be listed on the course website.
Group presentations will be 15 minutes long; please include time for questions into your overall presentation time.
Topics to include:
- Data overview: subject, features, number of records
- Data cleaning: how did you handle missing values, did you remove any fields, records? why or why not?
- Algorithms attempted: name the algorithms used and why you selected them
- Results: how well did each algorithm work, which was the best, and most importantly, what did you learn about your data?
- Conclusions & Future Work: what would you do differently in the future (if anything) and what all did you learn from the project.
You will be graded on the following:
- Content: Did the presentation include valuable material, relevant to the project? Did it include discussions of data, algorithms, results and conclusions?
- Collaboration: Did everyone contribute to the presentation?
- Organization: Was the presentation well organized and easy to follow?
- Presentation: Did the presenters speak clearly? Did they engage the audience? Was it obvious the material had been rehearsed?
Final Report
- Write a report summarizing your data, your question of interest, and your findings. Reference other existing work which has analyzed your data, or addressed similar topics. The report should contain information about the data, exploration of the data set, and an appropriate analysis. Graphs are welcome, but do not overdo them.
- The report should be 10 - 20 pages in length, including graphs.
- Your report should discuss the same topics as your presentation, but go more in-depth with them. I want to know the details about what you tried, what worked, what didn't, and most importantly what you were able to learn about your data.
- This paper should be structured similarly to the research papers you presented this semester; it should include the following sections:
- Introduction: overview of the project, explain what you're going to write about
- Materials: discuss data you used; include details about the data (attributes, number of records) as well as any data cleaning you did, or things you needed to do to get it into Weka
- Methods: describe the algorithms you tried, explain each algorithm and why you selected it for your data. If you used any bagging/boosting, discuss that here as well.
- Results: Evaluate the performance of each algorithm you tried, compare performances, discuss what you learned from your data.
- Conclusion: Discuss the overall performance of your algorithms, talk about anything you might do differently in the future (Future Work) and your overall findings.