Project 1
Task: Data Mining Applications Report
Worth: 15%
Deadline: 20:00, Sep. 26, 2012
Project Description: Click here
Received Submissions: Click here
Selected Works for Presentation: Click here
The Task
In this project, you are required to read some papers in one special issue
of SIGKDD Explorations, and then write a report. There are four special
issues which you can select:
1. Successful Real-World Data Mining Applications
http://www.kdd.org/newsletter/explorations-june-2006-8-1
2. Web Content Data Mining
http://www.kdd.org/newsletter/explorations-december-2004-6-2
3. Educational Data Mining
http://www.kdd.org/newsletter/explorations-december-2011-13-2
4. Data Mining for Health Informatics
http://www.kdd.org/newsletter/explorations-june-2007-9-1
Follow your interest, choose one from these special issues listed above,
read some papers in your selected issue and write a report. The URLs of
these special issues are also listed above, from which you can find the
papers of each special issue directly.
To wirte this report, you need to consider the following three aspects
based on the papers your have read:
Motivation, i.e., why data mining is used in this application;
Techniques, i.e., how data mining is used in this application;
Results, i.e., how well data mining performs in this application.
NOTES:
1). Please use this MSWord template to write your report in Chinese with
English abstract, the file for submission should be named with your ID
number, e.g., "MG1233001.docx".
2). Do NOT plagiarize, plagiarism will be seriously penalized: You should
be careful on writing your report. Whenever you are using words and works
of others, citations should be made clear such that one can tell which
part is actually yours. Details about how to identify a plagiarism can
be found in "Introduction to the Guidelines for Handling Plagiarism
Complaints".
Submission
Name your report using your student ID, e.g., 'MG1233001.docx'.
The file format should be doc/docx, no other format is acceptable.
NO submission after the deadline is acceptable!
NO email submission will be accepted!
Upload your file to FTP: (please use FTP software to upload, do not use
Windows Explorer or IE)
ftp://lamda.nju.edu.cn/mg_DM_2012/assignment1/
username: mg_dm12
password: mg_dm12
Evaluation
Your language: concise, precise, and logical.
Your organization: good structure, clearly and properly separated
sections and paragraphs.
Citations: all works of non-yourself should have correct references.
Insights: readers will have an idea on why and how data mining is useful
in this application after reading your report.
If plagiarism is identified, no scores will be given to this report.
Presentation
About 5 submissions will be selected and presented (by the author) in the
class.
Project 2
Task: A Classification Task
Worth: 10%
Deadline: 20:00, Oct. 17, 2012
Project Description: Click here
Received Submissions: Click here
Selected Works for Presentation: Click here
The Task
In this project, you are required to do a classification task. In detail,
your job includes:
1. Read the description of the task and download the data set;
2. Implement an algorithm and output the result;
3. Write a report;
4. Submit your work;
5. (optional) If selected, prepare your presentation
Task Decription and Dataset Download
Dataset: The dataset has in all 1100 instances. Each instance has 42
attributes.
Download:
training data[111 K]: each line is an instance, 700 instances in all.
testing data[64 K]: each line is an instance, 400 instances in all.
Task: Predict the label for each instance in testing data..
NOTES:
I have maken some preprocessings for the original dataset. So, please do
not make effort to find out the original dataset from website because it
is useless.
How to Write the Report
Your report should includes:
1. Your understand and analysis of the problem;
2. The motivation of your algorithm and introduction of the background
of your algorithm;
3. Full technical details of your algorithm, especially including
pseudocode of your algorithm;
4. Description or analysis of the performance you got;
5. Conclusion and (optional) discussion.
NOTES:
1. Please use this MSWord template to write your report in Chinese with
English abstract, the file for submission should be named with your
student ID, e.g., "MG1233001.docx".
2. Do NOT plagiarize, plagiarism will be seriously penalized: You should
be careful on writing your report. Whenever you are using words and works
of others, citations should be made clear such that one can tell which
part is actually yours. Details about how to identify a plagiarism can
be found in " Introduction to the Guidelines for Handling Plagiarism
Complaints ".
How to Submit
Your submition should includes:
1. 'output.txt' file: containing the predicted labels of testing
instances, each line corresponds to one instance like 'train label' file
in training dataset;
2. 'report.pdf' file: your report;
3. source file of your algorithm
Please carefully check out your submission:
1. Note that the name of the files should not be other names.
2. Pack all your files into a single compressed file (compress in zip,
rar, or 7z formats). Name the compressed file using your student ID, e.g.,
'MG1233001.rar'. Please delete the .bak, i.e., the backup files from your
final .rar files.
Upload your file to FTP: (please use FTP software to upload, do not use
Windows Explorer or IE)
ftp://lamda.nju.edu.cn/mg_DM_2012/assignment2/
username: mg_dm12
password: mg_dm12
Evaluation
We will evaluate your work in terms of:
Your prediction: according to your "output.txt", we will use accuracy to
evaluate your prediction.
Your report: novel idea, sound techniques, and beautiful writing gain you
high scores.
Your source code: Fake and plagiarized source codes receives low scores.
If plagiarism is identified, no scores will be given to this report.
Presentation
After all submissions being collected, about 5 assignments will be
selected and presented (by the author) in the class.
Project 3
Task: A Clustering Task
Worth: 10%
Deadline: 20:00, Nov. 7, 2012
Received Submissions: Click here
Project Description: Click here
Selected Works for Presentation: Click here
The Task
In this project, you are required to do a clustering task. In detail, your
job includes:
1. Read the description of the task and download the data set;
2. Implement an algorithm and output the result;
3. Write a report;
4. Submit your work;
5. (optional) If selected, prepare your presentation
Task Decription and Dataset Download
Dataset: The dataset has in all 1080 instances. Each instance has 856
attributes.
Download:
clustering data[1806 K]: each line is an instance, 1080 instances in all.
Task: cluster the data and give the cluster id of each instance which it
belongs to.
NOTES:
I have maken some preprocessings for the original dataset. So, please do
not make effort to find out the original dataset from website because it
is useless.
How to Write the Report
Your report should includes:
1. Your understand and analysis of the problem;
2. The motivation of your algorithm and introduction of the background
of your algorithm;
3. Full technical details of your algorithm, especially including
pseudocode of your algorithm;
4. Description or analysis of the performance you got;
5. Conclusion and (optional) discussion.
NOTES:
1. Please use this MSWord template to write your report in Chinese with
English abstract, the file for submission should be named with your
student ID, e.g., "MG1233001.docx".
2. Do NOT plagiarize, plagiarism will be seriously penalized: You should
be careful on writing your report. Whenever you are using words and works
of others, citations should be made clear such that one can tell which
part is actually yours. Details about how to identify a plagiarism can
be found in " Introduction to the Guidelines for Handling Plagiarism
Complaints ".
How to Submit
Your submition should includes:
1. 'output.txt' file : containing the clusters id of each examples, each
line corresponds to one instance like the last task;
2. 'report.pdf' file : your report;
3. source file of your algorithm
Please carefully check out your submission:
1. Note that the name of the files should not be other names.
2. Pack all your files into a single compressed file (compress in zip,
rar, or 7z formats). Name the compressed file using your student ID, e.g.,
'MG1233001.rar'. Please delete the .bak, i.e., the backup files from your
final .rar files.
Upload your file to FTP: (please use FTP software to upload, do not use
Windows Explorer or IE)
ftp://lamda.nju.edu.cn/mg_DM_2012/assignment3/
username: mg_dm12
password: mg_dm12
Evaluation
We will evaluate your work in terms of:
Your prediction: according to your "output.txt", we will use Rand measure
to evaluate your clustering results. As for Rand measure, you may refer
to http://en.wikipedia.org/wiki/Rand_measure.
Your report: novel idea, sound techniques, and beautiful writing gain you
high scores.
Your source code: Fake and plagiarized source codes receives low scores.
If plagiarism is identified, no scores will be given to this report.
Presentation
After all submissions being collected, about 5 assignments will be
selected and presented (by the author) in the class.
Project 4
Task: Data Mining Practice on A Real-World Task
Worth: 15%
Deadline: 20:00, Dec. 19, 2012
Project Description: Click here
Received Submissions: Click here
Selected Works for Presentation: Click here
The Task
In this project, you are required to mine the PAKDD 2012 Data Mining
Competition dataset. In detail, your job includes:
1. Read the description of the task and download the data set;
2. Implement an algorithm and output the prediction;
3. Write a report;
4. Submit your work;
5. (optional) If selected, prepare your presentation
Task Decription and Dataset Download
Background:
Maintaining customers and ensuring customers are satisfied with the
products offered have always been a major challenge for telcos. Companies