neural networks Dr. Gary D. Boetticher Software Metrics
software economics
Return to the home page of Dr. Boetticher
University of Houston Clear Lake - About the University
School of Science and Computer Engineering - Info about SCE
Research Areas - Info about Dr. Boetticher's research
Dr. Boetticher's publications
Courses taught by Dr. Boetticher
Dr. Boetticher's professional experiences

 

CSCI 5833 -- Data Mining Tools and Techniques

STAT 5931 -- Research Topics in Statistics
Updated October 29, 2009

Office and Addresses

Delta 171 Phone 281.283.3805
email: boetticher@uhcl.edu
Secretary: Ms. Kim Edwards, Delta 161 281.283.3860

Face-to-Face Class Hours

Friday 1:00 - 3:50, Delta 242.

Office Hours

Thur. 1 - 4, Friday 12 - 1, or by appointment. If the suite door is locked, then call my extension (last 4 digits) using the phone in the hallway.

Teaching Assistant

Pradeep Gownivariananda, email: pradeepga@gmail.com
Office Hours: Wednesday 7 PM - 10 PM, Friday 9 AM - 1 PM

WebCT link

Course Description

Data Mining has emerged as one of the most exciting and dynamic fields in computer science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion by the end of this year.

The theoretical underpinnings of the data mining have existed for awhile (e.g., pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. Data mining seeks to detect `interesting' and significant nuggets of relationships/knowledge buried within data. It seeks to discover association rules, episode rules, sequential rules, etc., and it is concerned with efficient data structures and algorithms for data examination which possess good scaling properties.

There have been several success stories in this relatively young area: the SKICAT system for automatic cataloguing of sky surveys (JPL), the Advanced Scout system for mining NBA data (IBM), the QuakeFinder system for geoscientific data mining (UCLA/JPL) and the PYTHIA system for mining information from performance evaluation of scientific software (Purdue). Case studies from various domains (financial, bioinformatics, etc.) will be presented.

The traditional graduate student load is 3 courses. Be prepared to commit 15 to 20 hours per week to this course!

Course Goals

 

By the end of the course, you will

  • Understand the data mining process.

  • Have a working knowledge of different data mining tools and techniques.  

  • Have an understanding of various Machine Learners (ML).

  • Have a working knowledge of some of the more significant current research in the area of data mining and ML.

  • Be aware of various data mining data repositories for the study of data mining.

  • Be able to effectively apply a number of data mining algorithms (e.g., neural networks, genetic algorithms) to solve data mining problems from various problem domains including Financial and Bioinformatics.

  • Be familiar with several successful applications of data mining.

Prerequisites

A course in artificial intelligence, machine learning, pattern recognition, algorithms, or statistics would be helpful, but is not required. Programming experience (or at least one course) in either C, C++, C#, Delphi, Java, PASCAL, or VB (using Visual Studio). If you do not meet the prerequisites, then you need to drop this course!  

Methodology

Lecture, seminar, case studies, and interactive problem solving.

Appraisal:

 Homework  15% of the total

 Quizzes

  5% of the total
 Term Project 25% of the total
 Participation:   5% of the total
 Midterm:  25% of the total
 Final:  25% of the total

Grading:

    93+ = A; 90 = A-; 87+ = B+; 83+ = B; 80+ = B-;

      77+ = C+; 73+ = C; 70 = C-; 67+ = D+; 63+ = D; 60+ = D-; 0+= F

My motto:

Seek the Truth.

Show altruistic love.

Appreciate beauty.

Required Textbook  

 

Kantardzic, Mehmed, Data Mining  Concepts, Models,   Methods, And Algorithms, IEEE Press, John Wiley & Sons, 2003. ISBN 0-471-22852-4 Errata1  Errata2

 

Reference Books

1.    Berry and Linoff, Data Mining Techniques, Wiley, 2000.

2.   Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2001, Morgan Kaufmann Publishers, ISBN 1-55860-489-8.

3.   Mitchell, Machine Learning, McGraw-Hill, Boston, 1997.

4.   Kam, Jiawei, Hamber, Micheline, Data Mining: Concepts and Techniques, Morgan Kauffman, 2000.  

5. Witten, I.H., E. Frank, Data Mining, Morgan Kaufmann Publishers, San Francisco, California, 1999.

Other Reference Materials

Conferences, Journals, and Organizations

Data Resources

Just announced new studies

UCI Machine Learning Database Repository

 

Bioinformatic and biological databases:  

Santa Fe dataset  

 

Data Mining Software

 

Schedule

 

Aug 28 Course overview, What is Data Mining? Review Term Project

 

FOR THIS WEEK (IF NOT SOONER)

·   Read:  Orientation

·   Read:  Syllabus

·   Take:  Syllabus Quiz (May take multiple times. No time limit,

             you need a 100% on this quiz in order to continue)

·   Read:  Unit 1 Notes: What is Data Mining/The Data Mining Process

·   Take:  Quiz on Unit 1 Online Notes (9/4/09)

             (Quiz deadline is 9/4/09, Time corresponds to

             beginning of class time. Other quiz deadlines

             follow this pattern.)

·   Read:  Kantardzic: Chapter 1

·   Take:  Quiz on Chapter One of Kantardzic (Concepts) (9/4/09)

·   Read:  Term Project

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read:  Unit 2 Online Notes: Data Preprocessing

·   Read:  Principal Component Analysis Tutorial

·   Take:  Quiz on Unit 2 Online Notes (9/7/09)

·   Read:  Kantardzic: Chapter 2

·   Take:  Quiz on Chapter Two of Kantardzic (9/7/09)

·   Read:  Kantardzic: Chapter 3

·   Take:  Quiz on Chapter Three of Kantardzic (9/7/09)

 

 

Background reading (not required)

Chapman, Pete, Julian Clinton, Thomas Khabaza, Thomas Reinartz, and Rdiger Wirth. The CRISP-DM process model. Technical report, CRISP-DM consortium, March 1999.

Elder, John, Top 10 Data Mining Mistakes, Elder Research, Charlottesville, Virginia.

Fayyad, Usama, Ramasamy Uthurusamy, Data Mining and Knowledge Discovery in Databases, Communications of the ACM, 39 11, November 1996.

Fayyad, Usama, Ramasamy Uthurusamy, Evolving Data Mining into Solutions for Insights, Communications of the ACM, 45 8, August 2002.

Fayyad, Usama, et al., Mining Scientific Data, Communications of the ACM, 39 11, November 1996.

Fayyad, Usama, et al., The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communications of the ACM, 39 11, November 1996.

Communications of the ACM, Special Issue on Knowledge Discovery November 1999. 

Hall, M., Geoffrey Holmes, Benchmarking Attribute Selection Techniques for Discrete Class Data Mining, IEEE Transactions on Knowledge and Data Engineering, 15 6, 2003, Pp. 1437 - 1447.

Hancock, Monte, Common Reasons Data Mining Projects Fail, KDD-2002, Edomonton, Canada, 2002.

SAS Institute, Data Mining and the Case for Sampling, SAS Institute Best Practices Paper, SAS Institute, 1998.

Smyth, Padhraic, Pregibon, Daryl, and Christos Faloutsos, Data-Driven Evolution of Data Mining Algorithms, Communications of the ACM, 45 8, August 2002.

 

Sep 04   Data Preprocessing

   

Assign Homework 1

Point value: 100 points

Due date:  September 25, 2009 1:00 PM

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read:  Kantardzic: Chapter 4

·   Take:  Quiz on Chapter Four of Kantardzic (9/11/09)

·   Read:  Kantardzic: Chapter 7

·   Take:  Quiz on Chapter Seven of Kantardzic (9/11/09)

·   Read:  Unit 3 Online Notes: Decision Trees

·   Take:  Quiz on Unit 3 Online Notes  (9/11/09)

 

Sep 11 Overview of Machine Learners & Decision Trees

    

Assign Homework 2

Point value: 100 points

Due date:  October 2, 2009 7:00 PM

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read:  Tim Menzies Tar2 and Tar3 material are available at the following link:

http://www.cs.pdx.edu/~timm/dm/rx.html#insider%20the%20tar3%20treatment%20learner

·   Review There is a very good Decision Tree tutorial available at: http://dms.irb.hr/tutorial/tut_dtrees.php

 

Sep 18 Decision trees continued, Tar 2

  

FOR NEXT WEEK (IF NOT SOONER)

·  Review The following tutorial on Genetic Algorithms:

            http://www.obitko.com/tutorials/genetic-algorithms/

·   Read:  Kantardzic: Chapter 10

·   Take:  Quiz on Chapter Ten of Kantardzic (9/25/09)

·   Read:  Unit 4 Online Notes: Genetic Algorithms

·   Take:  Quiz on Unit 4 Online Notes  (9/25/09)

 

Background reading (not required)

Koza, John, Genetic Programming, Dept. of CS, Stanford  University, 1997, Pp. 1 – 26.

Koza, John, Future Work and Practical Applications of Genetic Programming, Handbook of Evolutionary Computation, June, 1996, Pp. 1 – 7.  

Koza, John, Riccardo Poli, A Genetic Programming Tutorial, Stanford University

Mitchell, Tom M., Machine Learning and Data Mining, Communications of the ACM, 42 11, November 1999.

Whitley, Darrell, A Genetic Algorithm Tutorial, Dept. of CS, TR CS-93-103, Dept. of CS, Colorado State University, Pp. 1 – 38.

 

Sep 25   Genetic Algorithms

 

HOMEWORK 1 IS DUE

 

Assign Homework 3

Point value: 100 points

Due date:  Sunday, October 11, 2009 7:00 PM via email

 

FOR NEXT WEEK (IF NOT SOONER 

·   Review One (or more) of the following online neural network tutorials:

NN Tutorial 1

NN Tutorial 2

NN Tutorial 3

·   Read:  Kantardzic: Chapter 9 (Neural Networks)

·   Take:  Quiz on Chapter Nine of Kantardzic  (10/2/09)

·   Read:  Unit 5 Online Notes: Neural Networks

·   Review Download, install, and try GDB_Net (A neural network

                 software package) .

·   Review Try out this Kohonan Self-Organizing Map applet

                 Click here for the zipped code of this applet.

·   Take:  Quiz on Unit 5 Online Notes  (10/2/09)

Background reading (not required)

Gerstner, Wulfram, Supervised Learning for Neural Networks: A Tutorial with Java exercises. The corresponding Java applets for this tutorial are available at:     http://diwww.epfl.ch/mantra/tutorial/english/

 

Oct 02  Neural Networks

 

HOMEWORK 2 IS DUE

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read:  Kantardzic: Chapter 5

·   Take:  Quiz on Chapter Five of Kantardzic (10/9/09)

·   Read:  Unit 6 Online Notes: Evaluating Results

·   Take:  Quiz on Unit 6 Online Notes  (10/9/09)

Background reading (not required)

Anand, Sarbjot, et al., The Role of Domain Knowledge in Data Mining, CIKM, Baltimore, Maryland, 1995.

Clark, Glymour, et al., Statistical Inference and Data Mining, Communications of the ACM, 39 11, November 1996.

Elder, John, et al., A Statistical Perspective on Knowledge Discovery in Databases

Friedman, Jerome H., Data Mining and Statistics: What's the Connection?, Department of Statistics, Stanford University

 

 

Oct 09Evaluating Results, Review for midterm

 

HOMEWORK 3 is due on Sunday, October 11, at 7 PM via email.

   

FOR NEXT WEEK (IF NOT SOONER)  

·  Submit:  Midterm questions by Thursday, October 15th, 7 PM.

·   Study!

 

Oct 16 Midterm: Starts at 1 PM in Delta 237

 

FOR NEXT WEEK (IF NOT SOONER)

·   Read:  Unit 7 Online Notes: Financial Data Mining

·   Read:  Achelis, Steven, Technical from A to Z

·   Read:  Mizuno, et al., Application of Neural Network To Technical Analysis of Stock Market Prediction, Studies in Informatics and Control (With Emphasis on Useful Applications of Advanced Technology), 7 2, June 1998.

·   Read:  Frick, et al., Genetic-Based Trading Rules - A New Tool to Beat the Market With -- First Empirical Results --, in Aktuarielle Ansätze für Finanz-Risiken, Proceedings of 6th International AFIR Colloquium, Nürnberg, 1.-3. October 1996, (Editor Pete Albrecht) Verlag Versicherungswirtschaft e.V. Karlsruhe, Volume I/II, pp. 997 - 1018 (with coauthors A. Frick, R. Herrmann, M. Kreidler and A. Narr).

·   Take:  Quiz on Mizuno, Frick, and Unit 7 Online Notes  (10/23/09)

 

Background reading (not required)

Blok, Hendrik, On the Nature of the Stock Market: Simulations and Experiments, Ph.D. Dissertation, University of British Columbia

Chenoweth, T., Obradovic, Z., Sauchi Lee, Embedding Technical Analysis into Neural Network Based Trading Systems, Applied Artificial Intelligence Journal.

Li, J., Edward Tsang, Improving Technical Analysis Predictions: An Application of Genetic Programming, The 12th International FLAIRS Conference, USA, 1999.

Mahfoud, Sam, and Ganesh Mani, Financial Forecasting using Genetic Algorithms, Applied Artificial Intelligence, 10:543-565, 1996.

Gayle, Sanford, The Marriage of Market Basket Analysis to Predictive Modeling, SAS Institute.

Ansari, Suhail, et al., Integrating E-Commerce and Data Mining: Architectures and Challenges, ICDM 2001.

Chang, Wei-Lun, et al., A Synthesized Learning Approach for Web-Based CRM, International Workshop on WebKDD, Boston, USA, 2000.

Theusing, Christine, Klaus-Peter Huber, Analyzing the Footsteps of Your Customers - A case study by ASKnet and SAS Institute GmbH, 2000.

Vucetic, Slobodan, et al., A Regression-Based Approach for Scaling-Up Personalized Recommender Systems in E-Commerce, Workshop on Web Mining for E-Commerce, at the Sixth ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, Boston, MA, August 2000.

 

Oct 23 – Review of the midterm and Applications of Data Mining: Financial

 

Assign Homework 4

Point value: 100 points

Due date:  November 13, 2009 1:00 PM

  

 

* Last day to drop a class/withdraw for the semester is Oct 27th *

 

Oct 30Financial Data Mining

 

Assign Homework 5

Point value: 100 points

Due dates:  November 20, 2009 1:00 PM

 

FOR NEXT WEEK (IF NOT SOONER)

·   Read:  Kantardzic: Chapter 6

·   Take:  Quiz on Chapter Six of Kantardzic (11/06/09)

 ·   Read:  K-Means and Hierarchical Clustering Tutorial  

  (Moore @ CMU)

·   Read:  Instance-Based Learning Tutorial (Moore @ CMU)

·   Read:  Support Vector Machine Tutorial (Moore @ CMU)

·   Read:  Unit 8 Online Notes: Clustering, Instance-Based Learning, Support Vector Machines, Ensemble Learning

·   Take:  Quiz on Unit 8 Online Notes  (11/06/09)

·   Review The following K-Means applet:

   http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html

·   Review The following SVM applet:

      http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

 

Background reading (not required)

Aha, D., Kibler, D, Marc Albert, Instance-Based Learning Algorithms, Machine Learning, Kluwer Publishers, 6, 1991, Pp. 37-66.

 

Nov 06 – Clustering, Instance-Based Learning,

Support Vector Machines, Ensemble Learning

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Review: Transcription/Translation Overview Animation

·   Review: Transcription Animation

·   Review: Translation Animation

·   Read:  Unit 9 Online Notes: Bioinformatics & Data Mining

·   Read:  Dynamic Programming Tutorial

·   Read:  Advanced Dynamic Programming Tutorial

·   Take:  Quiz on Unit 9 Online Notes  (11/13/09)

Background reading (not required)

Bertone, et al., SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics, 2001.

Bertone, P., M. Gestein, Integrative Data Mining: The New Direction in Bioinformatics, IEEE Engineering in Medicine and Biology, 2000.

Ewen, Edward, et al., Data Warehousing in an Integrated Health System; Building the Business Case, DOLAP, 1998.

Initial sequencing and analysis of the human genome, Nature, February, 2001, Pp. 860-921.

Megalookonornou, Vasileios, et al., Mining Lesion-Deficit Associations in a Brain Image Database, KDD-99, 1999.

Shamir, Ron, Lecture 1, Algorithms for Molecular Biology, Tel Aviv University, Fall, 2001.

Shamir, Ron, Lecture 2, Algorithms for Molecular Biology, Tel Aviv University, Fall, 2001.

Shamir, Ron, Bionformatic Resources, Algorithms for Molecular Biology, Tel Aviv University, Fall, 2001.

Shamir, Ron, Hidden Markov Model, Algorithms for Molecular Biology, Tel Aviv University, Fall, 2001.

Tsur, Shalom, Data Mining in the Bioinformatics Domain, VLDB 2000.

 

Nov 13 – Bioinformatics

 

HOMEWORK 4 IS DUE

  

Nov 20 – Bioinformatics and project review

 

                  HOMEWORK 5 IS DUE on November 26th at 7 PM via email

 

Nov 27  Thanksgiving - No Class

 

TERM PROJECT DELIVERABLES DUE Tuesday, DECEMBER 1st, at 7 PM

 

Dec 04 Term Project presentations

 

FOR NEXT WEEK (IF NOT SOONER)

·   Submit:  Final questions by Tuesday, December 10th, 7 PM.

·   Study!

 

Dec 11  Final Exam: Starts at 1 PM in Delta 237

                  Review of term project

 

 

Other Policies

Homework, Projects, Research Paper

  • Homework and projects are due exactly at the prescribed time (usually the beginning of class). As soon as a homework or project is collected, then all others are considered 1 day late (even if it only 3 minutes). In the event you might be running late, you might want to email the assignment. Also, when preparing your assignment, be mindful of possible backlogs at the printer, jammed printer, printer out of toner, etc.

  • Late homework/projects are accepted with a penalty of 10% deduction per 24-hour period after the due date. No late project will be accepted one week after the due date. The last homework/project cannot be late.

  • There will be no extra-credit homework or projects in this course.

  • All homework and projects must be typed not hand-written.

  • A cover page is expected for all homework and projects.

  • VERY IMPORTANT! In certain classes students are encouraged to work in groups. For this class you are expected to work on all homework and projects individually. Students may not discuss, use, email, show, give, buy, sell, borrow, trade, steal, etc. in whole or part, any of the homework or projects in any manner not prescribed by the instructor. Penalty for cheating will be extremely severe and may result in an F for this course. This condition applies even after you complete this course! Penalty for cheating will be extremely severe and may result in an F for this course. 

  • Handing in an assignment for another student is considered cheating. Penalty for cheating will be extremely severe and may result in an F for this course. 

  • VERY IMPORTANT! Failing to report to the instructor any incident in which a student witnesses an alleged violation of the Academic Honesty Code is considered a violation of the academic honesty code. Please see me to discuss any incidents.

  • VERY IMPORTANT! Purchasing, or otherwise acquiring and submitting as one's own work any research paper or any other writing assignment prepared by others constitutes cheating. Penalty for cheating will be extremely severe and may result in an F for this course.

  • Standard academic honesty procedure will be followed. See the following link for additional information: http://b3308-adm.uhcl.edu/PolicyProcedures/Policy.html

Tests and Quizzes

  • There are no make-up tests except in verified medical emergencies and with immediate notification. Rescheduling a final exam in order to catch a plane flight is unacceptable. Make up exams are harder, and different, than original exams.

  • There are no make-up quizzes. Allow plenty of additional time in the event that webCT crashes.

  • You are responsible for all required readings assigned throughout the semester.

  • Students are to work on test and quizzes individually.  Students may not discuss, show, give, sell, borrow, trade, share, etc. their tests or quizzes. Penalty on cheating will be extremely severe. Standard academic honesty procedure will be followed.

  • VERY IMPORTANT! Providing answers for any assigned work or examination when not specifically authorized by the instructor to do so. Or, informing any person or persons of the contents of any examination prior to the time the examination is given is considered cheating. Penalty for cheating will be extremely severe and may result in an F for this course.

  • VERY IMPORTANT! Failing to report to the instructor any incident in which a student witnesses an alleged violation of the Academic Honesty Code is considered a violation of the academic honesty code. Please see me to discuss any incidents.

Miscellaneous

  • Any person with a disability who requires a special accommodation should inform me and contact the Disability services office or call 281 283 2627 as soon as possible.

  • You are expected to come fully prepared to every class!

  • Incomplete grades or administrative withdrawals occur only under extremely rare situations.

  • The ringing, beeping, buzzing of cell phones, watches, and/or pagers during class time is extremely rude and disruptive to your fellow students and to the class flow. Please turn off all cell phones, watches, and pagers prior to the start of class.

  • There is no formal attendance policy. However, it is my experience that those students who do attend class on a regular basis do better on tests than those that don't.

  • I am willing to provide letters of recommendation/references only if you have attained an 'A' in one of my classes, or two 'A-' in two of my classes.

  • I highly recommend that you seek out your advisor and complete you Candidate Plan of Study (CPS) as soon as possible. I am normally not available for advising during the summer months.

  • Pay very careful attention to your email correspondence. It reflects on your communication skills. Below is a compilation of email errors I have received during the past year.

    dear sir.

    wen r u gonna grad the homework, bcoz i have a doubt about the third problem

    Some student

    Common problems:

       *   wen instead of when

       *   bcoz instead of because

       *   r instead of are

       *   u instead of you

       *   lowecase i instead of I

       *   starting a sentence with a lowercase letter

       *   doubt instead of question

  • I immediately discard anonymous emails.

Return to Top


HomeUHCLSCE



2700 Bay Area Boulevard
Delta Building. Office 171
Houston, Texas 77058
Voice: 281-283-3805
Fax: 281-283-3869
boetticher@uhcl.edu


© 2002-2009 Boetticher: Data Mining Course, All Rights Reserved.

Undergrad courses taught by Dr. Boetticher
Graduate courses taught by Dr. Boetticher