Past Meeting - November 20, 2003
Learn about Data Mining Tools and Techniquies
and a Respectable Open Source Implementation
Thursday, October 23, 2003
6:30 P.M. - 8:00 P.M.
Sun Microsystems
Building Sun SAN05
9540 Towne Centre Drive
San Diego, CA 92121
Summary of the meeting
Drs. Balac and Sipes gave an excellent overview of the problems addressed
by data mining techniques, then delved into some simple techniques,
and ultimately into Weka, an excellent open source general data mining
tool. We found out that data mining is only partially about digging
into a database and finding correlations. Much of the work is in preparing
the data so that its as uniform and complete as possible. Only after
that's done can tools like Weka be used to scan and evaluate it.
The story doesn't end there, though. It takes a thorough understanding
of the kinds of computations a tool like Weka can make -- knowing what
Weka's doing, when to use particular Weka tools, and what the results
mean requires advanced statistics, advanced mathematics, and advanced
computer skills. Recommendation: while there is a lot one can find out
by using data mining disciplines, don't try this without a thorough
indoctrination into all of its aspects -- preferably from one of Drs.
Balac's and Sipes' UCSD Extension classes!
Abstract
There is an abundance of data that is rapidly being generated. Intelligent
software tools are increasingly needed to process and filter the data,
detect new patterns and similarities, and learn the information lying
hidden in the data. Large databases of information create great opportunities
for the application of data mining methods. Conventional computer science
algorithms are useful, but not powerful enough in solving many of the
knowledge discovery and pattern extraction problems. Data mining approaches
(such as decision trees, regression trees, clustering, association rules
and neural networks), are ideally suited for domains characterized by
the presence of large amounts of noisy data, and the absence of general
theories or hypothesis about the data. The fundamental idea behind these
approaches is to learn automatically from the data, creating a theory,
hypothesis or a model, through a process of inference, model fitting,
or learning from examples.
This talk introduces data mining and gives an overview of the basic
data mining tools and techniques, followed by a presentation of Weka,
a respectable open source data mining tool. We describe Weka and compare
it to several other tools. We conclude with what Weka, and data mining
in general can accomplish, and how righteously Data Mining has become
a topic of so much interest.
Presenter Bios
Natasha Balac, Ph.D. received her Ph.D. in Computer Science from Vanderbilt
University, with emphasis in Artificial Intelligence, Data Mining and
Robotics. She has developed a novel planning and learning system for
a mobile robot, using action models produced by the data mining technique
she introduced: multi-variate regression tree induction method. Currently,
Natasha is at the San Diego Supercomputer Center as well as teaching
Data Mining courses at the University of California San Diego Extension.
Tamara Sipes, Ph.D. is a Data Mining Specialist at Alodar Systems,
Inc. a consulting company offering solutions in Bioinformatics, Predictive
Modeling, and Enterprise Application Integration. Dr. Sipes uses her
data mining expertise to analyze data, select meaningful attributes,
discard outlying and redundant information, and build predictive models
that discover significant trends and relationships. Her work has led
to patent awards for clients in Biotechnology and other industries and
published research in the areas of data mining and learning technologies.
Dr. Sipes was awarded her doctorate in Artificial Intelligence/Data
Mining at Vanderbilt University.