What is Data Mining?
Large data sets have become available in many scientific disciplines due to advances in technology. Examples include medical imaging, bioinformatics, electronic health records, and remote sensing. As a result, there is an increasing interest in various scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets (see Grossman et al., 2001).Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. Data mining is a powerful semi-automatic data analytics methodology. Like the field of statistics, data mining provides tools for extracting knowledge out of data. However, unlike classical statistics which is, data mining, and in particular “predictive analytics”, is focused mostly on predicting new data. Data mining has become popular in many applications, especially in business. To name a few examples:CapitalOne bank uses data mining to predict whether a loan applicant will default on the loan, given information about his/her demographics, credit history, type of loan, etc.
Netflix (the largest DVD-by-mail rental company) and Amazon.com use data mining to provide recommendations to their customers (“you might also be interested in ___”).
British law enforcement and intelligence agencies use data mining to look for data patterns that might point to developing crime trends or security threats
Facebook uses data mining to predict how active a user will be after 3 months.
Children's Hospital in Boston uses data mining to sift through emergency room patient records for detecting domestic abuse
Pandora (an Internet music radio offering customized music) chooses the next song to play using data mining algorithms.
Data Mining for Research
While data mining is useful in practice, it is also a powerful tool for testing existing theories and developing new models (Shmueli & Koppius, 2010). Many empirical research fields are monopolized by statistical methods for analyzing data, mostly due to the training of the researchers and the lack of knowledge of data mining. Yet, data mining offers a unique and complementary technology for deriving knowledge from data. Moreover, predictive analytics help bridge gaps between theoretical work and practice. For these reasons, data mining has an enormous potential to further research in engineering fields that have large datasets.
Data mining provides semi-automated algorithms that learn from historic data how to combine a set of input information (X variables) to accurately predict a response of interest (Y variable). Statistical models typically focus on quantifying the relationship in the sample, and inferring about the relationship in the population. In contrast, predictive analytics focus on predictive power and on how the relationship between the input information can predict new individual observations. Think of the difference between quantifying the relationship between the amount of smoking and the probability of cancer in a population (statistical methods), compared to predicting the probability of cancer for one or more individuals, given their smoking habits (predictive analytics). The two are different, and call for different modeling (Shmueli, 2010).
Data mining algorithms tend to be more data-driven; they “learn” from data with much less assumptions compared to statistical methods. With large datasets, such tools provide an opportunity for discovering new relationships; they are semi-automated; and they can be deployed in real-time on large amounts of data.
During the “data mining for engineering research” workshop, participants will
Understand the basics of data mining and its potential for engineering research
Comprehend the concept of predictive power (as opposed to statistical significance), and the difference between statistical modeling and predictive analytics
Learn and experiment with several popular data mining methods
Become familiar with a few data mining software tools
Introduction to data mining and predictive analytics (approach, concepts, applications)
Interactive data visualization
Prediction and Classification
Dimension reduction and pattern detection
Evaluating predictive power
The workshop is intended for graduate-level students and faculty who have taken at least one statistics course.
Grossman R L, Kamath C, Kegelmeyer P, Kumar V, & Namburu R (Eds.) (2001) Data Mining for Scientific and Engineering Applications. Series: Massive Computing, Vol. 2, Springer. ISBN: 978-1-4020-0033-1
Shmueli G, Patel N R, and Bruce P (2010) Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, 2nd edition, John Wiley and Sons Inc., ISBN: 978-0-470-52682-8.
Shmueli G and Koppius O (2011) “Predictive Analytics in Information Systems Research”, MIS Quarterly, forthcoming.
Shmueli G (2010) “To Explain or To Predict?”, Statistical Science, vol 25(3) pp. 289-310.