Large data sets have become available in many scientific disciplines due to advances in technology. Examples include medical imaging, bioinformatics, electronic health records, and remote sensing. As a result, there is an increasing interest in various scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets (see Grossman et al., 2001).
Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data.
Data mining is a powerful semi-automatic data analytics methodology. Like the field of statistics, data mining provides tools for extracting knowledge out of data. However, unlike classical statistics which is, data mining, and in particular “predictive analytics”, is focused mostly on predicting new data.
Data mining has become popular in many applications, especially in business. To name a few examples:
While data mining is useful in practice, it is also a powerful tool for testing existing theories and developing new models (Shmueli & Koppius, 2010). Many empirical research fields are monopolized by statistical methods for analyzing data, mostly due to the training of the researchers and the lack of knowledge of data mining. Yet, data mining offers a unique and complementary technology for deriving knowledge from data. Moreover, predictive analytics help bridge gaps between theoretical work and practice. For these reasons, data mining has an enormous potential to further research in engineering fields that have large datasets.
Data mining provides semi-automated algorithms that learn from historic data how to combine a set of input information (X variables) to accurately predict a response of interest (Y variable). Statistical models typically focus on quantifying the relationship in the sample, and inferring about the relationship in the population. In contrast, predictive analytics focus on predictive power and on how the relationship between the input information can predict new individual observations. Think of the difference between quantifying the relationship between the amount of smoking and the probability of cancer in a population (statistical methods), compared to predicting the probability of cancer for one or more individuals, given their smoking habits (predictive analytics). The two are different, and call for different modeling (Shmueli, 2010).
Data mining algorithms tend to be more data-driven; they “learn” from data with much less assumptions compared to statistical methods. With large datasets, such tools provide an opportunity for discovering new relationships; they are semi-automated; and they can be deployed in real-time on large amounts of data.
During the “data mining for engineering research” workshop, participants will
The workshop is intended for graduate-level students and faculty who have taken at least one statistics course.
Grossman R L, Kamath C, Kegelmeyer P, Kumar V, & Namburu R (Eds.) (2001) Data Mining for Scientific and Engineering Applications. Series: Massive Computing, Vol. 2, Springer. ISBN: 978-1-4020-0033-1
Shmueli G, Patel N R, and Bruce P (2010) Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, 2nd edition, John Wiley and Sons Inc., ISBN: 978-0-470-52682-8.
Shmueli G and Koppius O (2011) “Predictive Analytics in Information Systems Research”, MIS Quarterly, forthcoming.
Shmueli G (2010) “To Explain or To Predict?”, Statistical Science, vol 25(3) pp. 289-310.