Data Mining


Data mining is a broad, deep, and frequently ambiguous field.  Authorities don’t even agree on a definition for the term.  What I will do is tell you how I interpret the term, especially as it applies to the book that is now available.  But first, some personal history that sets the background for this book...

I’ve been blessed to work as a consultant in a wide variety of fields, enjoying rare diversity in my work.  Early in my career, I developed computer algorithms that examined high-altitude photographs in an attempt to discover useful things.  How many bushels of wheat can be expected from midwestern farm fields this year?  Are any of those fields showing signs of disease?  How much water is stored in mountain icepacks?  Is that anomaly a disguised missile silo?  Is it a nuclear test site?

Eventually I moved on to the medical field, and then finance: Does this photomicrograph of a tissue slice show signs of malignancy?  Do these recent price movements presage a market collapse?

All of these endeavors have something in common:  they all require that we find variables that are meaningful in the context of the application.  These variables might address specific tasks, such as finding effective predictors for a prediction model.  Or the variables might address more general tasks, such as unguided exploration, seeking unexpected relationships among variables, relationships that might lead to novel approaches to solving the problem.

That, then, is the motivation for this book.  I have taken some of my most-used techniques, those that I have found to be especially valuable in the study of relationships among variables, and documented them with basic theoretical foundations and well commented C++ source code.  Naturally, this collection is far from complete.  Maybe Volume 2 will appear some day.  But this volume should keep you busy for a while.

Some readers may wonder why I have included a few techniques that are widely available in standard statistical packages, very old techniques such as maximum likelihood factor analysis and varimax rotation.  In these cases, I included them because they are useful, and yet reliable source code for these techniques is difficult to obtain.  There are times when it’s more convenient to have your own versions of old workhorses, integrated into your own personal or proprietary programs, than to be forced to coexist with canned packages that may not fetch data or present results in the way that you want.

Many readers will want to incorporate the routines in this book into their own data mining tools.  And that, in a nutshell, is the purpose of this book.  I hope that you incorporate these techniques into your own data mining toolbox and find them as useful as I have in my own work.

There is no sense in my listing here the main topics covered in this text; that‘s what a Table of Contents is for.  But I would like to point out a few special topics not frequently covered in other sources:

●    Information theory is a foundation of some of the most important techniques for discovering relationships between variables, yet it is voodoo mathematics to many people.  For this reason I devote the entire first chapter to a systematic exploration of this topic.  I do apologize to those who purchased my “Assessing and Improving Prediction and Classification” book as well as this one, because this chapter is a nearly exact copy of a chapter in that book.  Nonetheless, this material is critical to understanding much later material in this book, and I felt that it would be unfair to almost force readers to purchase that earlier book in order to understand some of the most important topics in this book.

●    Uncertainty reduction is one of the most useful ways to employ information theory to understand how knowledge of one variable lets us gain measurable insight into the behavior of another variable.

●    Schreiber’s information transfer is a fairly recent development which lets us explore causality, the directional transfer of information from one time series to another.

●    Forward stepwise selection is a venerable technique for building up a set of predictor variables for a model.  But a generalization of this method in which ranked sets of predictor candidates allow testing of large numbers of combinations of variables is orders of magnitude more effective at finding meaningful and exploitable relationships between variables.

●    Simple modifications to relationship criteria let us detect profoundly nonlinear relationships using otherwise linear techniques.

●    Now that extremely fast computers are readily available, Monte-Carlo permutation tests are practical and broadly applicable methods for performing rigorous statistical relationship tests that until recently were intractable.

●    Combinatorially Symmetric Cross Validation as a means of detecting overfitting in models is a recently developed technique which, while computationally intensive, can provide valuable information not available as little as five years ago.

●    Automated selection of variables suited for predicting a given target has been routine for decades.  But in many applications we have a choice of possible targets, any of which will solve our problem.  Embedding target selection in the search algorithm adds a useful dimension to the development process.

●    Feature Weighting as Regularized Energy-Based Learning (FREL) is a recently developed method for ranking the predictive efficacy of a collection of candidate variables when we are in the situation of having too few cases to employ traditional algorithms.

●    Everyone is familiar with scatterplots as a means of visualizing the relationship between pairs of variables.  But they can be generalized in ways that highlight relationship anomalies far more clearly than scatterplots.  Examining discrepancies between joint and marginal distributions, as well as the contribution to mutual information, in regions of the variable space can show exactly where interesting interactions are happening.

●    Researchers, especially in the field of psychology, have been using factor analysis for decades to identify hidden dimensions in data.  But few developers are aware that a frequently ignored byproduct of maximum likelihood factor analysis can be enormously useful to data miners by revealing which variables are in redundant relationships with other variables, and which provide unique information.

●    Everyone is familiar with using correlation statistics to measure the degree of relationship between pairs of variables, and perhaps even to extend this to the task of clustering variables which have similar behavior.  But it is often the case that variables are strongly contaminated by noise, or perhaps by external factors that are not noise but that are of no interest to us.  Hence it can be useful to cluster variables within the confines of a particular subspace of interest, ignoring aspects of the relationships that lie outside this desired subspace.

●    It is sometimes the case that a collection of time-series variables are coherent; they are impacted as a group by one or more underlying drivers, and so they change in predictable ways as time passes.  Conversely, this set of variables may be mostly independent, changing on their own as time passes, regardless of what the other variables are doing.  Detecting when our variables move from one of these states to the other allows us, among other things, to develop separate models, each optimized for the particular condition.


If you would like to download the Table of Contents, click here.

To download a zip file containing all of the source code referenced in the book, click here.  SEE NOTE BELOW!

To download the user's manual for the DATAMINE program which demonstrates the algorithms in the book, click here.

To download a zip file containing the DATAMINE program and its manual, click here.  SEE NOTE BELOW!


NOTE on downloading Zip files: As of this writing, GoDaddy has a bug in their website support which often prevents them from automatically detecting Zip files correctly.  They will tell you it is a file of unknown type with a long name.  You will need to download this file and then manually add a .ZIP extension, which will allow you to unzip the file.  I apologize for this, but there is nothing I can do about it.  It's a GoDaddy bug, and they were unable to give me a timeline for a fix.