Data Mining

Newsflash...  My latest book, "Extracting and Selecting Features for Data Mining: Algorithms in C++ and CUDA C" is in publication and should be available from Amazon and all major booksellers by June 2019.

Data mining is a broad, deep, and frequently ambiguous field.  Authorities don’t even agree on a definition for the term.  What I will do is tell you how I interpret the term, especially as it applies to the book that is now available.  But first, some personal history that sets the background for this book...

I’ve been blessed to work as a consultant in a wide variety of fields, enjoying rare diversity in my work.  Early in my career, I developed computer algorithms that examined high-altitude photographs in an attempt to discover useful things.  How many bushels of wheat can be expected from midwestern farm fields this year?  Are any of those fields showing signs of disease?  How much water is stored in mountain icepacks?  Is that anomaly a disguised missile silo?  Is it a nuclear test site?

Eventually I moved on to the medical field, and then finance: Does this photomicrograph of a tissue slice show signs of malignancy?  Do these recent price movements presage a market collapse?

All of these endeavors have something in common:  they all require that we find variables that are meaningful in the context of the application.  These variables might address specific tasks, such as finding effective predictors for a prediction model.  Or the variables might address more general tasks, such as unguided exploration, seeking unexpected relationships among variables, relationships that might lead to novel approaches to solving the problem.

That, then, is the motivation for my two books that focus on data mining topics.  I'll now discuss "Data Mining Algorithms in C++" which is published by the Apress division of Springer.  Later on this page I'll move on to my latest book, "Extracting and Selecting Features for Data Mining: Algorithms in C++ and CUDA C".  I have taken some of my most-used techniques, those that I have found to be especially valuable in the study of relationships among variables, and documented them with basic theoretical foundations and well commented C++ source code.  Naturally, this collection is far from complete.  Maybe Volume 2 will appear some day.  But this volume should keep you busy for a while.

Some readers may wonder why I have included a few techniques that are widely available in standard statistical packages, very old techniques such as maximum likelihood factor analysis and varimax rotation.  In these cases, I included them because they are useful, and yet reliable source code for these techniques is difficult to obtain.  There are times when it’s more convenient to have your own versions of old workhorses, integrated into your own personal or proprietary programs, than to be forced to coexist with canned packages that may not fetch data or present results in the way that you want.

Many readers will want to incorporate the routines in this book into their own data mining tools.  And that, in a nutshell, is the purpose of this book.  I hope that you incorporate these techniques into your own data mining toolbox and find them as useful as I have in my own work.

There is no sense in my listing here the main topics covered in this text; that‘s what a Table of Contents is for.  But I would like to point out a few special topics not frequently covered in other sources:

●    Information theory is a foundation of some of the most important techniques for discovering relationships between variables, yet it is voodoo mathematics to many people.  For this reason I devote the entire first chapter to a systematic exploration of this topic.  I do apologize to those who purchased my “Assessing and Improving Prediction and Classification” book as well as this one, because this chapter is a nearly exact copy of a chapter in that book.  Nonetheless, this material is critical to understanding much later material in this book, and I felt that it would be unfair to almost force readers to purchase that earlier book in order to understand some of the most important topics in this book.

●    Uncertainty reduction is one of the most useful ways to employ information theory to understand how knowledge of one variable lets us gain measurable insight into the behavior of another variable.

●    Schreiber’s information transfer is a fairly recent development which lets us explore causality, the directional transfer of information from one time series to another.

●    Forward stepwise selection is a venerable technique for building up a set of predictor variables for a model.  But a generalization of this method in which ranked sets of predictor candidates allow testing of large numbers of combinations of variables is orders of magnitude more effective at finding meaningful and exploitable relationships between variables.

●    Simple modifications to relationship criteria let us detect profoundly nonlinear relationships using otherwise linear techniques.

●    Now that extremely fast computers are readily available, Monte-Carlo permutation tests are practical and broadly applicable methods for performing rigorous statistical relationship tests that until recently were intractable.

●    Combinatorially Symmetric Cross Validation as a means of detecting overfitting in models is a recently developed technique which, while computationally intensive, can provide valuable information not available as little as five years ago.

●    Automated selection of variables suited for predicting a given target has been routine for decades.  But in many applications we have a choice of possible targets, any of which will solve our problem.  Embedding target selection in the search algorithm adds a useful dimension to the development process.

●    Feature Weighting as Regularized Energy-Based Learning (FREL) is a recently developed method for ranking the predictive efficacy of a collection of candidate variables when we are in the situation of having too few cases to employ traditional algorithms.

●    Everyone is familiar with scatterplots as a means of visualizing the relationship between pairs of variables.  But they can be generalized in ways that highlight relationship anomalies far more clearly than scatterplots.  Examining discrepancies between joint and marginal distributions, as well as the contribution to mutual information, in regions of the variable space can show exactly where interesting interactions are happening.

●    Researchers, especially in the field of psychology, have been using factor analysis for decades to identify hidden dimensions in data.  But few developers are aware that a frequently ignored byproduct of maximum likelihood factor analysis can be enormously useful to data miners by revealing which variables are in redundant relationships with other variables, and which provide unique information.

●    Everyone is familiar with using correlation statistics to measure the degree of relationship between pairs of variables, and perhaps even to extend this to the task of clustering variables which have similar behavior.  But it is often the case that variables are strongly contaminated by noise, or perhaps by external factors that are not noise but that are of no interest to us.  Hence it can be useful to cluster variables within the confines of a particular subspace of interest, ignoring aspects of the relationships that lie outside this desired subspace.

●    It is sometimes the case that a collection of time-series variables are coherent; they are impacted as a group by one or more underlying drivers, and so they change in predictable ways as time passes.  Conversely, this set of variables may be mostly independent, changing on their own as time passes, regardless of what the other variables are doing.  Detecting when our variables move from one of these states to the other allows us, among other things, to develop separate models, each optimized for the particular condition.


If you would like to download the Table of Contents, click here.

To download a zip file containing all of the source code referenced in the book, click here.  See the note on Zip files at the bottom of this page.

To download the user's manual for the DATAMINE program which demonstrates the algorithms in the book, click here.

To download a zip file containing the DATAMINE program and its manual, click here.  See the note on Zip files at the bottom of this page.


My Latest book: "Extracting and Selecting Features for Data Mining: Algorithms in C++ and CUDA C" is now available.

The following topics are covered:

Hidden Markov models are chosen and optimized according to their multivariate correlation with a target.  The idea is that observed variables are used to deduce the current state of a hidden Markov model, and then this state information is used to estimate the value of an unobservable target variable.  This use of memory in a time series discourages whipsawing of decisions and enhances information usage.

Forward Selection Component Analysis uses forward and optional backward refinement of maximum-variance-capture components from a subset of a large group of variables.  This hybrid combination of principal components analysis with stepwise selection lets us whittle down enormous feature sets, retaining only those variables that are most important.

Local Feature Selection identifies predictors that are optimal in localized areas of the feature space but may not be globally optimal.  Such predictors can be effectively used by nonlinear models but are neglected by many other feature selection algorithms that require global predictive power.  Thus, this algorithm can detect vital features that are missed by other feature selection algorithms.

Stepwise selection of predictive features is enhanced in three important ways.  First, instead of keeping a single optimal subset of candidates at each step, this algorithm keeps a large collection of high-quality subsets and performs a more exhaustive search of combinations of predictors that have joint but not individual power.  Second, cross validation is used to select features, rather than using the traditional in-sample performance.  This provides an excellent means of complexity control, resulting in greatly improved out-of-sample performance.  Third, a Monte-Carlo permutation test is applied at each addition step, assessing the probability that a good-looking feature set may be not good at all, but rather just lucky in its attainment of a lofty performance criterion.

Nominal-to-ordinal conversion lets us take a potentially valuable nominal variable (a category or class membership) that is unsuitable for input to a prediction model, and assign to each category a sensible numeric value that can be used as a model input.

If you would like to download the Table of Contents, click here.

To download a zip file containing all of the source code referenced in the book, click here.  Because this file contains C++ and CUDA C source, your web browser may issue a virus warning (even though there is no executable in it!).  To satisfy yourself of its safety, just download the file without opening it and then use your anti-virus program to scan it.


To download a demonstration program and user's manual, click the "VarScreen" link and scroll down to the bottom of the page.



NOTE on downloading Zip files: As of this writing, GoDaddy has a bug in their website support which often prevents them from automatically detecting Zip files correctly.  They will tell you it is a file of unknown type with a long name.  You will need to download this file and then manually add a .ZIP extension, which will allow you to unzip the file.  I apologize for this, but there is nothing I can do about it.  It's a GoDaddy bug, and they were unable to give me a timeline for a fix.