VarScreen

News flash... My free VarSCreen program, intended to help developers of predictive models screen predictor and target candidates, now has a new feature in Version 1.9: Local Feature Selection.

VarScreen is a free program which contains a variety of software tools useful for the developer of predictive models. These tools screen and evaluate candidates for predictors and targets. Some notable features of the program include:

1) Most operations involve just two quick steps: read the data and select the test to be performed. Program-supplied defaults are often satisfactory, and adjusting them is easy.

2) The program is fully multi-threaded, enabling it to take maximum advantage of modern multiple-core processors. As of this writing, many over-the-counter computers contain a CPU with ten or more cores, each of which is hyperthreaded to perform two sets of operation streams simultaneously. VarScreen keeps all of these threads busy as much as possible, which tremendously speeds operation compared to single-threaded programs.

3) The most massively compute-intensive algorithms make use of any CUDA-enabled nVidia card in the userâ€™s computer. These widely available video cards (standard hardware on many computers) turn an ordinary desktop computer into a super-computer, accelerating computations by several orders of magnitude. (The GTX Titan contains almost 3000 parallel processors.) Enormously complex algorithms that would require days of compute time on an ordinary computer with ordinary software can execute in several minutes using the VarScreen program on a computer with a modern nVidia display card.

4) Most tests print solo and unbiased p-values:

The Solo pval is the probability that a candidate that has a strictly random (no predictive power) relationship with the target could have, by sheer good luck, had a performance statistic at least as high as that obtained. If this quantity is not small, the developer should strongly suspect that the candidate is worthless for predicting the target. Of course, this logic is, in a sense, accepting a null hypothesis, which is well known to be a dangerous practice. However, if a reasonable number of cases are present and a reasonable number of Monte-Carlo replications have been done, this test is powerful enough that failure to achieve a small p-value can be interpreted as the candidate having little or no predictive power.

The problem with the Solo pval is that if more than one candidate is tested (the usual situation!), then there is a large probability that some truly worthless candidate will be lucky enough to achieve a high level of the performance statistic, and hence achieve a very small Solo pval. In fact, if all candidates are worthless, the Solo pvals will follow a uniform distribution, frequently obtaining small values by random chance. This situation can be remedied by conducting a more advanced test which accounts for this selection bias. The Unbiased pval for the best performer in the candidate set is the probability that this best performer could have attained its exalted level of performance by sheer luck if all candidates were truly worthless.

At this time, VarScreen contains the following tests:

1) The Univariate Mutual Information test computes the mutual information between a specified target variable and each member of a specified set of predictor candidates. The predictors are then listed in descending order of mutual information. Along with each candidate, the Solo pval and Unbiased pval are printed if Monte-Carlo replications are requested. *** New in version 1.4 ***: When one uses mutual information to select promising predictors from among a set of competitors, one hopes that the selected predictors will continue to be superior out-of-sample. This new feature estimates the probability that this will be the case.

2) The Bivariate Mutual Information test computes the mutual information between each of one or more specified target variables and each possible pair of predictors taken from a specified set of predictor candidates. The predictor pairs and associated targets are then listed in the VARSCREEN.LOG file in descending order of mutual information. Along with each such set, the Solo pval and Unbiased pval are printed if Monte-Carlo replications are requested. This test is useful because sometimes a single variable acting alone has little or no predictive power, but in conjunction with another it becomes useful. Also, sometimes we have several equally useful candidates for the target variable, and we are not sure which will be most predictable. One example of this situation is when the application is predicting future movement of a financial market with the goal of taking a position and then hopefully closing the position with a profit. Should we employ a tight stop to discourage severe losses? Or should we use a loose stop to avoid being closed out by random noise? We might test multiple targets corresponding to various degrees of stop positioning, and then determine which of the competitors is most predictable.

3) (New in Version 1.1) The Bivariate Mutual Information test now has the option of selecting candidates based on uncertainty reduction instead of mutual information. If multiple target candidates are specified, this eliminates distortion caused by the targets having different entropies.

4) (New in Version 1.2) A new test implements the algorithm of Peng, Long and Ding (2005) "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min Redundancy". This algorithm builds a subset of predictors, selected from a large list of candidates. This subset has the optimality property of maximizing its relevance in predicting the target while simultaneously minimizing its internal redundancy (and hence the size of the subset).

5) (New in Version 1.3) It develops hidden Markov models that fit all possible small subsets of a list of predictor candidates, and then it identifies the model whose state probability vector correlates most highly with a target variable. This method of predictor selection can be superior to other methods when the data is sequential and the classification has memory (each class decision is impacted by the prior decision).

6) (New in Version 1.5) Stationarity in the mean is vital to most prediction schemes. If a predictor or target significantly changes its mean in the midst of a data stream, it would be foolish to assume that a prediction model will perform well on both sides of this break. Or suppose we are following the performance of a manufacturing process or a market trading system. If a previously successful system suddenly deteriorates, we wish to determine whether this falloff in performance is within historical norms or perhaps signifies something more serious. VarScreen helps analyze such situations, including the case of examining multiple data streams simultaneously.

Version 1.5 also includes a minor bug fix for hidden Markov models, as well as a significant speedup of Monte-Carlo permutation tests in all operations.

7) (New in Version 1.6) The data-miner's nightmare is an application with high dimensionality (numerous predictor candidates) but relatively few cases. FREL (Feature Weighting as Regularized Energy-Based Learning) is a recent development that is useful in this situation. VarScreen also contains an extension of this algorithm based on ensemble learning, which provides increased robustness against variations in the training data.

8) (New in Version 1.7) Principal components analysis yields unwieldy results when applied to massive amounts of data (such as several thousand potential predictor variables). The components can be nearly impossible to interpret and expensive to produce for future production runs. Forward selection component analysis, with optional backward refinement, lets the developer whittle down the set of variables in such a way that a specified number of principal components, generated from the same number of optimally selected variables, captures the maximum amount of variance attributable to the complete set of candidate variables. This lets the developer identify the most important variables and compute principal components from a subset much smaller than the original set.

9) (New in Version 1.8) Most feature selection algorithms favor predictors that have predictive power over most or all of the feature domain. But it is often the case that some predictors, although very powerful, have most or all of their predictive power concentrated in only part of the domain. Modern nonlinear models are usually able to take full advantage of such variables, but if our feature selection algorithm is unable to find them, they do us no good. The Local Feature Selection algorithm of Armanfard, Reilly, and Komeili does a fabulous job of finding features whose power is focused in relatively small areas of the feature domain. After studying and using this algorithm for some time now, I have to say that this is the most sophisticated and effective predictor selection algorithm that I've ever seen. Highly recommended.

10) (New in Version 1.9) Stepwise selection of predictive features is enhanced in three important ways. First, instead of keeping a single optimal subset of candidates at each step, this algorithm keeps a large collection of high-quality subsets and performs a more exhaustive search of combinations of predictors that have joint but not individual power. Second, cross validation is used to select features, rather than using the traditional in-sample performance. This provides an excellent means of complexity control, resulting in greatly improved out-of-sample performance. Third, a Monte-Carlo permutation test is applied at each addition step, assessing the probability that a good-looking feature set may be not good at all, but rather just lucky in its attainment of a lofty performance criterion.

11) (New in Version 1.9) Nominal-to-ordinal conversion lets us take a potentially valuable nominal variable (a category or class membership) that is unsuitable for input to a prediction model, and assign to each category a sensible numeric value that can be used as a model input.

VarScreen is a work in progress. New screening algorithms will likely be added on a regular basis. Stay tuned.

Last but not least, please understand that VarScreen is an experimental program. It is provided free of charge to interested users for educational purposes only. In all likelihood this program contains errors and omissions. If you use this program for a purpose in which loss is possible, then you are fully responsible for any and all losses associated with use of this program. The developer of this program disclaims all responsibility for losses which the user may incur.

NOTE: Versions prior to 1.82 contain a potentially serious bug... the random number generator used for Monte-Carlo permutation tests has a thread conflict issue that, under some rare conditions may compromise results, especially for the cyclic method. I believe it has been repaired in Version 1.82, uploaded 4/16/2019. It is highly unlikely that you have ever encountered this bug, but I want to be safe and fix it.

To download the manual only, click here. This is actually Chapter 7 of my latest book, "Extracting and Selecting Features for Data Mining". You really should download its Table of Contents to help you find your way around the manual, click here.

To download the program (64-bit Windows) and manual (zipped), click here. Note that the program and the CUDA runtime must be placed in the same directory. The 32-bit version has been discontinued.

These programs contain no embedded advertising or malware of any sort, nor do they report back to me any information about your computer. Because the downloaded Zip file contains an executable (.EXE), some anti-virus software may flag it as suspicious. I assure you, these programs are totally free of viruses et cetera. You can verify this yourself by downloading the file without opening it, and then having your anti-virus program scan the downloaded file.