VarScreen

VarScreen is a free program which contains a variety of software tools useful for the developer of predictive models. These tools screen and evaluate candidates for predictors and targets. Some notable features of the program include:

1) Most operations involve just two quick steps: read the data and select the test to be performed. Program-supplied defaults are often satisfactory, and adjusting them is easy.

2) The program is fully multi-threaded, enabling it to take maximum advantage of modern multiple-core processors. As of this writing, many over-the-counter computers contain a CPU with ten or more cores, each of which is hyperthreaded to perform two sets of operation streams simultaneously. VarScreen keeps all of these threads busy as much as possible, which tremendously speeds operation compared to single-threaded programs.

3) The most massively compute-intensive algorithms make use of any CUDA-enabled nVidia card in the userâ€™s computer. These widely available video cards (standard hardware on many computers) turn an ordinary desktop computer into a super-computer, accelerating computations by several orders of magnitude. (The GTX Titan contains almost 3000 parallel processors.) Enormously complex algorithms that would require days of compute time on an ordinary computer with ordinary software can execute in several minutes using the VarScreen program on a computer with a modern nVidia display card.

4) Most tests print solo and unbiased p-values:

The Solo pval is the probability that a candidate that has a strictly random (no predictive power) relationship with the target could have, by sheer good luck, had a performance statistic at least as high as that obtained. If this quantity is not small, the developer should strongly suspect that the candidate is worthless for predicting the target. Of course, this logic is, in a sense, accepting a null hypothesis, which is well known to be a dangerous practice. However, if a reasonable number of cases are present and a reasonable number of Monte-Carlo replications have been done, this test is powerful enough that failure to achieve a small p-value can be interpreted as the candidate having little or no predictive power.

The problem with the Solo pval is that if more than one candidate is tested (the usual situation!), then there is a large probability that some truly worthless candidate will be lucky enough to achieve a high level of the performance statistic, and hence achieve a very small Solo pval. In fact, if all candidates are worthless, the Solo pvals will follow a uniform distribution, frequently obtaining small values by random chance. This situation can be remedied by conducting a more advanced test which accounts for this selection bias. The Unbiased pval for the best performer in the candidate set is the probability that this best performer could have attained its exalted level of performance by sheer luck if all candidates were truly worthless.

At this time, VarScreen contains the following tests:

1) The Univariate Mutual Information test computes the mutual information between a specified target variable and each member of a specified set of predictor candidates. The predictors are then listed in descending order of mutual information. Along with each candidate, the Solo pval and Unbiased pval are printed if Monte-Carlo replications are requested. *** New in version 1.4 ***: When one uses mutual information to select promising predictors from among a set of competitors, one hopes that the selected predictors will continue to be superior out-of-sample. This new feature estimates the probability that this will be the case.

2) The Bivariate Mutual Information test computes the mutual information between each of one or more specified target variables and each possible pair of predictors taken from a specified set of predictor candidates. The predictor pairs and associated targets are then listed in the VARSCREEN.LOG file in descending order of mutual information. Along with each such set, the Solo pval and Unbiased pval are printed if Monte-Carlo replications are requested. This test is useful because sometimes a single variable acting alone has little or no predictive power, but in conjunction with another it becomes useful. Also, sometimes we have several equally useful candidates for the target variable, and we are not sure which will be most predictable. One example of this situation is when the application is predicting future movement of a financial market with the goal of taking a position and then hopefully closing the position with a profit. Should we employ a tight stop to discourage severe losses? Or should we use a loose stop to avoid being closed out by random noise? We might test multiple targets corresponding to various degrees of stop positioning, and then determine which of the competitors is most predictable.

3) (New in Version 1.1) The Bivariate Mutual Information test now has the option of selecting candidates based on uncertainty reduction instead of mutual information. If multiple target candidates are specified, this eliminates distortion caused by the targets having different entropies.

4) (New in Version 1.2) A new test implements the algorithm of Peng, Long and Ding (2005) "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min Redundancy". This algorithm builds a subset of predictors, selected from a large list of candidates. This subset has the optimality property of maximizing its relevance in predicting the target while simultaneously minimizing its internal redundancy (and hence the size of the subset).

5) (New in Version 1.3) It develops hidden Markov models that fit all possible small subsets of a list of predictor candidates, and then it identifies the model whose state probability vector correlates most highly with a target variable. This method of predictor selection can be superior to other methods when the data is sequential and the classification has memory (each class decision is impacted by the prior decision).

6) (New in Version 1.5) Stationarity in the mean is vital to most prediction schemes. If a predictor or target significantly changes its mean in the midst of a data stream, it would be foolish to assume that a prediction model will perform well on both sides of this break. Or suppose we are following the performance of a manufacturing process or a market trading system. If a previously successful system suddenly deteriorates, we wish to determine whether this falloff in performance is within historical norms or perhaps signifies something more serious. VarScreen helps analyze such situations, including the case of examining multiple data streams simultaneously.

Version 1.5 also includes a minor bug fix for hidden Markov models, as well as a significant speedup of Monte-Carlo permutation tests in all operations.

7) (New in Version 1.6) The data-miner's nightmare is an application with high dimensionality (numerous predictor candidates) but relatively few cases. FREL (Feature Weighting as Regularized Energy-Based Learning) is a recent development that is useful in this situation. VarScreen also contains an extension of this algorithm based on ensemble learning, which provides increased robustness against variations in the training data.

8) (New in Version 1.7) Principal components analysis yields unwieldy results when applied to massive amounts of data (such as several thousand potential predictor variables). The components can be nearly impossible to interpret and expensive to produce for future production runs. Forward selection component analysis, with optional backward refinement, lets the developer whittle down the set of variables in such a way that a specified number of principal components, generated from the same number of optimally selected variables, captures the maximum amount of variance attributable to the complete set of candidate variables. This lets the developer identify the most important variables and compute principal components from a subset much smaller than the original set.

9) (New in Version 1.8) Most feature selection algorithms favor predictors that have predictive power over most or all of the feature domain. But it is often the case that some predictors, although very powerful, have most or all of their predictive power concentrated in only part of the domain. Modern nonlinear models are usually able to take full advantage of such variables, but if our feature selection algorithm is unable to find them, they do us no good. The Local Feature Selection algorithm of Armanfard, Reilly, and Komeili does a fabulous job of finding features whose power is focused in relatively small areas of the feature domain. After studying and using this algorithm for some time now, I have to say that this is the most sophisticated and effective predictor selection algorithm that I've ever seen. Highly recommended.

10) (New in Version 1.9) Stepwise selection of predictive features is enhanced in three important ways. First, instead of keeping a single optimal subset of candidates at each step, this algorithm keeps a large collection of high-quality subsets and performs a more exhaustive search of combinations of predictors that have joint but not individual power. Second, cross validation is used to select features, rather than using the traditional in-sample performance. This provides an excellent means of complexity control, resulting in greatly improved out-of-sample performance. Third, a Monte-Carlo permutation test is applied at each addition step, assessing the probability that a good-looking feature set may be not good at all, but rather just lucky in its attainment of a lofty performance criterion.

11) (New in Version 1.9) Nominal-to-ordinal conversion lets us take a potentially valuable nominal variable (a category or class membership) that is unsuitable for input to a prediction model, and assign to each category a sensible numeric value that can be used as a model input.

12) (New in Version 2.0) Print tables of autocorrelation and partial autocorrelation

13) (New in Version 2.1) Improved test for break in mean of a time series

14) (New in Version 2.2) Indicator selection for financial market prediction based on thresholded profit factor, with selection bias elimination

15) (New in Version 2.3) Assorted useful plotting functions, and a bug fix for Version 2.2.

16) (New in Version 2.4, accelerated in Version 2.5) The "indicator selection by optimal profit factor" has just gotten a huge new improvement. The Monte-Carlo permutation test used in the prior version, like all other permutation tests in the VarScreen program, provides correct unbiased p-values for only the single best indicator; for all other competitors the computed p-value is an upper bound. This can result in some good indicators being passed over because of excessive over-estimation. A recent breakthrough in permutation test theory now eliminates this problem, allowing detection of more good indicators while still maintaining a user-specified familywise error rate.

17) (New in Version 3.1) The RANSAC algorithm trains a linear-quadratic prediction model by iteratively estimating the probability that each case is excessively dominated by noise. The final model is based on only non-noise cases, and performance of predictor candidates is used to score predictors. The stepwise permutation test with fixed familywise error computes p-values corrected for selection bias. A new variable is created which is the percent probability that each case is not dominated by noise. NOTE... In order to better illustrate the RANSAC algorithm, I created a stand-alone version that runs in a Windows console. To download the complete source code, click here. You should be able to compile the single-threaded version on other platforms with only minor modification. Be sure to read the README.TXT file!

VarScreen is a work in progress. New screening algorithms will likely be added on a regular basis. Stay tuned.

Last but not least, please understand that VarScreen is an experimental program. It is provided free of charge to interested users for educational purposes only. In all likelihood this program contains errors and omissions. If you use this program for a purpose in which loss is possible, then you are fully responsible for any and all losses associated with use of this program. The developer of this program disclaims all responsibility for losses which the user may incur.

ERRATA: Versions 2.4 and 2.5 had an error in the Stepwise option of Optimal Profit Factor indicator selection which caused modest anti-conservative behavior when computing p-values. This was repaired in Version 2.6.

Versions prior to 2.8 had a memory allocation error in both operations under the 'Create' menu (FSCA and Nominal-to-Ordinal). This error could, in some unlikely but possible situations, cause incorrect results or even a program crash. This has been repaired in Version 2.8.

Versions prior to 3.0 had an annoying inconsistent behavior. Normally, VarScreen.log is placed in the same directory as the data file that was read. However, if the user's computer had no CUDA-capable video card, or if their video driver was significantly outdated, VarScreen.log would instead be placed in the directory that holds the executable file. This latter inconsistent behavior has been eliminated in Version 3.0

To download the manual only, click here.

To download the program (64-bit Windows) and manual (zipped), click here. The 32-bit version has been discontinued. Beginning with Version 3.0, no CUDA runtime dll is required; all CUDA operations are statically linked into the program.

These programs contain no embedded advertising or malware of any sort, nor do they report back to me any information about your computer. Because the downloaded Zip file contains an executable (.EXE), some anti-virus software may flag it as suspicious. I assure you, these programs are totally free of viruses et cetera. You can verify this yourself by downloading the file without opening it, and then having your anti-virus program scan the downloaded file.