Inductive Solutions, Inc.
380 Rector Place, Suite 4A, New York, New York 10280

Email  Telephone: +1 (212)945.0630   Fax: +1 (212)945.0367

 

Home

Products and Services
Software Products
Recommended Books

Bibliography and White Papers
Free Downloads

Links

 

 

RunPCA

How do you mine a set of data for the most significant factors?  How can you pre-condition a set of data to increase the speed and accuracy of regression analysis or neural network training?  How do you "reduce the dimensionality" of high dimensional problems? 

RunPCA is an information discovery ("datamining") tool based on Principal Component Analysis, a statistical method that transforms a set of data inputs into a new smaller set of uncorrelated inputs ordered by information content. 

RunPCA Features

  • Computes means, variances, covariances, and correlations of large data sets
  • Computes and ranks principal components and their variances
  • Automatically transforms data sets
  • Can analyze datasets up to 50,000 rows and 200 columns

Benefits

  • Easy-to-Learn and Easy-to-Use Excel Spreadsheet User Interface
  • Computation is very fast
  • The RunPCA C/C++ Library is available for further customization

License

The standard single user license is for Microsoft Windows.  Other licensing plans for other platforms are also available. Contact us about versions for other operating systems (such as Linux or Solaris), about site licenses, or about academic discounts.

For example, suppose we have a table of 10,000 rows and 3 columns (or "factors") and we want to discover some sort of relationship between the three columns.  The following table shows how the variance of the data of each column is distributed:

Variance Fraction Accumulated
0.381745 66.57 66.57
0.095436 16.64 18.32
0.096271 16.79 100

The most information (highest variance) is contained in the first column (almost two-thirds of the information as indicated in the first row of the table).  The remaining information is split almost evenly into the other two factors (as indicated in the next two rows).

After processing by RunPCA, the three original columns are transformed to "principal factors."  Now the variance of the transformed data (consisting of the three principal factors) is distributed as follows:

Variance Fraction Accumulated
0.494189 86.18 86.18
0.079263 13.82 100
0 0 100

Now most of the information (highest variance) is contained in the first principal factor (86% of the information). The remaining information is contained entirely in the second principal factor.  This effectively reduces the dimension of the data by 33%.  This means that if we have additional data of observed responses (or target outputs), then we can perform regression (or train a neural network) using only two columns of the transformed data, rather than the three columns of the original data. This can improve the speed and accuracy of the training or regression.

 
Purchase Student Version

 

(c) 2006 Inductive Solutions, Inc. All rights reserved.