## Dimensionality Reduction: Low Variance Filter

- 13/04/2015
- 171
- 0 Like

**Published In**

- Big Data
- Analytics
- Business Intelligence

Information in Data Columns with Low Variance

Information in Data Columns with Low Variance

Indeed, another way of measuring how much information a data column has, is to measure its variance. In the limit case where the column cells assume a constant value, the variance would be zero and the column would be of no help in the discrimination of different groups of data.

The Low Variance Filter node calculates each column variance and removes those columns with a variance value below a given threshold.

Necessary Pre-processing

Necessary Pre-processing

Notice that the variance can only be calculated for numerical columns, i.e. this dimensionality reduction method applies only to numerical columns. Note, too, that the variance value depends on the column numerical range. Therefore data column ranges need to be normalized before calculating their variance, in order to make variance values independent from the column domain range.

**Low Variance Filter Sub-Workflow**

First a Normalizer node normalizes all column ranges to [0, 1]; next, a Low Variance Filter node calculates the columns variance and filters out the columns with a variance lower than a set threshold; finally, all remaining columns are de-normalized to return to their original numerical range.

**Threshold Value Optimization**

As for the previously published method (

**Evaluation**

Using this approach and using the small KDD data set from the

Higher threshold values - i.e. more tolerant methods - actually produce worse accuracy values, proving that dimensionality reduction is not only necessary for execution time but also for performance improvement.

- 13/04/2015
- 171
- 0 Like

## Dimensionality Reduction: Low Variance Filter

- 13/04/2015
- 171
- 0 Like

#### Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.

#### Top Authors

Information in Data Columns with Low Variance

Information in Data Columns with Low Variance

Indeed, another way of measuring how much information a data column has, is to measure its variance. In the limit case where the column cells assume a constant value, the variance would be zero and the column would be of no help in the discrimination of different groups of data.

The Low Variance Filter node calculates each column variance and removes those columns with a variance value below a given threshold.

Necessary Pre-processing

Necessary Pre-processing

Notice that the variance can only be calculated for numerical columns, i.e. this dimensionality reduction method applies only to numerical columns. Note, too, that the variance value depends on the column numerical range. Therefore data column ranges need to be normalized before calculating their variance, in order to make variance values independent from the column domain range.

**Low Variance Filter Sub-Workflow**

First a Normalizer node normalizes all column ranges to [0, 1]; next, a Low Variance Filter node calculates the columns variance and filters out the columns with a variance lower than a set threshold; finally, all remaining columns are de-normalized to return to their original numerical range.

**Threshold Value Optimization**

As for the previously published method (

**Evaluation**

Using this approach and using the small KDD data set from the

Higher threshold values - i.e. more tolerant methods - actually produce worse accuracy values, proving that dimensionality reduction is not only necessary for execution time but also for performance improvement.

- 13/04/2015
- 171
- 0 Like

## Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.