## Dimensionality Reduction: Removing Highly Correlated Columns

- 27/04/2015
- 132
- 0 Like

**Published In**

- Big Data
- Analytics
- Business Intelligence

Often input features are correlated, i.e. they depend on one another and carry similar information. A data column with values highly correlated to those of another data column is not going to add very much new information to the existing pool of input features. One of the two columns can probably be removed without decreasing much the total amount of information available.

In order to remove one out of a pair of highly correlated data columns, we need to:

1. measure the correlation between columns in pairs using the Linear Correlation node,

2. find the pairs of columns with correlation higher than a given threshold (if any) and remove one of the two, using the Correlation Filter node.

The Linear Correlation node calculates the correlation coefficient for all pairs of numerical columns in the data set, as the Pearson’s Product Moment Coefficient ρ, and for all pairs of nominal columns, as the Pearson's chi square value. No correlation coefficient is defined between a numerical and a nominal data column.

The output table of the node is the correlation matrix, i.e. the matrix with the correlation coefficients for all pairs of data columns. The correlation coefficient referring to a numerical and a nominal column in the matrix has a missing value.

Correlation coefficients for numerical columns are also dependent on the data range. Filtering highly correlated data columns requires uniform data ranges again, which can be obtained with a Normalizer node.

The node also produces a color coded view of the correlation matrix (see figure below) where values range from intense blue (+1 = full correlation), white (0 = no correlation), to red (-1 = full inverse correlation). The matrix diagonal shows the correlation of a data column with itself and this is, of course, a 1.0 correlation. The crosses indicate a missing correlation value, e.g. for the combination of a numerical and a categorical column.

The Correlation Filter node takes a correlation matrix as input, identifies pairs of columns with high correlation (i.e. greater than a given threshold), and removes one of the two columns for each identified pair. A slide bar defines the threshold value, evaluating the corresponding number of remaining data columns with the Calculate button. The lower the threshold the more aggressive the column filter. The best threshold value can be found by optimizing the accuracy on a validation set with the optimization loop.

Below is a metanode filtering highly correlated data columns with a fixed threshold value.

- 27/04/2015
- 132
- 0 Like

## Dimensionality Reduction: Removing Highly Correlated Columns

- 27/04/2015
- 132
- 0 Like

#### Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.

#### Top Authors

Often input features are correlated, i.e. they depend on one another and carry similar information. A data column with values highly correlated to those of another data column is not going to add very much new information to the existing pool of input features. One of the two columns can probably be removed without decreasing much the total amount of information available.

In order to remove one out of a pair of highly correlated data columns, we need to:

1. measure the correlation between columns in pairs using the Linear Correlation node,

2. find the pairs of columns with correlation higher than a given threshold (if any) and remove one of the two, using the Correlation Filter node.

The Linear Correlation node calculates the correlation coefficient for all pairs of numerical columns in the data set, as the Pearson’s Product Moment Coefficient ρ, and for all pairs of nominal columns, as the Pearson's chi square value. No correlation coefficient is defined between a numerical and a nominal data column.

The output table of the node is the correlation matrix, i.e. the matrix with the correlation coefficients for all pairs of data columns. The correlation coefficient referring to a numerical and a nominal column in the matrix has a missing value.

Correlation coefficients for numerical columns are also dependent on the data range. Filtering highly correlated data columns requires uniform data ranges again, which can be obtained with a Normalizer node.

The node also produces a color coded view of the correlation matrix (see figure below) where values range from intense blue (+1 = full correlation), white (0 = no correlation), to red (-1 = full inverse correlation). The matrix diagonal shows the correlation of a data column with itself and this is, of course, a 1.0 correlation. The crosses indicate a missing correlation value, e.g. for the combination of a numerical and a categorical column.

The Correlation Filter node takes a correlation matrix as input, identifies pairs of columns with high correlation (i.e. greater than a given threshold), and removes one of the two columns for each identified pair. A slide bar defines the threshold value, evaluating the corresponding number of remaining data columns with the Calculate button. The lower the threshold the more aggressive the column filter. The best threshold value can be found by optimizing the accuracy on a validation set with the optimization loop.

Below is a metanode filtering highly correlated data columns with a fixed threshold value.

- 27/04/2015
- 132
- 0 Like

## Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.