## Dimensionality Reduction: Removing Data Columns with too many Missing Values

- 30/03/2015
- 100
- 0 Like

**Published In**

- Big Data
- Analytics
- Business Intelligence

This is the first blog post on the theme of dimensionality reduction of a series of seven.

I am currently working on a whitepaper listing and describing seven classical methods for dimensionality reduction. Extracts from that whitepaper will be posted on this blog regularly. This is the first post of the series.**Information in Data Columns**** with too many Missing Values**

The most straightforward way to reduce a dataset dimensionality runs through the count of missing values. While there are many strategies to interpolate missing values in a data column (see blog post “**Counting Missing Values**

To count the number of missing values, we can either use a Statistics or a GroupBy node.

The **Statistics **node produces three data tables at the output ports: a statistics table, a nominal histogram table, and an occurrences table. The nominal histogram table contains the number of missing values for each data column. The statistics table contains,among other statistical measures, the total number of rows in the input data table. Based on these two values, a Math Formula node can calculate the ratio of missing values for each column as:

Ratio of missing values = number of missing values / total number of rows

If we use the **GroupBy **node, in the “Pattern Based Aggregation” tab, we have to set “.*” as the RegEx search pattern for the column names where to count the missing values. After transposing the results, the same Math Formula node calculates the ratio of missing values, as described in the formula above.

A **Rule-based Row Filter** node applies the rule for a too high missing value ratio (that is ratio > threshold value) and filters out all columns fitting the rule. **Threshold Optimization**

The threshold value selection is of course a crucial topic, since a too aggressive threshold will reduce dimensionality at the expense of performance, while a too soft threshold might not get the best reduction ratio.

To find the best threshold value, we used an **optimization loop** and maximized the classification algorithm accuracy on the test set. However, in some particular situations one classification algorithm might perform better than others. To be on the safe side, we ran the optimization loop on three possible classification algorithms ‒ neural network (MLP), decision tree, and Naïve Bayes – selecting the best method accuracy as the value to be maximized in the Parameter Optimization Loop End node.**Full Process and Final Remarks**

The full process of eliminating data columns with too many missing values is shown in the figure below. The procedure to actually count the number of missing values and filter out the columns can be found in the “Column Selection by Missing Values” metanode.

Figure 1. Content of "Best Threshold on Missing Values" metanode: optimization loop to find best threshold on ratio of missing values with maximum accuracy

Figure 2. Content of "Column Selection by Missing Values" metanode: counting missing values, calculating missing value ratio, and filtering out data columns with ratio > threshold

Notice that this technique for data dimensionality reduction applies to both nominal and numerical columns and might be computational intensive. The bottle neck of this whole operation, in terms of execution time, occurs when the missing values are counted for all input data columns, either using the Statistics node or the GroupBy node. This is due to the data sorting algorithms.**Evaluation**

Using this approach and using the small KDD data set from the KDD 2009 challenge, the best threshold value was found at 0.4, leading to a dimensionality reduction of 71% and a corresponding accuracy of 76% by the MLP. However, by checking the accuracy results associated with each loop threshold value, we noticed that already by cutting out the data columns with more than 90% missing values, i.e. reaching a 67% reduction, we kept a 76% accuracy on the evaluation set. Which is already quite impressive, considering that we were just counting!

- 30/03/2015
- 100
- 0 Like

## Dimensionality Reduction: Removing Data Columns with too many Missing Values

- 30/03/2015
- 100
- 0 Like

#### Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.

#### Top Authors

This is the first blog post on the theme of dimensionality reduction of a series of seven.

I am currently working on a whitepaper listing and describing seven classical methods for dimensionality reduction. Extracts from that whitepaper will be posted on this blog regularly. This is the first post of the series.**Information in Data Columns**** with too many Missing Values**

The most straightforward way to reduce a dataset dimensionality runs through the count of missing values. While there are many strategies to interpolate missing values in a data column (see blog post “**Counting Missing Values**

To count the number of missing values, we can either use a Statistics or a GroupBy node.

The **Statistics **node produces three data tables at the output ports: a statistics table, a nominal histogram table, and an occurrences table. The nominal histogram table contains the number of missing values for each data column. The statistics table contains,among other statistical measures, the total number of rows in the input data table. Based on these two values, a Math Formula node can calculate the ratio of missing values for each column as:

Ratio of missing values = number of missing values / total number of rows

If we use the **GroupBy **node, in the “Pattern Based Aggregation” tab, we have to set “.*” as the RegEx search pattern for the column names where to count the missing values. After transposing the results, the same Math Formula node calculates the ratio of missing values, as described in the formula above.

A **Rule-based Row Filter** node applies the rule for a too high missing value ratio (that is ratio > threshold value) and filters out all columns fitting the rule. **Threshold Optimization**

The threshold value selection is of course a crucial topic, since a too aggressive threshold will reduce dimensionality at the expense of performance, while a too soft threshold might not get the best reduction ratio.

To find the best threshold value, we used an **optimization loop** and maximized the classification algorithm accuracy on the test set. However, in some particular situations one classification algorithm might perform better than others. To be on the safe side, we ran the optimization loop on three possible classification algorithms ‒ neural network (MLP), decision tree, and Naïve Bayes – selecting the best method accuracy as the value to be maximized in the Parameter Optimization Loop End node.**Full Process and Final Remarks**

The full process of eliminating data columns with too many missing values is shown in the figure below. The procedure to actually count the number of missing values and filter out the columns can be found in the “Column Selection by Missing Values” metanode.

Figure 1. Content of "Best Threshold on Missing Values" metanode: optimization loop to find best threshold on ratio of missing values with maximum accuracy

Figure 2. Content of "Column Selection by Missing Values" metanode: counting missing values, calculating missing value ratio, and filtering out data columns with ratio > threshold

Notice that this technique for data dimensionality reduction applies to both nominal and numerical columns and might be computational intensive. The bottle neck of this whole operation, in terms of execution time, occurs when the missing values are counted for all input data columns, either using the Statistics node or the GroupBy node. This is due to the data sorting algorithms.**Evaluation**

Using this approach and using the small KDD data set from the KDD 2009 challenge, the best threshold value was found at 0.4, leading to a dimensionality reduction of 71% and a corresponding accuracy of 76% by the MLP. However, by checking the accuracy results associated with each loop threshold value, we noticed that already by cutting out the data columns with more than 90% missing values, i.e. reaching a 67% reduction, we kept a 76% accuracy on the evaluation set. Which is already quite impressive, considering that we were just counting!

- 30/03/2015
- 100
- 0 Like

## Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.