## Dimensionality Reduction: Principal Component Analysis (PCA)

- 10/05/2015
- 179
- 0 Like

**Published In**

- Big Data
- Analytics
- Business Intelligence

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to move the original n coordinates of a data set into a new set of n coordinates called principal components. As a result of the transformation, the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible); each succeeding component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric.

The purpose of applying PCA to a data set is ultimately to reduce its dimensionality, by finding a new smaller set of m variables, m < n, retaining most of the data information, i.e. the variation in the data. Since the principal components (PCs) resulting from PCA are sorted in terms of variance, keeping the first m PCs should also retain most of the data information, while reducing the data set dimensionality.

Notice that the PCA transformation is sensitive to the relative scaling of the original variables. Therefore data column ranges need normalizing before applying PCA. Also notice that the new coordinates (PCs) are not real system-produced variables anymore. Applying PCA to your data set loses its interpretability. If interpretability of the results is important for your analysis, PCA is not the transformation for your project.

KNIME has 2 nodes to implement PCA transformation: PCA Compute and PCA Apply.

The **PCA Compute** node calculates the covariance matrix of the input data columns and its eigenvectors, identifying the directions of maximal variance in the data space. A high value of the eigenvalue indicates a high variance of the data on the corresponding eigenvector. Eigenvectors can be sorted by decreasing eigenvalues, i.e. variance. The PCA Compute node outputs the covariance matrix, the PCA model, and the PCA spectral decomposition of the original data columns along the eigenvectors. The PCA model is produced at the last output port and contains the eigenvalues and the eigenvector projections necessary to transform each data row from the original space into the new PC space.

The **PCA Apply** node transforms a data row from the original space into the new PC space, using the eigenvector projections in the PCA model. A point from the original data set is converted into the new set of PC coordinates by multiplying the original zero-mean data row by the eigenvector matrix generated by the spectral decomposition data table.

By reducing the number of eigenvectors, we effectively reduce the dimensionality of the new data set. Usually, only a subset of all PCs is necessary to keep 100% information from the original data set. The more tolerant the losing of information, the higher the dimensionality reduction of the data space. The configuration settings of the PCA Apply node allows to define the maximum tolerable information loss and calculate the consequent dimensionality reduction based on the necessary number of PCs.

Notice that PCA is a numerical technique. This means the reduction process only affects the numerical columns and does not act on the nominal columns. Also notice that PCA skips the missing values. On a data set with many missing values, PCA will be less effective.

Figure below shows a PCA sub-workflow. Here the training set is used to build the covariance matrix, after dealing with missing values and normalizing all data columns to fall into [0,1]. The first m eigenvectors of the PCA model are then applied to transform the data set and reduce its dimensionality from the original n coordinates to the m selected PCs, with m < n.

- 10/05/2015
- 179
- 0 Like

## Dimensionality Reduction: Principal Component Analysis (PCA)

- 10/05/2015
- 179
- 0 Like

#### Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.

#### Top Authors

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to move the original n coordinates of a data set into a new set of n coordinates called principal components. As a result of the transformation, the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible); each succeeding component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric.

The purpose of applying PCA to a data set is ultimately to reduce its dimensionality, by finding a new smaller set of m variables, m < n, retaining most of the data information, i.e. the variation in the data. Since the principal components (PCs) resulting from PCA are sorted in terms of variance, keeping the first m PCs should also retain most of the data information, while reducing the data set dimensionality.

Notice that the PCA transformation is sensitive to the relative scaling of the original variables. Therefore data column ranges need normalizing before applying PCA. Also notice that the new coordinates (PCs) are not real system-produced variables anymore. Applying PCA to your data set loses its interpretability. If interpretability of the results is important for your analysis, PCA is not the transformation for your project.

KNIME has 2 nodes to implement PCA transformation: PCA Compute and PCA Apply.

The **PCA Compute** node calculates the covariance matrix of the input data columns and its eigenvectors, identifying the directions of maximal variance in the data space. A high value of the eigenvalue indicates a high variance of the data on the corresponding eigenvector. Eigenvectors can be sorted by decreasing eigenvalues, i.e. variance. The PCA Compute node outputs the covariance matrix, the PCA model, and the PCA spectral decomposition of the original data columns along the eigenvectors. The PCA model is produced at the last output port and contains the eigenvalues and the eigenvector projections necessary to transform each data row from the original space into the new PC space.

The **PCA Apply** node transforms a data row from the original space into the new PC space, using the eigenvector projections in the PCA model. A point from the original data set is converted into the new set of PC coordinates by multiplying the original zero-mean data row by the eigenvector matrix generated by the spectral decomposition data table.

By reducing the number of eigenvectors, we effectively reduce the dimensionality of the new data set. Usually, only a subset of all PCs is necessary to keep 100% information from the original data set. The more tolerant the losing of information, the higher the dimensionality reduction of the data space. The configuration settings of the PCA Apply node allows to define the maximum tolerable information loss and calculate the consequent dimensionality reduction based on the necessary number of PCs.

Notice that PCA is a numerical technique. This means the reduction process only affects the numerical columns and does not act on the nominal columns. Also notice that PCA skips the missing values. On a data set with many missing values, PCA will be less effective.

Figure below shows a PCA sub-workflow. Here the training set is used to build the covariance matrix, after dealing with missing values and normalizing all data columns to fall into [0,1]. The first m eigenvectors of the PCA model are then applied to transform the data set and reduce its dimensionality from the original n coordinates to the m selected PCs, with m < n.

- 10/05/2015
- 179
- 0 Like

## Rosaria Silipo

Principal Data Scientist at KNIME

Opinions expressed by Gladwin Analytics members are their own.