Principal Component Analysis in Python

In this week’s data science project, I apply principal component analysis to a dataset based around hospital mortality using python. The dataset is extensive, containing over 50 variables on demographic characteristics, vital signs, and other measurements taken in the lab. This should be a perfect use-case for dimensionality reduction with PCA.

What is Principal Component Analysis?

Like my previous project on K-Means clustering, PCA is a type of unsupervised learning which means that we do not need any pre-defined labels for our data and we are not trying to make predictions.

The goal in PCA is to reduce the number of variables in our dataset. Therefore, the main purpose of PCA is to speed up machine learning. Using fewer variables while still retaining nearly all the original information and variability gives a nice boost to model training, consuming fewer resources and reaching a result faster.

Eigenvectors and eigenvalues drive PCA at it’s core. An eigenvector is a line that passes through the data and the algorithm chooses the line that maximises the variability between that line and the data points. The corresponding eigenvalue represents the magnitude of that variability. There are only as many eigenvectors as there are variables in the data.

An important question to ask in PCA is how many principal components to retain. The 2 most popular methods are:

  • Plotting the cumulative variance explained by each principal component. You would choose a cutoff value for the variance and select the number of components that occur at that cutoff.
  • A scree plot showing the eigenvalues of each principal component where the cutoff here is an eigenvalue of 1.

There are 3 critical processing conditions that must be satisfied when preparing PCA:

  • There must be no null or blank values in the dataset
  • All variables must be numeric
  • Standardise the variables to have a mean of 0 and a standard deviation of 1

Data & Project Goals

The data for this project is from Kaggle. Remember, if you’d like to know where I look for my datasets, check out my post on finding data.

This data is from the MIMIC-III database on in-hospital mortality. The original purpose of the dataset is to predict whether a patient lives or dies in hospital when in intensive care based on the risk factors selected.

There are a total of 51 variables in the data. In this project, we will attempt to reduce these variables down to a more manageable number using principal component analysis.

These are the goals for this project:

  • Prepare the data for PCA, satisfying the 3 critical processing conditions mentioned in the previous section
  • Apply the PCA algorithm (using the sklearn library)
  • Determine the optimal number of principal components to retain

Data Pre-Processing

We begin, as always, with importing the required libraries and the data.

import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")
# Import data
pca_raw_df = pd.read_csv("data/mortality.csv")

Next, we will carry out these 3 pre-processing steps:

  1. Remove any rows that contain blank / NA
  2. Select only numeric variables from the data
  3. Standardise the variables
# 1. Remove rows that contain blank values
pca_nona_df = pca_raw_df.dropna()

# 2. Select only numeric columns
# We calculate the max & assume that a max of 1 is probably not numeric
max_all = pca_nona_df.max()
max_cols = list(max_all[max_all != 1].index)

pca_sub_df = pca_nona_df[max_cols]

# Drop the columns that did not fit into our initial logic
pca_sub2_df = pca_sub_df.drop(columns = ['group', 'ID', 'gendera'])

# 3. Standardise the variables
scaler = StandardScaler()
pca_std = scaler.fit_transform(pca_sub2_df)

Once all pre-processing steps are complete, 38 variables remain.

Applying Principal Component Analysis

Now that our data is ready, we can apply PCA and then we will use these 2 methods to determine the optimal number of components to retain:

  1. Cumulative variance
  2. Scree plot
# Fit PCA
pca = PCA()
fit_pca = pca.fit_transform(pca_std)

Plot Cumulative Variance

# Plot the cumulative variance for each component
plt.figure(figsize = (15, 6))
components = np.arange(1, 39, step=1)
variance = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(components, variance, marker='o', linestyle='--', color='green')

plt.xlabel('Number of Components')
plt.xticks(np.arange(1, 39, step=1))
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.90, color='r', linestyle='-')
plt.text(0.5, 0.85, '90% variance threshold', color = 'red', fontsize=16)
plt.text(25, 0.85, "Components needed: "+str(np.where(np.cumsum(pca.explained_variance_ratio_)>=0.9)[0][0]), color = "red", fontsize=16)

plt.show()

The above plot contains the cumulative variance as a proportion on the y axis and the number of components on the x axis.

Using a variance threshold of 90%, the above chart helps us determine how many components we should retain from our dataset in order for it to still make sense for us in any further modelling.

Note that we chose 90% here as the variance threshold but this is not a golden rule. The researcher or data scientist chooses this variance threshold.

So, we can see from this plot that we need 22 components to retain 90% of the variability (information) in our data. That’s not much of an improvement over the total number of variables, 38.

Scree Plot

# Scree plot
plt.figure(figsize=(15, 6))
components = np.arange(1, 39, step=1)
eigenvalues = pca.explained_variance_

plt.plot(components, eigenvalues, marker = 'o', 
				 linestyle = '--', color = 'green')
plt.ylim(0, max(eigenvalues))

plt.ylabel('Eigenvalue')
plt.xlabel('Number of Components')
plt.xticks(np.arange(1, 39, step = 1))
plt.title('Scree plot')

plt.axhline(y=1, color = 'r', linestyle = '-')
plt.text(0, 0.75, 'Eigenvalue Cutoff', color = 'red', fontsize=14)
plt.text(15, 1.10, 'Components Needed: '+str(np.where(eigenvalues<=1)[0][0]), 
				 color = 'red', fontsize=14)

plt.show()

The above scree plot contains the eigenvalues on the y axis and the number of components on the x axis.

As a general rule of thumb, we should select the number of components that have an eigenvalue greater than 1. Based on the above scree plot we should choose to keep 13 components. This is very different to the previous cumulative variance plot where we chose to retain 22 components.

However, if we choose to retain 13 components, this only explains 65% of the variability. This is much lower than the variability captured when retaining 22 components (90%). At this point, the researcher or data scientist would need to decide whether a cumulative variance of 65% is satisfactory based on their domain knowledge and other research in the area.

Alternatively, there are other methods that can determine the optimal number of components. It is also possible that PCA just isn’t a suitable method for this dataset and other dimensionality reduction techniques could be explored.

I used these resources for this project, check them out if you’d like to dive deeper into PCA:

You can find all code for this project on Github.

I hope you found this project useful! Comment below or connect with me on Twitter if you have any questions.