Projects & Workflows Archives - Statistically Relevant

Joleen November 17, 2023

Projects & Workflows

Data Preprocessing Techniques Every Data Scientist Should Know

In data science, there’s an often-underestimated hero working behind the scenes: data preprocessing. Imagine analyzing a dataset filled with gaps, inconsistencies, and outliers. The results would be like deciphering a blurred photograph—it’s frustrating and usually just a total waste of time.

That’s where data preprocessing comes in. It’s the process of cleaning, transforming, and structuring your data to make it ready for analysis. And trust me when I say this: a little bit of data preprocessing can go a long way in making your data-driven projects shine.

Whether you’re a seasoned data scientist looking to fine-tune your skills or a newcomer eager to dive into data science, this guide is designed to walk you through the essential steps of data preprocessing, from cleaning and transformation to handling missing data and outliers, and even the art of feature engineering.

You’ll also learn how to build a robust data preprocessing pipeline, making your work more efficient and reproducible. You’ll get practical examples and Python code snippets to illustrate each concept, ensuring you can confidently implement these techniques in your projects.

Data Cleaning

Data, in its raw form, is rarely clean. It’s common to encounter missing values, duplicated records, outliers, and inconsistent formats. Ignoring these issues can lead to inaccurate analyses and flawed conclusions. Data cleaning is the process of finding and fixing these imperfections, ensuring that your dataset is reliable and ready for analysis.

Common Data Quality Issues

Before we delve into the techniques for data cleaning, let’s explore some common data quality issues:

Missing Data: Some observations may lack values for specific features. This can disrupt analyses and lead to biased results.
Duplicates: Duplicate records can skew statistical calculations and misrepresent patterns in your data.
Outliers: Outliers are extreme values that can distort statistical measures and machine learning models.
Inconsistent Data Formats: Dates, text, or categorical data may come in various formats, making it challenging to work with them cohesively.

Data Transformation

Imagine you’ve cleaned your data, removing all the inconsistencies and imperfections. You’ve done a fantastic job at restoring order to your dataset. But what if I told you there’s more to data preprocessing than just cleaning? That’s where data transformation steps in.

The Significance of Data Transformation

Why do we need data transformation? Well, it serves several critical purposes:

Normalization: By scaling numerical features to a standard range, you ensure that no single feature dominates the analysis due to its scale.
Standardization: Standardizing data (mean of 0 and standard deviation of 1) helps when algorithms are sensitive to feature scales, like many machine learning models.
Handling Categorical Data: Most machine learning algorithms require numerical input. Transforming categorical variables into numerical form allows you to include them in your analysis.
Creating New Features: Feature engineering, a subset of data transformation, involves creating new features that can enhance your model’s performance.

Techniques for Data Transformation

Let’s explore some common techniques for data transformation:

1. Normalization:

Scaling numerical features to a common range, often between 0 and 1. This prevents features with large values from dominating the analysis.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['normalized_column'] = scaler.fit_transform(df[['column_name']])

2. Standardization:

Transforming numerical features with a mean of 0 and a standard deviation of 1. This is particularly useful for algorithms sensitive to feature scales.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['standardized_column'] = scaler.fit_transform(df[['column_name']])

3. Handling Categorical Data:

Converting categorical variables into numerical form using techniques like one-hot encoding or label encoding.

# One-Hot Encoding
df = pd.get_dummies(df, columns=['categorical_column'], prefix=['prefix'])

4. Creating New Features (Feature Engineering):

Crafting new features from existing ones to capture meaningful information.

df['new_feature'] = df['feature1'] * df['feature2']

Handling Missing Data

Missing data is a common issue in real-world datasets. It can occur for several reasons, such as data collection errors, sensor malfunctions, or user omissions. Failing to address missing data appropriately can lead to biased analysis and incorrect conclusions.

Handling missing data is not about filling in gaps randomly; it’s about making informed decisions on how to impute missing values based on the nature of the data and the problem you’re trying to solve.

Strategies for Handling Missing Data

Let’s explore some strategies for handling missing data:

1. Deletion: Remove rows or columns with missing values. Use this cautiously, as it can lead to information loss.

df.dropna(subset=['column_name'], inplace=True) # Remove rows with missing values in a specific column

2. Imputation: Fill in missing values with calculated estimates, such as mean, median, mode, or more advanced methods.

df['column_name'].fillna(df['column_name'].median(), inplace=True) # Impute missing values with the median

3. Interpolation: Estimate missing values based on the values of neighboring data points. This is useful for time series data.

df['column_name'].interpolate(method='linear', inplace=True) # Interpolate missing values linearly

4. Advanced Techniques: Consider using machine learning-based imputation methods like k-Nearest Neighbors (KNN) or regression for more complex scenarios.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df['column_name'] = imputer.fit_transform(df[['column_name']])

The choice of strategy depends on the dataset, the nature of the missing data, and the goals of your analysis. There’s no one-size-fits-all solution, so understanding these techniques is crucial for effective data preprocessing.

Dealing with Outliers

Outliers can have a substantial impact on your analysis and models. They can skew statistical measures, mislead machine learning algorithms, and ultimately lead to incorrect conclusions. Thus, it’s crucial to address outliers appropriately.

However, not all outliers are bad. Sometimes, outliers may be genuine and informative data points, representing rare events or anomalies. The key is to distinguish between meaningful outliers and data errors.

Strategies for Dealing with Outliers

Let’s explore some strategies for dealing with outliers:

1. Detection:

The first step is to identify outliers. You can use visualization tools like box plots or statistical tests like the Interquartile Range (IQR) method.

Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)]

2. Handling Outliers:

Once detected, you have several options for dealing with outliers:

Removal: Remove the outliers from the dataset if they are data errors and distort your analysis.
Transformation: Apply mathematical transformations to make the data less sensitive to outliers.
Special Treatment: Sometimes, you might treat outliers as a separate category, especially if they carry valuable information.

Handling outliers requires careful consideration of your data, domain knowledge, and the specific goals of your analysis.

3. Leveraging Outliers:

In certain scenarios, outliers can be insightful. They might represent critical events or unique customer behaviors. In such cases, preserving and analyzing outliers can lead to valuable insights.

Remember that outlier handling should align with your project’s objectives. What’s an outlier in one context might be a crucial data point in another.

Inconsistent Data Formats

In real-world datasets, you often encounter data in various formats. Dates might come in different styles, text data might have inconsistent capitalization, and categorical variables might have multiple representations for the same category. Inconsistent data formats can lead to confusion, errors, and hinder your ability to analyze the data effectively.

Example 1: Inconsistent Date Formats

Issue: Date columns in a dataset might come in various formats (e.g., YYYY-MM-DD, MM/DD/YYYY), making it challenging to perform chronological analysis.

Solution: Standardize date formats using the pd.to_datetime function in Pandas:

# Standardize date format to YYYY-MM-DD
df['date_column'] = pd.to_datetime(df['date_column'], format='%Y-%m-%d', errors='coerce')

# Drop rows with invalid dates (if needed)
df.dropna(subset=['date_column'], inplace=True)

In the code above, we use errors='coerce' to force errors to become NaT (same as NaN, but for time-based data).

Example 2: Text Data Cleanup

Issue: Text data can contain unwanted characters, leading/trailing whitespaces, or HTML tags.

Solution: Strip these nuisance characters out using the re Python library

import re
df['text_column'] = df['text_column'].apply(lambda x: re.sub(r'<.*?>', '', x))

Example 3: Inconsistent Capitalization in Text Data

Issue: Text data can have inconsistent capitalization, making grouping or searching for specific terms difficult.

Solution: Standardize capitalization using Python’s string methods:

# Convert text to lowercase
df['text_column'] = df['text_column'].str.lower()

By converting all text to lowercase, you eliminate discrepancies due to capitalization, making text data more consistent and suitable for text analysis tasks.

Example 4: Inconsistent Categorical Data Representations

Issue: Categorical variables may have multiple representations for the same category (e.g., “Male” vs. “male”), leading to ambiguity.

Solution: Combine the above capitalization and text cleanup strategies with label encoding to standardize categorical variables:

from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to categorical column
df['category_column'] = label_encoder.fit_transform(df['category_column'])

Label encoding assigns a unique numerical label to each category, ensuring consistency in the representation of categorical data.

Feature Engineering

Why is feature engineering so crucial? Well, consider this: your raw data might not contain all the information needed to solve a particular problem effectively. By creating new features, you can extract more information, capture complex relationships, and improve the predictive power of your models.

Techniques for Feature Engineering

Let’s explore some common techniques for feature engineering:

1. Mathematical Transformations:

Apply mathematical functions to existing features to create new ones. For instance, taking the square root, logarithm, or exponential of a feature.

df['sqrt_feature'] = np.sqrt(df['original_feature'])

2. Interaction Features:

Multiply, divide, or combine existing features to capture interactions between them.

df['interaction_feature'] = df['feature1'] * df['feature2']

3. Binning:

Convert continuous numerical features into categorical bins.

df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, np.inf], labels=['-18', '18-35', '36-60', '60+'])

4. Text Feature Extraction:

Extract meaningful information from text data using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text_column'])

5. Time-Based Features:

Extract features from timestamps, such as day of the week, month, or hour of the day.

df['day_of_week'] = df['timestamp_column'].dt.dayofweek

Data Preprocessing Pipeline

Data preprocessing pipelines are your blueprint for transforming raw data into a valuable asset. They bring structure and order to the chaos of data quality issues. These pipelines are essential because they:

Ensure that preprocessing steps are applied in a specific order.
Automate repetitive tasks, saving time and reducing human error.
Enhance reproducibility by encapsulating the entire preprocessing workflow.

Building a Comprehensive Data Preprocessing Pipeline

Here’s a step-by-step guide to constructing a robust data preprocessing pipeline that addresses various data quality issues:

1. Import Libraries:

Import essential libraries for data preprocessing and machine learning.

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

2. Define Steps:

Break down your preprocessing tasks into individual steps. Each step should handle a specific data quality issue and include relevant transformers. Here’s an example:

# Define preprocessing steps in the pipeline

preprocessing_steps = [
('imputer', SimpleImputer(strategy='median')), # Handle missing data
('scaler', StandardScaler()), # Standardize numerical features
]

3. Create the Pipeline:

Build the preprocessing pipeline by passing in the defined steps.

preprocessing_pipeline = Pipeline(preprocessing_steps)

4. Fit and Transform:

Use the fit and transform methods of the pipeline to preprocess your data.

# Fit the pipeline on your training data
X_train_preprocessed = preprocessing_pipeline.fit_transform(X_train)

# Transform your test data using the same pipeline
X_test_preprocessed = preprocessing_pipeline.transform(X_test)

5. Model Training and Evaluation:

Train your machine learning models on the preprocessed data for more reliable and accurate results.

from sklearn.ensemble import RandomForestClassifier

# Train a model on the preprocessed data
model = RandomForestClassifier()
model.fit(X_train_preprocessed, y_train)

# Evaluate the model on the preprocessed test data
accuracy = model.score(X_test_preprocessed, y_test)

6. Data Consistency and Reproducibility:

By incorporating these preprocessing steps into your workflow, you ensure that data quality issues like missing values, outliers, and inconsistent formats are consistently addressed.

Tailor the pipeline to your specific project’s needs by adding or modifying preprocessing steps as necessary.

Conclusion

As you’ve discovered, data preprocessing is not just a necessary chore; it’s actually a pretty important step if you want to get something useful out of your machine learning models. Here are some key takeaways:

Data Quality Matters: The quality of your data directly impacts the quality of your results. To ensure reliable analyses, address common data quality issues like missing data, duplicates, outliers, and inconsistent formats.
Techniques and Tools: Familiarize yourself with a toolkit of techniques and tools to handle data preprocessing tasks efficiently. Python libraries like Pandas, NumPy, and Scikit-Learn are your allies in this journey.
Pipeline Efficiency: Build a data preprocessing pipeline to streamline your workflow, maintain consistency, and ensure reproducibility in your projects.
Feature Engineering: Don’t stop at data cleaning; explore feature engineering to create new insights from existing data, enhancing your models’ performance.
Continuous Learning: Data preprocessing is an evolving field. Stay curious, keep learning, and explore advanced techniques to refine your skills further.

Remember that data preprocessing is not a one-size-fits-all endeavor. Each project may present unique challenges, requiring you to adapt your strategies accordingly.

Additional Resources for Data Preprocessing Mastery

To dive deeper into data preprocessing and to expand your data science skills, here are some valuable resources:

Books: Explore books like “Python for Data Analysis” by Wes McKinney (completely free online) and “Feature Engineering for Machine Learning” by Alice Zheng and Amanda Casari for in-depth knowledge.
Online Courses: Platforms like Coursera, edX, and Udemy offer courses in data preprocessing, data cleaning, and feature engineering.
Community and Forums: Engage with the data science community on platforms like Stack Overflow and Reddit’s r/datascience to seek advice, share knowledge, and learn from others.
Kaggle: Participate in Kaggle competitions and explore kernels to see how experts tackle data preprocessing challenges.
Documentation: Always refer to the official documentation of libraries like Pandas, Scikit-Learn, and NumPy for detailed information and examples.

By continuously improving your data preprocessing skills, you’re enhancing your capabilities as a data scientist and ensuring that your analyses and models are built on a solid foundation.

I hope you’ve found this guide helpful. If you have any questions, don’t hesitate to drop a comment below or reach out to me on LinkedIn!

Copy-of-Blog-featured-classification-stroke

Joleen January 25, 2023

Projects & Workflows

Classification Ensemble in Python with a Stroke Prediction Dataset

This project is based on Season 3, Episode 2 of the Kaggle Playground Series. The title of this episode is: “Tabular Classification with a Stroke Prediction Dataset”. Our task is to predict the probability that a patient will have a stroke.

The target, stroke, is a binary variable and so classification methods are needed to predict the probability of stroke. In this project, I will be training the data on an ensemble of machine learning models.

The models I use (and their associated Python libraries) are:

Logistic regression (from sklearn.linear_model import LogisticRegression)
Decision trees (from sklearn.tree import DecisionTreeClassifier)
XGBoost (import xgboost as xgb)
LightGBM (import lightgbm as lgb)
Ada boost (from sklearn.ensemble import AdaBoostClassifier)
K-Nearest neighbors (from sklearn.neighbors import KNeighborsClassifier)
Random forest (from sklearn.ensemble import RandomForestClassifier)
Support vector machines (from sklearn.svm import SVC)

However, instead of building the ensemble by just averaging the results of the predictions from these models, I will instead use a method called Super Learner Ensembling. This method uses cross-validation and combines the results through a final meta learner. In this project, I use logistic regression as the meta-learner and the package ML-Ensemble (which simplifies the process considerably).

Check out a previous blog post to find out more about the different ensemble methods you can use.

Stroke Data Description

The dataset I will use in this project was synthetically generated from the original stroke dataset. This allows us to use real-world data while not compromising privacy in any way. However, we will combine some parts of the original stroke dataset to enrich our analysis.

This stroke dataset contains the following features:

id: unique identifier
gender: “male”, “female”, or “other” (categorical)
age: age (in years)
hypertension: whether or not the patient has hypertension (binary)
heart_disease: whether or not the patient has heart disease (binary)
ever_married: whether or not the patient was ever married (binary)
work_type: “children”, “govt_jov”, “never_worked”, “private”, “self-employed” (categorical)
residence_type: “rural”, “urban” (binary)
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: “formerly smoked”, “never smoked”, “smokes”, “unknown” (categorical)

Data Exploration

In this phase of the project, I want to identify any potential ‘funnies’ in the data that may require cleaning and create any visualizations that may be useful for the model-building stage.

Using df.describe() prints out some quick summary statistics of the dataset. However, by using the exclude parameter, you can ignore columns of a certain type (don’t know how I only just discovered this).

# Quick stats: Continuous variables
# Exclude object dtypes: gender, ever_married, work_type, residence_type, and smoking_status
round(train_cp.describe(exclude = "object"), 2)

This gives the following table output:

A first look at this table shows two obvious problems:

Minimum age is 0.08 – unless the study included newborns, this is probably dirty data.
Minimum BMI is 10.30 – this is an unrealistic value for a person so this is probably also dirty data.

Dirty data like this can be removed or imputed as if they were nulls. In this project, they didn’t make much of a difference to the modeling so I left those data points in the data.

Next, I want to have a look at the distribution of the target variable, stroke. Quickly printing the counts reveals a very unbalanced dataset with only 632 cases out of over 15k data points!

A post on the discussion forum for this dataset suggests adding valid cases from the original dataset to this dataset for training. There are 249 cases in that dataset, of which 40 BMI values are null. Taking complete cases leaves 209 additional cases that can be included in our model here.

Visualizing the target variable (after adding the additional cases):

Next, I print out a few quick counts of the categorical variables in the dataset.

Gender	Count
Male	5857
Female	9446
Other	1

Residence Type	Count
Rural	7763
Urban	7749

Ever Married	Count
No	4941
Yes	10571

Work Type	Count
Govt job	1561
Never worked	42
Private	9879
Self-employed	1992
Children	2038

These counts reveal two issues:

Gender has an ‘other’ category. Since there is only 1 data point with this category, I will simplify things by replacing it with ‘Female’ – the dominant gender category. Thereafter, the gender variable can be converted to a 0/1 binary variable.
Residence type and ever married should be converted to 0/1 binary variables. It is possible to do one hot encoding for these variables but it would unnecessarily increase the dimensionality of the data and can hurt the performance of the model.

Fitting the Model(s)

To predict whether or not a patient will have a stroke, I fit an ensemble of 8 classification models using the ‘Super Learner’ ensemble method.

To prepare the data for the model, the continuous variables are standardized and the categorical variables are one hot encoded. I also drop the ID column since it doesn’t help the model in any way.

To fit the ensemble, we must first define the list of models that will be used in the ensemble.

# Get a list of models
def get_models():
    models = list()
    models.append(LogisticRegression())
    models.append(DecisionTreeClassifier())
    models.append(xgb.XGBClassifier())
    models.append(lgb.LGBMClassifier())
    models.append(AdaBoostClassifier())
    models.append(KNeighborsClassifier())
    models.append(RandomForestClassifier())
    models.append(SVC(probability=True))
    return models

Then, using ML-Ensemble, we must add one layer for these models and then a second meta-layer for the predictions.

def build_ensemble():
ensemble = SuperLearner(random_state=seed, folds = 25, shuffle = True, verbose = 1)

# Add models
models = get_models()
ensemble.add(models, proba = True)

# Attach the final meta estimator
ensemble.add(LogisticRegression(), meta = True, proba = True)

return ensemble

An important point to note is that this Kaggle competition required the predicted probabilities and not the predicted classes. Since ML-Ensemble prints the classes by default, we need to specify proba = True in BOTH layers (this was not documented very well so I spent way too much time figuring out how to get predicted probabilities here!).

Using this function, we can build the super learner:

# Build the super learner
ensemble = build_ensemble()

And finally, the ensemble is fitted to the training data (here, we must use df.values because the ML-Ensemble package does not accept a dataframe):

# Fit the super learner ensemble
ensemble.fit(train_features_scaled.values, train_target.values)

The final score I got for this submission was 0.89285. I could have improved this with some fine-tuning, but I stopped here for this project.

I hope you found this helpful! You can find the full code on Github.

If you have any questions or comments, drop them below! I’d love to hear from you!

Joleen January 18, 2023

Projects & Workflows

Regression in Julia with the California Housing Dataset

This project is based on Season 3, Episode 1 of the Kaggle Playground Series. The title of this episode is: “Tabular Regression with the California Housing Dataset”. Our task is to predict the median housing value of a block group of housing.

In this project, my goal is to use the Julia programming language for the first time in a machine learning application. I will be using the Julia package Pluto as my notebook environment for the project. I like to think of Pluto notebooks as upgrades of Jupyter notebooks since they have many additional features that Jupyter lacks (such as reactivity).

The target variable, median housing value, is a continuous variable and so regression techniques are needed. To keep things simple as I learn to use Julia, I will be applying linear regression only.

Let’s get started.

California Housing Data Description

The dataset I will be using here was synthetically generated from the original California Housing dataset. This allows us to use real-world data while not compromising privacy in any way.

As previously mentioned, the target variable is MedHousVal – median housing value. The features of the dataset are:

MedInc – median income in block
HouseAge – median house age in block
AveRooms – average number of rooms per house
AveBedrms – average number of bedrooms per house
Population – population of block
AveOccup – average number of household members per house
Lattitude – latitude of block
Longitude – longitude of block

Setting up a Julia Environment

Pluto notebooks also serve as their own environments. Each Pluto notebook automatically stores information on the packages used, their dependencies, and their corresponding versions. However, I wanted to store this information about my environment separately so that I can share my code with anyone that uses Julia, without them needing to specifically use Pluto too.

Julia has built-in package and environment management through a package called Pkg. Once you start a Julia session in your terminal you simply type the right square bracket ] and you will enter ‘package mode’ where you can activate an environment and add, remove, or update packages.

If you navigated to the directory of your project in your terminal, you can simply type activate . and the current directory becomes the environment.

To switch off Pluto’s automatic package management, we must activate the project environment from inside Pluto using this code:

begin
	import Pkg

	# Activate the environment at the current working directory
	Pkg.activate(Base.current_project())

	# Instantiate to prevent missing dependencies
	Pkg.instantiate

	# Specify the packages we will be using in this project
	using StatsKit, MLJ, Gadfly, Random, MLBase
	import Plots
end

So, I will be using the following packages for this project:

StatsKit – collection of essential stats packages
MLJ – machine learning toolkit
Gadfly – data visualization (similar to GGPlot2 on R)
Random – random number generator for setting the seed
MLBase – basic machine learning functionality
Plots – Plots is imported only so that I can plot a correlation heat map

Data Exploration

During the data exploration phase, I’m looking for any issues that may need to be addressed (like missing values or extreme outliers) and visualizing the data.

A first inspection of the data doesn’t reveal any obvious issues, and I don’t notice anything special that warrants further investigation.

I decided to create an additional variable called ‘BlockDensity’. This uses the total population and average number of occupants per block to indicate how many houses are located in a block. The hypothesis was that dense blocks could either indicate smaller houses that cost less or it could indicate tall apartment buildings that cost more.

# Number of houses in a block (ie. block density)
mod_df.BlockDensity = mod_df.Population ./ mod_df.AveOccup

The main visualization that I look at is the correlation heat map to get an idea of how correlated the target is with the features (and the features with each other).

begin
	cols = names(mod_df)
	corr_matrix = cor(Matrix(mod_df))
	
	# PLOT
	(n,m) = size(corr_matrix)
	Plots.heatmap(corr_matrix, xticks=(1:m,cols), xrot=90, yticks=(1:m,cols), yflip=true, c = :bluesreds)
	Plots.annotate!([(j, i, (round(corr_matrix[i,j],digits=1), 8,"Computer Modern",:white)) for i in 1:n for j in 1:m])
end

This quickly reveals a strong positive correlation between the target (median house value) and median income. This makes sense since people with higher incomes can generally afford more expensive houses.

Since I’ll will be using linear regression for this project, I need to deal with outliers as they can skew the results. For this reason, I wrote a function that finds outliers based on a method by Tukey called the outer fences.

These formulas allow us to identify data points that are far out from the first and third quartiles.

# Finding outliers past the outer fences
# (Q1 - (3*IQR); Q3 + (3*IQR))

function outer_fences(df, X) 
	print("\n $(X)")
	# Quantiles
	quantiles = quantile!(df[!, X], [0.25, 0.75])
	q1 = quantiles[1]
	q3 = quantiles[2]

	# Interquartile range
	IQR = q3-q1

	# Fences
	fence_1 = q1 - (3*IQR)
	fence_2 = q3 + (3*IQR)

	print("\n Fence cutoffs: ($(round(fence_1, digits=2)), $(round(fence_2,digits=2)))")

	# Count the number of samples that are outside the fences
	outside_samples = df[!, X][(df[!, X] .< fence_1) .| (df[!, X] .> fence_2), :]
	count_outside_samples = size(outside_samples)[1]
	print("\n Number of outside samples: $(count_outside_samples)")	

	return X, fence_1, fence_2
end

I probably spent far too much time on this problem of outliers than I should have, but I was able to identify and remove around 1625 data points from the training data.

Fitting the Model

To fit a linear regression model I used the lm function from the GLM package. The only downside to this function is that you must specify every feature in the formula for the regression equation, which was a bit of a pain.

linearRegressor = lm(@formula(MedHouseVal ~ MedInc + HouseAge + AveRooms + AveBedrms + Population + AveOccup + Latitude + Longitude + BlockDensity), train)

The R2 value for this model looked promising but the RMSE was a little too high. Printing the R2 value was straightforward:

# R Square value of the model (0.9194)
r2(linearRegressor)

However, getting the RMSE took a bit more effort. First, I define the RMSE function definition:

# RMSE function definition
function rmse(performance_df)
    rmse = sqrt(mean(performance_df.error.*performance_df.error))
    return rmse
end

Then a dataframe is created with the predicted and error values for the training and test sets which can then be entered into the RMSE formula. The RMSE for the test set is 0.3147 and is 0.3111 for the train set.

begin
	# Prediction
	ypredicted_test = GLM.predict(linearRegressor, test)
	ypredicted_train = GLM.predict(linearRegressor, train)
	
	# Test Performance DataFrame
	performance_testdf = DataFrame(y_actual = test[!,:MedHouseVal], y_predicted = ypredicted_test)
	performance_testdf.error = performance_testdf[!,:y_actual] - performance_testdf[!,:y_predicted]
	performance_testdf.error_sq = performance_testdf.error.*performance_testdf.error
	
	# Train Performance DataFrame
	performance_traindf = DataFrame(y_actual = train[!,:MedHouseVal], y_predicted = ypredicted_train)
	performance_traindf.error = performance_traindf[!,:y_actual] - performance_traindf[!,:y_predicted]
	performance_traindf.error_sq = performance_traindf.error.*performance_traindf.error
end

Overall, this was a good learning experience using Julia for the first time. Some aspects of the Julia language seemed unnecessarily complicated and required additional function definitions or steps. However, on the other hand, there were many other aspects of Julia that made it a pleasure to work with (such as the extremely simple environment management).

Check out the associated Github repo for the full code.

Joleen October 7, 2022

Projects & Workflows

Overview of Machine Learning Ensemble Methods

Voting Ensembles

Voting ensembles combine diverse machine learning models using techniques like majority voting or average predictions. The individual models used in the ensemble could be regression or classification-based algorithms. Once the individual models have been trained, the ensemble can be constructed in a couple of different ways.

In regression, ensembles are created by averaging the (weighted or unweighted) predictions. In classification, ensembles are created through majority voting (also called hard voting) of predicted classes or through (weighted or unweighted) average predicted probabilities (also called soft voting).

Voting ensembles can be easily applied in python using these classes in the Sklearn library:

VotingClassifier
VotingRegressor

These classes can also be combined with grid search and cross-validation to tune the hyperparameters in the models such as choosing between hard and soft voting or choosing the best weights.

Bagging Ensembles

Bagging ensembles are created from several instances of a single algorithm that are each trained on different subsets of the training data.

There are a few methods that fall under the category of bagging ensembles, and they differ only in the way that the subsets are chosen:

When subsets are formed by sampling the observations with replacement, the method is called bagging (short for bootstrap aggregation).
When subsets are formed by sampling the observations without replacement, the method is called pasting (not short for anything).
When subsets are formed by sampling a set of features only, then the method is called random subspaces.
When subsets are formed by sampling both the features and observations, then the method is called random patches.

The final ensemble is built by aggregating the individual predictions made for each sample of observations. This aggregation is performed either by majority voting or by (unweighted or weighted) averaging.

When bagging is performed on decision trees, this is a special form of ensembling known as Random Forests and contains many unique, complex features compared to other algorithms.

Bagging can be implemented in python with these Sklearn classes:

BaggingClassifier
BaggingRegressor
Plus also, RandomForestClassifier and RandomForestRegressor

Boosting Ensembles

Boosting ensembles are formed by incrementally combining a set of weak learners into a single strong learner. The term ‘weak learner’ here represents a model that may only be slightly better than taking a random guess at the prediction. In the case of regression, this guess is made by simply using the mean value of the target variable.

In contrast to bagging, each weak learner is trained on the entire dataset, and no sampling takes place. However, in boosting, each observation in the training data is assigned a weight based on the ability of the algorithm to predict it accurately.

Since ensembles are constructed sequentially, boosting can be very slow and computationally expensive.

There are many implementations for boosting ensembles. Some have become very popular and are the default methods for applying boosting. These are five of the most common boosting implementations used today, as well as their associated python libraries:

AdaBoost (AdaBoostClassifier and AdaBoostRegressor from Sklearn)
(Standard) Gradient boosting (GradientBostingClassifier and GradientBoostingRegressor from Sklearn)
Histogram-based gradient boosting (HistGradientBoostingClassifier and HistGradientBoostingRegressor from Sklearn)
Extreme Gradient Boosting (XGBoost library)
Light Gradient Boosting Machine (LightGBM library)

Stacking Ensembles

Stacking is short for stacked generalization, and it involves a different approach for aggregating the predictions made by the individual models in the ensemble. Instead of averaging or finding the majority vote of a prediction, stacking uses a machine learning model to select the best way to combine the predictions.

Stacking can be applied using python and Sklearn and these libraries:

StackingClassifier
StackingRegressor

A special application of stacking is Super Learner Ensembling. This method applies k-fold cross-validation and builds an ensemble from the out-of-fold predictions from each model.

Super learner ensembles can be implemented in python using Sklearn. Alternatively, there is a library called ML-Ensemble (also called mlens) that builds on Sklearn to simplify the process.

Implicit & Explicit Ensembles

Implicit and explicit ensembles are actually single models that are trained in such a way that it behaves like an ensemble of multiple neural networks. This allows ensembles to be built without the additional training costs involved in training multiple models.

Implicit ensembles share their model parameters, and the averaging of the ensemble models is done at test time with a single unthinned network. Some implicit ensemble models are:

DropConnect
Dropout
Stochastic depth
Swapout (hybrid of dropout and stochastic depth)
Dropout as a bayesian approximation

Explicit ensembles do not share their model parameters and would produce an ensemble model through standard approaches like majority voting or averaging. Some explicit ensemble models are:

Snapshot ensembling
Random vector functional link networks

Consensus Clustering

Clustering is an unsupervised machine learning technique that is used to group similar observations together. Ensemble techniques are used in clustering to combine several weak clusters into a strong one. However, each weak cluster must be as different from each other as possible in order for the ensemble to be a significant improvement.

There are a few approaches that can be used to create clusters that differ:

Using different sampling data
Using different subsets of features
Using different clustering methods
Adding random noise to the models

This technique gains a consensus on the assigned cluster of an observation based on a combination of all other iterations of the clustering algorithm in the ensemble. Clustering ensembles are made up of (1) the generation mechanism and (2) the consensus function. The generation mechanism is simply the set of iterations of the clustering algorithm that will be combined to form the ensemble. The consensus function defines the way that the ensemble will reach a consensus on the clusters. The biggest challenge in clustering ensembles is to produce a combined ensemble (through the consensus function) that improves the result over any single clustering algorithm.

There are 2 primary consensus function approaches:

Objects co-occurrence: this uses a voting process to assign clusters
Median partition: this finds clusters through an optimization problem that finds the median partition with respect to the cluster ensemble. The median partition can be thought of as the partition that best summarises the results of all models in the ensemble.

Consensus clustering can be implemented in python using the Cluster Ensembles library. However, some of the more niche implementations of consensus clustering may need to be done from scratch in python as there doesn’t seem to be a lot of support out there for them yet.

I hope this blog post gave you a good high-level overview of some of the common ensemble methods available today. Look out for future blog posts where I dive into some of these methods with detailed descriptions and practical applications in python.

If you have any questions, feel free to reach out on Twitter or leave a comment below!

Joleen September 12, 2022

Projects & Workflows

Decision Trees in Python: Predicting Diabetes

In this post, we’ll be learning about decision trees, how they work and what the benefits are for using them. We’ll also use this algorithm in a real-world data to predict diabetes.

So, what are decision trees? Decision trees are a machine learning method for classification or regression. It works by segmenting the dataset through if-else control statements applied to the features.

There are few algorithms that can be used to implement decision trees and you may have heard of some of them. The most popular algorithms are ID3, C4.5 and CART. However, the Scikit-learn python library only supports the CART algorithm which stands for Classification and Regression Trees. This article is entirely based on the CART algorithm.

Benefits of Decision Trees

Decision trees are known as ‘white box’ models which means that you can easily find and interpret their decisions. This is in contrast to ‘black box’ neural networks where it is extremely difficult to figure out exactly how final predictions were made. Fortunately, decision tree models are easy to explain in simple terms, along with why and how the predictions were made by the model. Since decision trees are just if-else control statements at heart, you can even apply their rules and make predictions by hand.

Decision trees can be easily visualised in a tree-like plot that makes it even easier to understand and interpret the model. Have a look at this simplified decision tree below based on the data we’ll be analysing later on in this article. We can actually take a single data point and trace the path it would take to reach the final prediction for it.

Decision Tree Simple Model Visualisation

The Scikit-learn python library together with the CART algorithm supports binary, categorical, and numerical target variables. However, for the feature variables, only binary and numerical features are supported at this time. This means that each node in the decision tree can only have up to 2 branches leading out the node and so features must either be true or false.

The good news is that decision trees require very little data preparation and so you don’t need to worry about centering or normalising the numerical features first. Having said that, there are still a couple of best practices to follow when fitting a decision tree to your data and we’ll chat about them a bit more towards the end of this article.

It’s also good to keep in mind that decision trees are pretty sensitive to even small changes in the data and tend to learn the data like a parrot. This means that it is easy for the model to overfit to the data and can even be biased if the target variable classes are unbalanced. For this reason, decision trees must be closely controlled and optimised to prevent these problems (also more on this later).

How Does it Work?

Without getting all technical, let’s go over how the decision tree CART algorithm works. The main goal is to divide the data into distinct regions and to then make predictions based on those regions.

Starting at the top of the tree, the first question the algorithm must answer is “which feature should be chosen at the root?” To answer this question, the algorithm needs a way of evaluating each feature and choosing the ‘best’ feature to start with. Thereafter, the algorithm needs to keep asking a similar question at each node: “which feature should be used to split this node?” It does this by calculating and optimising a metric against each of the available features.

There are a couple of metrics that can be used depending on the problem at hand. For example, if we’re dealing with a regression problem then we can seek to find the feature with the lowest RSS (residual sum of squares). However, if we have a classification problem then we can choose the feature with the lowest entropy (or highest information gain) or the lowest gini impurity.

If you’re wondering whether to choose entropy or gini impurity for classification problems, don’t waste too much time on it as there isn’t a hell of a lot of difference between the resulting decision trees. So toss a (balanced) coin and pick one.

Sklearn uses the gini impurity by default so we’ll briefly go over this metric. The gini impurity is a metric that measures the purity of a node. That is, it measures how similar the observations are to each other. If all observations belong to the same class then the gini impurity would be 0 and the node would be considered ‘pure’.

The decision tree algorithm is in constant pursuit of pure nodes and it will continue to split the data into deeper and deeper trees until it finally reaches pure leaf nodes (or it runs out of data or features). In addition, the data is split in a greedy fashion using recursive binary splitting. It’s called greedy because the algorithm will make a split based on what is optimal for the step it is currently dealing with and will not choose a split that can result in a more optimal tree further down the line. This also means there are no backsies – the algorithm won’t backtrack and change its mind for a previous split.

We mentioned ‘leaf’ nodes. These are nodes that do not get any further splits and any observation that takes a route to a leaf node will get the resulting predicted class. At each leaf node, the class that has more than 50% of its samples belonging to it will serve as the prediction for that node. For classes with a tie, the non-event class is automatically chosen.

Predicting Diabetes with Decision Trees in Python

The data in this project contains biographical and medical information that is used to predict whether or not a patient has diabetes. You can find the data on Kaggle.

These are the goals for this project:

Explore the data – determine if it requires any cleaning and if there are any correlations in the data
Apply the decision tree classification algorithm (using sklearn)
Visualise the decision tree
Evaluate the accuracy of the model
Optimise the model to improve accuracy

Setup

import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics

import graphviz

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette(sns.dark_palette("seagreen", reverse=True))

Explore

These are the feature variables in our data:

Number of pregnancies
Glucose
Blood pressure
Skin thickness
Insulin
BMI
Diabetes Pedigree Function
Age

The most confusing feature here is the diabetes pedigree function. To understand it, I read the recommended research paper for this dataset and made these notes:

The criteria for a diabetes diagnosis is if the ‘2 hour post-load plasma glucose was at least 200 mg/dl’.
This dataset specifically contains women over the age of 21.
The Diabetes Pedigree Function provides ‘a synthesis of the diabetes mellitus history in relatives and the genetic relationship of these relatives to the subject’. In other words, this score is higher if there is a family history of diabetes and it’s lower if not.

To start our data exploration, let’s have a look at some summary statistics of our data:

These are a few points we can note about the data:

All our features are numerical
We have a total sample size of 768
There are no missing values to deal with right now
A few features have a minimum value of 0 which is suspicious for (living) humans:
- Min glucose = 0
- Min blood pressure = 0
- Min skin thickness = 0
- Min insulin = 0
- Min BMI = 0

To clean this up, we will convert these zeros to nulls and remove them from our dataset. This takes our dataset down from 768 to 392 (that was painful!).

Next, let’s have a look at a correlation matrix between all the variables in our data.

The outcome is positively correlated with all features which is a good sign for modelling
The outcome is most strongly correlated with Glucose (which makes sense since this is about Diabetes) and then Age
There is a strong correlation between Age and Pregnancies – older women = more pregnancies?
Insulin and Glucose are correlated – higher insulin = higher glucose?
SkinThickness and BMI are correlated – higher BMI = higher skin thickness?

An additional note to make is that decision trees tend to be sensitive to unbalanced classes. So, we’re also going to take note of how many observations fall into each outcome class from the original dataset:

Not Diabetes	Diabetes
500	268

From this table, we can see that the outcome classes are unbalanced – there are twice as many non-events as there are events. This could make it difficult for the model to correctly predict when someone has diabetes. We may need to balance out the classes during the optimisation phase to see if it improves the accuracy of the model.

Model

The is the modelling process we’ll follow to fit a decision tree model to the data:

Separate the features and target into 2 separate dataframes
Split the data into training and testing sets (80/20) – using train_test_split from sklearn
Apply the decision tree classifier – using DecisionTreeClassifier from sklearn
Predict the target for the test set
Evaluate model accuracy – using metrics from sklearn
Visualise the decision trees – using graphviz
Optimise the decision tree – using various parameters such as max_depth, min_samples_split, etc.

# Separate features and target
target = df_reduced["Outcome"]
features = df_reduced.drop("Outcome", axis = 1)

# Split into train and test sets
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.2, random_state = 42)

# Fit the decision tree model
decisiontree = tree.DecisionTreeClassifier(random_state = 42)
decisiontree = decisiontree.fit(features_train, target_train)

# Predict the target for the test set
target_pred = decisiontree.predict(features_test)

This fit gives an accuracy score of 72.15%. Not too bad, but I’m sure we can improve it.

Visualising this tree, we can see it is a bit of a mess. Some nodes have just 1 sample in them and since we don’t have much data we will need to control the number of samples in each node and also the max depth as this can lead to overfitting and poor performance on unseen data.

Optimise

We’ll try a couple of strategies to optimise this model:

max_depth = 3, min_samples_leaf = 2 — this produced an accuracy of 74.68%
max_depth = 4, min_samples_leaf = 2 — this produced as accuracy of 73.42%
max_depth = 5, min_samples_leaf = 2 — this produced an accuracy of 75.95%

Increasing the depth any further results in declining performance. A max depth of 5 gave the highest accuracy.

Next, we will balance the dataset by randomly selecting only 130 rows from a total of 262 where the outcome = 0 (ie. class = Not Diabetic). Balancing the dataset produces an accuracy of 78.85%, in addition to setting max_depth = 5 and min_samples_leaf = 2.

Looks like we’ve found a winner 🙂

Here is a visualisation of this optimised tree:

I hope you found this post helpful. If you have any questions don’t hesitate to drop a comment or reach out to me on Twitter!

Charts with text: linear and polynomial regression, exploring some red flags for models that underfit

Joleen February 21, 2022

Projects & Workflows

Linear & Polynomial Regression: Exploring Some Red Flags For Models That Underfit

In this first project of 2022, I will be fitting linear and polynomial regression models to some randomly generated dataset where there is a nonlinear relationship between the target and the predictor.

The purpose of this project is to observe some of the red flags for a model that is severely underfitting to the data and how these red flags change when fitting a more appropriate model.

The red flags that I’ll be considering are:

MSE and R-squared – these are common performance metrics used in linear models.
Residual plot – this plot will show us if some of the assumptions of linear regression have been violated.
Learning curves – this plot will show us how well the model fits to the data and usually gives a good indication of over/under fitting.

Generating the data

I generate the data using a polynomial equation of degree 2:

# Generate some nonlinear data
m = 500
X = 9 * np.random.rand(m, 1) - 1
Y = 0.5 * X**2 - 3 * X + 7 + np.random.randn(m, 1)

In this visualisation of the data, we can clearly see the nonlinear relationship that we have generated from the above equation:

Linear Regression

Being one of the oldest and simplest models, linear regression is pretty well known and easy to understand. In this project, I am using linear regression to demonstrate what underfitting looks like and as a comparison to polynomial regression.

To fit linear regression, the response variable must be continuous. There are also a couple of assumptions that must be satisfied in order to use linear regression with some degree of accuracy, otherwise the results can be misleading or downright incorrect.

Linearity: the relationship between the response variable and the predictors must be linear.
Homoscedasticity: the variance of the residuals must be constant. So, as the value of the predictor increases, the variability in the response stays the same.
Independence: the residuals must be independent of one another. This means that each observation must not be determined by or correlated with another. Usually only an issue in time series data.
Normality: the residuals are normally distributed. In other words, the average of all residual values is zero.

These assumptions are usually overlooked when linear regression is applied but they are very important. In this project, The assumptions of linearity and normality are both purposely violated to support the decision to use polynomial regression.

Below is the linear regression model fitted to the generated dataset:

linear_regressor = LinearRegression()
linear_regressor.fit(X_train, Y_train)

Y_predict = linear_regressor.predict(X_test)
linear_mse = mean_squared_error(Y_test, Y_predict)
print(f"MSE (linear regression): {round(linear_mse,2)}")

linear_r2_score = r2_score(Y_test, Y_predict)
print(f"R2 Score (linear regression): {round(linear_r2_score,2)}")

This model gives a Mean Square Error (MSE) score of 10.0 and an R-Squared (R2) score of 0.27. Both these scores give our first red flags. Ideally, MSE should be as low as possible (around 1 is very good) and R-squared should be as close to 1 as possible. So far, our model is not doing a good job of describing the data.

Let’s plot the residuals versus the fitted values – this gives a good indication of whether the assumptions of linearity and normality are violated. Ideally, there should be a random scatter of points around 0 with no discernible pattern or shape in the data. The below plot shows a definite nonlinear pattern in the data:

Next is the learning curves. To generate this plot, multiple models are fitted to varying dataset sizes and the Root Mean Square Error (RMSE) is recorded. In this plot, we get to compare the training performance against some predictions made on a validation set.

As before, we want the RMSE to be as low as possible. But now, we also want the training and validation errors to be as close as possible as the training set size increases. If the validation error is higher than the training error then we know that the model did not generalise well to new data and this is a red flag for over/under fitting.

In these learning curves for linear regression, not only is the RMSE high, but the validation error remains higher than the training error throughout the process.

Great, so now that we know that that above model is a total mess and does a really bad job at making predictions, let’s make a better choice.

Polynomial Regression

When the assumptions of linear regression are not satisfied and the model isn’t capturing the relationship between the response and the predictors then we need a different approach.

Polynomial regression lets us model a non-linear relationship between the response and the predictors. It is a natural extension of linear regression and works by including polynomial forms of the predictors at the degree of our choosing.

polynomial_features = PolynomialFeatures(degree = 2, include_bias = False)
X_train_poly = polynomial_features.fit_transform(X_train)
X_test_poly = polynomial_features.fit_transform(X_test)

linear_regressor = LinearRegression()
linear_regressor.fit(X_train_poly, Y_train)

Y_predict_poly = linear_regressor.predict(X_test_poly)
poly_mse = mean_squared_error(Y_test, Y_predict_poly)
print(f"MSE (polynomial regression): {round(poly_mse,2)}")

poly_r2_score = r2_score(Y_test, Y_predict_poly)
print(f"R2 Score (polynomial regression): {round(poly_r2_score,2)}")

Here, we get a wonderful MSE score of 1.0 and an even better R-squared score of 0.9. Plotting the new polynomial regression line to our data we see that it fits like a glove:

Plotting residuals versus fitted values again we now see that there is a nice random scatter of points around 0 and I can’t make out any patterns, so that’s good!

Lastly, the learning curves. Both the validation and training errors hover at around 1.00 and although the lines do not meet in this plot, if I added more data, they would eventually converge.

I hope this project gave you some insight into the red flags commonly seen for linear models that underfit. Reach out to me on Twitter if you have any feedback or questions.

You can find the full code for this project on GitHub.

A multi colored computer with the words Data Science Without a Fancy PC

Joleen December 16, 2021

Projects & Workflows

Data Science Without a Fancy PC: Use These Cloud Platforms Instead

You don’t need a fancy PC to get started with data science and machine learning. In fact, you can run all your code in cloud-based notebooks without even worrying about setting up an environment locally.

Even if you’re new to data science, you’ve probably heard of Jupyter Notebooks. It has become the number 1 way to do data science. Jupyter Notebooks make it so much easier to run your code, write commentary and see your output all in 1 place. Almost all cloud platforms use some kind of Jupyter-like environment.

In this blog post I am going to share 5 ways you can do data science in the cloud. Each of these platforms allow you to do this completely for free and they each work really well.

In my opinion, the 2 biggest upsides of using cloud platforms for data science are:

Speed of set up – You can get set up in just a few minutes and have almost everything you need to do machine learning available to you. You don’t need to go through the hassle of setting up an environment locally before you start writing code and analysing data.
Collaboration – Being able to share your work and collaborate on projects is a big upside of any kind of cloud platform. However, collaboration is not available for all the platforms listed here. Even when it is offered, the degree of collaboration differs from platform to platform.

5 platforms will be covered in this post:

Datacamp Workspaces
Kaggle Notebooks
Google Colab
Deepnote
Datalore by JetBrains
Gradient Notebooks

There are a few things you should note about how I have judged these platforms. I have actually tried each of these platforms myself and am giving my own opinions on what I think of them. My primary use of cloud platforms is to work on personal projects and not for company or enterprise use.

These are the criteria that I am using to compare these platforms:

Price – they should be free or at least offer a decent free plan (not just a trial)
Speed of set-up and low admin – I shouldn’t have to ‘babysit’ my projects in case I go over ‘allowed hours’. I’d like to be able to log on, work on a project and log off without worrying about whether I shut the server down.
Aesthetic and intuitive – the app should look good and it should be intuitive and easy to use
Collaboration – I’d like to be able to share my work with friends and be able to collaborate on them live. I’d also like to have the option to share securely if I want to, without my project being available publicly.

Great, now that I’ve cleared all that up, lets get into it.

DataCamp Workspaces

DataCamp recently launched their Workspaces feature that allows you to run code in a Jupyter-like environment. There is currently no collaboration feature but it is something they are working on. Right now, you can just share your notebooks with other people and they are able to view them and add comments.

DataCamp’s philosophy with the initial launch of Workspaces is that they want it to be as easy to do data science as it is to learn it. They have already created an incredible interactive environment for learning data science and Workspaces seem like a natural next step for anyone who now wants to easily apply their skills and start putting together a portfolio.

The Workspaces are completely free and you can choose between R or Python. So far, their notebook editor looks good and it is intuitive to use. There is also no admin involved in running or maintaining the workspace which makes it hassle-free.

Kaggle Notebooks

To me, Kaggle is like the OG of cloud-based data science platforms, having started way back in 2010. Initially, they started out as a machine learning competition platform. Since then they have expanded and now offer ways of sharing datasets, notebooks and have a huge community in their forums where you can ask questions and get help.

Kaggle notebooks are free to use, with the option to choose between R and Python and they integrate well with other services. You can even connect your Kaggle notebook to Google Cloud Services to beef up the hardware if you need it, although this will come at an additional cost, of course. Kaggle does provide access to a GPU and for personal projects, it is usually more than enough.

Collaboration on Kaggle notebooks is limited. Similar to DataCamp, you can share your notebooks with other people (such as your teammates in a competition) but you effectively work on different versions of the notebook so there is no live collaboration feature.

Google Colab

Google Colab is another one of Google’s products that operate as a natural extension of Google Drive just like Google Docs or Sheets. If you already use these other products then the Colab UI will feel very familiar.

Naturally, sharing on Colab notebooks is built-in. However, it does not seem to be capable of live collaboration (ie. with 2 or more people editing a notebook together in real-time). I find this to be disappointing since Google basically wrote the book on real-time collaboration with Docs and Sheets.

Colab is also not a particularly pretty app, especially when compared with some of the other platforms I’ll be covering on this list. However, since almost everyone interested in data science most likely has at least 1 google account, setup is by far the fastest.

It is free to use Colab but the resource is not guaranteed and there are several usage limits that change depending on demand. Your usage limits could even be different to mine if your code uses more resources.

Deepnote

Deepnote is by far the most good looking, full-featured platform I have come across. The UI looks great, the editing experience with their notebook is amazing, and they have live collaboration! All of this and it is still free to use (up to a point) plus the platform is still in Beta so you know there is much more to come.

I particularly like their publishing feature – you can publish your notebook as an article or as an interactive app or dashboard. I just love the presentation of the articles and dashboards. Your profile on Deepnote also acts as a portfolio and it is a great viewing experience for anyone looking through your work.

You can get up to 750 hours on their standard machines and each notebook comes with a nifty little feature that automatically shuts the machine down after 15 minutes of inactivity. This keeps the admin pretty low on this platform.

Datalore

Datalore is a platform built by the JetBrains team. The UI looks good and it also comes with live collaboration. There are also a few more options in terms of programming languages with this platform – you can choose between Python, R, Kotlin and Scala.

There is a bit of admin involved with this platform. On the free plan, you get 120 hours/month of basic machine time. However, once you open a notebook, so long as it is an open tab in your browser it will consume resources and eat into your available hours. Because of this, it is important to either close the tab or manually shut down the machine.

Also, if you’ve shared one of your notebooks with someone and they forget to close their browser window down at the end of the day, it’ll eat into your available quota. So that’s something to keep in mind.

Gradient Notebooks

Gradient Notebooks are built by Paperspace. The biggest selling point of Gradient is that they offer free GPU’s. For the free plan, Instead of restricting the number of hours on the platform, the only limits they impose is that you can only ever have 1 notebook running at a time and all notebooks must be public.

Getting set up on the platform and launching a notebook can take a good couple of minutes since all that additional hardware that they offer needs to be provisioned. There is also limited functionality for publishing notebooks and building a portfolio of projects using their platform. They are currently working on a public profile feature, so at least it’s in the roadmap.

Gradient is the platform to go to if you need more resources and more compute for your project and you don’t want to pay a cloud provider like Google Cloud or AWS to get it.

Closing Thoughts

Choosing which cloud platform to use depends on your own goals and needs.

If you’re looking for a place to build a portfolio and get involved in a large community while you’re still learning and improving your data science skills then I’d recommend Kaggle or DataCamp.

If you’re looking for a platform that’s got all the bells and whistles, allows you to build a professionally-looking portfolio and offers real-time collaboration then I’d recommend Deepnote.

If you’ve started branching out into deep learning, NLP, and computer vision then I’d recommend giving Gradient a try.

See this resource for a comprehensive list of almost all available data science notebook platforms.

I hope you found this blog post helpful! Let me know in the comments below what other platforms you’d recommend. I’d love to hear your thoughts!

Stochastic gradient descent in python with road sign

Joleen December 2, 2021

Projects & Workflows

Stochastic Gradient Descent in Python

What is Stochastic Gradient Descent?

Stochastic gradient descent is an optimisation technique, and not a machine learning model. It is a method that allow us to efficiently train a machine learning model on large amounts of data. The word ‘descent’ gives the purpose of SGD away – to minimise a cost (or loss) function.

For a better understanding of the underlying principle of GD, let’s consider an example.

A Simplified Example of Gradient Descent

Gradient descent seeks to find the global minimum of a function. To do that, it calculates the gradient at an initial random point and moves to another point in the direction of descending gradient until it reaches a point where the gradient is zero.

As an example, see figure of a parabola below.

The initial random point that starts the algorithm off is at (x = -2, y = 4). The gradient at this point is -3 so the algorithm steps to the right, toward this negative gradient, and calculates the gradient at the next step. Eventually, after a number of steps, the algorithm will reach a point where the gradient is 0 and stops. This is referred to as the global optimum or global minimum.

The learning rate hyperparameter determines the size of each step – too small and the model takes too long, too big and the model may not converge to the global minimum.

Some datasets have an irregular shape and the cost function could contain both local minima and global minima. It is good to keep this in mind when training a model, adjusting the learning strategy according to the current problem.

Batch vs Stochastic Gradient Descent

There are 3 types of Gradient Descent implimentations: batch, mini-batch or stochastic. Which one you choose depends on the amount of data you have and the type of model you are fitting.

Batch Gradient Descent: the entire training dataset is used at every step. Thus this algorithm is very slow for large datasets but scales well for large numbers of features. There is also a possibility that this algorithm may get stuck in local minima.
Stochastic Gradient Descent: a single, random observation in the training data is selected at each step. This algorithm is very fast, only needing to perform calculations on a single point at a time. However, it is erratic and may select points from all over the place, never stopping at a truly accurate solution. Instead, it approaches the minimum on average. That being said, this algorithm is much more likely to find the global maximum. Some of the erratic nature of the algorithm can be solved by using a learning schedule that slowly reduces the learning rate so that it can settle on a more accurate solution.
Mini-Batch Gradient Descent: the algorithm uses small subsets (or batches) of the training data at each step. This algorithm is faster than Batch GD but still suffers from the same drawback of potentially getting stuck in local minima.

How to Implement Stochastic Gradient Descent in Python

This project is focused around applying SGD to a classification problem. Therefore, the steps and performance measures chosen here are best suited to modelling a binary response variable.

Each aspect of the project is broken down below.

Preprocessing the data for classification

These are important steps for data preparation/preprocessing:

Clean up any funnies in the data – for example, in this project the value -1 or 9 are used to represent null or unknown values and so these should be cleaned up and replaced with null
Split the data into training and test sets (typically in an 80/20 ratio). To do this, the train_test_split function is used from sklearn.model_selection (making sure that shuffle = True as this is a requirement of SGD)
Either remove or impute missing values
Feature engineering – after some initial data exploration, new features can be created or existing features can be modified
Standardise continuous variables using the StandardScaler function from sklearn.preprocessing
Dummy code categorical variables using the OneHotEncoder function from sklearn.preprocessing making sure to use the parameter drop = first so that if there are 4 categories in a variable, for example, only 3 categories are retained and the 4th category is inferred by all others being 0.

Selecting the model (and loss function)

This project is a classification problem and so the SGDClassifier is used from sklearn.linear_model. Since SGD is just an optimisation algorithm, we must still pick the machine learning model that will be fitted to the data.

These are the available options:

Support Vector Machine (loss = "hinge")
Smoothed hinge loss (loss = "modified_huber")
Logistic regression (loss = "log") – this loss function is chosen for this project

Measuring performance

The are a couple of different performance measures available to choose from to evaluate a classifier. Although it’s not always necessary to use so many performance measures, since this project is for learning purposes, I will be using all of them.

Each of the below performance measures are calculated after performing cross-validation. This method involves splitting the trainin set into K folds, making predictions on each fold and then measuring it’s performance.

Cross validation accuracy (cross_val_score) – this is a ratio of correct predictions and is often not a good metric when our data is skewed (ie. some classes appear more often than others)
Confusion matrix (confusion_matrix) – calculated from the predictions made during cross validation, this is a table containing 2 rows for actuals and 2 columns for predictions. This matrix is used to calculate a few ratios
Precision (precision_score) – for all cases that the model predicts to be positive, the precision tells us what percentage of them are correct
Recall (recall_score) – for all cases that are actually positive, recall tells us the percentage that are correctly predicted
Precision and recall work as opposites. The preferred level of these 2 metrics depends on the project. Two plots are used to depict the trade off between them (predictions are made using the cross_val_predict function with the parameter method = "decision_function" and then precision, recall and thresholds are calculated with the precision_recall_curve function):
- Precision/recall vs thresholds plot: in this plot we can see the trade-off relationship between precision and recall
- Precision vs recall plot: we can use this plot to select a precision and associated recall value that occurs just before the curve dips sharply
ROC curve (roc_curve) – a 45 degree line is plot alongside this curve to compare the accuracy of our model against that of a purely random classifier. The goal is for our model to angle as far away from this line as possible.
AUC score (roc_auc_curve) – this score summarises the ROC curve. A score of 1 indicates a perfect model whereas a score of 0.5 represents a model indistinguishable from chance.

Analysing Road Safety Data

Project Overview

The UK Department of Transport has released data on reported road accidents to the public from 1979. The information provided only relates to personal injury accidents that occur on public roads and that are reported to the police. This project uses the 2020 accident dataset.

The goal of this project is to:

Explore the road safety accident dataset, selecting features and adjusting them as needed for modelling
Apply the Stochastic Gradient Descent optimisation technique with a log loss function

Specifically, this will be treated as a classification problem with a binary target (response) variable called accident severity. In this project, we will be trying to determine if there are any patterns in the data that can help to predict whether an accident will be severe or not.

You can see all the code used in this project on Github. This blog post aims to summarise the process and findings.

Python libraries used in this project:

# Import libraries
import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")

Data Description

The road safety accident data contains over 30 different features. After going through these columns I selected just 12 (including an index column) to be included in this project.

accident_index
longitude
latitude
accident_severity: 1 = fatal, 2 = serious, 3 = slight
number_of_vehicles
date
day_of_week: 1-7 = Sun-Sat
time
speed_limit: 20, 30, 40, 50, 60, 70, -1 = data missing, 99 = unknown
light_conditions: 1 = daylight, 4 = darkness – lights lit, 5 = darkness – lights unlit, 6 = darkness – no lighting, 7 = darkness – lighting unknown, -1 = data missing
road_surface_conditions: 1 = dry, 2 = wet or damp, 3 = snow, 4 = frost or ice, 5 = flood over 3cm, deep, 6 = oil or diesel, 7 = mud, -1 = data missing, 9 = unknown
urban_or_rural_area: 1 = urban, 2 = rural, 3 = unallocated, -1 = data missing

Data Exploration

To start off with, the features have different types that must each be dealt with – from dates, times, and continuous values to various numbers of categories.

Some features also have values that are not valid. For example, the speed limit feature contains a -1 value when data is missing and a 99 value when data is unknown. Both these values are replaced with null and subsequently removed from the dataset (I will not be dealing with imputation in this project).

Accident severity is the target variable for this project. Currently it has 3 categories: fatal, serious and slight. Since the fatal category is so rare and also because I’d like to treat this as a binary classification problem, I have combined the fatal and serious classes.

Even with the combination of 2 categories, this is quite a skewed variable with only 21% of cases appearing in the positive class so some performance issues could be expected.

Next, the date and time variables are converted to their appropriate types and I decided to convert them into categorical variables. For the dates, they are converted into quarters so that they fall into 4 categories: Q1 – Q4. For the times, they are grouped into 3 equal-sized categories: morning (4am – 11am), afternoon (12pm – 7pm) and evening (8pm – midnight and 1am – 3am).

The variable for the day of the week is left unchanged and will be converted into a dummy variable later on for modelling.

The next few variables are each treated similarly. They have multiple categories, some of which only have a handful of cases in them. Because of this, I convert them each into binary variables with 0 representing what I thought was an ‘ideal’ scenario such as dry roads, daylight, urban area and low speed limit roads.

The last 2 charts are for the continuous features in the dataset. Neither of them are altered in any way. First is number of vehicles involved in the accident with the majority of accidents involving only 1 or 2 vehicles. The second chart is a geographical scatter plot made up of the longitude and latitude values of the accidents. Neither of these charts reveal anything revelatory or worth looking further into.

Data Preprocessing

Other than combining the categories of some of the features and constructing some new features, we must also ensure that any categorical features with more than 2 classes are dummy coded and any continuous variables are standardised.

# Onehotencoder for the day of week, time & month category features
onehot_df = train_features[["day_of_week", "time_category", "month_cat"]]
onehot = OneHotEncoder(drop = 'first', sparse = False)
onehot_df = pd.DataFrame(onehot.fit_transform(onehot_df), columns = onehot.get_feature_names(["day_of_week", "time_category", "month_cat"]))

# Add onehotencoder features back to train set
train_features.reset_index(drop=True, inplace=True)
train_features = pd.concat([train_features, onehot_df], axis = 1)

# Remove redundant columns
train_features = train_features.drop(["date", "time", "day_of_week", "time_category", "month_cat"], axis = 1)

# Standardise the continuous features
scaler = StandardScaler()
train_features_scaled = train_features.copy()
features_to_scale = train_features_scaled[["longitude", "latitude", "number_of_vehicles"]]
features_to_scale = scaler.fit_transform(features_to_scale.values)
train_features_scaled[["longitude", "latitude", "number_of_vehicles"]] = features_to_scale

Training & Evaluating the Model

To train the model the SGD classifier is used with a log loss function

sgd_clf = SGDClassifier(random_state = 42, loss = 'log')
sgd_clf.fit(train_features_scaled, train_target)

Evaluating the model starts with calculating cross validation accuracy scores.

# Cross validation accuracy scores
cross_val_score(sgd_clf, train_features_scaled, train_target, cv = 10, scoring = "accuracy")

I chose to perform 10 folds of cross validation and get these scores:

[0.7807249 , 0.78197473, 0.78100264, 0.78169699, 0.78194444, 0.78236111, 0.78194444, 0.78222222, 0.78180556, 0.78236111]

Not too bad. However, remember that this score is not very reliable when the data is skewed – ie. there are more cases in some classes compared to others. This is true in this data as we noticed that only 21% of the cases fall in the positive class (fatal / serious accident severity) compared to the negative class (slight accident severity).

Next, we put together a confusion matrix to determine how well our classifier predicts the target.

# Confusion matrix
train_target_pred = cross_val_predict(sgd_clf, train_features_scaled, train_target, cv = 10)
conf_matrix = confusion_matrix(train_target, train_target_pred)

This gives the following table:

	Predicted – Negative	Predicted – Positive
Actual – Positive	56 155	147
Actual – Negative	15 564	138

Confusion Matrix

That doesn’t look so good – over 15k cases were predicted to be negative but were actually positive!

Let’s have a look at the precision and recall, calculated from this confusion matrix

# Precision & Recall
print("Precision: ", round(precision_score(train_target, train_target_pred)*100,2))
print("Recall: ", round(recall_score(train_target, train_target_pred)*100,2))

Precision: 48.42

Recall: 0.88

Now we’re getting a clearer picture of the performance of the model. The precision score tells us that for all the accidents that the model predicts to be severe, it is correct 48% of the time. The recall score tells us that for all accidents that are actually severe, the model predicts this correctly 0.88% of the time.

Let’s look at the precision and recall metrics in the form of 2 plots.

The first is the precision and recall trade-off versus various threshold scores.

# Visualise the precision/recall tradeoff

# Use the cross_val_predict() function to get the scores of all instances in the training set
# Return decision scores not predictions
train_target_decision_scores = cross_val_predict(sgd_clf, train_features_scaled, train_target, cv = 10, method = "decision_function")

# Compute the precision and recall for all possible thresholds
precisions, recalls, thresholds = precision_recall_curve(train_target, train_target_decision_scores)

# Plot precision and recall as functions of the threshold value
plt.figure(figsize=(8, 6))
plt.plot(thresholds, precisions[:-1], "b--", label = "Precision")
plt.plot(thresholds, recalls[:-1], "g--", label = "Recall")
plt.grid(True)

plt.legend(loc = "upper right")
plt.xlabel("Thresholds", fontsize=16)
plt.ylabel("Precision / Recall", fontsize=16)
plt.title("Precision / Recall Tradeoff", fontsize=16)
plt.show()

Ideally, in this plot we want to select a reasonably high precision value for which recall is also reasonably high. However, this plot does not paint a nice picture for the performance of this model. If we select a precision over 50%, the recall is basically 0. Having a low recall means that the classifier will fail to recognise positive cases a lot of the time.

The value of precision and recall that you are happy with depends entirely on the problem at hand. There is no golden rule to follow here. Although, I do believe almost 0 recall is not good in any scenario.

The second plot directly compares precision against recall

# Plot precision against recall

plt.figure(figsize=(8, 6))
plt.plot(recalls, precisions, "b-", linewidth=2)
plt.axis([0, 1, 0, 1])
plt.grid(True)

plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.title("Precision vs Recall", fontsize=16)
plt.show()

This plot leads to the same conclusion as the previous plot but does so from a different perspective.

Next up is the ROC curve and the associated AUC score.

# The ROC Curve
# Calculate the metrics
fpr, tpr, thresholds = roc_curve(train_target, train_target_decision_scores)

# Plot FPR against TPR
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth = 2)
plt.plot([0,1], [0,1], "k--") # 45 degree line
plt.grid(True)

plt.xlabel('FPR (1 - Specificity)', fontsize=16)
plt.ylabel('TPR (Sensitivity, Recall)', fontsize=16)
plt.title('ROC Curve', fontsize=16)
plt.show()

In the ROC curve, we would ideally like the curve to angle itself toward the top left corner as much as possible. The 45-degree dotted line represents a purely random classifier, much like flipping a fair coin. In this plot, however, the curve is quite flat, indicating a poor model.

The AUC score can be calculated from the ROC curve. It simply summarises the curve in a single metric. A score of 1 indicates a perfect fit, while a score of 50 indicates a purely random model.

# AUC Score
print("AUC Score: ", round(roc_auc_score(train_target, train_target_decision_scores)*100, 2))

AUC Score: 60.66

Although the score gives us a quick summary of the performance of the model, it does not tell us much about where the model is going wrong. It will also tend to give a better picture of performance when the positive class is rare (as in the case of our accident severity target variable). The precision and recall scores along with their plots gives more information on the performance of the model.

I hope you found this project helpful in your own machine learning endeavours. As always, I am open to feedback and suggestions. Leave a comment below with your thoughts.

Keyboard with the words principal component analysis in python

Joleen October 3, 2021

Projects & Workflows

Principal Component Analysis in Python

In this week’s data science project, I apply principal component analysis to a dataset based around hospital mortality using python. The dataset is extensive, containing over 50 variables on demographic characteristics, vital signs, and other measurements taken in the lab. This should be a perfect use-case for dimensionality reduction with PCA.

What is Principal Component Analysis?

Like my previous project on K-Means clustering, PCA is a type of unsupervised learning which means that we do not need any pre-defined labels for our data and we are not trying to make predictions.

The goal in PCA is to reduce the number of variables in our dataset. Therefore, the main purpose of PCA is to speed up machine learning. Using fewer variables while still retaining nearly all the original information and variability gives a nice boost to model training, consuming fewer resources and reaching a result faster.

Eigenvectors and eigenvalues drive PCA at it’s core. An eigenvector is a line that passes through the data and the algorithm chooses the line that maximises the variability between that line and the data points. The corresponding eigenvalue represents the magnitude of that variability. There are only as many eigenvectors as there are variables in the data.

An important question to ask in PCA is how many principal components to retain. The 2 most popular methods are:

Plotting the cumulative variance explained by each principal component. You would choose a cutoff value for the variance and select the number of components that occur at that cutoff.
A scree plot showing the eigenvalues of each principal component where the cutoff here is an eigenvalue of 1.

There are 3 critical processing conditions that must be satisfied when preparing PCA:

There must be no null or blank values in the dataset
All variables must be numeric
Standardise the variables to have a mean of 0 and a standard deviation of 1

Data & Project Goals

The data for this project is from Kaggle. Remember, if you’d like to know where I look for my datasets, check out my post on finding data.

This data is from the MIMIC-III database on in-hospital mortality. The original purpose of the dataset is to predict whether a patient lives or dies in hospital when in intensive care based on the risk factors selected.

There are a total of 51 variables in the data. In this project, we will attempt to reduce these variables down to a more manageable number using principal component analysis.

These are the goals for this project:

Prepare the data for PCA, satisfying the 3 critical processing conditions mentioned in the previous section
Apply the PCA algorithm (using the sklearn library)
Determine the optimal number of principal components to retain

Data Pre-Processing

We begin, as always, with importing the required libraries and the data.

import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")

# Import data
pca_raw_df = pd.read_csv("data/mortality.csv")

Next, we will carry out these 3 pre-processing steps:

Remove any rows that contain blank / NA
Select only numeric variables from the data
Standardise the variables

# 1. Remove rows that contain blank values
pca_nona_df = pca_raw_df.dropna()

# 2. Select only numeric columns
# We calculate the max & assume that a max of 1 is probably not numeric
max_all = pca_nona_df.max()
max_cols = list(max_all[max_all != 1].index)

pca_sub_df = pca_nona_df[max_cols]

# Drop the columns that did not fit into our initial logic
pca_sub2_df = pca_sub_df.drop(columns = ['group', 'ID', 'gendera'])

# 3. Standardise the variables
scaler = StandardScaler()
pca_std = scaler.fit_transform(pca_sub2_df)

Once all pre-processing steps are complete, 38 variables remain.

Applying Principal Component Analysis

Now that our data is ready, we can apply PCA and then we will use these 2 methods to determine the optimal number of components to retain:

Cumulative variance
Scree plot

# Fit PCA
pca = PCA()
fit_pca = pca.fit_transform(pca_std)

Plot Cumulative Variance

# Plot the cumulative variance for each component
plt.figure(figsize = (15, 6))
components = np.arange(1, 39, step=1)
variance = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(components, variance, marker='o', linestyle='--', color='green')

plt.xlabel('Number of Components')
plt.xticks(np.arange(1, 39, step=1))
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.90, color='r', linestyle='-')
plt.text(0.5, 0.85, '90% variance threshold', color = 'red', fontsize=16)
plt.text(25, 0.85, "Components needed: "+str(np.where(np.cumsum(pca.explained_variance_ratio_)>=0.9)[0][0]), color = "red", fontsize=16)

plt.show()

The above plot contains the cumulative variance as a proportion on the y axis and the number of components on the x axis.

Using a variance threshold of 90%, the above chart helps us determine how many components we should retain from our dataset in order for it to still make sense for us in any further modelling.

Note that we chose 90% here as the variance threshold but this is not a golden rule. The researcher or data scientist chooses this variance threshold.

So, we can see from this plot that we need 22 components to retain 90% of the variability (information) in our data. That’s not much of an improvement over the total number of variables, 38.

Scree Plot

# Scree plot
plt.figure(figsize=(15, 6))
components = np.arange(1, 39, step=1)
eigenvalues = pca.explained_variance_

plt.plot(components, eigenvalues, marker = 'o', 
				 linestyle = '--', color = 'green')
plt.ylim(0, max(eigenvalues))

plt.ylabel('Eigenvalue')
plt.xlabel('Number of Components')
plt.xticks(np.arange(1, 39, step = 1))
plt.title('Scree plot')

plt.axhline(y=1, color = 'r', linestyle = '-')
plt.text(0, 0.75, 'Eigenvalue Cutoff', color = 'red', fontsize=14)
plt.text(15, 1.10, 'Components Needed: '+str(np.where(eigenvalues<=1)[0][0]), 
				 color = 'red', fontsize=14)

plt.show()

The above scree plot contains the eigenvalues on the y axis and the number of components on the x axis.

As a general rule of thumb, we should select the number of components that have an eigenvalue greater than 1. Based on the above scree plot we should choose to keep 13 components. This is very different to the previous cumulative variance plot where we chose to retain 22 components.

However, if we choose to retain 13 components, this only explains 65% of the variability. This is much lower than the variability captured when retaining 22 components (90%). At this point, the researcher or data scientist would need to decide whether a cumulative variance of 65% is satisfactory based on their domain knowledge and other research in the area.

Alternatively, there are other methods that can determine the optimal number of components. It is also possible that PCA just isn’t a suitable method for this dataset and other dimensionality reduction techniques could be explored.

I used these resources for this project, check them out if you’d like to dive deeper into PCA:

Analytics Vidhya: Demystifying the working of Principal Component Analysis
George Dallas: Principal Component Analysis 4 Dummies (this article does a fantastic job of making eigenvectors and eigenvalues in PCA much easier to understand)
Towards Data Science: PCA using Python (scikit-learn)

You can find all code for this project on Github.

I hope you found this project useful! Comment below or connect with me on Twitter if you have any questions.

Joleen September 26, 2021

Projects & Workflows

K-Means Clustering in Python: Customer Data Segmentation

In this data science project, I tackle the problem of data segmentation or clustering, specifically applied to customer data.

There are a couple of different algorithms to choose from when clustering your data depending on your requirements and inputs. However, K-Means clustering is one of the most popular methods in machine learning and is the one I have chosen for this project using Python and Jupyter Notebook.

What is K-Means Clustering?

K-Means clustering is a type of unsupervised learning which means that we do not have any pre-defined labels in our data and so we are not trying to make predictions. Our goal is to find groups in our data such that individuals (data points) are relatively similar to each other within each group. This is useful to us because it allows us to make better business decisions about how to interact with our customers in marketing, product development, customer service, etc.

Great, so we already know that K-Means clustering is going to group together similar data points. How does it do that?

The algorithm starts with some initial estimate of where each of the centroids could be for each of the clusters. These initial values could be a randomly chosen data point in the data or you could set it yourself. Next, the algorithm will cycle through these two steps until the centroid positions reach convergence (ie. they don’t change):

Each data point is assigned to a cluster closest to them (a fancy mathematical formula is used called the squared Euclidian distance)
Once all data points belong to a single cluster, the mean of all data points within that cluster is calculated and this becomes the new centroid. Repeat.

The most important piece of information that you need to specify in K-Means clustering is the number of clusters you would like the algorithm to form – this is denoted by K. There are a couple of methods you can use to determine the best value for K and we will be exploring two such methods in this project:

The Elbow Method (using inertia)
Silhouette Score

The way that these methods work is they will run K-Means clustering on the data for each value of K in a specific range and will print the required result. This is then plotted and depending on the method, the optimal value for K is selected.

Typically, K-Means clustering is carried out on 2-dimensional numeric data as it is easier to visualise and validate the result. However, it is still possible to apply K-Means clustering to higher dimensional data if you wish or you can even reduce the dimensionality of the data first using a method like Principal Component Analysis (I’ll be doing a project on PCA soon!). Note that it is not advisable to carry out K-Means clustering on categorical data as you may not get a correct result.

Data & Project Goals

The data for this project can be found over at Kaggle. If you’d like to know all the places I look for data, check out my post on finding data.

This is customer data containing 5 variables:

CustomerID
Gender
Age
Annual income (k$)
Spending score (1-100)

The spending score variable is pretty ambiguous and there is no indication of how it was created so it was removed from the data after the initial exploratory phase.

These are my goals for this project:

Explore the data using visualisations (using the matplotlib and seaborn libraries)
Apply the K-Means clustering algorithm (using the sklearn library)
Determine the optimal number of clusters using the elbow method and silhouette score
Split the dataset by gender and observe if the results are different

Analysing the Data

First up, we import the required libraries and set some custom parameters for the plots.

import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")

The data is imported and I drop the CustomerID column as it doesn’t offer us any useful information.

# Import data
segment_df = pd.read_csv("Mall_Customers.csv")

# Drop CustomerID column and save as new dataframe
sub_df = segment_df.drop(columns = ["CustomerID"])

Let’s have a look at the data. Here I look for any funnies or anything strange that I may need to investigate further.

# View data
segment_df

We can see that there are only 200 rows so it doesn’t give us much data to play around with, especially when splitting by gender. Luckily we will not be doing any supervised machine learning because then we’d need a lot more data. This should be fine to just test our skills out a bit.

Data Visualisations

Since there are only a few variables, there isn’t much to explore. However, we should have a look at the distribution of gender in the data and then look at some scatter plots of the numeric variables (splitting by gender).

First, we create a summary table of the counts and percentages of gender in the data.

# Calculate percentange of gender
gender_percent = pd.DataFrame()
gender_percent["Count"] = segment_df.groupby("Gender")["CustomerID"].count()
gender_percent["Percent"] = gender_percent["Count"] / gender_percent["Count"].sum() * 100
gender_percent["Percent"] = gender_percent["Percent"].astype(int)

female = gender_percent["Percent"].iloc[0]
male = gender_percent["Percent"].iloc[1]

Plotting this in a bar chart we can see that there are more females than males in the data.

plt.figure(figsize = (6, 6))
sns.countplot(data = segment_df, x = "Gender")
plt.annotate(str(female)+"%", xy=(1, gender_percent["Count"].iloc[0]-10), color = 'white', size = 15, ha = 'center')
plt.annotate(str(male)+"%", xy=(0,gender_percent["Count"].iloc[1]-10), color = 'white', size = 15, ha = 'center')
plt.ylabel("Count")
plt.title("Distribution of Gender")
plt.show()

Now we visualise the correlation and distribution of the numeric variables using a pair plot. The data points are also split by gender by applying different colours to the plot.

There is nothing particularly interesting that stands out in this plot. However, spending score definitely shows some strange patterns in the scatter plot against income. As mentioned before, spending score will now be removed as we move on to applying K-Means clustering to our data.

plt.figure(figsize = (15, 6))
sns.pairplot(sub_df, hue = "Gender", height = 3)
plt.suptitle("Correlation & Distribution Pairplot", y=1.05)
plt.show()

Applying K-Means Clustering

All variables must be standardised before carrying out K-Means clustering as not all variables could be measured on the same scale and leaving the data as-is could skew the results. While all the data here is in a relatively similar scale (0-100), I still standardise the data as good practice.

# Scale the features (only age and annual income is selected)
scaler = StandardScaler()
features = segment_df[["Age", "Annual Income (k$)"]]
std_features = scaler.fit_transform(features)
std_features_df = pd.DataFrame({"std_age": std_features[:, 0], "std_income": std_features[:, 1]})

# Append the gender column to the dataframe
std_df = pd.concat([std_features_df, segment_df["Gender"]], axis = 1)

# Split the data into male and femail dataframes
male_std_df = std_df[std_df["Gender"] == "Male"][["std_age", "std_income"]]
female_std_df = std_df[std_df["Gender"] == "Female"][["std_age", "std_income"]]

Now our data is ready for K-Means. All we need to do next is determine the optimal number of clusters. We use the elbow method and the silhouette score. In both cases, it is applied to the full dataset and to the female and male datasets with the results plotted on a single line chart.

Elbow Method

# Calculate inertia

# Full dataset
inertia_all = []
for n in range(1 , 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    km.fit(std_features_df)
    inertia_all.append(km.inertia_)

# Female dataset
inertia_f = []
for n in range(1 , 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    km.fit(female_std_df)
    inertia_f.append(km.inertia_)

# Male dataset
inertia_m = []
for n in range(1 , 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    km.fit(male_std_df)
    inertia_m.append(km.inertia_)

Plot the number of clusters vs inertia for each dataset.

plt.figure(figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia_m, 'o', label = "Male")
plt.plot(np.arange(1 , 11) , inertia_m, '-' , alpha = 0.5, color = 'g')
plt.plot(np.arange(1 , 11) , inertia_f, 'o', label = "Female")
plt.plot(np.arange(1 , 11) , inertia_f, '-' , alpha = 0.5, color = 'orange')
plt.plot(np.arange(1 , 11) , inertia_all, 'o', label = "Combined")
plt.plot(np.arange(1 , 11) , inertia_all, '-' , alpha = 0.5, color = 'purple')
plt.legend()
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

What we are looking for is the ‘elbow’ in the chart. At what point does the line look like it is first bending? To me it looks like it is happening at K = 3 for all 3 datasets.

Silhouette Score

# Calculate silhouette score

# Full dataset
ss_all = []
for n in range(2, 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    predict = km.fit_predict(std_features_df)
    score = silhouette_score(X = std_features_df, labels = predict)
    ss_all.append(score)

# Male dataset
ss_m = []
for n in range(2, 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    predict = km.fit_predict(male_std_df)
    score = silhouette_score(X = male_std_df, labels = predict)
    ss_m.append(score)

# Female dataset
ss_f = []
for n in range(2, 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    predict = km.fit_predict(female_std_df)
    score = silhouette_score(X = female_std_df, labels = predict)
    ss_f.append(score)

Plot the number of clusters vs the silhouette score for each dataset.

plt.figure(figsize = (15 ,6))
plt.plot(np.arange(2 , 11) , ss_m , 'o', label = "Male")
plt.plot(np.arange(2 , 11) , ss_m , '-' , alpha = 0.5, color = 'g')
plt.plot(np.arange(2 , 11) , ss_f , 'o', label = "Female")
plt.plot(np.arange(2 , 11) , ss_f , '-' , alpha = 0.5, color = 'orange')
plt.plot(np.arange(2 , 11) , ss_all , 'o', label = "Combined")
plt.plot(np.arange(2 , 11) , ss_all , '-' , alpha = 0.5, color = 'purple')
plt.legend()
plt.xlabel('Number of Clusters') , plt.ylabel('Silhouette Score')
plt.show()

In this chart we are looking for the value of K for which the silhouette score is the highest. In this plot, the score is highest for K = 3. I found it much easier to determine the optimal number of clusters from this plot compared to the elbow method.

Now that we have selected the optimal number of clusters to run our final cluster analysis on, we proceed with the final model and plot. In this last plot I do not split by gender as the results are almost the same for all datasets.

# Apply k-means clustering to the full dataset
km = (KMeans(n_clusters = 3, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
km.fit(std_features_df)
all_labels = km.labels_
all_centroids = km.cluster_centers_

A scatter plot is shown with each datapoint assigned to a cluster along with each cluster’s centroid.

plt.figure(figsize = (9,6))
sns.scatterplot(x = std_features_df.iloc[:, 0], y = std_features_df.iloc[:, -1], hue = all_labels, palette = "Dark2", legend = False)
sns.scatterplot(x = all_centroids[:, 0], y = all_centroids[:, 1], color = "black", marker = "x", s = 200)
plt.ylabel("Income")
plt.xlabel("Age")
plt.title("Combined - Data Points Allocated to Clusters with Centroids")
plt.show()

These are the resources I used for this project, check them out if you want to dive deeper into K-Means clustering:

You can find all code for this project on GitHub.

I hope you found this project useful! Comment below or connect with me on twitter if you have any questions.

1 2