K-Means Clustering in Python: Customer Data Segmentation

In this data science project, I tackle the problem of data segmentation or clustering, specifically applied to customer data.

There are a couple of different algorithms to choose from when clustering your data depending on your requirements and inputs. However, K-Means clustering is one of the most popular methods in machine learning and is the one I have chosen for this project using Python and Jupyter Notebook.

What is K-Means Clustering?

K-Means clustering is a type of unsupervised learning which means that we do not have any pre-defined labels in our data and so we are not trying to make predictions. Our goal is to find groups in our data such that individuals (data points) are relatively similar to each other within each group. This is useful to us because it allows us to make better business decisions about how to interact with our customers in marketing, product development, customer service, etc.

Great, so we already know that K-Means clustering is going to group together similar data points. How does it do that?

The algorithm starts with some initial estimate of where each of the centroids could be for each of the clusters. These initial values could be a randomly chosen data point in the data or you could set it yourself. Next, the algorithm will cycle through these two steps until the centroid positions reach convergence (ie. they don’t change):

  1. Each data point is assigned to a cluster closest to them (a fancy mathematical formula is used called the squared Euclidian distance)
  2. Once all data points belong to a single cluster, the mean of all data points within that cluster is calculated and this becomes the new centroid. Repeat.

The most important piece of information that you need to specify in K-Means clustering is the number of clusters you would like the algorithm to form – this is denoted by K. There are a couple of methods you can use to determine the best value for K and we will be exploring two such methods in this project:

  • The Elbow Method (using inertia)
  • Silhouette Score

The way that these methods work is they will run K-Means clustering on the data for each value of K in a specific range and will print the required result. This is then plotted and depending on the method, the optimal value for K is selected.

Typically, K-Means clustering is carried out on 2-dimensional numeric data as it is easier to visualise and validate the result. However, it is still possible to apply K-Means clustering to higher dimensional data if you wish or you can even reduce the dimensionality of the data first using a method like Principal Component Analysis (I’ll be doing a project on PCA soon!). Note that it is not advisable to carry out K-Means clustering on categorical data as you may not get a correct result.

Data & Project Goals

The data for this project can be found over at Kaggle. If you’d like to know all the places I look for data, check out my post on finding data.

This is customer data containing 5 variables:

  • CustomerID
  • Gender
  • Age
  • Annual income (k$)
  • Spending score (1-100)

The spending score variable is pretty ambiguous and there is no indication of how it was created so it was removed from the data after the initial exploratory phase.

These are my goals for this project:

  • Explore the data using visualisations (using the matplotlib and seaborn libraries)
  • Apply the K-Means clustering algorithm (using the sklearn library)
  • Determine the optimal number of clusters using the elbow method and silhouette score
  • Split the dataset by gender and observe if the results are different

Analysing the Data

First up, we import the required libraries and set some custom parameters for the plots.

import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt
import seaborn as sns

custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)

The data is imported and I drop the CustomerID column as it doesn’t offer us any useful information.

# Import data
segment_df = pd.read_csv("Mall_Customers.csv")

# Drop CustomerID column and save as new dataframe
sub_df = segment_df.drop(columns = ["CustomerID"])

Let’s have a look at the data. Here I look for any funnies or anything strange that I may need to investigate further.

# View data
print data

We can see that there are only 200 rows so it doesn’t give us much data to play around with, especially when splitting by gender. Luckily we will not be doing any supervised machine learning because then we’d need a lot more data. This should be fine to just test our skills out a bit.

Data Visualisations

Since there are only a few variables, there isn’t much to explore. However, we should have a look at the distribution of gender in the data and then look at some scatter plots of the numeric variables (splitting by gender).

First, we create a summary table of the counts and percentages of gender in the data.

# Calculate percentange of gender
gender_percent = pd.DataFrame()
gender_percent["Count"] = segment_df.groupby("Gender")["CustomerID"].count()
gender_percent["Percent"] = gender_percent["Count"] / gender_percent["Count"].sum() * 100
gender_percent["Percent"] = gender_percent["Percent"].astype(int)

female = gender_percent["Percent"].iloc[0]
male = gender_percent["Percent"].iloc[1]

Plotting this in a bar chart we can see that there are more females than males in the data.

plt.figure(figsize = (6, 6))
sns.countplot(data = segment_df, x = "Gender")
plt.annotate(str(female)+"%", xy=(1, gender_percent["Count"].iloc[0]-10), color = 'white', size = 15, ha = 'center')
plt.annotate(str(male)+"%", xy=(0,gender_percent["Count"].iloc[1]-10), color = 'white', size = 15, ha = 'center')
plt.title("Distribution of Gender")
gender bar chart

Now we visualise the correlation and distribution of the numeric variables using a pair plot. The data points are also split by gender by applying different colours to the plot.

There is nothing particularly interesting that stands out in this plot. However, spending score definitely shows some strange patterns in the scatter plot against income. As mentioned before, spending score will now be removed as we move on to applying K-Means clustering to our data.

plt.figure(figsize = (15, 6))
sns.pairplot(sub_df, hue = "Gender", height = 3)
plt.suptitle("Correlation & Distribution Pairplot", y=1.05)
correlation pairplot

Applying K-Means Clustering

All variables must be standardised before carrying out K-Means clustering as not all variables could be measured on the same scale and leaving the data as-is could skew the results. While all the data here is in a relatively similar scale (0-100), I still standardise the data as good practice.

# Scale the features (only age and annual income is selected)
scaler = StandardScaler()
features = segment_df[["Age", "Annual Income (k$)"]]
std_features = scaler.fit_transform(features)
std_features_df = pd.DataFrame({"std_age": std_features[:, 0], "std_income": std_features[:, 1]})

# Append the gender column to the dataframe
std_df = pd.concat([std_features_df, segment_df["Gender"]], axis = 1)

# Split the data into male and femail dataframes
male_std_df = std_df[std_df["Gender"] == "Male"][["std_age", "std_income"]]
female_std_df = std_df[std_df["Gender"] == "Female"][["std_age", "std_income"]]

Now our data is ready for K-Means. All we need to do next is determine the optimal number of clusters. We use the elbow method and the silhouette score. In both cases, it is applied to the full dataset and to the female and male datasets with the results plotted on a single line chart.

Elbow Method

# Calculate inertia

# Full dataset
inertia_all = []
for n in range(1 , 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))

# Female dataset
inertia_f = []
for n in range(1 , 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))

# Male dataset
inertia_m = []
for n in range(1 , 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))

Plot the number of clusters vs inertia for each dataset.

plt.figure(figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia_m, 'o', label = "Male")
plt.plot(np.arange(1 , 11) , inertia_m, '-' , alpha = 0.5, color = 'g')
plt.plot(np.arange(1 , 11) , inertia_f, 'o', label = "Female")
plt.plot(np.arange(1 , 11) , inertia_f, '-' , alpha = 0.5, color = 'orange')
plt.plot(np.arange(1 , 11) , inertia_all, 'o', label = "Combined")
plt.plot(np.arange(1 , 11) , inertia_all, '-' , alpha = 0.5, color = 'purple')
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
inertia line plot

What we are looking for is the ‘elbow’ in the chart. At what point does the line look like it is first bending? To me it looks like it is happening at K = 3 for all 3 datasets.

Silhouette Score

# Calculate silhouette score

# Full dataset
ss_all = []
for n in range(2, 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    predict = km.fit_predict(std_features_df)
    score = silhouette_score(X = std_features_df, labels = predict)

# Male dataset
ss_m = []
for n in range(2, 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    predict = km.fit_predict(male_std_df)
    score = silhouette_score(X = male_std_df, labels = predict)

# Female dataset
ss_f = []
for n in range(2, 11):
    km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
    predict = km.fit_predict(female_std_df)
    score = silhouette_score(X = female_std_df, labels = predict)

Plot the number of clusters vs the silhouette score for each dataset.

plt.figure(figsize = (15 ,6))
plt.plot(np.arange(2 , 11) , ss_m , 'o', label = "Male")
plt.plot(np.arange(2 , 11) , ss_m , '-' , alpha = 0.5, color = 'g')
plt.plot(np.arange(2 , 11) , ss_f , 'o', label = "Female")
plt.plot(np.arange(2 , 11) , ss_f , '-' , alpha = 0.5, color = 'orange')
plt.plot(np.arange(2 , 11) , ss_all , 'o', label = "Combined")
plt.plot(np.arange(2 , 11) , ss_all , '-' , alpha = 0.5, color = 'purple')
plt.xlabel('Number of Clusters') , plt.ylabel('Silhouette Score')
silhouette line chart

In this chart we are looking for the value of K for which the silhouette score is the highest. In this plot, the score is highest for K = 3. I found it much easier to determine the optimal number of clusters from this plot compared to the elbow method.

Now that we have selected the optimal number of clusters to run our final cluster analysis on, we proceed with the final model and plot. In this last plot I do not split by gender as the results are almost the same for all datasets.

# Apply k-means clustering to the full dataset
km = (KMeans(n_clusters = 3, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 19))
all_labels = km.labels_
all_centroids = km.cluster_centers_

A scatter plot is shown with each datapoint assigned to a cluster along with each cluster’s centroid.

plt.figure(figsize = (9,6))
sns.scatterplot(x = std_features_df.iloc[:, 0], y = std_features_df.iloc[:, -1], hue = all_labels, palette = "Dark2", legend = False)
sns.scatterplot(x = all_centroids[:, 0], y = all_centroids[:, 1], color = "black", marker = "x", s = 200)
plt.title("Combined - Data Points Allocated to Clusters with Centroids")
k means scatter plot

These are the resources I used for this project, check them out if you want to dive deeper into K-Means clustering:

You can find all code for this project on GitHub.

I hope you found this project useful! Comment below or connect with me on twitter if you have any questions.