Stochastic Gradient Descent in Python

Joleen December 2, 2021

What is Stochastic Gradient Descent?

Stochastic gradient descent is an optimisation technique, and not a machine learning model. It is a method that allow us to efficiently train a machine learning model on large amounts of data. The word ‘descent’ gives the purpose of SGD away – to minimise a cost (or loss) function.

For a better understanding of the underlying principle of GD, let’s consider an example.

A Simplified Example of Gradient Descent

Gradient descent seeks to find the global minimum of a function. To do that, it calculates the gradient at an initial random point and moves to another point in the direction of descending gradient until it reaches a point where the gradient is zero.

As an example, see figure of a parabola below.

The initial random point that starts the algorithm off is at (x = -2, y = 4). The gradient at this point is -3 so the algorithm steps to the right, toward this negative gradient, and calculates the gradient at the next step. Eventually, after a number of steps, the algorithm will reach a point where the gradient is 0 and stops. This is referred to as the global optimum or global minimum.

The learning rate hyperparameter determines the size of each step – too small and the model takes too long, too big and the model may not converge to the global minimum.

Some datasets have an irregular shape and the cost function could contain both local minima and global minima. It is good to keep this in mind when training a model, adjusting the learning strategy according to the current problem.

Batch vs Stochastic Gradient Descent

There are 3 types of Gradient Descent implimentations: batch, mini-batch or stochastic. Which one you choose depends on the amount of data you have and the type of model you are fitting.

Batch Gradient Descent: the entire training dataset is used at every step. Thus this algorithm is very slow for large datasets but scales well for large numbers of features. There is also a possibility that this algorithm may get stuck in local minima.
Stochastic Gradient Descent: a single, random observation in the training data is selected at each step. This algorithm is very fast, only needing to perform calculations on a single point at a time. However, it is erratic and may select points from all over the place, never stopping at a truly accurate solution. Instead, it approaches the minimum on average. That being said, this algorithm is much more likely to find the global maximum. Some of the erratic nature of the algorithm can be solved by using a learning schedule that slowly reduces the learning rate so that it can settle on a more accurate solution.
Mini-Batch Gradient Descent: the algorithm uses small subsets (or batches) of the training data at each step. This algorithm is faster than Batch GD but still suffers from the same drawback of potentially getting stuck in local minima.

How to Implement Stochastic Gradient Descent in Python

This project is focused around applying SGD to a classification problem. Therefore, the steps and performance measures chosen here are best suited to modelling a binary response variable.

Each aspect of the project is broken down below.

Preprocessing the data for classification

These are important steps for data preparation/preprocessing:

Clean up any funnies in the data – for example, in this project the value -1 or 9 are used to represent null or unknown values and so these should be cleaned up and replaced with null
Split the data into training and test sets (typically in an 80/20 ratio). To do this, the train_test_split function is used from sklearn.model_selection (making sure that shuffle = True as this is a requirement of SGD)
Either remove or impute missing values
Feature engineering – after some initial data exploration, new features can be created or existing features can be modified
Standardise continuous variables using the StandardScaler function from sklearn.preprocessing
Dummy code categorical variables using the OneHotEncoder function from sklearn.preprocessing making sure to use the parameter drop = first so that if there are 4 categories in a variable, for example, only 3 categories are retained and the 4th category is inferred by all others being 0.

Selecting the model (and loss function)

This project is a classification problem and so the SGDClassifier is used from sklearn.linear_model. Since SGD is just an optimisation algorithm, we must still pick the machine learning model that will be fitted to the data.

These are the available options:

Support Vector Machine (loss = "hinge")
Smoothed hinge loss (loss = "modified_huber")
Logistic regression (loss = "log") – this loss function is chosen for this project

Measuring performance

The are a couple of different performance measures available to choose from to evaluate a classifier. Although it’s not always necessary to use so many performance measures, since this project is for learning purposes, I will be using all of them.

Each of the below performance measures are calculated after performing cross-validation. This method involves splitting the trainin set into K folds, making predictions on each fold and then measuring it’s performance.

Cross validation accuracy (cross_val_score) – this is a ratio of correct predictions and is often not a good metric when our data is skewed (ie. some classes appear more often than others)
Confusion matrix (confusion_matrix) – calculated from the predictions made during cross validation, this is a table containing 2 rows for actuals and 2 columns for predictions. This matrix is used to calculate a few ratios
Precision (precision_score) – for all cases that the model predicts to be positive, the precision tells us what percentage of them are correct
Recall (recall_score) – for all cases that are actually positive, recall tells us the percentage that are correctly predicted
Precision and recall work as opposites. The preferred level of these 2 metrics depends on the project. Two plots are used to depict the trade off between them (predictions are made using the cross_val_predict function with the parameter method = "decision_function" and then precision, recall and thresholds are calculated with the precision_recall_curve function):
- Precision/recall vs thresholds plot: in this plot we can see the trade-off relationship between precision and recall
- Precision vs recall plot: we can use this plot to select a precision and associated recall value that occurs just before the curve dips sharply
ROC curve (roc_curve) – a 45 degree line is plot alongside this curve to compare the accuracy of our model against that of a purely random classifier. The goal is for our model to angle as far away from this line as possible.
AUC score (roc_auc_curve) – this score summarises the ROC curve. A score of 1 indicates a perfect model whereas a score of 0.5 represents a model indistinguishable from chance.

Analysing Road Safety Data

Project Overview

The UK Department of Transport has released data on reported road accidents to the public from 1979. The information provided only relates to personal injury accidents that occur on public roads and that are reported to the police. This project uses the 2020 accident dataset.

The goal of this project is to:

Explore the road safety accident dataset, selecting features and adjusting them as needed for modelling
Apply the Stochastic Gradient Descent optimisation technique with a log loss function

Specifically, this will be treated as a classification problem with a binary target (response) variable called accident severity. In this project, we will be trying to determine if there are any patterns in the data that can help to predict whether an accident will be severe or not.

You can see all the code used in this project on Github. This blog post aims to summarise the process and findings.

Python libraries used in this project:

# Import libraries
import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")

Data Description

The road safety accident data contains over 30 different features. After going through these columns I selected just 12 (including an index column) to be included in this project.

accident_index
longitude
latitude
accident_severity: 1 = fatal, 2 = serious, 3 = slight
number_of_vehicles
date
day_of_week: 1-7 = Sun-Sat
time
speed_limit: 20, 30, 40, 50, 60, 70, -1 = data missing, 99 = unknown
light_conditions: 1 = daylight, 4 = darkness – lights lit, 5 = darkness – lights unlit, 6 = darkness – no lighting, 7 = darkness – lighting unknown, -1 = data missing
road_surface_conditions: 1 = dry, 2 = wet or damp, 3 = snow, 4 = frost or ice, 5 = flood over 3cm, deep, 6 = oil or diesel, 7 = mud, -1 = data missing, 9 = unknown
urban_or_rural_area: 1 = urban, 2 = rural, 3 = unallocated, -1 = data missing

Data Exploration

To start off with, the features have different types that must each be dealt with – from dates, times, and continuous values to various numbers of categories.

Some features also have values that are not valid. For example, the speed limit feature contains a -1 value when data is missing and a 99 value when data is unknown. Both these values are replaced with null and subsequently removed from the dataset (I will not be dealing with imputation in this project).

Accident severity is the target variable for this project. Currently it has 3 categories: fatal, serious and slight. Since the fatal category is so rare and also because I’d like to treat this as a binary classification problem, I have combined the fatal and serious classes.

Even with the combination of 2 categories, this is quite a skewed variable with only 21% of cases appearing in the positive class so some performance issues could be expected.

Next, the date and time variables are converted to their appropriate types and I decided to convert them into categorical variables. For the dates, they are converted into quarters so that they fall into 4 categories: Q1 – Q4. For the times, they are grouped into 3 equal-sized categories: morning (4am – 11am), afternoon (12pm – 7pm) and evening (8pm – midnight and 1am – 3am).

The variable for the day of the week is left unchanged and will be converted into a dummy variable later on for modelling.

The next few variables are each treated similarly. They have multiple categories, some of which only have a handful of cases in them. Because of this, I convert them each into binary variables with 0 representing what I thought was an ‘ideal’ scenario such as dry roads, daylight, urban area and low speed limit roads.

The last 2 charts are for the continuous features in the dataset. Neither of them are altered in any way. First is number of vehicles involved in the accident with the majority of accidents involving only 1 or 2 vehicles. The second chart is a geographical scatter plot made up of the longitude and latitude values of the accidents. Neither of these charts reveal anything revelatory or worth looking further into.

Data Preprocessing

Other than combining the categories of some of the features and constructing some new features, we must also ensure that any categorical features with more than 2 classes are dummy coded and any continuous variables are standardised.

# Onehotencoder for the day of week, time & month category features
onehot_df = train_features[["day_of_week", "time_category", "month_cat"]]
onehot = OneHotEncoder(drop = 'first', sparse = False)
onehot_df = pd.DataFrame(onehot.fit_transform(onehot_df), columns = onehot.get_feature_names(["day_of_week", "time_category", "month_cat"]))

# Add onehotencoder features back to train set
train_features.reset_index(drop=True, inplace=True)
train_features = pd.concat([train_features, onehot_df], axis = 1)

# Remove redundant columns
train_features = train_features.drop(["date", "time", "day_of_week", "time_category", "month_cat"], axis = 1)

# Standardise the continuous features
scaler = StandardScaler()
train_features_scaled = train_features.copy()
features_to_scale = train_features_scaled[["longitude", "latitude", "number_of_vehicles"]]
features_to_scale = scaler.fit_transform(features_to_scale.values)
train_features_scaled[["longitude", "latitude", "number_of_vehicles"]] = features_to_scale

Training & Evaluating the Model

To train the model the SGD classifier is used with a log loss function

sgd_clf = SGDClassifier(random_state = 42, loss = 'log')
sgd_clf.fit(train_features_scaled, train_target)

Evaluating the model starts with calculating cross validation accuracy scores.

# Cross validation accuracy scores
cross_val_score(sgd_clf, train_features_scaled, train_target, cv = 10, scoring = "accuracy")

I chose to perform 10 folds of cross validation and get these scores:

[0.7807249 , 0.78197473, 0.78100264, 0.78169699, 0.78194444, 0.78236111, 0.78194444, 0.78222222, 0.78180556, 0.78236111]

Not too bad. However, remember that this score is not very reliable when the data is skewed – ie. there are more cases in some classes compared to others. This is true in this data as we noticed that only 21% of the cases fall in the positive class (fatal / serious accident severity) compared to the negative class (slight accident severity).

Next, we put together a confusion matrix to determine how well our classifier predicts the target.

# Confusion matrix
train_target_pred = cross_val_predict(sgd_clf, train_features_scaled, train_target, cv = 10)
conf_matrix = confusion_matrix(train_target, train_target_pred)

This gives the following table:

	Predicted – Negative	Predicted – Positive
Actual – Positive	56 155	147
Actual – Negative	15 564	138

Confusion Matrix

That doesn’t look so good – over 15k cases were predicted to be negative but were actually positive!

Let’s have a look at the precision and recall, calculated from this confusion matrix

# Precision & Recall
print("Precision: ", round(precision_score(train_target, train_target_pred)*100,2))
print("Recall: ", round(recall_score(train_target, train_target_pred)*100,2))

Precision: 48.42

Recall: 0.88

Now we’re getting a clearer picture of the performance of the model. The precision score tells us that for all the accidents that the model predicts to be severe, it is correct 48% of the time. The recall score tells us that for all accidents that are actually severe, the model predicts this correctly 0.88% of the time.

Let’s look at the precision and recall metrics in the form of 2 plots.

The first is the precision and recall trade-off versus various threshold scores.

# Visualise the precision/recall tradeoff

# Use the cross_val_predict() function to get the scores of all instances in the training set
# Return decision scores not predictions
train_target_decision_scores = cross_val_predict(sgd_clf, train_features_scaled, train_target, cv = 10, method = "decision_function")

# Compute the precision and recall for all possible thresholds
precisions, recalls, thresholds = precision_recall_curve(train_target, train_target_decision_scores)

# Plot precision and recall as functions of the threshold value
plt.figure(figsize=(8, 6))
plt.plot(thresholds, precisions[:-1], "b--", label = "Precision")
plt.plot(thresholds, recalls[:-1], "g--", label = "Recall")
plt.grid(True)

plt.legend(loc = "upper right")
plt.xlabel("Thresholds", fontsize=16)
plt.ylabel("Precision / Recall", fontsize=16)
plt.title("Precision / Recall Tradeoff", fontsize=16)
plt.show()

Ideally, in this plot we want to select a reasonably high precision value for which recall is also reasonably high. However, this plot does not paint a nice picture for the performance of this model. If we select a precision over 50%, the recall is basically 0. Having a low recall means that the classifier will fail to recognise positive cases a lot of the time.

The value of precision and recall that you are happy with depends entirely on the problem at hand. There is no golden rule to follow here. Although, I do believe almost 0 recall is not good in any scenario.

The second plot directly compares precision against recall

# Plot precision against recall

plt.figure(figsize=(8, 6))
plt.plot(recalls, precisions, "b-", linewidth=2)
plt.axis([0, 1, 0, 1])
plt.grid(True)

plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.title("Precision vs Recall", fontsize=16)
plt.show()

This plot leads to the same conclusion as the previous plot but does so from a different perspective.

Next up is the ROC curve and the associated AUC score.

# The ROC Curve
# Calculate the metrics
fpr, tpr, thresholds = roc_curve(train_target, train_target_decision_scores)

# Plot FPR against TPR
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth = 2)
plt.plot([0,1], [0,1], "k--") # 45 degree line
plt.grid(True)

plt.xlabel('FPR (1 - Specificity)', fontsize=16)
plt.ylabel('TPR (Sensitivity, Recall)', fontsize=16)
plt.title('ROC Curve', fontsize=16)
plt.show()

In the ROC curve, we would ideally like the curve to angle itself toward the top left corner as much as possible. The 45-degree dotted line represents a purely random classifier, much like flipping a fair coin. In this plot, however, the curve is quite flat, indicating a poor model.

The AUC score can be calculated from the ROC curve. It simply summarises the curve in a single metric. A score of 1 indicates a perfect fit, while a score of 50 indicates a purely random model.

# AUC Score
print("AUC Score: ", round(roc_auc_score(train_target, train_target_decision_scores)*100, 2))

AUC Score: 60.66

Although the score gives us a quick summary of the performance of the model, it does not tell us much about where the model is going wrong. It will also tend to give a better picture of performance when the positive class is rare (as in the case of our accident severity target variable). The precision and recall scores along with their plots gives more information on the performance of the model.

I hope you found this project helpful in your own machine learning endeavours. As always, I am open to feedback and suggestions. Leave a comment below with your thoughts.