Evaluating Classification Models: Understanding the Confusion Matrix and ROC Curves

One of the most important aspects of machine learning classification models is evaluating how well they predict the target. For this, it’s essential to have a solid understanding of the confusion matrix and ROC curves.

The confusion matrix breaks down a model’s predictions by showing true positives, true negatives, false positives, and false negatives. On the other hand, an ROC curve offers a visual representation of a model’s discrimination skills, showcasing how the model can tell classes apart at different decision thresholds.

Let’s break down what these two evaluation techniques mean and how you can implement them in your machine learning projects.

Confusion Matrix

A confusion matrix measures the performance and accuracy of machine learning classification models. It gives us a breakdown of the predictions made by a model compared to the actual outcomes. In other words, how confused is your model?

The matrix is mainly used for binary classification. Still, it can be extended to multi-class problems, where there would be as many rows and columns as classes in the target variable.

Confusion matrix (image by author)

We’ll use the example of cancer prediction to help with the explanations. A positive case is one where cancer is present, and a negative case is one without cancer.

The confusion matrix is made up of 4 components:

  • True positives (TP): cases that are correctly predicted as positive. The model correctly predicts cancer patients as having cancer.
  • True negatives (TN): cases that are correctly predicted as negative. The model correctly predicts that non-cancer patients do not have cancer.
  • False positives (FP): also called a Type I Error, these are cases where the model wrongly classifies a negative case as positive. Here, the model predicts that a non-cancer patient has cancer – not ideal, but not too severe either.
  • False negatives (FN): also called a Type II Error, these are cases where the model wrongly classifies a positive case as negative. Here, the model predicts that a cancer patient does not have cancer – very dangerous to the patient.

False negatives, or type II errors, are usually dangerous in many classification tasks, and you will want to minimize these as much as possible. As in the example above, the last thing you want to do is tell someone they don’t have cancer when they actually do.

Summary Metrics

We can calculate a few summary ratios from the confusion matrix, each with a different meaning and interpretation.

Accuracy score

Accuracy is the percentage of cases that the model predicted correctly. This is a very high-level summary; we need more information to evaluate the classifier properly.

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Considerations when using the accuracy score:

  • High accuracy isn’t enough for classifiers with more than 2 classes because you don’t know if all classes are predicted equally well or if the model neglects 1 or more classes.
  • For imbalanced classes, high accuracy could result from most predictions going into the more common class rather than as a reflection of the model’s predictive power.

Precision (or positive predicted value)

Precision is the ratio of correct positive predictions out of all positive predictions (both correct and incorrect). If we have high precision, then we minimize false positives (or type I errors). However, this metric can be misleading in imbalanced datasets where one class dominates.

Precision = \frac{TP}{TP + FP}

Recall (sensitivity or true positive rate)

Recall is the ratio of correct positive predictions out of all positive cases. From our cancer example, recall measures the model’s ability to correctly detect cancer patients out of all those with cancer. High recall means that false negatives (type II errors) are minimized.

Recall = \frac{TP}{TP + FN}

Specificity (true negative rate)

Specificity is the ratio of correct negative predictions out of all cases that are actually negative. From our cancer example, specificity measures the model’s ability to correctly predict non-cancer patients out of all those who don’t have cancer.

Specificity = \frac{TN}{TN + FP}

F1 score

The F1 score is the harmonic mean of precision and recall, providing a balanced measure. Using a harmonic mean helps us find a sweet spot between precision and recall, ensuring we don’t favor one metric over another. Useful in situations where both false positives and false negatives have significant consequences.

F_{1} = 2 * \frac{precision * recall}{precision + recall} = \frac{TP}{TP + \frac{FN + FP}{2}}

The F1 score can be a preferred metric when there is an uneven class distribution, and it tends to favor classifiers with similar precision and recall. However, different applications may favor one measure over the other (such as with the cancer example, where it is preferable to minimize type II errors), and this is where the F1 score is unsuitable.

Balanced accuracy (BA)

The balanced accuracy score is used to combat the downsides of using the accuracy score with imbalanced data. This metric is calculated as the average of sensitivity and specificity.

Balanced\ Accuracy = \frac{1}{2} \bigg(\frac{TP}{TP + FN} + \frac{TN}{TN + FP}\bigg)

Creating Summary Metrics in Python with Sklearn

Here is an example python code snippet for creating a confusion matrix and the summary metrics:

from sklearn.metrics import confusion_matrix

# Creating a confusion matrix
# Get y_pred predicted values from your fitted model
cm = confusion_matrix(y_true, y_pred)

# Displaying the confusion matrix
print("Confusion Matrix:")

# Extract metrics from the confusion matrix
TN, FP, FN, TP = cm.ravel()

# Metrics derived from the confusion matrix
accuracy = (TP + TN) / (TP + TN + FP + FN)
b_accuracy = 0.5 * (TP / (TP + FN) + TN / (TN + FP))
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)

# Displaying metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Balanced Accuracy: {b_accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")

With sklearn, you also don’t have to calculate these metrics from scratch. Here’s how you can use a few built-in functions:

import sklearn.metrics as metrics

# Accuracy
metrics.accuracy_score(y_true, y_pred)

# Balanced accuracy
metrics.balanced_accuracy_score(y_true, y_pred)

# Precision
metrics.precision_score(y_true, y_pred)

# Recall
metrics.recall_score(y_true, y_pred)

# F1 score
metrics.f1_score(y_true, y_pred)

Visualizing the Confusion Matrix

We can use a visualization like a heatmap to make it easier to understand and pick out red flags in the confusion matrix. Here’s how you can create a heatmap visualization of a confusion matrix in Python using the seaborn and matplotlib libraries:

import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["Predicted Negative", "Predicted Positive"],
            yticklabels=["Actual Negative", "Actual Positive"])
plt.title("Confusion Matrix Heatmap")

In this example:

  • cm is the confusion matrix we obtained using sklearn’s confusion_matrix function earlier.
  • seaborn is used to create the heatmap.
  • annot=True adds the actual values to each cell.
  • fmt="d" formats the cell values as integers.
  • cmap="Blues" sets the color palette.

ROC Curves

Receiver Operating Characteristic (ROC) curves are graphical representations of how the model can tell classes apart at different decision thresholds. This gives a good overview of a model’s performance across various thresholds, helping to understand the trade-offs between TPR and FPR.

It plots the true positive rate (sensitivity or recall) against the false positive rate (1 – specificity) at various classification thresholds. This gives a good overview of a model’s performance across various thresholds, helping to understand the trade-offs between TPR and FPR.

We can also calculate the Area Under the ROC Curve (AUC) for a single measure of the model’s overall performance. Higher AUC = better model performance.

ROC Curve (source: Wikipedia)

The diagonal line represents random guessing; any curve above it indicates better-than-random performance. The closer the curve is to the top-left corner, the higher the model’s performance.

Sensitivity and specificity are typically inversely related, and adjusting the classification threshold (i.e., at what probability value is a prediction considered positive) affects both metrics. You can choose an operating point on the ROC curve based on the needs of your project, balancing sensitivity and specificity.

As the classification threshold decreases, sensitivity tends to increase (captures more positives), but specificity may decrease (captures more false positives). On the other hand, as the threshold increases, sensitivity tends to decrease (misses more positives), but specificity may increase (reduces false positives).

For imbalanced datasets, this balancing act in the ROC curve is particularly useful in helping to assess model performance beyond accuracy.

ROC Curve Construction

First, start by training a binary classification model on your dataset. This could be logistic regression, support vector machines, random forests, or any other binary classifier. You can find a detailed project on my website where we fitted a logistic regression to road safety data with a focus on the stochastic gradient descent optimization algorithm – check it out to learn more!

If you want to build your own machine learning model but don’t know where to find data, we have a post on that too!

Obtain predictions and predicted probabilities from your model for the test set. Most classifiers provide a probability score indicating the confidence of the prediction. A binary classifier typically outputs probabilities between 0 and 1.

You would need to choose a threshold to convert these probabilities into class predictions (e.g., 0 or 1). Varying the threshold allows us to generate different points on the ROC curve. For each threshold, we need to calculate the true positive rate and the false positive rate. Then, we can plot each FPR and TPR pair on the ROC chart and obtain the AUC summary score.

Luckily, sklearn does all the heavy lifting for us, so we don’t need to compute these thresholds ourselves for each threshold. Here’s an example python snippet where we compute the ROC and AUC metrics using sklearn and then plot the ROC curve using matplotlib.

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Compute the metrics for the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# Compute Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')


Understanding the performance of a classification model involves two key tools: confusion matrices and ROC curves. We can use the confusion matrix to determine how confused our model is between the positives and negatives, and we can get a bunch of summary metrics from this table to quantify the confusion. Then we ask ourselves how much confusion is okay based on our problem at hand.

Next, we use the ROC curve to plot the trade-off between sensitivity and specificity at different thresholds, and we can get an excellent overall summary metric called the AUC, which quantifies the overall performance of the model. We can then pick a threshold with a sensitivity and specificity that is ok for our case (usually based on domain expertise or familiarity with the consequences of misclassification).

Usually, we use the confusion matrix and ROC curve together when evaluating a classification model because either method has a few drawbacks that the other fills in. For example, the confusion matrix is pretty sensitive to imbalanced datasets, while the ROC curve doesn’t provide a detailed enough assessment of model performance.

An alternative evaluation technique is the precision-recall curve, which has a number of benefits that overcome many of the challenges faced with the ROC curve. Stay tuned for future posts on evaluating classification models!

If you’re interested in learning about model evaluation in regression models, you can find a detailed blog post on my website discussing some red flags to look out for.

I hope you found this helpful! If you have any questions or comments, drop them below. I’d love to hear from you!

You can also connect with me on LinkedIn, where I post more data science content.