In this first project of 2022, I will be fitting linear and polynomial regression models to some randomly generated dataset where there is a nonlinear relationship between the target and the predictor.
The purpose of this project is to observe some of the red flags for a model that is severely underfitting to the data and how these red flags change when fitting a more appropriate model.
The red flags that I’ll be considering are:
MSE and R-squared – these are common performance metrics used in linear models.
Residual plot – this plot will show us if some of the assumptions of linear regression have been violated.
Learning curves – this plot will show us how well the model fits to the data and usually gives a good indication of over/under fitting.
Generating the data
I generate the data using a polynomial equation of degree 2:
# Generate some nonlinear data
m = 500
X = 9 * np.random.rand(m, 1) - 1
Y = 0.5 * X**2 - 3 * X + 7 + np.random.randn(m, 1)
In this visualisation of the data, we can clearly see the nonlinear relationship that we have generated from the above equation:
Linear Regression
Being one of the oldest and simplest models, linear regression is pretty well known and easy to understand. In this project, I am using linear regression to demonstrate what underfitting looks like and as a comparison to polynomial regression.
To fit linear regression, the response variable must be continuous. There are also a couple of assumptions that must be satisfied in order to use linear regression with some degree of accuracy, otherwise the results can be misleading or downright incorrect.
Linearity: the relationship between the response variable and the predictors must be linear.
Homoscedasticity: the variance of the residuals must be constant. So, as the value of the predictor increases, the variability in the response stays the same.
Independence: the residuals must be independent of one another. This means that each observation must not be determined by or correlated with another. Usually only an issue in time series data.
Normality: the residuals are normally distributed. In other words, the average of all residual values is zero.
These assumptions are usually overlooked when linear regression is applied but they are very important. In this project, The assumptions of linearity and normality are both purposely violated to support the decision to use polynomial regression.
Below is the linear regression model fitted to the generated dataset:
This model gives a Mean Square Error (MSE) score of 10.0 and an R-Squared (R2) score of 0.27. Both these scores give our first red flags. Ideally, MSE should be as low as possible (around 1 is very good) and R-squared should be as close to 1 as possible. So far, our model is not doing a good job of describing the data.
Let’s plot the residuals versus the fitted values – this gives a good indication of whether the assumptions of linearity and normality are violated. Ideally, there should be a random scatter of points around 0 with no discernible pattern or shape in the data. The below plot shows a definite nonlinear pattern in the data:
Next is the learning curves. To generate this plot, multiple models are fitted to varying dataset sizes and the Root Mean Square Error (RMSE) is recorded. In this plot, we get to compare the training performance against some predictions made on a validation set.
As before, we want the RMSE to be as low as possible. But now, we also want the training and validation errors to be as close as possible as the training set size increases. If the validation error is higher than the training error then we know that the model did not generalise well to new data and this is a red flag for over/under fitting.
In these learning curves for linear regression, not only is the RMSE high, but the validation error remains higher than the training error throughout the process.
Great, so now that we know that that above model is a total mess and does a really bad job at making predictions, let’s make a better choice.
Polynomial Regression
When the assumptions of linear regression are not satisfied and the model isn’t capturing the relationship between the response and the predictors then we need a different approach.
Polynomial regression lets us model a non-linear relationship between the response and the predictors. It is a natural extension of linear regression and works by including polynomial forms of the predictors at the degree of our choosing.
Here, we get a wonderful MSE score of 1.0 and an even better R-squared score of 0.9. Plotting the new polynomial regression line to our data we see that it fits like a glove:
Plotting residuals versus fitted values again we now see that there is a nice random scatter of points around 0 and I can’t make out any patterns, so that’s good!
Lastly, the learning curves. Both the validation and training errors hover at around 1.00 and although the lines do not meet in this plot, if I added more data, they would eventually converge.
I hope this project gave you some insight into the red flags commonly seen for linear models that underfit. Reach out to me on Twitter if you have any feedback or questions.
You can find the full code for this project on GitHub.
You don’t need a fancy PC to get started with data science and machine learning. In fact, you can run all your code in cloud-based notebooks without even worrying about setting up an environment locally.
Even if you’re new to data science, you’ve probably heard of Jupyter Notebooks. It has become the number 1 way to do data science. Jupyter Notebooks make it so much easier to run your code, write commentary and see your output all in 1 place. Almost all cloud platforms use some kind of Jupyter-like environment.
In this blog post I am going to share 5 ways you can do data science in the cloud. Each of these platforms allow you to do this completely for free and they each work really well.
In my opinion, the 2 biggest upsides of using cloud platforms for data science are:
Speed of set up – You can get set up in just a few minutes and have almost everything you need to do machine learning available to you. You don’t need to go through the hassle of setting up an environment locally before you start writing code and analysing data.
Collaboration – Being able to share your work and collaborate on projects is a big upside of any kind of cloud platform. However, collaboration is not available for all the platforms listed here. Even when it is offered, the degree of collaboration differs from platform to platform.
5 platforms will be covered in this post:
Datacamp Workspaces
Kaggle Notebooks
Google Colab
Deepnote
Datalore by JetBrains
Gradient Notebooks
There are a few things you should note about how I have judged these platforms. I have actually tried each of these platforms myself and am giving my own opinions on what I think of them. My primary use of cloud platforms is to work on personal projects and not for company or enterprise use.
These are the criteria that I am using to compare these platforms:
Price – they should be free or at least offer a decent free plan (not just a trial)
Speed of set-up and low admin – I shouldn’t have to ‘babysit’ my projects in case I go over ‘allowed hours’. I’d like to be able to log on, work on a project and log off without worrying about whether I shut the server down.
Aesthetic and intuitive – the app should look good and it should be intuitive and easy to use
Collaboration – I’d like to be able to share my work with friends and be able to collaborate on them live. I’d also like to have the option to share securely if I want to, without my project being available publicly.
Great, now that I’ve cleared all that up, lets get into it.
DataCamp Workspaces
DataCamp recently launched their Workspaces feature that allows you to run code in a Jupyter-like environment. There is currently no collaboration feature but it is something they are working on. Right now, you can just share your notebooks with other people and they are able to view them and add comments.
DataCamp’s philosophy with the initial launch of Workspaces is that they want it to be as easy to do data science as it is to learn it. They have already created an incredible interactive environment for learning data science and Workspaces seem like a natural next step for anyone who now wants to easily apply their skills and start putting together a portfolio.
The Workspaces are completely free and you can choose between R or Python. So far, their notebook editor looks good and it is intuitive to use. There is also no admin involved in running or maintaining the workspace which makes it hassle-free.
Kaggle Notebooks
To me, Kaggle is like the OG of cloud-based data science platforms, having started way back in 2010. Initially, they started out as a machine learning competition platform. Since then they have expanded and now offer ways of sharing datasets, notebooks and have a huge community in their forums where you can ask questions and get help.
Kaggle notebooks are free to use, with the option to choose between R and Python and they integrate well with other services. You can even connect your Kaggle notebook to Google Cloud Services to beef up the hardware if you need it, although this will come at an additional cost, of course. Kaggle does provide access to a GPU and for personal projects, it is usually more than enough.
Collaboration on Kaggle notebooks is limited. Similar to DataCamp, you can share your notebooks with other people (such as your teammates in a competition) but you effectively work on different versions of the notebook so there is no live collaboration feature.
Google Colab
Google Colab is another one of Google’s products that operate as a natural extension of Google Drive just like Google Docs or Sheets. If you already use these other products then the Colab UI will feel very familiar.
Naturally, sharing on Colab notebooks is built-in. However, it does not seem to be capable of live collaboration (ie. with 2 or more people editing a notebook together in real-time). I find this to be disappointing since Google basically wrote the book on real-time collaboration with Docs and Sheets.
Colab is also not a particularly pretty app, especially when compared with some of the other platforms I’ll be covering on this list. However, since almost everyone interested in data science most likely has at least 1 google account, setup is by far the fastest.
It is free to use Colab but the resource is not guaranteed and there are several usage limits that change depending on demand. Your usage limits could even be different to mine if your code uses more resources.
Deepnote
Deepnote is by far the most good looking, full-featured platform I have come across. The UI looks great, the editing experience with their notebook is amazing, and they have live collaboration! All of this and it is still free to use (up to a point) plus the platform is still in Beta so you know there is much more to come.
I particularly like their publishing feature – you can publish your notebook as an article or as an interactive app or dashboard. I just love the presentation of the articles and dashboards. Your profile on Deepnote also acts as a portfolio and it is a great viewing experience for anyone looking through your work.
You can get up to 750 hours on their standard machines and each notebook comes with a nifty little feature that automatically shuts the machine down after 15 minutes of inactivity. This keeps the admin pretty low on this platform.
Datalore
Datalore is a platform built by the JetBrains team. The UI looks good and it also comes with live collaboration. There are also a few more options in terms of programming languages with this platform – you can choose between Python, R, Kotlin and Scala.
There is a bit of admin involved with this platform. On the free plan, you get 120 hours/month of basic machine time. However, once you open a notebook, so long as it is an open tab in your browser it will consume resources and eat into your available hours. Because of this, it is important to either close the tab or manually shut down the machine.
Also, if you’ve shared one of your notebooks with someone and they forget to close their browser window down at the end of the day, it’ll eat into your available quota. So that’s something to keep in mind.
Gradient Notebooks
Gradient Notebooks are built by Paperspace. The biggest selling point of Gradient is that they offer free GPU’s. For the free plan, Instead of restricting the number of hours on the platform, the only limits they impose is that you can only ever have 1 notebook running at a time and all notebooks must be public.
Getting set up on the platform and launching a notebook can take a good couple of minutes since all that additional hardware that they offer needs to be provisioned. There is also limited functionality for publishing notebooks and building a portfolio of projects using their platform. They are currently working on a public profile feature, so at least it’s in the roadmap.
Gradient is the platform to go to if you need more resources and more compute for your project and you don’t want to pay a cloud provider like Google Cloud or AWS to get it.
Closing Thoughts
Choosing which cloud platform to use depends on your own goals and needs.
If you’re looking for a place to build a portfolio and get involved in a large community while you’re still learning and improving your data science skills then I’d recommend Kaggle or DataCamp.
If you’re looking for a platform that’s got all the bells and whistles, allows you to build a professionally-looking portfolio and offers real-time collaboration then I’d recommend Deepnote.
If you’ve started branching out into deep learning, NLP, and computer vision then I’d recommend giving Gradient a try.
See this resource for a comprehensive list of almost all available data science notebook platforms.
I hope you found this blog post helpful! Let me know in the comments below what other platforms you’d recommend. I’d love to hear your thoughts!
Stochastic gradient descent is an optimisation technique, and not a machine learning model. It is a method that allow us to efficiently train a machine learning model on large amounts of data. The word ‘descent’ gives the purpose of SGD away – to minimise a cost (or loss) function.
For a better understanding of the underlying principle of GD, let’s consider an example.
A Simplified Example of Gradient Descent
Gradient descent seeks to find the global minimum of a function. To do that, it calculates the gradient at an initial random point and moves to another point in the direction of descending gradient until it reaches a point where the gradient is zero.
As an example, see figure of a parabola below.
The initial random point that starts the algorithm off is at (x = -2, y = 4). The gradient at this point is -3 so the algorithm steps to the right, toward this negative gradient, and calculates the gradient at the next step. Eventually, after a number of steps, the algorithm will reach a point where the gradient is 0 and stops. This is referred to as the global optimum or global minimum.
The learning rate hyperparameter determines the size of each step – too small and the model takes too long, too big and the model may not converge to the global minimum.
Some datasets have an irregular shape and the cost function could contain both local minima and global minima. It is good to keep this in mind when training a model, adjusting the learning strategy according to the current problem.
Batch vs Stochastic Gradient Descent
There are 3 types of Gradient Descent implimentations: batch, mini-batch or stochastic. Which one you choose depends on the amount of data you have and the type of model you are fitting.
Batch Gradient Descent: the entire training dataset is used at every step. Thus this algorithm is very slow for large datasets but scales well for large numbers of features. There is also a possibility that this algorithm may get stuck in local minima.
Stochastic Gradient Descent: a single, random observation in the training data is selected at each step. This algorithm is very fast, only needing to perform calculations on a single point at a time. However, it is erratic and may select points from all over the place, never stopping at a truly accurate solution. Instead, it approaches the minimum on average. That being said, this algorithm is much more likely to find the global maximum. Some of the erratic nature of the algorithm can be solved by using a learning schedule that slowly reduces the learning rate so that it can settle on a more accurate solution.
Mini-Batch Gradient Descent: the algorithm uses small subsets (or batches) of the training data at each step. This algorithm is faster than Batch GD but still suffers from the same drawback of potentially getting stuck in local minima.
How to Implement Stochastic Gradient Descent in Python
This project is focused around applying SGD to a classification problem. Therefore, the steps and performance measures chosen here are best suited to modelling a binary response variable.
Each aspect of the project is broken down below.
Preprocessing the data for classification
These are important steps for data preparation/preprocessing:
Clean up any funnies in the data – for example, in this project the value -1 or 9 are used to represent null or unknown values and so these should be cleaned up and replaced with null
Split the data into training and test sets (typically in an 80/20 ratio). To do this, the train_test_split function is used from sklearn.model_selection (making sure that shuffle = True as this is a requirement of SGD)
Either remove or impute missing values
Feature engineering – after some initial data exploration, new features can be created or existing features can be modified
Standardise continuous variables using the StandardScaler function from sklearn.preprocessing
Dummy code categorical variables using the OneHotEncoder function from sklearn.preprocessing making sure to use the parameter drop = first so that if there are 4 categories in a variable, for example, only 3 categories are retained and the 4th category is inferred by all others being 0.
Selecting the model (and loss function)
This project is a classification problem and so the SGDClassifier is used from sklearn.linear_model. Since SGD is just an optimisation algorithm, we must still pick the machine learning model that will be fitted to the data.
These are the available options:
Support Vector Machine (loss = "hinge")
Smoothed hinge loss (loss = "modified_huber")
Logistic regression (loss = "log") – this loss function is chosen for this project
Measuring performance
The are a couple of different performance measures available to choose from to evaluate a classifier. Although it’s not always necessary to use so many performance measures, since this project is for learning purposes, I will be using all of them.
Each of the below performance measures are calculated after performing cross-validation. This method involves splitting the trainin set into K folds, making predictions on each fold and then measuring it’s performance.
Cross validation accuracy (cross_val_score) – this is a ratio of correct predictions and is often not a good metric when our data is skewed (ie. some classes appear more often than others)
Confusion matrix (confusion_matrix) – calculated from the predictions made during cross validation, this is a table containing 2 rows for actuals and 2 columns for predictions. This matrix is used to calculate a few ratios
Precision (precision_score) – for all cases that the model predicts to be positive, the precision tells us what percentage of them are correct
Recall (recall_score) – for all cases that are actually positive, recall tells us the percentage that are correctly predicted
Precision and recall work as opposites. The preferred level of these 2 metrics depends on the project. Two plots are used to depict the trade off between them (predictions are made using the cross_val_predict function with the parameter method = "decision_function" and then precision, recall and thresholds are calculated with the precision_recall_curve function):
Precision/recall vs thresholds plot: in this plot we can see the trade-off relationship between precision and recall
Precision vs recall plot: we can use this plot to select a precision and associated recall value that occurs just before the curve dips sharply
ROC curve (roc_curve) – a 45 degree line is plot alongside this curve to compare the accuracy of our model against that of a purely random classifier. The goal is for our model to angle as far away from this line as possible.
AUC score (roc_auc_curve) – this score summarises the ROC curve. A score of 1 indicates a perfect model whereas a score of 0.5 represents a model indistinguishable from chance.
Analysing Road Safety Data
Project Overview
The UK Department of Transport has released data on reported road accidents to the public from 1979. The information provided only relates to personal injury accidents that occur on public roads and that are reported to the police. This project uses the 2020 accident dataset.
The goal of this project is to:
Explore the road safety accident dataset, selecting features and adjusting them as needed for modelling
Apply the Stochastic Gradient Descent optimisation technique with a log loss function
Specifically, this will be treated as a classification problem with a binary target (response) variable called accident severity. In this project, we will be trying to determine if there are any patterns in the data that can help to predict whether an accident will be severe or not.
You can see all the code used in this project on Github. This blog post aims to summarise the process and findings.
Python libraries used in this project:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")
Data Description
The road safety accident data contains over 30 different features. After going through these columns I selected just 12 (including an index column) to be included in this project.
To start off with, the features have different types that must each be dealt with – from dates, times, and continuous values to various numbers of categories.
Some features also have values that are not valid. For example, the speed limit feature contains a -1 value when data is missing and a 99 value when data is unknown. Both these values are replaced with null and subsequently removed from the dataset (I will not be dealing with imputation in this project).
Accident severity is the target variable for this project. Currently it has 3 categories: fatal, serious and slight. Since the fatal category is so rare and also because I’d like to treat this as a binary classification problem, I have combined the fatal and serious classes.
Even with the combination of 2 categories, this is quite a skewed variable with only 21% of cases appearing in the positive class so some performance issues could be expected.
Next, the date and time variables are converted to their appropriate types and I decided to convert them into categorical variables. For the dates, they are converted into quarters so that they fall into 4 categories: Q1 – Q4. For the times, they are grouped into 3 equal-sized categories: morning (4am – 11am), afternoon (12pm – 7pm) and evening (8pm – midnight and 1am – 3am).
The variable for the day of the week is left unchanged and will be converted into a dummy variable later on for modelling.
The next few variables are each treated similarly. They have multiple categories, some of which only have a handful of cases in them. Because of this, I convert them each into binary variables with 0 representing what I thought was an ‘ideal’ scenario such as dry roads, daylight, urban area and low speed limit roads.
The last 2 charts are for the continuous features in the dataset. Neither of them are altered in any way. First is number of vehicles involved in the accident with the majority of accidents involving only 1 or 2 vehicles. The second chart is a geographical scatter plot made up of the longitude and latitude values of the accidents. Neither of these charts reveal anything revelatory or worth looking further into.
Data Preprocessing
Other than combining the categories of some of the features and constructing some new features, we must also ensure that any categorical features with more than 2 classes are dummy coded and any continuous variables are standardised.
# Onehotencoder for the day of week, time & month category features
onehot_df = train_features[["day_of_week", "time_category", "month_cat"]]
onehot = OneHotEncoder(drop = 'first', sparse = False)
onehot_df = pd.DataFrame(onehot.fit_transform(onehot_df), columns = onehot.get_feature_names(["day_of_week", "time_category", "month_cat"]))
# Add onehotencoder features back to train set
train_features.reset_index(drop=True, inplace=True)
train_features = pd.concat([train_features, onehot_df], axis = 1)
# Remove redundant columns
train_features = train_features.drop(["date", "time", "day_of_week", "time_category", "month_cat"], axis = 1)
# Standardise the continuous features
scaler = StandardScaler()
train_features_scaled = train_features.copy()
features_to_scale = train_features_scaled[["longitude", "latitude", "number_of_vehicles"]]
features_to_scale = scaler.fit_transform(features_to_scale.values)
train_features_scaled[["longitude", "latitude", "number_of_vehicles"]] = features_to_scale
Training & Evaluating the Model
To train the model the SGD classifier is used with a log loss function
sgd_clf = SGDClassifier(random_state = 42, loss = 'log')
sgd_clf.fit(train_features_scaled, train_target)
Evaluating the model starts with calculating cross validation accuracy scores.
Not too bad. However, remember that this score is not very reliable when the data is skewed – ie. there are more cases in some classes compared to others. This is true in this data as we noticed that only 21% of the cases fall in the positive class (fatal / serious accident severity) compared to the negative class (slight accident severity).
Next, we put together a confusion matrix to determine how well our classifier predicts the target.
Now we’re getting a clearer picture of the performance of the model. The precision score tells us that for all the accidents that the model predicts to be severe, it is correct 48% of the time. The recall score tells us that for all accidents that are actually severe, the model predicts this correctly 0.88% of the time.
Let’s look at the precision and recall metrics in the form of 2 plots.
The first is the precision and recall trade-off versus various threshold scores.
# Visualise the precision/recall tradeoff
# Use the cross_val_predict() function to get the scores of all instances in the training set
# Return decision scores not predictions
train_target_decision_scores = cross_val_predict(sgd_clf, train_features_scaled, train_target, cv = 10, method = "decision_function")
# Compute the precision and recall for all possible thresholds
precisions, recalls, thresholds = precision_recall_curve(train_target, train_target_decision_scores)
# Plot precision and recall as functions of the threshold value
plt.figure(figsize=(8, 6))
plt.plot(thresholds, precisions[:-1], "b--", label = "Precision")
plt.plot(thresholds, recalls[:-1], "g--", label = "Recall")
plt.grid(True)
plt.legend(loc = "upper right")
plt.xlabel("Thresholds", fontsize=16)
plt.ylabel("Precision / Recall", fontsize=16)
plt.title("Precision / Recall Tradeoff", fontsize=16)
plt.show()
Ideally, in this plot we want to select a reasonably high precision value for which recall is also reasonably high. However, this plot does not paint a nice picture for the performance of this model. If we select a precision over 50%, the recall is basically 0. Having a low recall means that the classifier will fail to recognise positive cases a lot of the time.
The value of precision and recall that you are happy with depends entirely on the problem at hand. There is no golden rule to follow here. Although, I do believe almost 0 recall is not good in any scenario.
The second plot directly compares precision against recall
In the ROC curve, we would ideally like the curve to angle itself toward the top left corner as much as possible. The 45-degree dotted line represents a purely random classifier, much like flipping a fair coin. In this plot, however, the curve is quite flat, indicating a poor model.
The AUC score can be calculated from the ROC curve. It simply summarises the curve in a single metric. A score of 1 indicates a perfect fit, while a score of 50 indicates a purely random model.
Although the score gives us a quick summary of the performance of the model, it does not tell us much about where the model is going wrong. It will also tend to give a better picture of performance when the positive class is rare (as in the case of our accident severity target variable). The precision and recall scores along with their plots gives more information on the performance of the model.
I hope you found this project helpful in your own machine learning endeavours. As always, I am open to feedback and suggestions. Leave a comment below with your thoughts.
In this week’s data science project, I apply principal component analysis to a dataset based around hospital mortality using python. The dataset is extensive, containing over 50 variables on demographic characteristics, vital signs, and other measurements taken in the lab. This should be a perfect use-case for dimensionality reduction with PCA.
What is Principal Component Analysis?
Like my previous project on K-Means clustering, PCA is a type of unsupervised learning which means that we do not need any pre-defined labels for our data and we are not trying to make predictions.
The goal in PCA is to reduce the number of variables in our dataset. Therefore, the main purpose of PCA is to speed up machine learning. Using fewer variables while still retaining nearly all the original information and variability gives a nice boost to model training, consuming fewer resources and reaching a result faster.
Eigenvectors and eigenvalues drive PCA at it’s core. An eigenvector is a line that passes through the data and the algorithm chooses the line that maximises the variability between that line and the data points. The corresponding eigenvalue represents the magnitude of that variability. There are only as many eigenvectors as there are variables in the data.
An important question to ask in PCA is how many principal components to retain. The 2 most popular methods are:
Plotting the cumulative variance explained by each principal component. You would choose a cutoff value for the variance and select the number of components that occur at that cutoff.
A scree plot showing the eigenvalues of each principal component where the cutoff here is an eigenvalue of 1.
There are 3 critical processing conditions that must be satisfied when preparing PCA:
There must be no null or blank values in the dataset
All variables must be numeric
Standardise the variables to have a mean of 0 and a standard deviation of 1
Data & Project Goals
The data for this project is from Kaggle. Remember, if you’d like to know where I look for my datasets, check out my post on finding data.
This data is from the MIMIC-III database on in-hospital mortality. The original purpose of the dataset is to predict whether a patient lives or dies in hospital when in intensive care based on the risk factors selected.
There are a total of 51 variables in the data. In this project, we will attempt to reduce these variables down to a more manageable number using principal component analysis.
These are the goals for this project:
Prepare the data for PCA, satisfying the 3 critical processing conditions mentioned in the previous section
Apply the PCA algorithm (using the sklearn library)
Determine the optimal number of principal components to retain
Data Pre-Processing
We begin, as always, with importing the required libraries and the data.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")
# Import data
pca_raw_df = pd.read_csv("data/mortality.csv")
Next, we will carry out these 3 pre-processing steps:
Remove any rows that contain blank / NA
Select only numeric variables from the data
Standardise the variables
# 1. Remove rows that contain blank values
pca_nona_df = pca_raw_df.dropna()
# 2. Select only numeric columns
# We calculate the max & assume that a max of 1 is probably not numeric
max_all = pca_nona_df.max()
max_cols = list(max_all[max_all != 1].index)
pca_sub_df = pca_nona_df[max_cols]
# Drop the columns that did not fit into our initial logic
pca_sub2_df = pca_sub_df.drop(columns = ['group', 'ID', 'gendera'])
# 3. Standardise the variables
scaler = StandardScaler()
pca_std = scaler.fit_transform(pca_sub2_df)
Once all pre-processing steps are complete, 38 variables remain.
Applying Principal Component Analysis
Now that our data is ready, we can apply PCA and then we will use these 2 methods to determine the optimal number of components to retain:
Cumulative variance
Scree plot
# Fit PCA
pca = PCA()
fit_pca = pca.fit_transform(pca_std)
Plot Cumulative Variance
# Plot the cumulative variance for each component
plt.figure(figsize = (15, 6))
components = np.arange(1, 39, step=1)
variance = np.cumsum(pca.explained_variance_ratio_)
plt.ylim(0.0,1.1)
plt.plot(components, variance, marker='o', linestyle='--', color='green')
plt.xlabel('Number of Components')
plt.xticks(np.arange(1, 39, step=1))
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')
plt.axhline(y=0.90, color='r', linestyle='-')
plt.text(0.5, 0.85, '90% variance threshold', color = 'red', fontsize=16)
plt.text(25, 0.85, "Components needed: "+str(np.where(np.cumsum(pca.explained_variance_ratio_)>=0.9)[0][0]), color = "red", fontsize=16)
plt.show()
The above plot contains the cumulative variance as a proportion on the y axis and the number of components on the x axis.
Using a variance threshold of 90%, the above chart helps us determine how many components we should retain from our dataset in order for it to still make sense for us in any further modelling.
Note that we chose 90% here as the variance threshold but this is not a golden rule. The researcher or data scientist chooses this variance threshold.
So, we can see from this plot that we need 22 components to retain 90% of the variability (information) in our data. That’s not much of an improvement over the total number of variables, 38.
The above scree plot contains the eigenvalues on the y axis and the number of components on the x axis.
As a general rule of thumb, we should select the number of components that have an eigenvalue greater than 1. Based on the above scree plot we should choose to keep 13 components. This is very different to the previous cumulative variance plot where we chose to retain 22 components.
However, if we choose to retain 13 components, this only explains 65% of the variability. This is much lower than the variability captured when retaining 22 components (90%). At this point, the researcher or data scientist would need to decide whether a cumulative variance of 65% is satisfactory based on their domain knowledge and other research in the area.
Alternatively, there are other methods that can determine the optimal number of components. It is also possible that PCA just isn’t a suitable method for this dataset and other dimensionality reduction techniques could be explored.
I used these resources for this project, check them out if you’d like to dive deeper into PCA:
In this data science project, I tackle the problem of data segmentation or clustering, specifically applied to customer data.
There are a couple of different algorithms to choose from when clustering your data depending on your requirements and inputs. However, K-Means clustering is one of the most popular methods in machine learning and is the one I have chosen for this project using Python and Jupyter Notebook.
What is K-Means Clustering?
K-Means clustering is a type of unsupervised learning which means that we do not have any pre-defined labels in our data and so we are not trying to make predictions. Our goal is to find groups in our data such that individuals (data points) are relatively similar to each other within each group. This is useful to us because it allows us to make better business decisions about how to interact with our customers in marketing, product development, customer service, etc.
Great, so we already know that K-Means clustering is going to group together similar data points. How does it do that?
The algorithm starts with some initial estimate of where each of the centroids could be for each of the clusters. These initial values could be a randomly chosen data point in the data or you could set it yourself. Next, the algorithm will cycle through these two steps until the centroid positions reach convergence (ie. they don’t change):
Each data point is assigned to a cluster closest to them (a fancy mathematical formula is used called the squared Euclidian distance)
Once all data points belong to a single cluster, the mean of all data points within that cluster is calculated and this becomes the new centroid. Repeat.
The most important piece of information that you need to specify in K-Means clustering is the number of clusters you would like the algorithm to form – this is denoted by K. There are a couple of methods you can use to determine the best value for K and we will be exploring two such methods in this project:
The Elbow Method (using inertia)
Silhouette Score
The way that these methods work is they will run K-Means clustering on the data for each value of K in a specific range and will print the required result. This is then plotted and depending on the method, the optimal value for K is selected.
Typically, K-Means clustering is carried out on 2-dimensional numeric data as it is easier to visualise and validate the result. However, it is still possible to apply K-Means clustering to higher dimensional data if you wish or you can even reduce the dimensionality of the data first using a method like Principal Component Analysis (I’ll be doing a project on PCA soon!). Note that it is not advisable to carry out K-Means clustering on categorical data as you may not get a correct result.
Data & Project Goals
The data for this project can be found over at Kaggle. If you’d like to know all the places I look for data, check out my post on finding data.
This is customer data containing 5 variables:
CustomerID
Gender
Age
Annual income (k$)
Spending score (1-100)
The spending score variable is pretty ambiguous and there is no indication of how it was created so it was removed from the data after the initial exploratory phase.
These are my goals for this project:
Explore the data using visualisations (using the matplotlib and seaborn libraries)
Apply the K-Means clustering algorithm (using the sklearn library)
Determine the optimal number of clusters using the elbow method and silhouette score
Split the dataset by gender and observe if the results are different
Analysing the Data
First up, we import the required libraries and set some custom parameters for the plots.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")
The data is imported and I drop the CustomerID column as it doesn’t offer us any useful information.
# Import data
segment_df = pd.read_csv("Mall_Customers.csv")
# Drop CustomerID column and save as new dataframe
sub_df = segment_df.drop(columns = ["CustomerID"])
Let’s have a look at the data. Here I look for any funnies or anything strange that I may need to investigate further.
# View data
segment_df
We can see that there are only 200 rows so it doesn’t give us much data to play around with, especially when splitting by gender. Luckily we will not be doing any supervised machine learning because then we’d need a lot more data. This should be fine to just test our skills out a bit.
Data Visualisations
Since there are only a few variables, there isn’t much to explore. However, we should have a look at the distribution of gender in the data and then look at some scatter plots of the numeric variables (splitting by gender).
First, we create a summary table of the counts and percentages of gender in the data.
Plotting this in a bar chart we can see that there are more females than males in the data.
plt.figure(figsize = (6, 6))
sns.countplot(data = segment_df, x = "Gender")
plt.annotate(str(female)+"%", xy=(1, gender_percent["Count"].iloc[0]-10), color = 'white', size = 15, ha = 'center')
plt.annotate(str(male)+"%", xy=(0,gender_percent["Count"].iloc[1]-10), color = 'white', size = 15, ha = 'center')
plt.ylabel("Count")
plt.title("Distribution of Gender")
plt.show()
Now we visualise the correlation and distribution of the numeric variables using a pair plot. The data points are also split by gender by applying different colours to the plot.
There is nothing particularly interesting that stands out in this plot. However, spending score definitely shows some strange patterns in the scatter plot against income. As mentioned before, spending score will now be removed as we move on to applying K-Means clustering to our data.
All variables must be standardised before carrying out K-Means clustering as not all variables could be measured on the same scale and leaving the data as-is could skew the results. While all the data here is in a relatively similar scale (0-100), I still standardise the data as good practice.
# Scale the features (only age and annual income is selected)
scaler = StandardScaler()
features = segment_df[["Age", "Annual Income (k$)"]]
std_features = scaler.fit_transform(features)
std_features_df = pd.DataFrame({"std_age": std_features[:, 0], "std_income": std_features[:, 1]})
# Append the gender column to the dataframe
std_df = pd.concat([std_features_df, segment_df["Gender"]], axis = 1)
# Split the data into male and femail dataframes
male_std_df = std_df[std_df["Gender"] == "Male"][["std_age", "std_income"]]
female_std_df = std_df[std_df["Gender"] == "Female"][["std_age", "std_income"]]
Now our data is ready for K-Means. All we need to do next is determine the optimal number of clusters. We use the elbow method and the silhouette score. In both cases, it is applied to the full dataset and to the female and male datasets with the results plotted on a single line chart.
Elbow Method
# Calculate inertia
# Full dataset
inertia_all = []
for n in range(1 , 11):
km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 19))
km.fit(std_features_df)
inertia_all.append(km.inertia_)
# Female dataset
inertia_f = []
for n in range(1 , 11):
km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 19))
km.fit(female_std_df)
inertia_f.append(km.inertia_)
# Male dataset
inertia_m = []
for n in range(1 , 11):
km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 19))
km.fit(male_std_df)
inertia_m.append(km.inertia_)
Plot the number of clusters vs inertia for each dataset.
What we are looking for is the ‘elbow’ in the chart. At what point does the line look like it is first bending? To me it looks like it is happening at K = 3 for all 3 datasets.
Silhouette Score
# Calculate silhouette score
# Full dataset
ss_all = []
for n in range(2, 11):
km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 19))
predict = km.fit_predict(std_features_df)
score = silhouette_score(X = std_features_df, labels = predict)
ss_all.append(score)
# Male dataset
ss_m = []
for n in range(2, 11):
km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 19))
predict = km.fit_predict(male_std_df)
score = silhouette_score(X = male_std_df, labels = predict)
ss_m.append(score)
# Female dataset
ss_f = []
for n in range(2, 11):
km = (KMeans(n_clusters = n, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 19))
predict = km.fit_predict(female_std_df)
score = silhouette_score(X = female_std_df, labels = predict)
ss_f.append(score)
Plot the number of clusters vs the silhouette score for each dataset.
In this chart we are looking for the value of K for which the silhouette score is the highest. In this plot, the score is highest for K = 3. I found it much easier to determine the optimal number of clusters from this plot compared to the elbow method.
Now that we have selected the optimal number of clusters to run our final cluster analysis on, we proceed with the final model and plot. In this last plot I do not split by gender as the results are almost the same for all datasets.
# Apply k-means clustering to the full dataset
km = (KMeans(n_clusters = 3, init='k-means++', n_init = 10, max_iter=300,
tol=0.0001, random_state= 19))
km.fit(std_features_df)
all_labels = km.labels_
all_centroids = km.cluster_centers_
A scatter plot is shown with each datapoint assigned to a cluster along with each cluster’s centroid.
plt.figure(figsize = (9,6))
sns.scatterplot(x = std_features_df.iloc[:, 0], y = std_features_df.iloc[:, -1], hue = all_labels, palette = "Dark2", legend = False)
sns.scatterplot(x = all_centroids[:, 0], y = all_centroids[:, 1], color = "black", marker = "x", s = 200)
plt.ylabel("Income")
plt.xlabel("Age")
plt.title("Combined - Data Points Allocated to Clusters with Centroids")
plt.show()
These are the resources I used for this project, check them out if you want to dive deeper into K-Means clustering:
Probably the most valuable resource to me during my Masters research project was my notebook. I recorded everything about my project (meeting notes, analyses, ideas, theories, rough drafts) and even summaries of research papers that were interesting or relevant to my research.
Since starting some data science projects outside of research and academia, I have continued to keep notebooks and they have not failed me (unless I’m just an obsessive notebook keeper 🤷♀️).
In this blog post, I hope to encourage you to also keep a notebook and give you some essential guidelines to follow when putting together a notebook for your project.
What is a lab notebook?
A traditional lab notebook captures the thoughts, ideas, experiments, failures, successes and everything else that takes place during the course of lab work and a research project.
This would work very similarly for a data scientist. During a project, a lab notebook would capture all aspects of the thought process and iterations of the project. You could record all your ideas and theories to solve the problem and outline the process you would like to take to test those ideas. If the idea succeeds or fails then you note it down and include a brief reason why you think it worked or failed.
A traditional lab notebook would typically be a physical notebook but any digital document-editing software would achieve the same results (such as Google Docs or Word).
However, for a data scientist my recommended approach would be to use a markdown-based document – such as RStudio’s RMarkdown Notebooks or Jupyter Notebooks. These documents allow you to write code, view the output and add your own commentary in the same document without needing to go through the hassle of copy-paste-insert to get code and output into the document.
Why is it important to keep a lab notebook?
It is common practice for a data scientist to prepare documentation at the end of the project detailing the purpose of the project, with a quick summary of the approach and then some information for deployment and steps around reproducing the analysis.
However, this documentation would not detail all the ideas, theories and failed experiments throughout the project (yes, failures are just as important as the overall success of the project).
So, here are a couple of reasons to keep a lab notebook:
Clarity: keeping a lab notebook will force you to think clearly about your project as your progress through all the stages of your analysis. By recording your thoughts, asking questions and interpreting the results of your analyses, you will become clearer on your approach.
Context: future-you or collaborators will be able to understand the context behind your results: what you’ve tried, what worked, what didn’t and what the next steps might be for improvements or future work.
Centralised: keeping every detail about your project in one place will make it easier to find something later on.
Essential elements of a lab notebook
Hadley Wickham gives some great tips on what he calls an ‘analysis notebook’ in his book R for Data Science which I have adapted and added a couple of extra elements that have been particularly helpful to me.
Try to follow these guidelines when creating your lab notebook:
If you followed my other blog post on how to structure a data science project then you would of created a file called notebook.Rmd for this exact purpose within your project folder.
Your notebook will start with a YAML header. As a minimum, it’s a good idea to include a few key options. See the snippet below for an example from one of my projects:
Next, include some information on what you are hoping to achieve with your project and any goals that you may have (you can also fill in your goals later if you are doing some exploration first):
# Project Goals
This project aims to ...
This will be achieved through the following x goals:
1. First goal
2. Second goal
3. Third goal
Next you’ll want to include a setup chunk containing all the packages you load in for your analysis. There are 2 other packages that are recommended for better reproducibility:
The here package – simplifies file path references so that you can avoid the headache of fixing references to files before you run an analysis. You can learn more about this package here
The renv package – this package keeps track of all packages used in your project, including the versions. You can find the documentation here
Keep your notebook organised with headers and sub-headers using the # H1, ## H2, and ### H3 header tags for easy navigation. This will show up nicely in the table of contents in the HTML output of your notebook.
As you progress through your analysis, record everything and don’t delete code chunks that didn’t work out – just write a quick note about what went wrong and continue on.
If you need to do any data cleaning or if you are modifying the data in any way then clearly state why you are doing it.
At the end of your analysis, write a few closing remarks on what your findings were and give some recommendations for how the project can be improved so that if someone were to pick it up in the future they would know where to start.
I strongly encourage you to keep a notebook the next time you start a new project.
Not only will you gain more clarity in your thinking but you will also make the process of creating documentation at the end of your project much easier!
It’s easy to run headfirst into writing code and analysing data and forget about project layout, so make it part of your routine to get started with a simple structure that you either maintain throughout the project or evolve as you go.
Future you (and others) will thank you for creating a well-structured, organised project. Also remember that ‘well-structured’ doesn’t have to mean complicated!
Project Guidelines
Start simple, evolve as needed – don’t overcomplicate things when you’re just getting started because this may even lead to procrastinating and putting off the project because the project setup is so overwhelming.
The nature of your project will determine the structure – a simple exploratory project won’t even need all the elements described below but a growing machine learning model may need a more complex structure.
Keep reproducibility in mind – create your project as if you’re collaborating with a team (even if it’s just for you) because future-you is also on your team.
Stay flexible – adapt your structure to changes as they come up; don’t be afraid to change your entire project structure if you need to.
RStudio Projects & Filepaths
RStudio projects make dealing with filepaths a breeze! Never set your working directory directly using setwd(). This could cause you to run into a few of problems:
If you try to transfer your work to a new computer (especially if it is a different operating system) then you’d have to change your directory each time.
Your project will no longer be reproducible without someone having to change the working directory.
If you wanted to move your project to a new folder, you’d have to change the working directory.
Keep it simple!
RStudio projects are defined by the .Rproj file that is created for it and this specifies that the current project directory where the .Rproj file lives is ‘home’ and all other files are referenced relative to it.
Besides filepaths, RStudio projects make it easy to switch from one project to another and have more than one project open at once. Projects are clearly separated from one another without worrying about overlapping environments and conflicts.
Folder structure
Here is an overview of the steps for creating a new project:
In RStudio: create a new project using File > New Project and create a new directory with the name of your project.
Next you’ll create three subfolders inside your new project directory called R, data, and output.
Back in our root project directory, create your main script file called analysis.R, your README.md file, and your notebook.Rmd file.
The R subfolder should contain all your function scripts. These scripts don’t actually do any analyses themselves but rather acts as a place to store your functions for easy reference and will later be called by the analysis.R file.
This folder is optional, depending on your project. For example, you may do most of your analyses within your notebook.Rmd as in the case of projects consisting only of a few exploratory data analyses or you may just use your analysis.R file for all your analyses.
You will keep all your data files in the data folder. These can be Excel files, csv’s, etc. Treat these files strictly as read-only. It is very tempting to correct a small typo or error in your data, especially if your data is in an Excel file but resist this! Do all your editing within R and keep these steps well-documented.
The output folder is reserved for storing your model output, log files, or plots. You can create further subfolders for each of these to keep things organised if you generate a lot of output.
One thing to note here is that these files should all be generated by your scripts so that you can easily run your script again and re-generate all relevant output files. So don’t be scared to delete old files if you no longer need them.
The Root Files
Your analysis.R file would usually start with something like
# Import packages
library(cool_package)
library(other_cool_package)
# Run scripts
source("R/import_data.R")
source("R/clean.R")
… followed by all your analyses below it.
The notebook.Rmd file is your holy grail as you progress through your project and test out different hypotheses or explore new ideas. Document everything and keep it all together is this notebook, like a diary of sorts.
I have preferred a paper ‘laboratory’ notebook in the past but after coming across the idea of using a .Rmd file for this purpose from Hadley Wickham in his R for Data Science book, I wish I had discovered it sooner! However, I do still keep a paper notebook to jot down ideas or questions that may come up as I go through the project or spontaneously as I go about my day, but I always make sure that I document these ideas at some point in my notebook.Rmd file.
Lastly, there is the README.md. This file is for a quick overview of the purpose of the project, what you are trying to achieve, the data you are using (and it’s source) as well as a brief summary of your findings. Think of this as your project abstract so that someone can quickly get an overview of your project.
That’s it!
This project structure shouldn’t take you more than a few minutes to set up and will save you so much more time in the long run. Having a well organised project that is easy to share and come back to in the future is not to be underestimated!
Starting a new data science project (especially for the first time) can feel like a monumental task. You’ve probably read a bunch of data science articles or gone through an online course and you’re keen to apply your new-found knowledge to some actual data (that does not involve petal width).
But where do you start? How do you go through this process so that you come out the other side feeling excited and motivated to continue your data science journey rather than feeling like some kind of imposter?
The Project Flow
The solution to this is a project flow that guides you from getting started through to each step along the way. The most difficult part is just starting.
Trust me, I know the feeling all too well.
If you don’t have a solid process for starting your data science project off on the right foot then you will most likely just procrastinate yourself out of actually doing anything!
I am going to share the process I follow whenever I start a data science project.
Project Flow
There are two starting positions in this project flow. Either you start with data or you start with a goal in mind.
Starting with a goal: your best option is to follow the blue arrows and move on to first finding the data and then tidying it.
Starting with data: you can either follow the green or blue arrows where you set your goals next or just start tidying it immediately and set your goals later.
The reason the process is so flexible is because you may not know what is possible with your data until you start tidying it up a bit and doing some basic exploration. This may spark up some ideas for how you want to move forward and then you can set more informed goals on how you want to proceed through the data.
Other times, the data could be neatly formatted and well documented and so just by going through the documentation you will get an idea of the the route that you want to take right away. In this case, you can set some goals early on and then tidy up the data to get it ready for your analyses.
Also, you should not view this process as a concrete 1-2-3-done! sort of deal. It is intended to be looped and iterated until you have a good idea of what you want to achieve and how you will be going about performing your analyses.
You will feel the invariable urge to skip setting some goals for your project. You will probably be super pumped to jump right into the deep end and write some awesome code but I urge you to resist this temptation. If you do not set clear goals at the start of the project, you will not know what success looks like and you will get caught in the clutches of analysis paralysis.
Here are some quick actionable tips to take away if you don’t have the time to check out the other posts right now:
Finding data: the dataset you choose is going to depend on what you’re interested in and what motivates you. However, if you are just starting out or you are at a complete loss about what kinds of data you want to analyse then I highly recommend that you start with Kaggle. Browse through their datasets until you find something that interests you.
Setting goals: be smart about setting goals. Make sure you are specific and realistic about what you want to achieve and ensure you can measure what success looks like at the end. Lastly, give yourself a deadline so that you create a bit of positive friction that will motivate you to take action rather that procrastinate.
If you’re starting your first (or next) data science project then chances are you’ll need to find one or more interesting datasets to analyse in your project. Well, look no further because I have a list of the best 5 places to look for great datasets to use in your project.
The truth is you can find data almost anywhere – either pre-formatted and cleaned up for you or in its raw, dirty form. You can download plain flat files with nicely organised columns (but not necessarily clean) or you can scrape together your own dataset from websites or from multiple different sources.
I’ll be dealing with the former here because if you’re anything like me, you’re less interested in collecting the data and more interested in analysing the data.
I encourage you to choose datasets based on what you’re interested in and what motivates you. If you pick a dataset because it seems like a hot topic and it doesn’t actually excite you then you are more likely to never finish the project.
I have organised the datasets into 5 categories below based on their overall theme. This is by no means an exhaustive list of datasets and there are of course many more places out there to find data but I hope that this list will give you a good place to start.
Fight Club
Enter the first category: fight club. There are datasets available on competition websites that have been created specifically to use in predictive modelling. They are sometimes cleaned before hand with only a few tweaks needed depending on your use case.
You don’t actually have to make a submission to the competition if that’s not your vibe, you can just download the data for use in your own custom projects.
News sites that are focussed on data journalism are good places to look for data because they often release their (already cleaned) data publicly.
If you enjoy their content you can also download the data they use in their articles and try to replicate the results yourself or come up with different problems or questions that you want to explore.
FiveThirtyEight – data used in their articles can often be found on Github. Alternatively, if you’re using the R programming language, there is a package called fivethirtyeight (developed by Kim, Ismay, and Chunn 2019) that contains all datasets used by the site. Also, check out this website for a complete list of all datasets contained in the package.
BuzzFeed – data for this site can also be found on Github
ProPublica – they host a data store on their website with both free and paid data sets
The Big Fish
Cloud hosting platforms/providers often host big data sets. Big data is a whole other beast unto itself that may require a very different approach.
I would not recommend starting here if you’re still getting used to querying and cleaning data and haven’t dealt with big data before.
They are typically geared toward getting you to use their platform and so you would need to sign up for their services in order to get access to their data.
These hunter-gatherer sites collect data from multiple different sources and store them in their own repositories. They could be centred around a specific industry or they could be a mixture of different topics, each belonging to different sectors and industries.
This is my favourite category to search for data in. Remember what I said about being less interested in collecting the data and more interested in analysing the data? Well, then I guess it’s a no-brainer that I’d be hanging out over here.
However, be warned that some of these sites can have data that is quite dirty, especially if they are user-contributed with minimal documentation. But you’re a data scientist, right? Dirty data is like your home-away-from-home 😏
If you want to keep your data updated and refreshed at close to real-time then open the flood gates and let the stream in.
Some sites make their data available at near real-time via an API but be aware that there are limitations on the number of API calls you can make or the number of data records returned within a certain interval. Generally, after this they make you pay for more calls or a greater data cap.
Phew! There are a lot of options for datasets! The last thing you want to happen is to get stuck in procrasti-searching – procrastinating by endlessly searching these sites for datasets (did you see what I did there?).
The most important thing is that you just pick a dataset and get started on your project. You will have plenty of time to come back to this list and these sites along your data science journey. One project does not make a data scientist so keep practicing.
If you have a favourite source that you like to get your data from then leave it in the comments below – I would love to expand my list of data sources. You can never have too much data, can you?
Have you ever been really motivated to get stuck into analysing some data so you go ahead and download a really interesting dataset and (after a bit of googling) you import the data and go to town on it – you create summary tables, visualisations, fit a model, then go back to visualisations, then back to fitting some other model, and so on and so on. And then you wonder “what’s the point of all this? I don’t even know what I’m doing here”.
Well, I’ve been there (more than once) and the number 1 thing that I have found to combat this never-ending feeling of analysis-to-nowhere is to set actionable, achievable goals before I start any analysis.
You need to get clear on what you’re going to do, why you’re doing it, how you’re going to do it, what success looks like at the end and when it needs to be done by.
Keep reading on below because I will show you how to do this using the SMART goals framework within the context of a data science project.
This is the SMART goals framework:
Specific: be clear on your what’s and how’s
Measurable: identify the metrics that define success
Achievable: make it challenging but still be reachable
Relevant: it should be important or interesting to you
Time-bound: give it a deadline
Specific
Your goal should be clear and easy to understand. You (and your team) should be aligned with what it is you’re going to do and what’s required to achieve it.
Try to answer the following questions when making your goal specific:
What do I want to accomplish with this project? Is it to just learn something new or will I explore and visualise my data, or fit a model and make predictions (or all of the above)? Be specific on each of the things you’re going to be doing along the way with your project.
How will I accomplish it? What programming language will I use and what packages and functions will I be using within my chosen language?
If this involves a team then who will need to be involved in the project with me?
Measurable
A measurable goal identifies a metric with a specific target to show that you’ve succeeded. Consider these points during this process:
If you’re just learning something new then try to choose a metric that will indicate when you have successfully learnt the topic
For projects that involve exploration andvisualisation then you’ll need to decide at what point you’ll stop exploring.
If you chose to fit a model and make predictions then you’ll need to choose a metric that indicates good model fit or accuracy
In some cases, the statement of the goal as a whole is measurable. In other cases, you’ll need to get a little creative about what you will use to measure your success.
Achievable
The goal should be achievable based on your current skills, knowledge and resources. This, however, doesn’t mean it should be easy. The goal should still challenge you, but not so much that you are setting yourself up for failure. The goal should help you grow and improve, and if it does fail, then at least you should “fail fast and fail forward” (Will Smith).
Ask yourself these questions:
Do I have the skills and knowledge necessary to achieve the goal? If not, then determine what you need to learn first and come back to this project at a later stage.
Do I have the resources necessary to achieve the goal?
Does the goal challenge me to improve and grow, even if I fail?
Relevant
In this step, you must evaluate the importance of the goal. Think about whether it aligns to to your own interests as well as whether it will actually allow you to grow and move forward.
Answering the following questions should help you through this process.
Is it worth the time and effort? Do these goals excite and motivate me to go forward with the project?
Is this the right time to tackle the goal? In other words, do I need to wait until I have acquired more skills and knowledge first or are there other projects and goals that make more sense right now
Time-Bound
Your goal should have a deadline. This creates a sense of urgency in achieving the goal. A time-bound goal creates a positive tension that keeps you moving and builds the momentum you need to achieve the goal.
Depending on the size of your project, you may need to break the goal down into smaller objectives and tasks so that you continue to build momentum as you progress through the project.
If your goal uses multiple different methods, packages, and functions then it is a good idea to split these up into multiple goals that share some similar characteristics such as being achievable and relevant but might have different deadlines and metrics that define their success.
Each project is unique and there is no right or wrong goal.