Joleen, Author at Statistically Relevant

Green plant with a zoom meeting in the background

Joleen January 20, 2022

Analytics & Business

Data Science is a Team Sport: Why Deepnote is The Platform For Small Teams

Approaching data science like a team sport is still a relatively new concept in the field today. In the past, data scientists have typically worked alone as ‘unicorns’ and have been expected to wear multiple hats. However, more and more companies are building teams that contain a mix of roles from BI experts, data analysts, data scientists and data engineers, all the way up to Chief Data Officer.

Effective collaboration and communication are critical to the success of data science projects, not just within the data science team but across other teams too. This means building cross-functional teams that work together toward a common goal.

In a world that is also becoming increasingly remote, companies have been trying to find more effective ways of collaborating on projects across almost all business units. Deepnote is working to solve this problem for data scientists with their cloud-based notebook platform. Deepnote was built with data teams in mind, taking the headache out of many of the unique challenges that these teams usually face.

These are the three biggest challenges that small teams face when working collaboratively on data science projects:

Working together in real-time — being able to see the code and output in real-time, where ideas can be shared and feedback can be given in the moment.
Building reproducible data science projects that can be shared with new and future team members quickly and with little effort.
Going from data science to data storytelling – giving presentations on the results of data science projects in the form of a story to leadership and executive teams in the company.

In this post, we’re going to expand on these issues and demonstrate how Deepnote has addressed them.

Working Together in Real-Time

When working in a remote environment, it’s easy to lose sight of the importance of collaboration. You can’t see what a team member is doing or what they’re working on unless you get on a Zoom call and share your screen or take a series of screenshots and send them over Slack or email. This adds friction and is definitely not the most efficient way to collaborate. It slows down your work, and projects end up taking longer or get scrapped altogether.

Being able to write code alongside teammates allows you to each get involved in the project, sharing ideas and feedback in the moment. You can bounce ideas off of each other and explore different ways of approaching a problem. By sharing knowledge, team members can learn faster and find better solutions to business problems.

This is particularly useful when you’re stuck on a challenging problem or get lost down a rabbit hole that takes you in the wrong direction. Being able to interact with one another and demonstrate, in real-time, what you’re doing is an element of collaboration that is missing from a lot of the current processes and tools that data teams use.

Deepnote reduces the friction, allowing teams to easily work together in real-time. You just share your notebook with team members and their avatar will show up on the screen. Clicking on their avatar allows you to quickly jump to their position in the notebook to avoid excessive scrolling (very useful for large notebooks).

Deepnote also gives you the ability to take your communications offline by leaving comments on specific code cells. This allows teams to have discussions and share knowledge and ideas even if real-time collaboration is not possible (e.g., when working in different time zones).

Building Reproducible Projects

Reproducibility is a key prerequisite for reliable science. Since the package versions, environment, and computational requirements are not bundled with Jupyter notebooks, reproducing a colleague’s analysis can be frustrating.

On the other hand, if you’ve ever tried onboarding a new team member, you’ll know how time consuming it can be to get their environments set up with the correct software and packages so that they can work on current projects and also be able to refer back to past ones.

This is why reproducibility matters — it will save you time in the long run and you can produce projects that are more accurate, more efficient, and with fewer bugs.

There are 3 main areas of a project that need to be reproducible: code, environment, and compute. With Deepnote, you can build projects that are reproducible in all 3 of these areas and you don’t even have to really think about it.

Every project on Deepnote comes with pre-installed packages that are commonly used in data science projects and everything just works. However, you do still get a lot of flexibility to set up your own custom environments if you need it.

Bonus tip: you’re not even restricted to just using Python if that’s not your thing — you can choose a different kernel to run on Deepnote, such as R or Julia. You can also directly insert SQL blocks for querying your data so that everything stays in one place.

Deepnote also offers a choice on the type of machine you want to use for your project. However, for the majority of data science projects, their Standard machines offer more than enough compute. The benefit here is that once you select a machine for a project, you never have to worry about it again. There is no maintenance or admin required and you can easily start or return to a project in seconds.

Lastly, there’s the code. Deepnote comes with automatic version histories and snapshots of files. This makes it easy to keep track of the changes that collaborators made to the project and roll back to previous versions.

A good philosophy to follow is to design notebooks that can be read, run and reused.

From Data Science to Data Storytelling

Data storytelling is one of the most overlooked skills in data science. While having the technical skills needed to properly analyze data is incredibly important, it doesn’t mean anything unless you’re also able to communicate the insights and results. Without this piece of the puzzle, no one will understand or much less, use your model.

Effective data storytelling allows the insights and results of data science projects to inspire people to take action. This moves the company toward its strategic goals and highlights the valuable contribution of data teams.

One way to communicate the results of a project is with a static presentation. You take a bunch of screenshots of your results, create a few charts and stick it all in a PowerPoint presentation and hope for the best. While this may be one of the most common ways to prepare presentations, it is not a good way to tell a compelling data story.

The most effective form of data storytelling is a combination of good visuals and interactivity. Luckily, Deepnote allows you to do both! Using their Publishing Editor, you can transform your analyses into either an article with a more readable layout or a dashboard with interactive blocks that can be moved around and resized. In other words, you can turn your notebooks into apps!

Closing Remarks

Improving our ability to collaborate within and across data teams is a crucial factor in the success of data science projects. While data teams do face some unique challenges, there are ways to overcome them.

With Deepnote, you can collaborate in real-time, create reproducible projects, and create interactive web apps from notebooks — all in the same platform.

Click here to start building your data science projects on Deepnote!

This is a sponsored blog post, but all opinions are my own.

A multi colored computer with the words Data Science Without a Fancy PC

Joleen December 16, 2021

Projects & Workflows

Data Science Without a Fancy PC: Use These Cloud Platforms Instead

You don’t need a fancy PC to get started with data science and machine learning. In fact, you can run all your code in cloud-based notebooks without even worrying about setting up an environment locally.

Even if you’re new to data science, you’ve probably heard of Jupyter Notebooks. It has become the number 1 way to do data science. Jupyter Notebooks make it so much easier to run your code, write commentary and see your output all in 1 place. Almost all cloud platforms use some kind of Jupyter-like environment.

In this blog post I am going to share 5 ways you can do data science in the cloud. Each of these platforms allow you to do this completely for free and they each work really well.

In my opinion, the 2 biggest upsides of using cloud platforms for data science are:

Speed of set up – You can get set up in just a few minutes and have almost everything you need to do machine learning available to you. You don’t need to go through the hassle of setting up an environment locally before you start writing code and analysing data.
Collaboration – Being able to share your work and collaborate on projects is a big upside of any kind of cloud platform. However, collaboration is not available for all the platforms listed here. Even when it is offered, the degree of collaboration differs from platform to platform.

5 platforms will be covered in this post:

Datacamp Workspaces
Kaggle Notebooks
Google Colab
Deepnote
Datalore by JetBrains
Gradient Notebooks

There are a few things you should note about how I have judged these platforms. I have actually tried each of these platforms myself and am giving my own opinions on what I think of them. My primary use of cloud platforms is to work on personal projects and not for company or enterprise use.

These are the criteria that I am using to compare these platforms:

Price – they should be free or at least offer a decent free plan (not just a trial)
Speed of set-up and low admin – I shouldn’t have to ‘babysit’ my projects in case I go over ‘allowed hours’. I’d like to be able to log on, work on a project and log off without worrying about whether I shut the server down.
Aesthetic and intuitive – the app should look good and it should be intuitive and easy to use
Collaboration – I’d like to be able to share my work with friends and be able to collaborate on them live. I’d also like to have the option to share securely if I want to, without my project being available publicly.

Great, now that I’ve cleared all that up, lets get into it.

DataCamp Workspaces

DataCamp recently launched their Workspaces feature that allows you to run code in a Jupyter-like environment. There is currently no collaboration feature but it is something they are working on. Right now, you can just share your notebooks with other people and they are able to view them and add comments.

DataCamp’s philosophy with the initial launch of Workspaces is that they want it to be as easy to do data science as it is to learn it. They have already created an incredible interactive environment for learning data science and Workspaces seem like a natural next step for anyone who now wants to easily apply their skills and start putting together a portfolio.

The Workspaces are completely free and you can choose between R or Python. So far, their notebook editor looks good and it is intuitive to use. There is also no admin involved in running or maintaining the workspace which makes it hassle-free.

Kaggle Notebooks

To me, Kaggle is like the OG of cloud-based data science platforms, having started way back in 2010. Initially, they started out as a machine learning competition platform. Since then they have expanded and now offer ways of sharing datasets, notebooks and have a huge community in their forums where you can ask questions and get help.

Kaggle notebooks are free to use, with the option to choose between R and Python and they integrate well with other services. You can even connect your Kaggle notebook to Google Cloud Services to beef up the hardware if you need it, although this will come at an additional cost, of course. Kaggle does provide access to a GPU and for personal projects, it is usually more than enough.

Collaboration on Kaggle notebooks is limited. Similar to DataCamp, you can share your notebooks with other people (such as your teammates in a competition) but you effectively work on different versions of the notebook so there is no live collaboration feature.

Google Colab

Google Colab is another one of Google’s products that operate as a natural extension of Google Drive just like Google Docs or Sheets. If you already use these other products then the Colab UI will feel very familiar.

Naturally, sharing on Colab notebooks is built-in. However, it does not seem to be capable of live collaboration (ie. with 2 or more people editing a notebook together in real-time). I find this to be disappointing since Google basically wrote the book on real-time collaboration with Docs and Sheets.

Colab is also not a particularly pretty app, especially when compared with some of the other platforms I’ll be covering on this list. However, since almost everyone interested in data science most likely has at least 1 google account, setup is by far the fastest.

It is free to use Colab but the resource is not guaranteed and there are several usage limits that change depending on demand. Your usage limits could even be different to mine if your code uses more resources.

Deepnote

Deepnote is by far the most good looking, full-featured platform I have come across. The UI looks great, the editing experience with their notebook is amazing, and they have live collaboration! All of this and it is still free to use (up to a point) plus the platform is still in Beta so you know there is much more to come.

I particularly like their publishing feature – you can publish your notebook as an article or as an interactive app or dashboard. I just love the presentation of the articles and dashboards. Your profile on Deepnote also acts as a portfolio and it is a great viewing experience for anyone looking through your work.

You can get up to 750 hours on their standard machines and each notebook comes with a nifty little feature that automatically shuts the machine down after 15 minutes of inactivity. This keeps the admin pretty low on this platform.

Datalore

Datalore is a platform built by the JetBrains team. The UI looks good and it also comes with live collaboration. There are also a few more options in terms of programming languages with this platform – you can choose between Python, R, Kotlin and Scala.

There is a bit of admin involved with this platform. On the free plan, you get 120 hours/month of basic machine time. However, once you open a notebook, so long as it is an open tab in your browser it will consume resources and eat into your available hours. Because of this, it is important to either close the tab or manually shut down the machine.

Also, if you’ve shared one of your notebooks with someone and they forget to close their browser window down at the end of the day, it’ll eat into your available quota. So that’s something to keep in mind.

Gradient Notebooks

Gradient Notebooks are built by Paperspace. The biggest selling point of Gradient is that they offer free GPU’s. For the free plan, Instead of restricting the number of hours on the platform, the only limits they impose is that you can only ever have 1 notebook running at a time and all notebooks must be public.

Getting set up on the platform and launching a notebook can take a good couple of minutes since all that additional hardware that they offer needs to be provisioned. There is also limited functionality for publishing notebooks and building a portfolio of projects using their platform. They are currently working on a public profile feature, so at least it’s in the roadmap.

Gradient is the platform to go to if you need more resources and more compute for your project and you don’t want to pay a cloud provider like Google Cloud or AWS to get it.

Closing Thoughts

Choosing which cloud platform to use depends on your own goals and needs.

If you’re looking for a place to build a portfolio and get involved in a large community while you’re still learning and improving your data science skills then I’d recommend Kaggle or DataCamp.

If you’re looking for a platform that’s got all the bells and whistles, allows you to build a professionally-looking portfolio and offers real-time collaboration then I’d recommend Deepnote.

If you’ve started branching out into deep learning, NLP, and computer vision then I’d recommend giving Gradient a try.

See this resource for a comprehensive list of almost all available data science notebook platforms.

I hope you found this blog post helpful! Let me know in the comments below what other platforms you’d recommend. I’d love to hear your thoughts!

Man standing on a concrete surface containing multiple arrows with the words decision intelligence: diving into one of the top trends of 2022

Joleen December 10, 2021

Industry News & Trends

Decision Intelligence: Diving Into One of The Top Trends of 2022

Gartner released their top strategic technology trends for 2022 and Decision Intelligence was one of them. They described it as “a practical approach to improve organisational decision making”. It was the first time I had ever come across the term but it definitely got my attention. I wanted to find out more about it and how it could be used in practice.

To be honest, I found that to be a lot easier said than done. Decision Intelligence is still a relatively new field and there are some amazing pioneers out there such as Lorien Pratt, who wrote a book on the topic, and Cassie Kozyrkov, head of decision intelligence at Google.

However, I struggled to find actionable resources on the topic.

When learning something new, it is helpful to first understand its philosophical and theoretical components. However, you may get to a point, as I did, where you begin to ask yourself “great, this thing is important, how do I use it?” It is here that I found resources to be incomplete.

In this blog post, I will give a (brief) overview of decision intelligence and why it is important. Then I will do my best to explain how decision intelligence can be applied.

What is Decision Intelligence?

The basic idea behind decision intelligence is that “decisions are based on our understanding of how actions lead to outcomes“. It answers the question: “if I make this decision and take this action today, what will be the outcome tomorrow?”

Beyond that, making a decision is about forecasting through time in the hopes that the desired outcome can be achieved. Decision intelligence allows you to understand the ripple effect of a decision across other areas, both internally and externally.

Decision intelligence is not an alternative to current solutions or another tool in the data toolbox. Instead, it is a unifier: bringing together complex systems, business intelligence, analytics, machine learning, and AI into a framework where the decision is the main driver.

Instead of trying to find the tools and methods that store, retrieve, analyse, and support data, decision intelligence focuses on the decision first and the data takes on a supporting role.

Why is Decision Intelligence Important?

It is common to find a large gap between the decision-makers in an organisation and the data scientists and analysts. Decision-makers feel that they don’t have enough of the right information available to them to make big decisions that could have a significant impact on the business. On the other hand, many data scientists and analysts feel like their models, reports, and analyses aren’t being used to their full potential in the business.

Decision intelligence helps to bridge this gap, requiring a new way of thinking for both decision-makers and analysts. By focusing on decisions first and allowing them to be the driver of everything else, we can do a much better job of working together. Ultimately, models, reports, and analyses are not of any value (no matter how complex or cutting edge they may be) unless they lead to some action.

In organisations with so much at stake like the Millennium Project and the United Nations, there is a high risk of making decisions that can have unintended consequences. In these cases, the systems are so complex and overwhelming that decision-makers tend to suffer from ‘analysis-paralysis’, and rather than making data-driven decisions, they resort to just using their intuition.

With the rise of big data, we’ve got a new problem. How can we make the connection between data and the problems we care about?

How can data be used to solve really difficult problems like conflict and poverty? For a company, how can data be used to make more money and gain a competitive advantage? For medical providers in Sub-Saharan Africa, how can data help them decide where they should get medicine and which doctors they should train?

How to Apply Decision Intelligence

Through my research, I have found 2 main ways to use decision intelligence in practice. The exact methodology of DI is different for every organisation but usually involves a cause-and-effect diagram and a visual decision model.

The process starts with a decision in mind and by applying a blueprint for how to approach that decision, we can then determine what action needs to be taken to reach an outcome.

Causal Decision Diagram

The core application of decision intelligence is to understand how we can get from cause to effect. Lorien Pratt refers to the diagrams that allow us to examine cause and effect as a Causal Decision Diagram (CDD).

She had found a pattern in decision making through her work and created a template that contains the following elements:

Decision Levers (choices) – the things you can change; decisions you need to make
Externals – things that cannot be affected directly but are related to and affect the outcome
Intermediates – factors involved in the decision
Outcomes – the thing you are ultimately trying to acheive

Interactive Decision Models

The final output of a decision model is a visual representation of the causal decision diagram. This is the part of the decision model that incorporates data to objectively determine the effect of a decision.

The process of reaching a final decision model requires a combination of business intelligence, predictive analytics, machine learning, and various other analytical techniques.

Lorien Pratt gives a nice example of this visual model by considering the impact of the decision to invest in a new training program in a company. Levers can be adjusted and, together with some predictive models, the total investment benefit can be determined.

Decision intelligence is the missing link between data and decision-making. For all the amazing data science and artificial intelligence advances being made today, none of it is valuable unless it can be used to take action.

I hope this blog post was able to give you a sense of what decision intelligence is and how it can be used in practice.

Stochastic gradient descent in python with road sign

Joleen December 2, 2021

Projects & Workflows

Stochastic Gradient Descent in Python

What is Stochastic Gradient Descent?

Stochastic gradient descent is an optimisation technique, and not a machine learning model. It is a method that allow us to efficiently train a machine learning model on large amounts of data. The word ‘descent’ gives the purpose of SGD away – to minimise a cost (or loss) function.

For a better understanding of the underlying principle of GD, let’s consider an example.

A Simplified Example of Gradient Descent

Gradient descent seeks to find the global minimum of a function. To do that, it calculates the gradient at an initial random point and moves to another point in the direction of descending gradient until it reaches a point where the gradient is zero.

As an example, see figure of a parabola below.

The initial random point that starts the algorithm off is at (x = -2, y = 4). The gradient at this point is -3 so the algorithm steps to the right, toward this negative gradient, and calculates the gradient at the next step. Eventually, after a number of steps, the algorithm will reach a point where the gradient is 0 and stops. This is referred to as the global optimum or global minimum.

The learning rate hyperparameter determines the size of each step – too small and the model takes too long, too big and the model may not converge to the global minimum.

Some datasets have an irregular shape and the cost function could contain both local minima and global minima. It is good to keep this in mind when training a model, adjusting the learning strategy according to the current problem.

Batch vs Stochastic Gradient Descent

There are 3 types of Gradient Descent implimentations: batch, mini-batch or stochastic. Which one you choose depends on the amount of data you have and the type of model you are fitting.

Batch Gradient Descent: the entire training dataset is used at every step. Thus this algorithm is very slow for large datasets but scales well for large numbers of features. There is also a possibility that this algorithm may get stuck in local minima.
Stochastic Gradient Descent: a single, random observation in the training data is selected at each step. This algorithm is very fast, only needing to perform calculations on a single point at a time. However, it is erratic and may select points from all over the place, never stopping at a truly accurate solution. Instead, it approaches the minimum on average. That being said, this algorithm is much more likely to find the global maximum. Some of the erratic nature of the algorithm can be solved by using a learning schedule that slowly reduces the learning rate so that it can settle on a more accurate solution.
Mini-Batch Gradient Descent: the algorithm uses small subsets (or batches) of the training data at each step. This algorithm is faster than Batch GD but still suffers from the same drawback of potentially getting stuck in local minima.

How to Implement Stochastic Gradient Descent in Python

This project is focused around applying SGD to a classification problem. Therefore, the steps and performance measures chosen here are best suited to modelling a binary response variable.

Each aspect of the project is broken down below.

Preprocessing the data for classification

These are important steps for data preparation/preprocessing:

Clean up any funnies in the data – for example, in this project the value -1 or 9 are used to represent null or unknown values and so these should be cleaned up and replaced with null
Split the data into training and test sets (typically in an 80/20 ratio). To do this, the train_test_split function is used from sklearn.model_selection (making sure that shuffle = True as this is a requirement of SGD)
Either remove or impute missing values
Feature engineering – after some initial data exploration, new features can be created or existing features can be modified
Standardise continuous variables using the StandardScaler function from sklearn.preprocessing
Dummy code categorical variables using the OneHotEncoder function from sklearn.preprocessing making sure to use the parameter drop = first so that if there are 4 categories in a variable, for example, only 3 categories are retained and the 4th category is inferred by all others being 0.

Selecting the model (and loss function)

This project is a classification problem and so the SGDClassifier is used from sklearn.linear_model. Since SGD is just an optimisation algorithm, we must still pick the machine learning model that will be fitted to the data.

These are the available options:

Support Vector Machine (loss = "hinge")
Smoothed hinge loss (loss = "modified_huber")
Logistic regression (loss = "log") – this loss function is chosen for this project

Measuring performance

The are a couple of different performance measures available to choose from to evaluate a classifier. Although it’s not always necessary to use so many performance measures, since this project is for learning purposes, I will be using all of them.

Each of the below performance measures are calculated after performing cross-validation. This method involves splitting the trainin set into K folds, making predictions on each fold and then measuring it’s performance.

Cross validation accuracy (cross_val_score) – this is a ratio of correct predictions and is often not a good metric when our data is skewed (ie. some classes appear more often than others)
Confusion matrix (confusion_matrix) – calculated from the predictions made during cross validation, this is a table containing 2 rows for actuals and 2 columns for predictions. This matrix is used to calculate a few ratios
Precision (precision_score) – for all cases that the model predicts to be positive, the precision tells us what percentage of them are correct
Recall (recall_score) – for all cases that are actually positive, recall tells us the percentage that are correctly predicted
Precision and recall work as opposites. The preferred level of these 2 metrics depends on the project. Two plots are used to depict the trade off between them (predictions are made using the cross_val_predict function with the parameter method = "decision_function" and then precision, recall and thresholds are calculated with the precision_recall_curve function):
- Precision/recall vs thresholds plot: in this plot we can see the trade-off relationship between precision and recall
- Precision vs recall plot: we can use this plot to select a precision and associated recall value that occurs just before the curve dips sharply
ROC curve (roc_curve) – a 45 degree line is plot alongside this curve to compare the accuracy of our model against that of a purely random classifier. The goal is for our model to angle as far away from this line as possible.
AUC score (roc_auc_curve) – this score summarises the ROC curve. A score of 1 indicates a perfect model whereas a score of 0.5 represents a model indistinguishable from chance.

Analysing Road Safety Data

Project Overview

The UK Department of Transport has released data on reported road accidents to the public from 1979. The information provided only relates to personal injury accidents that occur on public roads and that are reported to the police. This project uses the 2020 accident dataset.

The goal of this project is to:

Explore the road safety accident dataset, selecting features and adjusting them as needed for modelling
Apply the Stochastic Gradient Descent optimisation technique with a log loss function

Specifically, this will be treated as a classification problem with a binary target (response) variable called accident severity. In this project, we will be trying to determine if there are any patterns in the data that can help to predict whether an accident will be severe or not.

You can see all the code used in this project on Github. This blog post aims to summarise the process and findings.

Python libraries used in this project:

# Import libraries
import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")

Data Description

The road safety accident data contains over 30 different features. After going through these columns I selected just 12 (including an index column) to be included in this project.

accident_index
longitude
latitude
accident_severity: 1 = fatal, 2 = serious, 3 = slight
number_of_vehicles
date
day_of_week: 1-7 = Sun-Sat
time
speed_limit: 20, 30, 40, 50, 60, 70, -1 = data missing, 99 = unknown
light_conditions: 1 = daylight, 4 = darkness – lights lit, 5 = darkness – lights unlit, 6 = darkness – no lighting, 7 = darkness – lighting unknown, -1 = data missing
road_surface_conditions: 1 = dry, 2 = wet or damp, 3 = snow, 4 = frost or ice, 5 = flood over 3cm, deep, 6 = oil or diesel, 7 = mud, -1 = data missing, 9 = unknown
urban_or_rural_area: 1 = urban, 2 = rural, 3 = unallocated, -1 = data missing

Data Exploration

To start off with, the features have different types that must each be dealt with – from dates, times, and continuous values to various numbers of categories.

Some features also have values that are not valid. For example, the speed limit feature contains a -1 value when data is missing and a 99 value when data is unknown. Both these values are replaced with null and subsequently removed from the dataset (I will not be dealing with imputation in this project).

Accident severity is the target variable for this project. Currently it has 3 categories: fatal, serious and slight. Since the fatal category is so rare and also because I’d like to treat this as a binary classification problem, I have combined the fatal and serious classes.

Even with the combination of 2 categories, this is quite a skewed variable with only 21% of cases appearing in the positive class so some performance issues could be expected.

Next, the date and time variables are converted to their appropriate types and I decided to convert them into categorical variables. For the dates, they are converted into quarters so that they fall into 4 categories: Q1 – Q4. For the times, they are grouped into 3 equal-sized categories: morning (4am – 11am), afternoon (12pm – 7pm) and evening (8pm – midnight and 1am – 3am).

The variable for the day of the week is left unchanged and will be converted into a dummy variable later on for modelling.

The next few variables are each treated similarly. They have multiple categories, some of which only have a handful of cases in them. Because of this, I convert them each into binary variables with 0 representing what I thought was an ‘ideal’ scenario such as dry roads, daylight, urban area and low speed limit roads.

The last 2 charts are for the continuous features in the dataset. Neither of them are altered in any way. First is number of vehicles involved in the accident with the majority of accidents involving only 1 or 2 vehicles. The second chart is a geographical scatter plot made up of the longitude and latitude values of the accidents. Neither of these charts reveal anything revelatory or worth looking further into.

Data Preprocessing

Other than combining the categories of some of the features and constructing some new features, we must also ensure that any categorical features with more than 2 classes are dummy coded and any continuous variables are standardised.

# Onehotencoder for the day of week, time & month category features
onehot_df = train_features[["day_of_week", "time_category", "month_cat"]]
onehot = OneHotEncoder(drop = 'first', sparse = False)
onehot_df = pd.DataFrame(onehot.fit_transform(onehot_df), columns = onehot.get_feature_names(["day_of_week", "time_category", "month_cat"]))

# Add onehotencoder features back to train set
train_features.reset_index(drop=True, inplace=True)
train_features = pd.concat([train_features, onehot_df], axis = 1)

# Remove redundant columns
train_features = train_features.drop(["date", "time", "day_of_week", "time_category", "month_cat"], axis = 1)

# Standardise the continuous features
scaler = StandardScaler()
train_features_scaled = train_features.copy()
features_to_scale = train_features_scaled[["longitude", "latitude", "number_of_vehicles"]]
features_to_scale = scaler.fit_transform(features_to_scale.values)
train_features_scaled[["longitude", "latitude", "number_of_vehicles"]] = features_to_scale

Training & Evaluating the Model

To train the model the SGD classifier is used with a log loss function

sgd_clf = SGDClassifier(random_state = 42, loss = 'log')
sgd_clf.fit(train_features_scaled, train_target)

Evaluating the model starts with calculating cross validation accuracy scores.

# Cross validation accuracy scores
cross_val_score(sgd_clf, train_features_scaled, train_target, cv = 10, scoring = "accuracy")

I chose to perform 10 folds of cross validation and get these scores:

[0.7807249 , 0.78197473, 0.78100264, 0.78169699, 0.78194444, 0.78236111, 0.78194444, 0.78222222, 0.78180556, 0.78236111]

Not too bad. However, remember that this score is not very reliable when the data is skewed – ie. there are more cases in some classes compared to others. This is true in this data as we noticed that only 21% of the cases fall in the positive class (fatal / serious accident severity) compared to the negative class (slight accident severity).

Next, we put together a confusion matrix to determine how well our classifier predicts the target.

# Confusion matrix
train_target_pred = cross_val_predict(sgd_clf, train_features_scaled, train_target, cv = 10)
conf_matrix = confusion_matrix(train_target, train_target_pred)

This gives the following table:

	Predicted – Negative	Predicted – Positive
Actual – Positive	56 155	147
Actual – Negative	15 564	138

Confusion Matrix

That doesn’t look so good – over 15k cases were predicted to be negative but were actually positive!

Let’s have a look at the precision and recall, calculated from this confusion matrix

# Precision & Recall
print("Precision: ", round(precision_score(train_target, train_target_pred)*100,2))
print("Recall: ", round(recall_score(train_target, train_target_pred)*100,2))

Precision: 48.42

Recall: 0.88

Now we’re getting a clearer picture of the performance of the model. The precision score tells us that for all the accidents that the model predicts to be severe, it is correct 48% of the time. The recall score tells us that for all accidents that are actually severe, the model predicts this correctly 0.88% of the time.

Let’s look at the precision and recall metrics in the form of 2 plots.

The first is the precision and recall trade-off versus various threshold scores.

# Visualise the precision/recall tradeoff

# Use the cross_val_predict() function to get the scores of all instances in the training set
# Return decision scores not predictions
train_target_decision_scores = cross_val_predict(sgd_clf, train_features_scaled, train_target, cv = 10, method = "decision_function")

# Compute the precision and recall for all possible thresholds
precisions, recalls, thresholds = precision_recall_curve(train_target, train_target_decision_scores)

# Plot precision and recall as functions of the threshold value
plt.figure(figsize=(8, 6))
plt.plot(thresholds, precisions[:-1], "b--", label = "Precision")
plt.plot(thresholds, recalls[:-1], "g--", label = "Recall")
plt.grid(True)

plt.legend(loc = "upper right")
plt.xlabel("Thresholds", fontsize=16)
plt.ylabel("Precision / Recall", fontsize=16)
plt.title("Precision / Recall Tradeoff", fontsize=16)
plt.show()

Ideally, in this plot we want to select a reasonably high precision value for which recall is also reasonably high. However, this plot does not paint a nice picture for the performance of this model. If we select a precision over 50%, the recall is basically 0. Having a low recall means that the classifier will fail to recognise positive cases a lot of the time.

The value of precision and recall that you are happy with depends entirely on the problem at hand. There is no golden rule to follow here. Although, I do believe almost 0 recall is not good in any scenario.

The second plot directly compares precision against recall

# Plot precision against recall

plt.figure(figsize=(8, 6))
plt.plot(recalls, precisions, "b-", linewidth=2)
plt.axis([0, 1, 0, 1])
plt.grid(True)

plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.title("Precision vs Recall", fontsize=16)
plt.show()

This plot leads to the same conclusion as the previous plot but does so from a different perspective.

Next up is the ROC curve and the associated AUC score.

# The ROC Curve
# Calculate the metrics
fpr, tpr, thresholds = roc_curve(train_target, train_target_decision_scores)

# Plot FPR against TPR
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth = 2)
plt.plot([0,1], [0,1], "k--") # 45 degree line
plt.grid(True)

plt.xlabel('FPR (1 - Specificity)', fontsize=16)
plt.ylabel('TPR (Sensitivity, Recall)', fontsize=16)
plt.title('ROC Curve', fontsize=16)
plt.show()

In the ROC curve, we would ideally like the curve to angle itself toward the top left corner as much as possible. The 45-degree dotted line represents a purely random classifier, much like flipping a fair coin. In this plot, however, the curve is quite flat, indicating a poor model.

The AUC score can be calculated from the ROC curve. It simply summarises the curve in a single metric. A score of 1 indicates a perfect fit, while a score of 50 indicates a purely random model.

# AUC Score
print("AUC Score: ", round(roc_auc_score(train_target, train_target_decision_scores)*100, 2))

AUC Score: 60.66

Although the score gives us a quick summary of the performance of the model, it does not tell us much about where the model is going wrong. It will also tend to give a better picture of performance when the positive class is rare (as in the case of our accident severity target variable). The precision and recall scores along with their plots gives more information on the performance of the model.

I hope you found this project helpful in your own machine learning endeavours. As always, I am open to feedback and suggestions. Leave a comment below with your thoughts.

books and graduation cap with the words is a masters degree really worth it for a career in data science

Joleen November 16, 2021

Other

Is a Masters Degree Really Worth It For a Career In Data Science?

Since data science is such a new profession, there is a lot of confusion and uncertainty around job requirements and qualifications. A common question that gets asked is: what kind of qualifications are necessary to get a job in data science and is a Masters degree or PhD really worth it?

I wanted to share my own experiences with getting a Masters degree in statistics and whether I felt that it has been a necessary requirement for my career in data science.

Why I decided to get a Masters degree

I always knew I would pursue some kind of tertiary education. I wasn’t pressured into it by my family, but I considered it to be an unspoken prerequisite to success. I believed that the only way I could have any kind of success in my life was if I got a degree. Which degree I would get was an entirely different question and I never really knew the answer. I changed my mind A LOT throughout high school and in university. I could never just make a decision and stick to it.

The truth is, I didn’t know where my life was going and I had no idea how to get there. I once dreamt of being an egyptologist, discovering lost sarcophagi and dancing through unseen traps on the floor – or was that just the latest Lara Croft movie? I was an impressionable kid, okay, don’t judge.

Things changed for me a few years out of school when I discovered statistics while pursuing a degree in accounting. Things just started clicking into place, which was a nice change from failing accounting. So I changed my university program and made a commitment to getting a Masters degree in statistics. In no way was this easy and it definitely did not ‘come naturally’ to me. I worked harder than I ever had for anything in my life but it was one of the most rewarding experiences and I’m glad I did it.

To be clear, this goal was arbitrary. There was no expectation or requirement placed on me to get a Masters degree. I believed that it was a high enough qualification that would open doors for me in the future, giving me the best possible chance at a successful career and, by extension, a successful life. I had no idea if this assumption was correct and I didn’t really have any anecdotal evidence for why a Masters degree would be beneficial.

I did not even know that ‘data science’ was a thing when I was in university and I didn’t pick my degree based on a job title. You don’t even need a Masters degree in data science in order to get a job in data science. Many professional data scientists have degrees in other areas of STEM and some are even completely self-taught. At the end of the day, you need to choose your own path.

Upsides to my Masters Degree

So what did the masters degree actually get me? I would say, in large part, it gave me the ability to be independent.

When I was in school, and even in my undergrad at university, my learning was linear and one-sided: study chapter 2 for the test, do these mock exams, solve these problems, etc. I didn’t really have to figure anything out for myself and if I happened to study and remember exactly the right things that appeared in the exam, well, then I was successful. That is nothing like real life.

The final requirement for my Masters was a dissertation that had to be original work with an application that had never been done before. The entire process required to produce this dissertation was intense and one of the most difficult things I had to do at the time. This is the part of my tertiary education that taught me some valuable skills that have transferred into my career and adult life.

The process of preparing my Masters dissertation taught me some very useful skills:

I learnt how to do independent research and figure things out myself
I developed resilience to achieve the goals I set for myself and to not give up when things got tough
I learnt how to write about my research and the results
I learnt how to select and fit the right statistical model for my data
I learnt how to prepare and give presentations about my work

However, this only really forms a part of what is required in a data science or analytics-related job. Some data science job descriptions are so different to each other, it’s a wonder they have the same job title.

Each company has unique requirements for a data scientist and the job could lean heavily on either programming, database management and data engineering, business intelligence, machine learning engineering, or even just be highly research-based. Some companies just list all of the above in the job description hoping to catch the Loch Ness monster.

If the variety in the skills required for data science jobs overwhelms you then I’d recommend that you identify 1 or 2 companies that you’d like to work for and do some research into what kinds of skills they expect their data scientists to have and double down on that. This will help to narrow your efforts down to just a handful of skills that you can demonstrate your proficiency in. Companies are much more interested in whether a candidate can actually carry out the duties of the job that is unique to them rather than just being able to do a little bit of everything.

For me, I got my first job in business intelligence and analytics before I even graduated with a Masters degree through a referral by my professor. I went from working on my Masters full-time for a year to then working on it part-time for the next few years while I built up some industry experience. However, the fact that I was working on a Masters gave me a competitive edge against the other candidates who were also applying for the job.

Downsides

Choosing to get a Masters degree and even choosing which university to go to can have many downsides. These are a few of them:

Cost is probably one of the biggest factors to consider. Most postgraduate degrees around the world are notoriously expensive. However, I was very fortunate in this regard as I was able to get multiple scholarships that contributed towards the majority of my tuition. I also did tutoring for undergraduates that helped pay for my living expenses.
Skill lag. Most university programs tend to lag behing industry and you may not be learning cutting edge algorithms and the latest developments. While I extensively studied a number of statistical models, we did not even touch on some of the more advanced machine learning algorithms and their applications.
I was restricted to only the professors at my university and the respective areas they specialised in. This is a factor that was completely outside of my control. I could not pick and choose who would teach me and while the professors may be highly regarded researchers, their teaching methods did not always click with me.

Self-Study

I have had to do much more of my own self-study to learn data science-specific skills. However, there are hundreds of resources out there on just about any topic in data science, machine learning, business intelligence and artificial intelligence. This can be both a blessing and curse.

You can learn as much, if not more, than what is taught in university by using the resources available online and it wont cost you nearly as much. However, the downside is that you are spoiled for choice which usually means that you make no decision at all. It is much easier to bail out of learning anything when you have so many options in front of you since you don’t know where to turn, where to start, which resource is best, and who to go to for support. This is where university programs can be really beneficial – they offer structure, curriculum, and support all bundled up nicely.

However, it is possible to create a structure and a curriculum to self-study the topics and areas of data science that interest you the most. In this way you have all the control and flexibility to choose exactly what you want to learn and the direction you want your studying to take. I understand that this path is not for everyone and you need to find out what works for you.

Self-study is very difficult and takes a lot of intrinsic motivation. There are no deadlines, nobody breathing down your neck to finish a chapter or complete a project. You have to be driven and dedicated to your own study and growth.

If you’re interested in this self-study route, then I recommend you give this blog post a read – I developed my own 6 month data science curriculum with some of the best books and courses in machine learning. There are also a couple of people who have developed and shared their own curriculums as well as step-by-step processes for how to go from absolute beginner to mastery in data science. Check them out and draw inspiration from their journeys.

Here are a few to get you started:

To sum up, I believe that getting a Masters degree allowed me to kick-start my career in data science and analytics and I don’t think I would be where I am now without it. That being said, the education I received during my Masters was definitely not sufficient for keeping up with the growth in the field and a lot more self-study is needed so that my skills remain relevant.

Office items on a desk with the words 4 ways I stay productive while working from home

Joleen November 11, 2021

Productivity & Self-Study

4 Ways I Stay Productive While Working From Home

I have a definite love-hate relationship with productivity. To me, being productive means that I am actively working towards my goals and dreams. However, productivity is not something I can objectively obtain every day. Instead, it ebbs and flows based on so many factors such as diet, sleep, sickness, weather, connection, conflict, my own internal narrative, and the list goes on.

It is almost impossible to control exactly how productive you will be in a given day, week, month. Because of this, I use a few hacks that give me the best possible chance to have a successful day. I’m not perfect – I don’t cross-off all my to-do’s every day, I procrastinate (all the time) and I find it incredibly difficult to reach ‘flow’ state.

The secret is to not try to reach perfection every day but instead to take just 1 step forward, and then another, and then another.

Setting up my Environment

There are 3 aspects to my environment: my physical space, my digital space, and my personal body.

Physical

I try to make sure my physical space is tidy, clean and inspiring.

There are a few plants on my desk that I love looking at, and they quite literally bring the space to life. Studies have actually proven that house plants improve concentration and productivity by 15 percent! I don’t know about you but I’ll take that 15 percent, thank you.

I have a good chair, a desk that allows my arms to rest at a 90 degree angle and a monitor at eye-level so I tick all the ergonomic boxes there.

Digital

Before I log off for the day, I try to make sure that all distracting apps and browser tabs are closed. There is nothing that ruins my morning more that sitting down in front of my computer, turning on my monitor and the first thing I see is Twitter, Reddit, and 20 other open tabs. The first thing I’d want to do is some time-wasting scrolling and then… ugh, can I just go back to bed?

Thats why I close all open tabs and save any interesting articles or links to my Notion at the end of the day. If I really need it, I can always go back to it again. Scrolling through my browser history is a big time kill.

Personal

This is for making sure I personally feel good before I sit down to work. Remember that this will be different for everyone and this is what works for me.

It is non-negotiable for me to get dressed every morning. I don’t wear anything formal, but it needs to be different to what I was wearing when I woke up.

My mental and physical health have been sore topics in the past. I have often found it difficult to imagine and accept that maintaining a regular routine, eating well and working out can actually make any significant difference to my life.

When working from home I have found it too easy to let these things slip. I went through a phase where I would only wake up at the very last minute before it was absolutely necessary for me to be sitting at my desk, I didn’t workout at all and I ate way too much take out. Most of this came from a place of plain laziness and, needless to say, it did not do anything to help my mental or physical health. I frequently felt unmotivated and sluggish during the day.

I have realised over time how important it is to prioritise these things over all else. Now, I try to workout 4 times a week and eat a reasonably healthy diet. I have to regularly remind myself that working out, taking breaks, and resting when I need it are ALL productive activities.

Time Blocking in Google Calendar

I was first introduced to the idea of time blocking by Cal Newport in his book Deep Work. Cal recommends using pen and paper for this but I find that my perfectionism (e.g. needing all time blocks to be an equal size and symmetrically spaced) and the need for everything to look ‘neat’ holds me back so I use Google Calendar instead.

At the beginning of the week (preferably on a Sunday), I go into my Google Calendar and review my commitments for the week, making sure everything is actually in my calendar. Then I fill in my meal times and any other activities like exercise and leisure time. I also leave some blocks blank and leave space between meetings so that there is still some flexibility in my day.

This allows me to realistically see what blocks of time I have available for deep work and I will fill those in. I try to organise my day so that I can set aside at least 3 hours for deep work. This isn’t possible every day but a few 3-hour-chunks per week is achievable for me. I do this because I have come to realise that content-switching and multi-tasking is almost impossible for me. I cannot get any amount of quality work done in short 30 minute slots here and there.

The combination of time blocking and deep work really helps me to avoid procrastinating because I always know what I should be working on, when I should be working on it and how much time I have set aside for it.

A good thing to remember is that no plan is set in stone. If you don’t feel like doing something that you set in your calendar, move it! If you wake up with a raging headache and need rest then listen to your body. Your future productivity will thank you for taking the rest!

Each morning, before I start my work day I review my calendar and write out the tasks I will complete according to the amount of time I have set aside for the day. I try to break the tasks down into just 1 action. Instead of having a task for ‘write blog post’, I’ll have several tasks like ‘write outline’, ‘brain dump in each heading’, ‘brainstorm headlines’, ‘collect 5 research links/resources’, ‘write first draft’, ‘edit first draft’, ‘design blog banner graphic’, ‘publish blog post’.

Each task is achievable and there is no ambiquity around exactly what I’ll be doing. I also ask myself which tasks are essential for me to feel that I have won the day. If things don’t go as planned and something else comes up at the last minute and if I at least do these essential tasks then I will feel like the day is still won.

The Pomodoro Method

This is something I use every day on the Forest app. Once I am ready to start working on a task, I will set a timer on the Forest app. If I am feeling a lot of resistance to the task then I will set a short 25 minute timer to just get me started with it and usually I don’t mind setting another 25 minutes and another and so it goes. Sometimes I just do a single 25 minute pomodoro and that’s it. If I still struggle to focus on the task, I move on to something else.

If I don’t have much resistance then I will usually default to a 50 minute timer with a 10 minute break. While I’m doing some tasks, my mind often wonders to other things like “I want to go for a walk, it’s such a beautiful day” or “I should make coffee, I’d feel more productive with a cup of coffee” or “I’m suddenly feeling hungry, I should get a snack”.

The pomodoro technique is so powerful for my wandering mind because I will look at the timer and think “NO, only 20 minutes to go. I just have to stick it out till the end and then I can go for the walk/coffee/snack but I just need to do this for 20 more minutes”.

During deep work, I still use the pomodoro method but I the breaks are shorter and usually only involves a quick stretch and maybe a trip to the kitchen for more ~~coffee~~ water. I try not to do anything that will significantly distract me or turn my attention in some other direction because that defeats the purpose of the deep work session.

I also sometimes set timers for non-productive tasks. For example, if I want to go on social media or check emails I will set a short timer for it. This makes sure that I don’t fall into an endless pit of scrolling through social media or replying to emails for hours because there is a time cap and then it’s back to work.

Co-working Communities

During the pandemic, like so many others, I wanted to feel part of a community where I could get back the connection I felt I had lost during this time. It took some time to get used to it and gain the confidence to put myself out there in a new community and make connections with people. But I am happy I spent the time on it because I have found some amazing communities of people with similar interests to me.

A big aspect of this is that these communities can then form part of your accountability system. Talking about my goals and putting them out there takes them out of my head and forces me to take action to achieve them and I know that I will be supported and held accountable to them too!

Lastly, there is the virtual co-working. You could do this on a call (with or without video) using a platform like zoom or you could even join a live stream on Twitch under the ‘Co-Working’ tag and work alongside the streamer. Twitch co-working streams has started to get very popular in the last year or so.

After trying out various forms of co-working I find that video calls are the most effective. There is an element of pressure to work because the other people on the call can see you and you don’t want to look like you’re procrastinating. Besides, they’re also working so you’re not alone.

I hope you found this post helpful! Let me know in the comments below what the number 1 way is that you stay productive while working from home.

Stack of books with the words 4 strategies I use to self-study technical textbooks

Joleen November 8, 2021

Productivity & Self-Study

4 Strategies I Use to Self-Study Technical Textbooks

Reading a textbook for self-study is very different to school or university. Back when I was in school, I read the textbook for the sole purpose of doing well on a test and it didn’t really matter if I couldn’t apply the concepts I was learning about. However, I have come to realise that this is the most critical part of self-study.

Over the years, I have developed a few helpful and effective strategies for reading technical textbooks and I want to share them with you. Also, if you didn’t know – I have created a 6 month data science curriculum and I will be applying these strategies as I read through a few of the technical textbooks in my curriculum.

Annotate & mark it up

If you are able to annotate or mark up your text book then you definitely should. Annotating a text book almost forces you to read more actively. You approach the text with the purpose of identifying what’s important and what’s unclear and it prevents you from just mindlessly reading words that mean nothing.

I prefer to just use a pencil when annotating a book, but that’s my personal preference. If you like using highlighters then go ahead! I annotate like this:

Underline or place an asterix next to important concepts & paragraphs – most of the time this is subjective and is just what I think is important as I read
Place a question mark next to things I don’t fully understand with a little note on what question I had in my mind so I know to revisit it or spend more time on it later
If something is very complex I will usually scribble some notes or try to break down the concept on a small piece of paper that I insert into the book at the position I am reading. This helps to identify what I DO understand and when I revisit the section to make further notes, I can focus my attention on the parts I DON’T understand. The worst thing you can do is say “I don’t understand section 5 at all” because that doesn’t help you. What, specifically, about section 5 didn’t you understand? Often, all you’re missing is a small piece of the puzzle that makes everything click into place.
Lastly, if I come across anything that I know I want to come back to for a project or an implementation, I’ll add a little sticky flag on the edge of the page so that it sticks out to me when the book is closed.

Don’t take notes on everything

This is probably the hardest thing to remember. Throughout my time at school and university I was always a compulsive note-taker. I took extremely comprehensive notes on absolutely everything to the point that my notes became more of a hindrance than a help. I learnt the hard way how to take better notes.

When self-studying a textbook, this is even more important to remember because you are probably not going to take a test on the material. Instead, you want to approach the text with the goal of becoming a practitioner (more on implementation later).

On the topic of digital vs handwritten notes, I prefer taking digital notes. I keep all my notes in an app called Notion and it allows me to capture almost anything such as links, embeds, formulas, pictures, code snippets, etc.

These are some of the benefits I have experienced when taking digital notes:

I can type faster than I can write
I can jump around topics – my notes do not need to be sequential or perfect the first time I write them
I can fill in or add to my notes, and easily delete irrelevant or badly written notes
It does not come with the side effect of hand cramps after an hour of writing 😬

As I write my notes, I try to keep these points in mind:

Take notes on the higher level material – don’t get stuck taking notes on every detail you read. That’s what the book is there for, don’t rewrite it.
Ask questions in your notes (and find answers for them in other texts, online articles, video’s, etc. and link to them in your notes)
Link concepts together in your notes from other parts of the book or from other resources (similar to the question/answer method above). I believe that true understanding and learning comes when you’re able to link together the things you are learning
For complex concepts that I struggle to understand, I would get out a blank sheet of paper (I go analog for this – pen and paper are the best for complex and challenging material) and pretend I’m explaining the topic to someone, starting with what I do know and then filling in the gaps in my knowledge with the material in the textbook (see the Feynman Technique).

Implement fast

I am a big believer in project-based learning, especially for technical topics. There is an excellent quote by Stephen R. Covey in his book The 7 Habits of Highly Effective People:

To learn and not to do is really not to learn; to know and not to do is really not to know
Stephen R. Covey

When going through a textbook I am constantly asking myself how I can apply what I am learning. Before I have even completed a chapter, I will start looking for data, python libraries, and example code that I can use in a project that applies the algorithm, model, system or high-level concept I am learning.

The faster I can get to the project-building phase, the better. Once I start implementing, I usually end up with a list of questions that did not even occur to me while I was reading the textbook and I quickly realise which of the concepts I didn’t actually understand and I’ll go in search of answers. Sometimes the textbook doesn’t explain something in a way that I understand and reading it somewhere else, explained in a different way, can suddenly make things click.

Iterate & expand

Think of your study as a iterative process. You may not cover every detail or completely understand everything you read the first time you read it. However, you should understand just enough to get to apply the concepts in practical projects, as mentioned in the point above. Once you start working with the material in a practical way you’ll be able to identify the gaps in your knowledge and then you can refer back to the book to expand on those areas and fill in some of the gaps.

During each iteration you should also return to your notes. Add new ideas and remove points that don’t make sense or aren’t relevant to keep notes on.

Remember not to get too attached to your textbook. There are hundreds of resources out there for learning something and expanding your knowledge requires different perspectives or different ways of teaching something. Explore.

I hope you found these strategies helpful. Leave a comment below or connect with me on Twitter and let me know your thoughts on self-study and reading technical textbooks, I’d love to hear from you!

Laptop in the background with the words my data science self-study curriculum: a 6 month learning journey

Joleen October 31, 2021

Productivity & Self-Study

My Data Science Self-Study Curriculum

Welcome to the next 6 months of my data science learning journey!

After being heavily inspired by Daniel Bourke’s AI masters curriculum, I have created my own learning curriculum that I will be following for the next 6 months.

Will this be a challenge for me? Yes.

Will the content be difficult and hard to understand? Probably.

Will I hit roadblocks and hurdles along the way? Almost definitely.

Is it achievable? Absolutely!

Is it for everyone? Maybe. But you get to decide what’s possible for you. If your dream is to learn data science then accept the challenges that await you and get started!

Please remember that everything here is my own opinion and my own way of doing things. You may not agree with it and that’s totally ok! You can create a learning curriculum that aligns to your own goals and beliefs.

Why am I doing this?

Learning data science and machine learning is no small feat. The field is advancing rapidly and it takes continuous study just to keep up. My goal with this learning curriculum is to build up my foundational knowledge and become proficient in machine learning and deep learning. I also want to be able to take on more challenging projects or pick up a new concept faster in the future.

This is not intended to be a beginners guide or something that you should do if you are just getting into data science or programming for the first time. I have a masters degree in statistics so I am already quite comfortable with the math and stats that form the foundations of machine learning. I have also been coding in python for a few years so I’ve already got a good understanding of the basics of programming in python.

However, I have quite a few gaps in my knowledge when it comes to applying machine learning algorithms to real data in python and I also have not properly explored deep learning so that will all be completely new to me.

I am not trying to get through the content in these courses and books as fast as possible just for the sake of saying I did the course. So I will probably be skipping some sections if they don’t feel relevant or if the content is a duplication of something I already learnt in another course.

It is still very important to me that I understand the content and am able to apply it myself. Therefore, my curriculum has a very strong focus on projects. Most of what I learn will be applied in a practical (and realistic) way with new, real-world data sets (not the data given in the course/book).

I will also probably not be trying to get any special certificates as part of the courses that I complete. I don’t particularly care for the testing methods of these courses which are very far removed from real life. Thats why I’ll be completing projects on my own that will give me an indication of my competency and level of understanding as I progress through the content.

How much time will I spending on this?

I roughly estimated how long I think each course will take (including all the projects) and would also like to spend about 4 hours per day, 6 days a week studying. From this I felt that 6 months would be a challenging yet achievable amount of time to complete this curriculum.

Why did I choose these courses/topics?

Once again, I got most of these resources from Daniel Bourke’s website. He has a comprehensive list of some of the best resources to learn data science, machine learning and deep learning. I encourage you to check it out if you’re trying to figure out what kinds of resources are out there for learning this stuff.

Also, the courses I chose can all be taken for free, except for the Udemy course which I got at a ridiculously discounted price. The only caveat is that the Cousera courses do not provide a certificate if taken in their free ‘audit’ mode and there may be some limitations. The only other thing I paid for was the Hands on Machine Learning Kindle book and the other 2 books were given to me as a gift.

For my own curriculum I created 2 sections which I then allocated the material to:

Foundations & machine learning – covering the general approach to machine learning problems and the various algorithms used in machine learning (such as k-means clustering, regression, decision trees, support vector machines, etc.)
Deep learning – covering topics related to neural networks, NLP, computer vision, hyperparameter tuning, optimisation, best practices, etc.

How am I tracking my progress?

I created my own Notion page with a kanban board and a calendar. I estimated the amount of time I need to get through each resource and gave each project a deadline. This will be my guide as I progress through the curriculum.

My aim here is to get rid of this question that I often ask myself when learning new things: “ok, what should I do right now and then what’s next?”. Using this curriculum and guide in Notion means that I should always know what I need to be working on at any point and what will be coming up next. A little structure makes me feel better 😊

If things change during this time, then I will simply adjust my timelines in Notion. It’s important to still be flexible to life events, disruptions, other interests and the possibility that some concepts will just take longer for me to grasp than others and I won’t know what I might struggle with until I start.

Each project that I complete will be added to my GitHub and made into a blog post. Project-based blog posts will go up at least every other week. I’ll also be writing other posts on some of the key concepts I’m learning so that I can reinforce the knowledge by distilling the information down into (hopefully) easy-to-read blog posts.

I will also be posting my experiences, interesting facts, new projects and challenges I face to Twitter everyday. There will also be a weekly check-in tweet where I will give an update on my progress in the curriculum and what I’ll be covering during the week.

The Curriculum

Alright, that’s enough chit-chat. Here are links to all the resources I’m using in my curriculum:

Foundations & Machine Learning:

Deep Learning:

Practical Deep Learning for Coders (Fast AI Online Course)
Deep Learning Specialisation (set of 5 courses on Cousera)
Full stack deep learning
Hugging Face course (NLP)

Additional Resources:

Machine Learning by Tom M. Mitchell (Book)
Deep Learning by Ian Goodfellow et al (Book)

Here is a little snippet of my study plan in Notion with timelines:

I hope that by sharing my curriculum you will be inspired to also make a commitment to learning and building in data science!

Follow me on Twitter and join me on my journey. Send me a DM or comment below and let me know your thoughts, I’d love to hear from you!

Joleen October 10, 2021

Analytics & Business

Are Business Intelligence Tools Still Worth It?

Business Intelligence forms the back-bone of decision making in an ever-growing number of organisations in the world today. More than ever, organisations need to be able to make fast decisions to changes in the market.

Unfortunately, many businesses do not have a system that allows them to obtain the insights they need to do this. Instead, they rely on spreadsheet software like Microsoft Excel to capture, manage and analyse all their business data.

Let’s face it, spreadsheets are one of the most accessible, cost effective and flexible solutions to use when your business is in it’s early days and you don’t really have much data to work with.

However, every business reaches a tipping point as they grow when there’s just too many moving parts to get a proper handle on things and decision making end up suffering for it. The spreadsheet you used to use to track sales, marketing, customers, and inventory is now starting to gain a few holes. Preparing and maintaining the spreadsheet is probably taking up more of your time than you’d care to admit, you can’t identify trends and connections between all the various areas you’re tracking and the data just isn’t as accurate as you’d like.

Long story short, you may have reached a point where you need to level up your reporting and analytics capability and a business intelligence system will get you there.

What is Business Intelligence?

BI is a catch-all term for a range of technology and software that enables organisations to manage and analyse data. Broadly speaking, BI is about combining data from multiple sources and using it to make better decisions.

Over the years, BI tools have evolved into being more of a self-service solution. This makes it possible for anyone in the organisation to analyse data and create their own reports and dashboards. Gone are the days when you had complex models that were built and understood only by the specialist who created them. The BI tools of today are more accessible, intuitive and easier to understand.

There are many BI tools and software out there, some of which are free and some come with a hefty price tag. However, business intelligence is about more than just the software. BI involves data storage, data governance, reporting, insights, and also promotes a shift in the culture of the organisation towards data-driven decision making.

Benefits of Business Intelligence

Faster & more accruate reporting and analysis

BI tools are designed to handle large amounts of data from multiple sources. All of this happens behind the scenes so that you only need to focus on extracting insights. The time and energy saved on this front will allow you to prioritise the things that really matter in your business.

Examples of the kinds of reporting and analytics that BI platforms are capable of can be found throughout the organisation. From product development and greater operational efficiency to improving customer experience and employee satisfaction.

With this kind of insight available to business leaders, it is easier to innovate and find new opportunities.

Single source of truth

BI platforms combine data from various internal and external sources into a single warehouse that is more accessible.

This provides a single source of truth for the data where data governance and maintenance is much easier. Not to mention, you avoid inaccurate spreadsheets or databases that are siloed within a single department. With consistent, higher-quality data available, employees are more likely to use and trust the reports and dashboards that are generated.

Establishing a data-driven culture

Building a business intelligence system leads the organisation down the path of developing a data-driven culture. Decisions are based on data and experimentation rather than gut and guess-work. A data-driven culture is essential if the organisation wants to ensure long-term adoption and benefit of current and future systems.

This is a journey, but it is a worthwhile one. Organisations with a data-driven culture are committed to developing their analytical capability, finding insights and deriving value from their data. This approach leads to greater collaboration within and between departments and a mindset of curiosity and discovery.

Business Intelligence Trends in 2021

A recent survey with over 2,200 participants shed some light on the data, BI and analytics trends seen in 2021. These are the top 5:

Master data and data quality management
Data discovery and visualisation
Establishing a data-driven culture
Data governance
Self-service analytics

Essentially, these trends make up the foundation of a good business intelligence system. Organisations must be able to obtain, govern and use their data towards making better decisions. Ultimately, this will help the organisation to move towards their strategic objectives.

BI Is Definitely Worth It

I hope this post helped you to decide whether you want to build a BI system in your organisation. In my opinion, BI is definitely a worthwhile investment and it is easier than ever to get started. Power BI is one of the leading BI platforms on the market at the moment. It is also free to download and start using on your PC immediately.

If you do decide to build a business intelligence system in your organisation then I recommend that you develop an analytics strategy that will also serve as your roadmap. Start small, experiment and make improvements as you go.

For more information check out these resources:

What are your views on BI? Let me know in the comments below!

Keyboard with the words principal component analysis in python

Joleen October 3, 2021

Projects & Workflows

Principal Component Analysis in Python

In this week’s data science project, I apply principal component analysis to a dataset based around hospital mortality using python. The dataset is extensive, containing over 50 variables on demographic characteristics, vital signs, and other measurements taken in the lab. This should be a perfect use-case for dimensionality reduction with PCA.

What is Principal Component Analysis?

Like my previous project on K-Means clustering, PCA is a type of unsupervised learning which means that we do not need any pre-defined labels for our data and we are not trying to make predictions.

The goal in PCA is to reduce the number of variables in our dataset. Therefore, the main purpose of PCA is to speed up machine learning. Using fewer variables while still retaining nearly all the original information and variability gives a nice boost to model training, consuming fewer resources and reaching a result faster.

Eigenvectors and eigenvalues drive PCA at it’s core. An eigenvector is a line that passes through the data and the algorithm chooses the line that maximises the variability between that line and the data points. The corresponding eigenvalue represents the magnitude of that variability. There are only as many eigenvectors as there are variables in the data.

An important question to ask in PCA is how many principal components to retain. The 2 most popular methods are:

Plotting the cumulative variance explained by each principal component. You would choose a cutoff value for the variance and select the number of components that occur at that cutoff.
A scree plot showing the eigenvalues of each principal component where the cutoff here is an eigenvalue of 1.

There are 3 critical processing conditions that must be satisfied when preparing PCA:

There must be no null or blank values in the dataset
All variables must be numeric
Standardise the variables to have a mean of 0 and a standard deviation of 1

Data & Project Goals

The data for this project is from Kaggle. Remember, if you’d like to know where I look for my datasets, check out my post on finding data.

This data is from the MIMIC-III database on in-hospital mortality. The original purpose of the dataset is to predict whether a patient lives or dies in hospital when in intensive care based on the risk factors selected.

There are a total of 51 variables in the data. In this project, we will attempt to reduce these variables down to a more manageable number using principal component analysis.

These are the goals for this project:

Prepare the data for PCA, satisfying the 3 critical processing conditions mentioned in the previous section
Apply the PCA algorithm (using the sklearn library)
Determine the optimal number of principal components to retain

Data Pre-Processing

We begin, as always, with importing the required libraries and the data.

import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)
sns.set_palette("Dark2")

# Import data
pca_raw_df = pd.read_csv("data/mortality.csv")

Next, we will carry out these 3 pre-processing steps:

Remove any rows that contain blank / NA
Select only numeric variables from the data
Standardise the variables

# 1. Remove rows that contain blank values
pca_nona_df = pca_raw_df.dropna()

# 2. Select only numeric columns
# We calculate the max & assume that a max of 1 is probably not numeric
max_all = pca_nona_df.max()
max_cols = list(max_all[max_all != 1].index)

pca_sub_df = pca_nona_df[max_cols]

# Drop the columns that did not fit into our initial logic
pca_sub2_df = pca_sub_df.drop(columns = ['group', 'ID', 'gendera'])

# 3. Standardise the variables
scaler = StandardScaler()
pca_std = scaler.fit_transform(pca_sub2_df)

Once all pre-processing steps are complete, 38 variables remain.

Applying Principal Component Analysis

Now that our data is ready, we can apply PCA and then we will use these 2 methods to determine the optimal number of components to retain:

Cumulative variance
Scree plot

# Fit PCA
pca = PCA()
fit_pca = pca.fit_transform(pca_std)

Plot Cumulative Variance

# Plot the cumulative variance for each component
plt.figure(figsize = (15, 6))
components = np.arange(1, 39, step=1)
variance = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(components, variance, marker='o', linestyle='--', color='green')

plt.xlabel('Number of Components')
plt.xticks(np.arange(1, 39, step=1))
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.90, color='r', linestyle='-')
plt.text(0.5, 0.85, '90% variance threshold', color = 'red', fontsize=16)
plt.text(25, 0.85, "Components needed: "+str(np.where(np.cumsum(pca.explained_variance_ratio_)>=0.9)[0][0]), color = "red", fontsize=16)

plt.show()

The above plot contains the cumulative variance as a proportion on the y axis and the number of components on the x axis.

Using a variance threshold of 90%, the above chart helps us determine how many components we should retain from our dataset in order for it to still make sense for us in any further modelling.

Note that we chose 90% here as the variance threshold but this is not a golden rule. The researcher or data scientist chooses this variance threshold.

So, we can see from this plot that we need 22 components to retain 90% of the variability (information) in our data. That’s not much of an improvement over the total number of variables, 38.

Scree Plot

# Scree plot
plt.figure(figsize=(15, 6))
components = np.arange(1, 39, step=1)
eigenvalues = pca.explained_variance_

plt.plot(components, eigenvalues, marker = 'o', 
				 linestyle = '--', color = 'green')
plt.ylim(0, max(eigenvalues))

plt.ylabel('Eigenvalue')
plt.xlabel('Number of Components')
plt.xticks(np.arange(1, 39, step = 1))
plt.title('Scree plot')

plt.axhline(y=1, color = 'r', linestyle = '-')
plt.text(0, 0.75, 'Eigenvalue Cutoff', color = 'red', fontsize=14)
plt.text(15, 1.10, 'Components Needed: '+str(np.where(eigenvalues<=1)[0][0]), 
				 color = 'red', fontsize=14)

plt.show()

The above scree plot contains the eigenvalues on the y axis and the number of components on the x axis.

As a general rule of thumb, we should select the number of components that have an eigenvalue greater than 1. Based on the above scree plot we should choose to keep 13 components. This is very different to the previous cumulative variance plot where we chose to retain 22 components.

However, if we choose to retain 13 components, this only explains 65% of the variability. This is much lower than the variability captured when retaining 22 components (90%). At this point, the researcher or data scientist would need to decide whether a cumulative variance of 65% is satisfactory based on their domain knowledge and other research in the area.

Alternatively, there are other methods that can determine the optimal number of components. It is also possible that PCA just isn’t a suitable method for this dataset and other dimensionality reduction techniques could be explored.

I used these resources for this project, check them out if you’d like to dive deeper into PCA:

Analytics Vidhya: Demystifying the working of Principal Component Analysis
George Dallas: Principal Component Analysis 4 Dummies (this article does a fantastic job of making eigenvectors and eigenvalues in PCA much easier to understand)
Towards Data Science: PCA using Python (scikit-learn)

You can find all code for this project on Github.

I hope you found this project useful! Comment below or connect with me on Twitter if you have any questions.

1 2 3 4

About Joleen

Posts by Joleen Bothma:

Working Together in Real-Time

Building Reproducible Projects

From Data Science to Data Storytelling

Closing Remarks

DataCamp Workspaces

Kaggle Notebooks

Google Colab

Deepnote

Datalore

Gradient Notebooks

Closing Thoughts

What is Decision Intelligence?

Why is Decision Intelligence Important?

How to Apply Decision Intelligence

Causal Decision Diagram

Interactive Decision Models

What is Stochastic Gradient Descent?

A Simplified Example of Gradient Descent

Batch vs Stochastic Gradient Descent

How to Implement Stochastic Gradient Descent in Python

Preprocessing the data for classification

Selecting the model (and loss function)

Measuring performance

Analysing Road Safety Data

Project Overview

Data Description

Data Exploration

Data Preprocessing

Training & Evaluating the Model

Why I decided to get a Masters degree

Upsides to my Masters Degree

Downsides

Self-Study

Setting up my Environment

Physical

Digital

Personal

Time Blocking in Google Calendar

The Pomodoro Method

Co-working Communities

Annotate & mark it up

Don’t take notes on everything

Implement fast

Iterate & expand

Why am I doing this?

How much time will I spending on this?

Why did I choose these courses/topics?

How am I tracking my progress?

The Curriculum

What is Business Intelligence?

Benefits of Business Intelligence

Faster & more accruate reporting and analysis

Single source of truth

Establishing a data-driven culture

Business Intelligence Trends in 2021

BI Is Definitely Worth It

What is Principal Component Analysis?

Data & Project Goals

Data Pre-Processing

Applying Principal Component Analysis

Plot Cumulative Variance

Scree Plot