How To Structure Your Data Science Project For Success
It’s easy to run headfirst into writing code and analysing data and forget about project layout, so make it part of your routine to get started with a simple structure that you either maintain throughout the project or evolve as you go.
Future you (and others) will thank you for creating a well-structured, organised project. Also remember that ‘well-structured’ doesn’t have to mean complicated!
- Start simple, evolve as needed - don’t overcomplicate things when you’re just getting started because this may even lead to procrastinating and putting off the project because the project setup is so overwhelming.
- The nature of your project will determine the structure - a simple exploratory project won’t even need all the elements described below but a growing machine learning model may need a more complex structure.
- Keep reproducibility in mind - create your project as if you’re collaborating with a team (even if it’s just for you) because future-you is also on your team.
- Stay flexible - adapt your structure to changes as they come up; don’t be afraid to change your entire project structure if you need to.
RStudio Projects & Filepaths
RStudio projects make dealing with filepaths a breeze! Never set your working directory directly using
setwd(). This could cause you to run into a few of problems:
- If you try to transfer your work to a new computer (especially if it is a different operating system) then you’d have to change your directory each time.
- Your project will no longer be reproducible without someone having to change the working directory.
- If you wanted to move your project to a new folder, you’d have to change the working directory.
Keep it simple!
RStudio projects are defined by the
.Rproj file that is created for it and this specifies that the current project directory where the
.Rproj file lives is ‘home’ and all other files are referenced relative to it.
Besides filepaths, RStudio projects make it easy to switch from one project to another and have more than one project open at once. Projects are clearly separated from one another without worrying about overlapping environments and conflicts.
Here is an overview of the steps for creating a new project:
- In RStudio: create a new project using File > New Project and create a new directory with the name of your project.
- Next you’ll create three subfolders inside your new project directory called
- Back in our root project directory, create your main script file called
README.mdfile, and your
myproject/ ├── R/ ├── data/ ├── output/ ├── myproject.Rproj ├── analysis.R ├── README.md └── notebook.Rmd
R subfolder should contain all your function scripts. These scripts don’t actually do any analyses themselves but rather acts as a place to store your functions for easy reference and will later be called by the
This folder is optional, depending on your project. For example, you may do most of your analyses within your
notebook.Rmd as in the case of projects consisting only of a few exploratory data analyses or you may just use your
analysis.R file for all your analyses.
You will keep all your data files in the
data folder. These can be Excel files, csv’s, etc. Treat these files strictly as read-only. It is very tempting to correct a small typo or error in your data, especially if your data is in an Excel file but resist this! Do all your editing within R and keep these steps well-documented.
output folder is reserved for storing your model output, log files, or plots. You can create further subfolders for each of these to keep things organised if you generate a lot of output.
One thing to note here is that these files should all be generated by your scripts so that you can easily run your script again and re-generate all relevant output files. So don’t be scared to delete old files if you no longer need them.
The Root Files
analysis.R file would usually start with something like
# Import packages library(cool_package) library(other_cool_package) # Run scripts source("R/import_data.R") source("R/clean.R")
… followed by all your analyses below it.
notebook.Rmd file is your holy grail as you progress through your project and test out different hypotheses or explore new ideas. Document everything and keep it all together is this notebook, like a diary of sorts.
I have preferred a paper ‘laboratory’ notebook in the past but after coming across the idea of using a
.Rmd file for this purpose from Hadley Wickham in his R for Data Science book, I wish I had discovered it sooner! However, I do still keep a paper notebook to jot down ideas or questions that may come up as I go through the project or spontaneously as I go about my day, but I always make sure that I document these ideas at some point in my
Lastly, there is the
README.md. This file is for a quick overview of the purpose of the project, what you are trying to achieve, the data you are using (and it’s source) as well as a brief summary of your findings. Think of this as your project abstract so that someone can quickly get an overview of your project.
This project structure shouldn’t take you more than a few minutes to set up and will save you so much more time in the long run. Having a well organised project that is easy to share and come back to in the future is not to be underestimated!