Finding Data For Your Data Science Project: The 5 Best Places To Look
If you’re starting your first (or next) data science project then chances are you’ll need to find one or more interesting datasets to analyse in your project. Well, look no further because I have a list of the best 5 places to look for great datasets to use in your project.
The truth is you can find data almost anywhere – either pre-formatted and cleaned up for you or in its raw, dirty form. You can download plain flat files with nicely organised columns (but not necessarily clean) or you can scrape together your own dataset from websites or from multiple different sources.
I’ll be dealing with the former here because if you’re anything like me, you’re less interested in collecting the data and more interested in analysing the data.
I encourage you to choose datasets based on what you’re interested in and what motivates you. If you pick a dataset because it seems like a hot topic and it doesn’t actually excite you then you are more likely to never finish the project.
I have organised the datasets into 5 categories below based on their overall theme. This is by no means an exhaustive list of datasets and there are of course many more places out there to find data but I hope that this list will give you a good place to start.
Fight Club
Enter the first category: fight club. There are datasets available on competition websites that have been created specifically to use in predictive modelling. They are sometimes cleaned before hand with only a few tweaks needed depending on your use case.
You don’t actually have to make a submission to the competition if that’s not your vibe, you can just download the data for use in your own custom projects.
- Kaggle – competition platform
- DataKind – volunteer projects
- DrivenData competition platform
Paparazzi
News sites that are focussed on data journalism are good places to look for data because they often release their (already cleaned) data publicly.
If you enjoy their content you can also download the data they use in their articles and try to replicate the results yourself or come up with different problems or questions that you want to explore.
- FiveThirtyEight – data used in their articles can often be found on Github. Alternatively, if you’re using the R programming language, there is a package called
fivethirtyeight
(developed by Kim, Ismay, and Chunn 2019) that contains all datasets used by the site. Also, check out this website for a complete list of all datasets contained in the package. - BuzzFeed – data for this site can also be found on Github
- ProPublica – they host a data store on their website with both free and paid data sets
The Big Fish
Cloud hosting platforms/providers often host big data sets. Big data is a whole other beast unto itself that may require a very different approach.
I would not recommend starting here if you’re still getting used to querying and cleaning data and haven’t dealt with big data before.
They are typically geared toward getting you to use their platform and so you would need to sign up for their services in order to get access to their data.
The Hunter-Gatherers
These hunter-gatherer sites collect data from multiple different sources and store them in their own repositories. They could be centred around a specific industry or they could be a mixture of different topics, each belonging to different sectors and industries.
This is my favourite category to search for data in. Remember what I said about being less interested in collecting the data and more interested in analysing the data? Well, then I guess it’s a no-brainer that I’d be hanging out over here.
However, be warned that some of these sites can have data that is quite dirty, especially if they are user-contributed with minimal documentation. But you’re a data scientist, right? Dirty data is like your home-away-from-home 😏
- Google custom dataset search
- Data.world
- Github repo – Awesome public datasets – a massive list of links to sites that host public datasets
- The World Bank
- r/datasets – subreddit
- UCI Machine Learning Repository
- Quandl – economic and financial data
Flood Gates
If you want to keep your data updated and refreshed at close to real-time then open the flood gates and let the stream in.
Some sites make their data available at near real-time via an API but be aware that there are limitations on the number of API calls you can make or the number of data records returned within a certain interval. Generally, after this they make you pay for more calls or a greater data cap.
- Github
- Quantopian (stock trading)
Pick a dataset and start your project
Phew! There are a lot of options for datasets! The last thing you want to happen is to get stuck in procrasti-searching – procrastinating by endlessly searching these sites for datasets (did you see what I did there?).
The most important thing is that you just pick a dataset and get started on your project. You will have plenty of time to come back to this list and these sites along your data science journey. One project does not make a data scientist so keep practicing.
If you have a favourite source that you like to get your data from then leave it in the comments below – I would love to expand my list of data sources. You can never have too much data, can you?