Notes for contributors

It’s easy to find datasets online. What’s more difficult is finding quality datasets that are suitable for specific training and development needs. On Real World Data Science we aim to solve that problem.

Our Datasets section will provide a curated list of recommended datasets along with detailed notes and guidance on what each dataset contains, how it is structured, and how best to make use of it. In particular, we want to highlight messy rather than pristine datasets – ones that capture the imperfections and oddities found in real-world data – so that users can practice not only data analysis and modelling, but data cleaning and preparation too!


If you have a dataset to recommend, your submission must cover the following areas:

  • Dataset name
  • Link to source
  • What data science tasks/methods can this dataset be used to demonstrate?
  • Have you used this dataset for your own teaching/learning? (see Advice and recommendations below)
  • Why was the dataset originally created?
  • When was it created?
  • Who created it?
  • Licences/restrictions?
  • Size of dataset
  • Data types/description
  • Real/synthetic data?

Advice and recommendations

Help others to make good use of your recommended dataset. If you’ve had experience using a recommended dataset for your own teaching and learning, please consider creating an exercise for platform users to complete. If you encountered the dataset as part of a training course, competition or exercise created by a third party, make sure to give them a namecheck.