The Dataset is a bi-weekly blog and newsletter about the datasets behind artificial intelligence, read by many experts.


Subscribe here.

This site is maintained by Arun Rao. I work on a Machine Learning (P13N) team at Amazon Music, and before that I was a co-founder of Starbutter AI (an NLP/conversational agent startup) and a quantitative bond trader at PIMCO (which manages $2 trillion). I tweet @raohackr. You can email me at: [hermesfeet] at gmail dot com.

I cover some of the most interesting data sets in the world and what top academic and commercial researchers are doing with them. This is my personal project collecting and analyzing publicly available datasets and it is completely unaffiliated with any past or present employers.


Google: The Unreasonable Effectiveness of Data – 2009

Google: The Unreasonable Effectiveness of Data – Revisited 2017

Open AI: GPT-3: Language Models are Few-Shot Learners

MIT: The Data Nutrition Project

Open AI/Jack Clark: Import AI on new Machine Learning advances

UC Irvine: Machine Learning Repository of Datasets

Primer on AI and Machine Learning (Part 1) — Beginners Level (Non-Technical)

Machine Learning (including Deep Learning and Reinforcement Learning) for Engineers — A Technical Primer (Part 2)

Data Elixir: News and Resources for Data Science Practitioners

What People Say

Without big data, you are blind and deaf and in the middle of a freeway.

Geoffrey Moore

Data is the new oil.

Clive Humby

Today’s datasets are not very large. It’s yesterday’s datasets that were ridiculously small.

Pedro Domingos

Send a message to me