Data is a (bad) representation of reality. So data science will save the world.

Regardless of your worldview, I think there is near-universal agreement that humanity is facing a bristling array of existential threats. Here are a few obvious ones:

In fact, it seems to me that the possibility of none of these threats materializing in the next 100 years is damn slim… unless we collectively get a lot smarter about detection, decision, and action.

But I think we may be able to save ourselves. Because we have lots of data, and (to a lesser degree) we have data scientists.


Most data is an abstract representation of the world. Whether it is the decoded genome of a disease, or untold reams of data that represent possible future states of the world in large-scale simulations, or measurements of honeybee populations. Unfortunately, data is a very thin representation of reality (which is to say: most dimensions of reality are not captured in most datasets) and the data itself is usually siloed, dirty, and riddled with errors.

 Data Scientists

This is where data scientists come in. They clean, correct, and unify data, and then reconstitute it to create as complete a representation of reality as they possibly can. Then they use it to: detect disease outbreaks (and maybe in the future create novel vaccines on the fly), to run simulations to help us make decisions to avoid disastrous wartime and economic scenarios, and aid (natural) scientists in understanding and taking action to heal our environment.

Data and data scientists alone will not suffice. We face monumentally complex problems that will require expertise and hard work from much of humanity to solve. But by interfacing between data (a representation of reality that computers can manipulate), and the real world, they are uniquely positioned to substantially contribute to most, if not all of humanities biggest challenges.

My optimism is only slightly tempered by the fact that there are far too few data scientists in the world today. So… have you considered a career in data science? It’s lucrative, it’s sexy, and it’ll save the world.


Now read this

How’s that data lake?

What’s a data lake? A “data lake” is a single data store that ideally holds all of an enterprise’s data. The benefit of a lake architecture is that you can safely and easily access your data from many end-points such as dashboards,... Continue →