Learning data science with John Oliver

1024px-john_oliver_6

You know this guy, right? In case you don’t, he is John Oliver, an english comedian with a perspective on the modern world that can only be matched by his distinctive voice!

I saw the video below sometime ago. In it, John Oliver presents in his usual style what is wrong with how science is used and presented. I won’t discuss the large amount of pet peeves I have with what I see on mainstream media or shared on Facebook regarding science or the lack of it. It would be out of context, too long and, to be intellectually honest, incredibly boring especially after John’s tremendous piece.

Instead I want to invite to watch the video in case you haven’t and I’ll tell you why I believe his views are important in the context of data science also. 

And what does this have to do with data science apart from the word “science”? As I see it quite a lot. In this age of big data and massive computational analytics, how statistics and the scientific method are used is deeply misunderstood. Here are some examples…

“There are so many studies that they seem to contradict one another”

While this doesn’t affect data science products, it affects consistently random trials, aka A/B tests which I personally view as part of data science. In layman terms, if A is our control group, meaning, our current state, and B is the test group that is exposed to our alternative state and our trial shows that results of B are better than results of A, we make B exposed to the whole population. This sounds fair, right? Assuming the trial was correctly conducted, then yes, it is fair.

What if you repeat the A/B test after B is live? Should you expect the same results as the first test? No! We now removed B from the population. They are aware of B, they know it exists, therefore the response is conditioned by that knowledge. It can even be the case that now A is better than B!

This is equally true when performing data analysis. If our products are in a continuous stage of change, why do we expect that we observed in past never changes? That is the fundamental principle of science! If we have empiric evidence that our existing knowledge is incorrect, we reject it in favour of the new one.

“Some [studies] maybe biased to get eye catching positive results”

Are you a business user? Are you in a position of authority? If you said yes to one or both of these two questions and want unbiased results, don’t pressure your analysts and scientists for results. Don’t ask them to fish the answer you like. Don’t ask for follow up after follow up, going deeper and deeper into more and more segmented groups. As someone with responsibility, you are making them, yourself and your company a bad service.

Why?…

“Playing with your data until you find something that statistically significant but probably meaningless”

And this can happen on analysis, machine learning, reporting, trials, research, you name it! Everything that analytics and data science teams do can suffer from this and it is bad for everyone!

If you torture data long enough, it will confess!

This is one the most well known quotes in statistics. Data analysts and scientists can make data confess. We opt not to. We will do a bad job if we do.

This is the time we really need to ask ourselves what is it that we want from analytics in general and data science specifically. Do we want to be effective or to be right? Do we want to find the model that generalises effectively for 1 million users or do we want to prove the model is wrong for 100 because we “don’t believe those results”?

This is what “playing with data” is all about and most people don’t even know they are doing it. By asking analysts to dig deeper they are, in practice, either working to disprove or lowering the statistical power of a perfectly good generalised insight, model or data product.

“Find the study that goes best with you and you follow that…”

I don’t really need to say how wrong this is, I hope. I didn’t even pick on sample size that is abundantly used in the video as an example of problems. I prefer to finish with this one thought. That if we use whatever data skills, be it statistics, general analytics and specifically data science that goes best, we are losing a tremendous opportunity to innovate but worse we are killing the credibility of analytics.

Like John Oliver says at some point “Science is, by its nature, imperfect. But it is hugely important.” I raise my glass to that. It is the power to make the best informed decision instead of the most powerful opinion. It is the opportunity to learn and improve our knowledge instead of maintaining the status quo.

The way to keep analytics and data science relevant and trustworthy is to keep it independent and relevant but more important, by understanding that the work of data engineers, analysts and scientists in this day and age follows the scientific method and that it is perfectly ok to not be an expert as long as the experts are left to do their work.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s