The problem of data science MOOCs

the_problem_of_ds_moocs.jpg

Massive Open Online Courses are a big thing now. I’m a great fan of them to be honest. They allow for knowledge to be shared, many times knowledge that would only be spread on a specific context. A university, country, industry, company, is now made available to the world, many times freely.

And that is a good thing.

It shouldn’t be a surprise that if one of the hot areas of the moment is data and being data fundamentally a technological area that it is a good fit for MOOCs. However there is a problem. So, have you been doing or thinking about doing massive open online courses on subjects like data analysis, data science or machine learning? Grab the popcorns and I’ll rant a bit.

Someone that deals with data, especially when it deals with machine learning and/or data science, often faces questions like: “Have you used deep learning?” or “What is the best machine learning algorithm?”

It’s quite alright when these questions come from layman. Maybe they have some interest and follow some information source that touches the subject. However I don’t expect someone that has at least some exposure to the field to make these questions because they are fundamentally… well… non sense!

What is the best machine learning algorithm?

Let’s forget the deep learning question and focus on this question instead. There is no such thing as a best machine learning algorithm! There are a bunch of tasks that have a bunch of algorithms to execute those tasks. Each algorithm has a bunch of assumptions, pros and cons.

I can add other questions like “What are the top algorithms?” and specifics that add nothing to the question, like “What is the best algorithm for classification in social sciences?” or “What algorithm should I use in a dataset with 1 million rows?”

The plot thickens when these questions come from people that had some exposure to data science and machine learning. Note that the last two questions assume some knowledge of the subject, at the very least, knowledge about applying machine learning algorithms.

What does this have to do with MOOCs?

When it comes to MOOCs, the formula for machine learning algorithms is to present the task, present some algorithms that apply to the task and then fitting a model using the algorithm on a specific dataset.

People are trained to think about algorithms and datasets. There’s nothing fundamentally wrong with it. The problem is that it ends there. What is missing is training people to think about the problem.

To be honest, the awesome people behind the MOOCs are well aware of this and do their best to prevent it. Some discuss it in the courses intros, others mention applications of the tasks in real live examples. Some of the MOOCs final projects present a real life problem. But I have never seen a MOOC where people are trained to think about the problem with the same level of depth that they are trained to use algorithms in datasets.

So… what’s the problem of that?

There are several problems but I can boil it down to this: it’s the wrong goal. It is the difference between these two tasks:

Classify the users between churners and non churners.

…and…

Reduce churn.

In the first one, the goal is to classify the users. You are likely to have a prebuilt dataset with users labeled as churner and non churners. Your whole goal is to fit the model. You’ll consider using that algorithm you read about, after all it’s just 1000 rows!

In the second one you have no dataset. Probably no labels. You have to think it through, speak with people in the product team, try to understand what it is that they want to affect, is it D0? D7? How do they want to receive the outcome? Is it the likelihood of churn? Is it a boolean? Should you put it on a database? Send it to an API? Are there computational constraints? Product constraints? Are you suppose to act or inform?

Only now you’ll think about the dataset you need to create, what task it is and how you’ll communicate the outcome and only after all of these steps you’ll fit the model. And then validate it… and then see how you can put it live… and then evaluate it…

No MOOC that I know delivers this kind of thought process. It is mostly about algorithms and datasets. This makes people obsessed with hyped algorithms and interesting datasets when they should be obsessed with deploying solutions to add value. The thought process is the difference between data science and machine learning being the path or the end goal.

Still… MOOCs are awesome!

Don’t let this rant feel as an attack to MOOCs. It isn’t!

People behind many great MOOCs out there have gone through a lot of work to make this information available to as many people as possible and without them a lot of the shared knowledge of machine learning and data science would not be possible.

The criticism is not that they are bad but that they are incomplete in the context of preparing people to real life problems.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s