I’m starting this category “All things data” to write posts about data for the sake of data. Some of them are simple but play a big role, like this one. Others are complex. There will be a twist, though. The examples I’ll show will be in the context of games analytics. There will be no Iris dataset to fit a logistic regression. If you are smiling you know what I’m talking about. If you are not, don’t worry, we’ll get to that.
If you do data analysis, posts as this one are the foundation of your work and if you are like me, it is second nature and you don’t even think about it. I felt it was important to write these things in my own words with examples of the kind of datasets I work with. If it helps someone better understand data, great. In the future maybe it will help someone fit a model properly.
Today the topic is simply data. The most simple and down to earth definition of what data are.
What are data?
Some people (like two…) have asked me why do I say “data are” instead of “data is”. So let’s start with that. Data is the plural of datum. One data point is a datum. Many data points are data.
Data are collections of facts that we have knowledge of. This is an important concept since we never have perfect information, regardless of how much data we have. Many people think we deal with perfect information and present absolute certainties. That could not be further away from the truth. Any statistician, data analyst and data scientist work in a world of uncertainty.
Types of data
Data can be numerical or categorical. If you think of a dataset, measures are numerical data and dimensions are categorical data.
Numerical data can be discrete or continuous. The easiest way to describe the difference is saying that integers are discrete and real numbers are continuous. A player level is a discrete data point while retention rate is a continuous data point.
Categorical data can be nominal or ordinal. Ordinal data differs from nominal data because their categories have a determined order. Think of ordinal data as ranked categorical data. Let’s say the players in your game have a rank for their achievements: rookie, novice and veteran. That would be ordinal data. The country where they are from is nominal.
Idiosyncrasies of data types
My favourite is dates and times. Time is continuous but we store it discretely, e.g. UNIX timestamp. Later on we display it as an ordinal variable, like a date or a month. And sometimes we even use it as a nominal variable, e.g. day of the week.
Note that this is mostly related with our interpretation, not the data itself.
Our classification of data matters most to our understanding of the data, meaning, the information we get from it and how we perceive it. For instance, categorical data is, in fact, discrete data. Many systems use integer indices to better manage categorical data in memory. This means that many models are using integers and not our human readable categories.
Why does this matter?
Understanding data on its own allows us to make better decisions when presenting, analysing and modelling it.
A good visualisation depends on what you want to show. For instance if you want to show the relationship between two continuous variables, you should choose a scatterplot. If you want to see how a continuous variable is different across categories, you should choose a box plot. Counts of a categorical variable, a histogram, distribution of a continuous variable, a density plot.
Adding more variables changes your options depending on the data types of those variables. The same applies to models and statistical tests and many data science tasks.
These are very simple concepts that have a huge impact in your decisions on how to understand data.