Databases and tables for game analytics

databases_and_tables.jpg

It has been some time so let’s recap how we got here. First I gave an overview of what a game analytics stack can be. Then I moved to the planning stage pointing the steps from zero to data science. In the last couple of posts in this category I wrote about basic events. First how to think and define them, later on the structure of the data created from those events. The last couple of posts were about user state, what it is and how we can use it.

I think it is abundantly clear that there is method in the madness! Today I’ll write about the databases and tables needed for basic reporting. Not only the definition of the fields but also different structures and technical considerations. 

What are reporting tables?

Let’s define reporting as self serving dashboards available to business users. The metrics and analytics in these dashboards are automatically updated on a given time frame and they are equal for every single game.

We already defined the questions and Key Performance Indicators (KPIs) we want to answer and use in previous posts, now it is all a matter of “simply” defining the tables’ structures and there are a number of things to consider that depend on how data informs the business.

What time frames do you need?

By time frame I refer to the granularity of the unit of time. Different business users have different needs. Reporting to the CEO or CFO probably begs for a monthly, more aggregated and therefor less granular view of the business KPIs. On the other hand product owners probably want a more granular view, some weekly, others daily and many, all of the above.

There are rare occasions where hourly and even real time or near real time analytics are necessary. These not only offer new challenges on their own but it is likely that the KPIs you’ll use are slightly different. For now let us focus on daily, weekly and monthly since those will be very similar if not equal in structure and address the hourly tables on a later date.

Considering this, the only point that is relevant is that you’ll need different processes to create different tables with the same variables in different time frames. It should be relatively easy to get these going but don’t forget to consider it!

Should you separate cohort tables from metrics tables?

If you recall from the second post on basic events, I divided datasets (in this case, tables) between metrics and cohort. The fundamental difference is the date. On metrics tables we are concerned about the date where something happened. On cohort tables we are concerned with the groups of players that started playing in the same day, the acquisition day.

We can have one table with both cohort and metrics. This table will have two dates as dimensions, the activity date and the acquisition date. Such table allows both performance metrics dashboards and cohort analysis dashboards but there are a couple of things you’ll need to consider.

Size matters!

The first is size of the table in number of rows. A table that serves both “masters” and has country and platform can be quite big. If each row is one combination of each dimension, then the number of rows of 1 year worth of cohorts across 1 year worth of retained days for 195 countries and 4 platform stores (I’m assuming Apple, Google, Amazon and Windows) will result in a table with approximately 100 million rows… per game per year!

Another solution (one that we use at Miniclip) is to have metrics tables without acquisition date and cohort tables with a selected number of retained days. We currently use retention days 1, 2, 7, 15, 30, 60, 90, 180 and 365. That means our metrics tables will have less than 285 thousand rows a year which is manageable on MySQL boxes. Our cohorts tables will have around 2 million rows a year.

I cannot give you a definitive solution for this. It depends on how you manage your data warehouses and database systems. But keep in mind that these have to be lightning fast for business users to use and abuse.

Last but not least…

How to handle calculated aggregations?

Calculated aggregations are things like Average Revenue Per User (ARPU). Should these be part of the tables? Meaning should they be pre calculated and inserted? In my humble opinion, no. I’ll go even further and say never!

We have a bunch of aggregations already, that’s the whole point of having these tables. Daily Active Users (DAU) is a count, Revenue is a sum, etc. But everything that is not counts and sums shouldn’t be in the table. On one hand because we are likely to have to recalculate it again. On the other hand because we don’t want to change tables every time we decide to put a new metric up.

Conclusions

I hope this post gave you enough food for thought in terms of defining the basic reporting tables. There are some things to it but overall it’s just a matter of making technical choices to support our games.

To wrap up this part, we need visualisations!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s