As you probably know, a few months ago we introduced a powerful addition to our analytics stack – Custom Reports. Equipped with a new engine, the new module has made our analytics faster and more agile than ever.
It wouldn’t have been possible without the powerful database management system offered by ClickHouse (kudos, guys!). In this blog post we want to show you how this system contributed to the advances made with our product.
What we wanted to improve
While creating Custom Reports, we had a few important things in mind. We wanted to:
- make it as easy to use as possible – our users should be able to take advantage of the reports without database know-how or technical expertise
- provide our clients with more variety – give them multiple ways to visualize their data, and of course
- make our reports generate faster and without the initial preprocessing.
We’ve rethought many elements to make these improvements, but the biggest challenge was with the database back end. To make our reports generate faster, we had to abandon the row-oriented MySQL once and for all and switch to a more flexible solution.
Why we chose ClickHouse
The obvious choice seemed to be transitioning to the column-oriented ClickHouse.
Because it’s far more efficient at processing queries that use only a small (but arbitrary) subset of columns, which is typical for analytics workloads.
Here you can see a visual comparison prepared by Yandex:
Quite a difference, right?
And why ClickHouse?
Because it’s open-source software that was built to solve the very problem we were facing.
If you want to delve deeper into the technicalities of ClickHouse or compare it to other solutions, follow this link.
How we prepared for the task
However, in order to make the best use of a column-oriented database, we had to make major changes to our data model. In essence, this required converting the normalized Piwik PRO database schema into wide tables that require no further joins in runtime (a strategy called denormalization).
This approach makes filtering a large amount of data far more efficient (and actually possible for real-time queries). It’s especially useful for user-defined segments that relied on arbitrary columns, and as such couldn’t be easily indexed.
As another performance optimization, we’ve decided to keep the distinction between sessions and events. This way all the reports that don’t require the additional event-level dimensions can be calculated even faster, and the excellent data compression offered by ClickHouse makes it a worthy trade-off.
To make all of this happen, we’ve implemented an ETL process that every few minutes retrieves a new batch of data from MySQL, processes it asynchronously in smaller chunks and then loads the results to ClickHouse.
Free Comparison of 5 Leading Web Analytics Vendors
Compare 40 Variables of 5 Leading Enterprise-Ready Web Analytics Vendors:Download FREE Ebook
We’ve also taken the opportunity to convert some of the stored data into modern standards or simplify its structure wherever possible to make interacting with this data (e.g. through our API) as easy as possible.
Was it worth the struggle?
Overall, it was definitely worth it.
Thanks to these improvements:
- reports can now be generated in real-time
- no resource expensive pre-processing that was previously required, especially when working with custom data segments
If you’re interested in delving deeper into the subject, feel free to reach out to us anytime, our team will be happy to answer all your burning questions!