Daniel Zurawski's Stack Decisions

Daniel Zurawski

Technical Lead at SuperAwesome·Oct 28, 2020

Recommends

(

)

One of the reasons why your real-time reporting built on top of MySQL might not be performing so well is due to the fact that you are most likely interested in aggregates (e.g. group by & SUM, AVG, TopN). In data warehousing, there is a term known as column-oriented vs row-oriented databases - the key here is that in column-oriented DBMSs, you more precisely access the data you need to answer a question, avoiding having to scan the entire table to calculate an answer. Most of the time pre-aggregates can be calculated on insertion instead of at query time.

An excellent OLAP modern tool that I successfully used for many years to index events from Kafka at a staggering rate and query millions of events in less than a second is Apache Druid and it's an example of a distributed column-oriented data store. There are of course many more technologies out there for answering OLAP business intelligence questions, but personally, I think you won't go very far with a traditional RDBMS or a Lucene based search engine like ElasticSearch for building a Business Intelligence database for vast amounts of data.

"Apache Druid is an open-source data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence (OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation."

If you don't want to invest resources into deploying and hosting it yourself, there are other companies out there that can host it for you, but I will leave that up to you to research.

Here is an excellent article by my former work colleagues explaining how they implemented real-time analytics on top of Druid: https://medium.com/superawesome-engineering/how-we-use-apache-druids-real-time-analytics-to-power-kidtech-at-superawesome-8da6a0fb28b1. Also, I recommend reading through this HackerNews thread that talks in-depth about time-series databases: https://news.ycombinator.com/item?id=18403507.

How we use Apache Druid’s real-time analytics to power kidtech at SuperAwesome | by Natasha Mulla | SuperAwesome Engineering | Sep, 2020 | Medium (medium.com)

10 upvotes·12.3K views