Apache Kylin vs Druid

Overview

Druid

Stacks376

Followers867

Votes32

Apache Kylin

Stacks61

Followers236

Votes24

GitHub Stars3.8K

Forks1.5K

Apache Kylin vs Druid: What are the differences?

Introduction

In this Markdown code, I will present the key differences between Apache Kylin and Druid, two popular open-source projects for big data processing and analytics.

Data Processing Approach: Apache Kylin is an online analytical processing (OLAP) engine that uses columnar storage to accelerate query performance. It builds and maintains pre-calculated cubes to provide fast query responses. On the other hand, Druid is a distributed, real-time analytics data store designed to process high volumes of event-driven data in real-time. It organizes data in memory for fast data ingestion and query execution.
Query Capabilities: Apache Kylin supports complex OLAP queries with advanced features like group-by, distinct count, and top-N. It offers dimensional modeling and allows users to explore multi-dimensional data sets efficiently. Druid, on the contrary, focuses on ad-hoc querying and provides sub-second query response times for real-time data exploration. It excels at filtering, aggregating, and slicing and dicing data based on time-based dimensions.
Data Ingestion and Storage: Apache Kylin primarily relies on Apache Hadoop and HBase for data ingestion and storage. It leverages the distributed file system for storing and processing large volumes of data. In contrast, Druid has its own data ingestion engine that supports a wide range of data sources, including streaming platforms like Apache Kafka. Druid stores data in a specialized in-memory columnar format for fast queries.
Scalability and Performance: Apache Kylin offers high scalability and can handle large data volumes efficiently. It uses distributed processing to parallelize query execution and achieve high performance. However, it requires additional hardware resources to support high throughput and quick response times. Druid, on the other hand, is designed to scale horizontally, with the ability to handle petabytes of data and thousands of nodes. It can deliver near real-time analytics even at massive scale.
Data Model Flexibility: Apache Kylin supports traditional star and snowflake schemas commonly used in OLAP systems. It enables users to define and build data cubes that optimize query performance for specific use cases. In contrast, Druid follows a denormalized, flat-table data model. It focuses on real-time analytics and provides flexible schemas that suit ad-hoc querying and multidimensional analysis.
Ecosystem Integration: Apache Kylin integrates well with the Apache Hadoop ecosystem and other big data tools like Hive, HBase, and Spark. It leverages the benefits of these technologies for data processing and storage. On the other hand, Druid has extensive integrations with various data sources, including Kafka, Hadoop, and cloud storage systems like Amazon S3. It also provides connectors for popular analytics and visualization tools like Apache Superset and Tableau.

In summary, Apache Kylin is an OLAP engine that focuses on complex OLAP queries and dimensional modeling, while Druid is a real-time analytics data store that excels at ad-hoc querying and real-time data exploration. Kylin leverages Hadoop and HBase for data processing and storage, while Druid has its own ingestion engine and relies on in-memory columnar storage. Both projects offer high scalability and performance but differ in data model flexibility and ecosystem integrations.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Detailed Comparison

Druid	Apache Kylin
Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.	Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets, originally contributed from eBay Inc.
-	Extremely Fast OLAP Engine at Scale; ANSI SQL Interface on Hadoop; Interactive Query Capability; MOLAP Cube; Seamless Integration with BI Tools
Statistics
GitHub Stars -	GitHub Stars 3.8K
GitHub Forks -	GitHub Forks 1.5K
Stacks 376	Stacks 61
Followers 867	Followers 236
Votes 32	Votes 24
Pros & Cons
Pros 15 Real Time Aggregations 6 Batch and Real-Time Ingestion 5 OLAP 3 OLAP + OLTP 2 Combining stream and historical analytics Cons 3 Limited sql support 2 Joins are not supported well 1 Complexity	Pros 7 Star schema and snowflake schema support 5 Seamless BI integration 4 OLAP on Hadoop 3 Easy install 3 Sub-second latency on extreme large dataset
Integrations
Zookeeper	Hadoop Apache Spark Tableau PowerBI Superset

What are some alternatives to Druid, Apache Kylin?

Apache Spark

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Presto

Distributed SQL Query Engine for Big Data

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Splunk

It provides the leading platform for Operational Intelligence. Customers use it to search, monitor, analyze and visualize machine data.

Apache Impala

Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

Vertica

It provides a best-in-class, unified analytics platform that will forever be independent from underlying infrastructure.

Azure Synapse

It is an analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale. It brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs.

Apache Kudu

A new addition to the open source Apache Hadoop ecosystem, Kudu completes Hadoop's storage layer to enable fast analytics on fast data.

Related Comparisons

Apache Kylin vs Druid: What are the differences?

Introduction

In this Markdown code, I will present the key differences between Apache Kylin and Druid, two popular open-source projects for big data processing and analytics.

Data Processing Approach: Apache Kylin is an online analytical processing (OLAP) engine that uses columnar storage to accelerate query performance. It builds and maintains pre-calculated cubes to provide fast query responses. On the other hand, Druid is a distributed, real-time analytics data store designed to process high volumes of event-driven data in real-time. It organizes data in memory for fast data ingestion and query execution.
Query Capabilities: Apache Kylin supports complex OLAP queries with advanced features like group-by, distinct count, and top-N. It offers dimensional modeling and allows users to explore multi-dimensional data sets efficiently. Druid, on the contrary, focuses on ad-hoc querying and provides sub-second query response times for real-time data exploration. It excels at filtering, aggregating, and slicing and dicing data based on time-based dimensions.
Data Ingestion and Storage: Apache Kylin primarily relies on Apache Hadoop and HBase for data ingestion and storage. It leverages the distributed file system for storing and processing large volumes of data. In contrast, Druid has its own data ingestion engine that supports a wide range of data sources, including streaming platforms like Apache Kafka. Druid stores data in a specialized in-memory columnar format for fast queries.
Scalability and Performance: Apache Kylin offers high scalability and can handle large data volumes efficiently. It uses distributed processing to parallelize query execution and achieve high performance. However, it requires additional hardware resources to support high throughput and quick response times. Druid, on the other hand, is designed to scale horizontally, with the ability to handle petabytes of data and thousands of nodes. It can deliver near real-time analytics even at massive scale.
Data Model Flexibility: Apache Kylin supports traditional star and snowflake schemas commonly used in OLAP systems. It enables users to define and build data cubes that optimize query performance for specific use cases. In contrast, Druid follows a denormalized, flat-table data model. It focuses on real-time analytics and provides flexible schemas that suit ad-hoc querying and multidimensional analysis.
Ecosystem Integration: Apache Kylin integrates well with the Apache Hadoop ecosystem and other big data tools like Hive, HBase, and Spark. It leverages the benefits of these technologies for data processing and storage. On the other hand, Druid has extensive integrations with various data sources, including Kafka, Hadoop, and cloud storage systems like Amazon S3. It also provides connectors for popular analytics and visualization tools like Apache Superset and Tableau.

Apache Kylin vs Druid

Overview

Apache Kylin vs Druid: What are the differences?