Druid vs Presto vs Apache Spark

Need advice about which tool to choose?Ask the StackShare community!

Druid

378
865
+ 1
32
Presto

392
1K
+ 1
66
Apache Spark

2.9K
3.5K
+ 1
140

Apache Spark vs Druid vs Presto: What are the differences?

Apache Spark, Druid, and Presto are all popular tools used in big data processing. Each of these tools has its own set of strengths and weaknesses that make them suitable for different use cases.

  1. Data Processing Paradigm: Apache Spark is a general-purpose distributed processing framework that provides high-level APIs in programming languages like Java, Scala, and Python. It is especially useful for iterative algorithms and interactive data mining. Druid, on the other hand, is a high-performance, real-time analytics database specifically designed for fast ad-hoc queries on large datasets. Presto is a distributed SQL query engine optimized for low-latency interactive queries on various data sources.

  2. Data Storage: Apache Spark does not have its own storage capabilities but can read and process data from various data sources like HDFS, Cassandra, and S3. Druid is designed to store and query large volumes of data in a column-oriented, distributed architecture, optimized for fast query performance. Presto supports querying data where it resides, allowing users to query data from multiple sources like HDFS, MySQL, and Cassandra without the need to move or replicate the data.

  3. Query Speed and Performance: Apache Spark is known for its in-memory computation capabilities, which can significantly improve processing speed for iterative algorithms and machine learning tasks. Druid is optimized for sub-second query performance, making it ideal for real-time analytics and dashboard applications. Presto is designed for low-latency interactive queries on massive datasets, making it suitable for scenarios where query speed is crucial.

  4. Scalability and Fault Tolerance: Apache Spark provides fault tolerance through its resilient distributed dataset (RDD) abstraction, which enables data recovery in case of node failures. Druid is highly scalable and can handle petabytes of data spread across a cluster of machines, providing horizontal scalability for growing data volumes. Presto is horizontally scalable and fault-tolerant, allowing users to add more nodes to the cluster for increased processing power and handle failures gracefully.

  5. Use Cases: Apache Spark is widely used for batch processing, real-time streaming analytics, machine learning, and graph processing. Druid is commonly used for real-time analytics, time series data analysis, and event data analysis. Presto is popular for ad-hoc queries, interactive analytics, and business intelligence applications where fast query response times are critical.

In Summary, Apache Spark, Druid, and Presto offer unique features and capabilities for big data processing, with Apache Spark focusing on general-purpose distributed processing, Druid on real-time analytics, and Presto on interactive querying.

Get Advice from developers at your company using StackShare Enterprise. Sign up for StackShare Enterprise.
Learn More
Pros of Druid
Pros of Presto
Pros of Apache Spark
  • 15
    Real Time Aggregations
  • 6
    Batch and Real-Time Ingestion
  • 5
    OLAP
  • 3
    OLAP + OLTP
  • 2
    Combining stream and historical analytics
  • 1
    OLTP
  • 18
    Works directly on files in s3 (no ETL)
  • 13
    Open-source
  • 12
    Join multiple databases
  • 10
    Scalable
  • 7
    Gets ready in minutes
  • 6
    MPP
  • 61
    Open-source
  • 48
    Fast and Flexible
  • 8
    One platform for every big data problem
  • 8
    Great for distributed SQL like applications
  • 6
    Easy to install and to use
  • 3
    Works well for most Datascience usecases
  • 2
    Interactive Query
  • 2
    Machine learning libratimery, Streaming in real
  • 2
    In memory Computation

Sign up to add or upvote prosMake informed product decisions

Cons of Druid
Cons of Presto
Cons of Apache Spark
  • 3
    Limited sql support
  • 2
    Joins are not supported well
  • 1
    Complexity
    Be the first to leave a con
    • 4
      Speed

    Sign up to add or upvote consMake informed product decisions

    - No public GitHub repository available -
    - No public GitHub repository available -

    What is Druid?

    Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

    What is Presto?

    Distributed SQL Query Engine for Big Data

    What is Apache Spark?

    Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

    Need advice about which tool to choose?Ask the StackShare community!

    What companies use Druid?
    What companies use Presto?
    What companies use Apache Spark?

    Sign up to get full access to all the companiesMake informed product decisions

    What tools integrate with Druid?
    What tools integrate with Presto?
    What tools integrate with Apache Spark?

    Sign up to get full access to all the tool integrationsMake informed product decisions

    Blog Posts

    What are some alternatives to Druid, Presto, and Apache Spark?
    HBase
    Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop.
    MongoDB
    MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
    Cassandra
    Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
    Prometheus
    Prometheus is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
    Elasticsearch
    Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).
    See all alternatives