Alternatives to Apache Kudu logo

Alternatives to Apache Kudu

Cassandra, HBase, Apache Spark, Apache Impala, and Hadoop are the most popular alternatives and competitors to Apache Kudu.
72
258
+ 1
10

What is Apache Kudu and what are its top alternatives?

Apache Kudu is an open-source data storage engine that provides a combination of fast analytics and fast data ingestion. It is designed to handle analytical workloads like SQL and machine learning. Key features of Apache Kudu include columnar storage, real-time updates, support for diverse workloads, and seamless integration with Apache Spark and Apache Impala. However, limitations of Apache Kudu include high memory usage for certain workloads and the lack of support for complex transactions.

  1. Cloudera Impala: Cloudera Impala is an open-source, massively parallel processing SQL query engine for large-scale data stored in Apache Hadoop clusters. Key features include fast query performance, integration with various BI tools, and support for complex queries. Pros include fast querying speed, while cons include limited support for real-time data ingestion compared to Apache Kudu.

  2. Apache HBase: Apache HBase is an open-source, distributed, scalable, Big Data store that runs on top of the Hadoop Distributed File System (HDFS). Key features include random, real-time read/write access to Big Data, linear and modular scalability, and automatic sharding. Pros include fast read and write access, while cons include performance limitations for analytical workloads compared to Apache Kudu.

  3. Druid: Apache Druid is a high-performance, column-oriented, distributed data store for real-time analytics on large datasets. Key features include low-latency queries, scalable infrastructure, and support for time-series data. Pros include real-time data ingestion, while cons include limited support for transactional workloads compared to Apache Kudu.

  4. ClickHouse: ClickHouse is an open-source, column-oriented database management system that allows for real-time analytics, interactive queries, and scalable data storage. Key features include high performance, horizontal scalability, and native support for various data formats. Pros include efficient query execution, while cons include limited support for complex transactions compared to Apache Kudu.

  5. Cassandra: Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers. Key features include linear scalability, continuous availability, and flexible data storage. Pros include high availability and fault tolerance, while cons include limited support for real-time analytics compared to Apache Kudu.

  6. InfluxDB: InfluxDB is an open-source, time-series database designed for fast, high-availability storage and retrieval of time-series data. Key features include high performance, efficient storage, and support for data visualization tools. Pros include native support for time-series data, while cons include limited support for complex queries compared to Apache Kudu.

  7. Vertica: Vertica is a commercial, column-oriented, relational database management system designed for Big Data analytics. Key features include high performance, scalability, and advanced analytics capabilities. Pros include support for advanced analytics functions, while cons include licensing costs compared to Apache Kudu.

  8. Greenplum: Greenplum is an open-source, massively parallel processing data platform based on PostgreSQL. Key features include scalable architecture, support for complex SQL queries, and advanced analytics capabilities. Pros include support for complex, ad-hoc queries, while cons include steep learning curve compared to Apache Kudu.

  9. TiDB: TiDB is an open-source, distributed SQL database that combines the horizontal scalability of NoSQL with the ACID compliance of traditional RDBMS. Key features include distributed transactions, SQL support, and horizontal scalability. Pros include scalability and horizontal sharding, while cons include performance limitations compared to Apache Kudu.

  10. MemSQL: MemSQL is a distributed, in-memory, SQL-compatible database that provides high performance and scalability for real-time analytics workloads. Key features include in-memory processing, high availability, and support for distributed SQL queries. Pros include fast query performance, while cons include limited support for complex data types compared to Apache Kudu.

Top Alternatives to Apache Kudu

  • Cassandra
    Cassandra

    Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL. ...

  • HBase
    HBase

    Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop. ...

  • Apache Spark
    Apache Spark

    Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. ...

  • Apache Impala
    Apache Impala

    Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. ...

  • Hadoop
    Hadoop

    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...

  • Druid
    Druid

    Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations. ...

  • Apache Ignite
    Apache Ignite

    It is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale ...

  • Clickhouse
    Clickhouse

    It allows analysis of data that is updated in real time. It offers instant results in most cases: the data is processed faster than it takes to create a query. ...

Apache Kudu alternatives & related posts

Cassandra logo

Cassandra

3.5K
3.5K
507
A partitioned row store. Rows are organized into tables with a required primary key.
3.5K
3.5K
+ 1
507
PROS OF CASSANDRA
  • 119
    Distributed
  • 98
    High performance
  • 81
    High availability
  • 74
    Easy scalability
  • 53
    Replication
  • 26
    Reliable
  • 26
    Multi datacenter deployments
  • 10
    Schema optional
  • 9
    OLTP
  • 8
    Open source
  • 2
    Workload separation (via MDC)
  • 1
    Fast
CONS OF CASSANDRA
  • 3
    Reliability of replication
  • 1
    Size
  • 1
    Updates

related Cassandra posts

Thierry Schellenbach
Shared insights
on
GolangGolangPythonPythonCassandraCassandra
at

After years of optimizing our existing feed technology, we decided to make a larger leap with 2.0 of Stream. While the first iteration of Stream was powered by Python and Cassandra, for Stream 2.0 of our infrastructure we switched to Go.

The main reason why we switched from Python to Go is performance. Certain features of Stream such as aggregation, ranking and serialization were very difficult to speed up using Python.

We’ve been using Go since March 2017 and it’s been a great experience so far. Go has greatly increased the productivity of our development team. Not only has it improved the speed at which we develop, it’s also 30x faster for many components of Stream. Initially we struggled a bit with package management for Go. However, using Dep together with the VG package contributed to creating a great workflow.

Go as a language is heavily focused on performance. The built-in PPROF tool is amazing for finding performance issues. Uber’s Go-Torch library is great for visualizing data from PPROF and will be bundled in PPROF in Go 1.10.

The performance of Go greatly influenced our architecture in a positive way. With Python we often found ourselves delegating logic to the database layer purely for performance reasons. The high performance of Go gave us more flexibility in terms of architecture. This led to a huge simplification of our infrastructure and a dramatic improvement of latency. For instance, we saw a 10 to 1 reduction in web-server count thanks to the lower memory and CPU usage for the same number of requests.

#DataStores #Databases

See more
Thierry Schellenbach
Shared insights
on
RedisRedisCassandraCassandraRocksDBRocksDB
at

1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

#InMemoryDatabases #DataStores #Databases

See more
HBase logo

HBase

454
492
15
The Hadoop database, a distributed, scalable, big data store
454
492
+ 1
15
PROS OF HBASE
  • 9
    Performance
  • 5
    OLTP
  • 1
    Fast Point Queries
CONS OF HBASE
    Be the first to leave a con

    related HBase posts

    I am researching different querying solutions to handle ~1 trillion records of data (in the realm of a petabyte). The data is mostly textual. I have identified a few options: Milvus, HBase, RocksDB, and Elasticsearch. I was wondering if there is a good way to compare the performance of these options (or if anyone has already done something like this). I want to be able to compare the speed of ingesting and querying textual data from these tools. Does anyone have information on this or know where I can find some? Thanks in advance!

    See more

    Hi, I'm building a machine learning pipelines to store image bytes and image vectors in the backend.

    So, when users query for the random access image data (key), we return the image bytes and perform machine learning model operations on it.

    I'm currently considering going with Amazon S3 (in the future, maybe add Redis caching layer) as the backend system to store the information (s3 buckets with sharded prefixes).

    As the latency of S3 is 100-200ms (get/put) and it has a high throughput of 3500 puts/sec and 5500 gets/sec for a given bucker/prefix. In the future I need to reduce the latency, I can add Redis cache.

    Also, s3 costs are way fewer than HBase (on Amazon EC2 instances with 3x replication factor)

    I have not personally used HBase before, so can someone help me if I'm making the right choice here? I'm not aware of Hbase latencies and I have learned that the MOB feature on Hbase has to be turned on if we have store image bytes on of the column families as the avg image bytes are 240Kb.

    See more
    Apache Spark logo

    Apache Spark

    2.9K
    3.5K
    140
    Fast and general engine for large-scale data processing
    2.9K
    3.5K
    + 1
    140
    PROS OF APACHE SPARK
    • 61
      Open-source
    • 48
      Fast and Flexible
    • 8
      One platform for every big data problem
    • 8
      Great for distributed SQL like applications
    • 6
      Easy to install and to use
    • 3
      Works well for most Datascience usecases
    • 2
      Interactive Query
    • 2
      Machine learning libratimery, Streaming in real
    • 2
      In memory Computation
    CONS OF APACHE SPARK
    • 4
      Speed

    related Apache Spark posts

    Conor Myhrvold
    Tech Brand Mgr, Office of CTO at Uber · | 44 upvotes · 10M views

    How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:

    Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.

    Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:

    https://eng.uber.com/distributed-tracing/

    (GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)

    Bindings/Operator: Python Java Node.js Go C++ Kubernetes JavaScript OpenShift C# Apache Spark

    See more
    Eric Colson
    Chief Algorithms Officer at Stitch Fix · | 21 upvotes · 6.1M views

    The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

    Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

    At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

    For more info:

    #DataScience #DataStack #Data

    See more
    Apache Impala logo

    Apache Impala

    145
    300
    18
    Real-time Query for Hadoop
    145
    300
    + 1
    18
    PROS OF APACHE IMPALA
    • 11
      Super fast
    • 1
      Massively Parallel Processing
    • 1
      Load Balancing
    • 1
      Replication
    • 1
      Scalability
    • 1
      Distributed
    • 1
      High Performance
    • 1
      Open Sourse
    CONS OF APACHE IMPALA
      Be the first to leave a con

      related Apache Impala posts

      I have been working on a Java application to demonstrate the latency for the select/insert/update operations on KUDU storage using Apache Kudu API - Java based client. I have a few queries about using Apache Kudu API

      1. Do we have JDBC wrapper to use Apache Kudu API for getting connection to Kudu masters with connection pool mechanism and all DB operations?

      2. Does Apache KuduAPI supports order by, group by, and aggregate functions? if yes, how to implement these functions using Kudu APIs.

      3. How can we add kudu predicates to Kudu update operation? if yes, how?

      4. Does Apache Kudu API supports batch insertion (execute the Kudu Insert for multiple rows at one go instead of row by row)? (like Kudusession.apply(List);)

      5. Does Apache Kudu API support join on tables?

      6. which tool is preferred over others (Apache Impala /Kudu API) for read and update/insert DB operations?

      See more
      Hadoop logo

      Hadoop

      2.5K
      2.3K
      56
      Open-source software for reliable, scalable, distributed computing
      2.5K
      2.3K
      + 1
      56
      PROS OF HADOOP
      • 39
        Great ecosystem
      • 11
        One stack to rule them all
      • 4
        Great load balancer
      • 1
        Amazon aws
      • 1
        Java syntax
      CONS OF HADOOP
        Be the first to leave a con

        related Hadoop posts

        Shared insights
        on
        KafkaKafkaHadoopHadoop
        at

        The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.

        For databases, a custom Hadoop streamer pulled database data and wrote it to S3.

        Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.

        See more
        Conor Myhrvold
        Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 3M views

        Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

        Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

        https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

        (Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

        See more
        Druid logo

        Druid

        379
        865
        32
        Fast column-oriented distributed data store
        379
        865
        + 1
        32
        PROS OF DRUID
        • 15
          Real Time Aggregations
        • 6
          Batch and Real-Time Ingestion
        • 5
          OLAP
        • 3
          OLAP + OLTP
        • 2
          Combining stream and historical analytics
        • 1
          OLTP
        CONS OF DRUID
        • 3
          Limited sql support
        • 2
          Joins are not supported well
        • 1
          Complexity

        related Druid posts

        Shared insights
        on
        DruidDruidMongoDBMongoDB

        My background is in Data analytics in the telecom domain. Have to build the database for analyzing large volumes of CDR data so far the data are maintained in a file server and the application queries data from the files. It's consuming a lot of resources queries are taking time so now I am asked to come up with the approach. I planned to rewrite the app, so which database needs to be used. I am confused between MongoDB and Druid.

        So please do advise me on picking from these two and why?

        See more

        My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

        The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

        I make a report based on the two files in Jupyter notebook and convert it to HTML.

        • Everything is done with vanilla python and Pandas.
        • sometimes I may get a different format of data
        • cloud service is Microsoft Azure.

        What I'm considering is the following:

        Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

        the intermediate states can be stored in druid too. and visualization would be with apache superset.

        See more
        Apache Ignite logo

        Apache Ignite

        95
        165
        32
        An open-source distributed database, caching and processing platform
        95
        165
        + 1
        32
        PROS OF APACHE IGNITE
        • 4
          Multiple client language support
        • 4
          Written in java. runs on jvm
        • 4
          Free
        • 4
          High Avaliability
        • 3
          Load balancing
        • 3
          Sql query support in cluster wide
        • 3
          Rest interface
        • 2
          Easy to use
        • 2
          Distributed compute
        • 2
          Better Documentation
        • 1
          Distributed Locking
        CONS OF APACHE IGNITE
          Be the first to leave a con

          related Apache Ignite posts

          Clickhouse logo

          Clickhouse

          389
          519
          78
          A column-oriented database management system
          389
          519
          + 1
          78
          PROS OF CLICKHOUSE
          • 19
            Fast, very very fast
          • 11
            Good compression ratio
          • 6
            Horizontally scalable
          • 5
            Great CLI
          • 5
            Utilizes all CPU resources
          • 5
            RESTful
          • 4
            Buggy
          • 4
            Open-source
          • 4
            Great number of SQL functions
          • 3
            Server crashes its normal :(
          • 3
            Has no transactions
          • 2
            Flexible connection options
          • 2
            Highly available
          • 2
            ODBC
          • 2
            Flexible compression options
          • 1
            In IDEA data import via HTTP interface not working
          CONS OF CLICKHOUSE
          • 5
            Slow insert operations

          related Clickhouse posts