Alternatives to HBase logo

Alternatives to HBase

Cassandra, Google Cloud Bigtable, MongoDB, Hadoop, and Druid are the most popular alternatives and competitors to HBase.
459
494
+ 1
15

What is HBase and what are its top alternatives?

Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop.
HBase is a tool in the Databases category of a tech stack.
HBase is an open source tool with 5.2K GitHub stars and 3.3K GitHub forks. Here’s a link to HBase's open source repository on GitHub

Top Alternatives to HBase

  • Cassandra
    Cassandra

    Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL. ...

  • Google Cloud Bigtable
    Google Cloud Bigtable

    Google Cloud Bigtable offers you a fast, fully managed, massively scalable NoSQL database service that's ideal for web, mobile, and Internet of Things applications requiring terabytes to petabytes of data. Unlike comparable market offerings, Cloud Bigtable doesn't require you to sacrifice speed, scale, or cost efficiency when your applications grow. Cloud Bigtable has been battle-tested at Google for more than 10 years—it's the database driving major applications such as Google Analytics and Gmail. ...

  • MongoDB
    MongoDB

    MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding. ...

  • Hadoop
    Hadoop

    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...

  • Druid
    Druid

    Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations. ...

  • Couchbase
    Couchbase

    Developed as an alternative to traditionally inflexible SQL databases, the Couchbase NoSQL database is built on an open source foundation and architected to help developers solve real-world problems and meet high scalability demands. ...

  • Apache Hive
    Apache Hive

    Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. ...

  • RocksDB
    RocksDB

    RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads. RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation. ...

HBase alternatives & related posts

Cassandra logo

Cassandra

3.6K
3.5K
507
A partitioned row store. Rows are organized into tables with a required primary key.
3.6K
3.5K
+ 1
507
PROS OF CASSANDRA
  • 119
    Distributed
  • 98
    High performance
  • 81
    High availability
  • 74
    Easy scalability
  • 53
    Replication
  • 26
    Reliable
  • 26
    Multi datacenter deployments
  • 10
    Schema optional
  • 9
    OLTP
  • 8
    Open source
  • 2
    Workload separation (via MDC)
  • 1
    Fast
CONS OF CASSANDRA
  • 3
    Reliability of replication
  • 1
    Size
  • 1
    Updates

related Cassandra posts

Thierry Schellenbach
Shared insights
on
GolangGolangPythonPythonCassandraCassandra
at

After years of optimizing our existing feed technology, we decided to make a larger leap with 2.0 of Stream. While the first iteration of Stream was powered by Python and Cassandra, for Stream 2.0 of our infrastructure we switched to Go.

The main reason why we switched from Python to Go is performance. Certain features of Stream such as aggregation, ranking and serialization were very difficult to speed up using Python.

We’ve been using Go since March 2017 and it’s been a great experience so far. Go has greatly increased the productivity of our development team. Not only has it improved the speed at which we develop, it’s also 30x faster for many components of Stream. Initially we struggled a bit with package management for Go. However, using Dep together with the VG package contributed to creating a great workflow.

Go as a language is heavily focused on performance. The built-in PPROF tool is amazing for finding performance issues. Uber’s Go-Torch library is great for visualizing data from PPROF and will be bundled in PPROF in Go 1.10.

The performance of Go greatly influenced our architecture in a positive way. With Python we often found ourselves delegating logic to the database layer purely for performance reasons. The high performance of Go gave us more flexibility in terms of architecture. This led to a huge simplification of our infrastructure and a dramatic improvement of latency. For instance, we saw a 10 to 1 reduction in web-server count thanks to the lower memory and CPU usage for the same number of requests.

#DataStores #Databases

See more
Thierry Schellenbach
Shared insights
on
RedisRedisCassandraCassandraRocksDBRocksDB
at

1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

#InMemoryDatabases #DataStores #Databases

See more
Google Cloud Bigtable logo

Google Cloud Bigtable

138
362
25
The same database that powers Google Search, Gmail and Analytics
138
362
+ 1
25
PROS OF GOOGLE CLOUD BIGTABLE
  • 11
    High performance
  • 9
    Fully managed
  • 5
    High scalability
CONS OF GOOGLE CLOUD BIGTABLE
    Be the first to leave a con

    related Google Cloud Bigtable posts

    Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.

    Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!

    Check Out My Architecture: CLICK ME

    Check out the GitHub repo attached

    See more
    Rory Gwozdz
    CTO at Harvested Financial · | 2 upvotes · 125.1K views

    I'm trying to build a way to read financial data really, really fast, for low cost. We are write/update-light (in this arena) and read-heavy. Google BigQuery being serverless can keep costs beyond low, but query speeds are always a few seconds because, I think, of the lack of indexing and potential to take advantage of the structure of the common queries. I have tried various partitions on BigQuery to speed things up too with some success but nothing extraordinary. I have never used Google Cloud Bigtable but get how it works conceptually. I believe it would make date-range based queries markedly faster. Question is, are there ways to take advantage of date-ranges in BigQuery, or does it makes sense to just shift to BigTable for mega-fast reads? I'd love to get sub-50ms.

    See more
    MongoDB logo

    MongoDB

    92.5K
    79.8K
    4.1K
    The database for giant ideas
    92.5K
    79.8K
    + 1
    4.1K
    PROS OF MONGODB
    • 827
      Document-oriented storage
    • 593
      No sql
    • 553
      Ease of use
    • 464
      Fast
    • 410
      High performance
    • 257
      Free
    • 218
      Open source
    • 180
      Flexible
    • 145
      Replication & high availability
    • 112
      Easy to maintain
    • 42
      Querying
    • 39
      Easy scalability
    • 38
      Auto-sharding
    • 37
      High availability
    • 31
      Map/reduce
    • 27
      Document database
    • 25
      Easy setup
    • 25
      Full index support
    • 16
      Reliable
    • 15
      Fast in-place updates
    • 14
      Agile programming, flexible, fast
    • 12
      No database migrations
    • 8
      Easy integration with Node.Js
    • 8
      Enterprise
    • 6
      Enterprise Support
    • 5
      Great NoSQL DB
    • 4
      Support for many languages through different drivers
    • 3
      Schemaless
    • 3
      Aggregation Framework
    • 3
      Drivers support is good
    • 2
      Fast
    • 2
      Managed service
    • 2
      Easy to Scale
    • 2
      Awesome
    • 2
      Consistent
    • 1
      Good GUI
    • 1
      Acid Compliant
    CONS OF MONGODB
    • 6
      Very slowly for connected models that require joins
    • 3
      Not acid compliant
    • 2
      Proprietary query language

    related MongoDB posts

    Shared insights
    on
    Node.jsNode.jsGraphQLGraphQLMongoDBMongoDB

    I just finished the very first version of my new hobby project: #MovieGeeks. It is a minimalist online movie catalog for you to save the movies you want to see and for rating the movies you already saw. This is just the beginning as I am planning to add more features on the lines of sharing and discovery

    For the #BackEnd I decided to use Node.js , GraphQL and MongoDB:

    1. Node.js has a huge community so it will always be a safe choice in terms of libraries and finding solutions to problems you may have

    2. GraphQL because I needed to improve my skills with it and because I was never comfortable with the usual REST approach. I believe GraphQL is a better option as it feels more natural to write apis, it improves the development velocity, by definition it fixes the over-fetching and under-fetching problem that is so common on REST apis, and on top of that, the community is getting bigger and bigger.

    3. MongoDB was my choice for the database as I already have a lot of experience working on it and because, despite of some bad reputation it has acquired in the last months, I still believe it is a powerful database for at least a very long list of use cases such as the one I needed for my website

    See more
    Vaibhav Taunk
    Team Lead at Technovert · | 31 upvotes · 4M views

    I am starting to become a full-stack developer, by choosing and learning .NET Core for API Development, Angular CLI / React for UI Development, MongoDB for database, as it a NoSQL DB and Flutter / React Native for Mobile App Development. Using Postman, Markdown and Visual Studio Code for development.

    See more
    Hadoop logo

    Hadoop

    2.5K
    2.3K
    56
    Open-source software for reliable, scalable, distributed computing
    2.5K
    2.3K
    + 1
    56
    PROS OF HADOOP
    • 39
      Great ecosystem
    • 11
      One stack to rule them all
    • 4
      Great load balancer
    • 1
      Amazon aws
    • 1
      Java syntax
    CONS OF HADOOP
      Be the first to leave a con

      related Hadoop posts

      Shared insights
      on
      KafkaKafkaHadoopHadoop
      at

      The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.

      For databases, a custom Hadoop streamer pulled database data and wrote it to S3.

      Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.

      See more
      Conor Myhrvold
      Tech Brand Mgr, Office of CTO at Uber · | 7 upvotes · 3M views

      Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

      Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

      https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

      (Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

      See more
      Druid logo

      Druid

      380
      867
      32
      Fast column-oriented distributed data store
      380
      867
      + 1
      32
      PROS OF DRUID
      • 15
        Real Time Aggregations
      • 6
        Batch and Real-Time Ingestion
      • 5
        OLAP
      • 3
        OLAP + OLTP
      • 2
        Combining stream and historical analytics
      • 1
        OLTP
      CONS OF DRUID
      • 3
        Limited sql support
      • 2
        Joins are not supported well
      • 1
        Complexity

      related Druid posts

      Shared insights
      on
      DruidDruidMongoDBMongoDB

      My background is in Data analytics in the telecom domain. Have to build the database for analyzing large volumes of CDR data so far the data are maintained in a file server and the application queries data from the files. It's consuming a lot of resources queries are taking time so now I am asked to come up with the approach. I planned to rewrite the app, so which database needs to be used. I am confused between MongoDB and Druid.

      So please do advise me on picking from these two and why?

      See more

      My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

      The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

      I make a report based on the two files in Jupyter notebook and convert it to HTML.

      • Everything is done with vanilla python and Pandas.
      • sometimes I may get a different format of data
      • cloud service is Microsoft Azure.

      What I'm considering is the following:

      Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

      the intermediate states can be stored in druid too. and visualization would be with apache superset.

      See more
      Couchbase logo

      Couchbase

      475
      603
      110
      Document-Oriented NoSQL Database
      475
      603
      + 1
      110
      PROS OF COUCHBASE
      • 18
        High performance
      • 18
        Flexible data model, easy scalability, extremely fast
      • 9
        Mobile app support
      • 7
        You can query it with Ansi-92 SQL
      • 6
        All nodes can be read/write
      • 5
        Equal nodes in cluster, allowing fast, flexible changes
      • 5
        Both a key-value store and document (JSON) db
      • 5
        Open source, community and enterprise editions
      • 4
        Automatic configuration of sharding
      • 4
        Local cache capability
      • 3
        Easy setup
      • 3
        Linearly scalable, useful to large number of tps
      • 3
        Easy cluster administration
      • 3
        Cross data center replication
      • 3
        SDKs in popular programming languages
      • 3
        Elasticsearch connector
      • 3
        Web based management, query and monitoring panel
      • 2
        Map reduce views
      • 2
        DBaaS available
      • 2
        NoSQL
      • 1
        Buckets, Scopes, Collections & Documents
      • 1
        FTS + SQL together
      CONS OF COUCHBASE
      • 3
        Terrible query language

      related Couchbase posts

      Gabriel Pa

      We implemented our first large scale EPR application from naologic.com using CouchDB .

      Very fast, replication works great, doesn't consume much RAM, queries are blazing fast but we found a problem: the queries were very hard to write, it took a long time to figure out the API, we had to go and write our own @nodejs library to make it work properly.

      It lost most of its support. Since then, we migrated to Couchbase and the learning curve was steep but all worth it. Memcached indexing out of the box, full text search works great.

      See more
      Ilias Mentzelos
      Software Engineer at Plum Fintech · | 9 upvotes · 235.8K views
      Shared insights
      on
      MongoDBMongoDBCouchbaseCouchbase

      Hey, we want to build a referral campaign mechanism that will probably contain millions of records within the next few years. We want fast read access based on IDs or some indexes, and isolation is crucial as some listeners will try to update the same document at the same time. What's your suggestion between Couchbase and MongoDB? Thanks!

      See more
      Apache Hive logo

      Apache Hive

      478
      473
      0
      Data Warehouse Software for Reading, Writing, and Managing Large Datasets
      478
      473
      + 1
      0
      PROS OF APACHE HIVE
        Be the first to leave a pro
        CONS OF APACHE HIVE
          Be the first to leave a con

          related Apache Hive posts

          Ashish Singh
          Tech Lead, Big Data Platform at Pinterest · | 38 upvotes · 3M views

          To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

          Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

          We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

          Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

          Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

          #BigData #AWS #DataScience #DataEngineering

          See more
          Jan Vlnas
          Developer Advocate at Superface · | 5 upvotes · 433K views

          From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.

          I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.

          Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.

          See more
          RocksDB logo

          RocksDB

          139
          290
          11
          Embeddable persistent key-value store for fast storage, developed and maintained by Facebook Database Engineering Team
          139
          290
          + 1
          11
          PROS OF ROCKSDB
          • 5
            Very fast
          • 3
            Made by Facebook
          • 2
            Consistent performance
          • 1
            Ability to add logic to the database layer where needed
          CONS OF ROCKSDB
            Be the first to leave a con

            related RocksDB posts

            Thierry Schellenbach
            Shared insights
            on
            RedisRedisCassandraCassandraRocksDBRocksDB
            at

            1.0 of Stream leveraged Cassandra for storing the feed. Cassandra is a common choice for building feeds. Instagram, for instance started, out with Redis but eventually switched to Cassandra to handle their rapid usage growth. Cassandra can handle write heavy workloads very efficiently.

            Cassandra is a great tool that allows you to scale write capacity simply by adding more nodes, though it is also very complex. This complexity made it hard to diagnose performance fluctuations. Even though we had years of experience with running Cassandra, it still felt like a bit of a black box. When building Stream 2.0 we decided to go for a different approach and build Keevo. Keevo is our in-house key-value store built upon RocksDB, gRPC and Raft.

            RocksDB is a highly performant embeddable database library developed and maintained by Facebook’s data engineering team. RocksDB started as a fork of Google’s LevelDB that introduced several performance improvements for SSD. Nowadays RocksDB is a project on its own and is under active development. It is written in C++ and it’s fast. Have a look at how this benchmark handles 7 million QPS. In terms of technology it’s much more simple than Cassandra.

            This translates into reduced maintenance overhead, improved performance and, most importantly, more consistent performance. It’s interesting to note that LinkedIn also uses RocksDB for their feed.

            #InMemoryDatabases #DataStores #Databases

            See more

            I am researching different querying solutions to handle ~1 trillion records of data (in the realm of a petabyte). The data is mostly textual. I have identified a few options: Milvus, HBase, RocksDB, and Elasticsearch. I was wondering if there is a good way to compare the performance of these options (or if anyone has already done something like this). I want to be able to compare the speed of ingesting and querying textual data from these tools. Does anyone have information on this or know where I can find some? Thanks in advance!

            See more