Apache Hive vs Apache Spark: What are the differences?
What is Apache Hive? Data Warehouse Software for Reading, Writing, and Managing Large Datasets. Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.
What is Apache Spark? Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Apache Hive and Apache Spark belong to "Big Data Tools" category of the tech stack.
Some of the features offered by Apache Hive are:
- Built on top of Apache Hadoop
- Tools to enable easy access to data via SQL
- Support for extract/transform/load (ETL), reporting, and data analysis
On the other hand, Apache Spark provides the following key features:
- Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
- Write applications quickly in Java, Scala or Python
- Combine SQL, streaming, and complex analytics
Apache Hive and Apache Spark are both open source tools. Apache Spark with 22.5K GitHub stars and 19.4K forks on GitHub appears to be more popular than Apache Hive with 2.62K GitHub stars and 2.58K GitHub forks.
Uber Technologies, Slack, and Shopify are some of the popular companies that use Apache Spark, whereas Apache Hive is used by Repro, Algorithmia, and Eyereturn Marketing. Apache Spark has a broader approval, being mentioned in 266 company stacks & 112 developers stacks; compared to Apache Hive, which is listed in 27 company stacks and 12 developer stacks.