Impala vs Apache Spark: What are the differences?
Impala: Real-time Query for Hadoop. Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time; Apache Spark: Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Impala and Apache Spark belong to "Big Data Tools" category of the tech stack.
Some of the features offered by Impala are:
- Do BI-style Queries on Hadoop
- Unify Your Infrastructure
- Implement Quickly
On the other hand, Apache Spark provides the following key features:
- Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
- Write applications quickly in Java, Scala or Python
- Combine SQL, streaming, and complex analytics
"Super fast" is the top reason why over 7 developers like Impala, while over 45 developers mention "Open-source" as the leading cause for choosing Apache Spark.
Impala and Apache Spark are both open source tools. It seems that Apache Spark with 22.3K GitHub stars and 19.3K forks on GitHub has more adoption than Impala with 2.17K GitHub stars and 825 GitHub forks.
Uber Technologies, 9GAG, and Hootsuite are some of the popular companies that use Apache Spark, whereas Impala is used by Expedia.com, 37 Signals, and Stripe. Apache Spark has a broader approval, being mentioned in 263 company stacks & 111 developers stacks; compared to Impala, which is listed in 15 company stacks and 5 developer stacks.