Amazon EMR vs Kafka

Overview

Amazon EMR

Stacks543

Followers682

Votes54

Kafka

Stacks24.2K

Followers22.3K

Votes607

GitHub Stars31.2K

Forks14.8K

Amazon EMR vs Kafka: What are the differences?

Introduction

In this Markdown document, we will compare and highlight the key differences between Amazon EMR (Elastic MapReduce) and Kafka.

Scalability: Amazon EMR is a fully managed big data processing service that allows users to analyze large datasets using popular frameworks such as Apache Spark and Hadoop. It provides the ability to scale the cluster size up or down based on the processing workload. On the other hand, Kafka is a distributed event streaming platform that is designed to handle high-throughput, fault-tolerant, and scalable data streaming. Kafka scales horizontally by adding more brokers to the cluster and distributing the data across multiple nodes.
Data Processing Paradigm: Amazon EMR primarily focuses on batch processing and is commonly used for analyzing large static datasets. It provides support for various data processing frameworks like Apache Hadoop, Apache Spark, and Apache Hive. On the other hand, Kafka is more oriented towards real-time data streaming. It offers a publish-subscribe message model where producers write data to topics and consumers read data from those topics in real-time.
Data Storage: Amazon EMR integrates with various storage options such as Amazon S3, HDFS (Hadoop Distributed File System), and Amazon EBS (Elastic Block Store). It allows users to store the input and output data in these storage systems. Kafka, on the other hand, stores the data in its own internal storage system called "log". The data is retained for a configurable amount of time or until a certain size threshold is reached.
Data Retention and Replayability: In Amazon EMR, the input and output data are typically stored in durable storage systems like Amazon S3 or HDFS, which allows for long-term retention and data replayability. On the other hand, Kafka retains data for a configurable amount of time or until a size threshold is reached within its internal storage system. Kafka provides the ability to replay the data, but the retention policy is finite.
Data Integration: Amazon EMR provides seamless integration with various AWS services like Amazon Redshift, Amazon DynamoDB, and Amazon Kinesis. It allows users to transfer data easily between different services and perform complex data analytics. Kafka, on the other hand, acts as a data pipeline or messaging system that enables data integration between different applications and systems. It serves as a central hub for real-time data streaming.
Data Processing Latency: Amazon EMR is designed for batch processing, which typically involves higher latency compared to real-time processing. The processing time in EMR depends on the size of the data and the complexity of the computations. Kafka, on the other hand, is optimized for low-latency and real-time data streaming. It can process and stream millions of messages per second with low latency, making it suitable for use cases that require near real-time processing.

In summary, Amazon EMR and Kafka have several key differences. EMR focuses on batch processing and scalability for analyzing large datasets, while Kafka is designed for real-time data streaming and provides high-throughput, fault-tolerant, and scalable event processing. The two platforms differ in terms of data processing paradigm, data storage, data retention, data integration, and data processing latency.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon EMR, Kafka

viradiya

Apr 12, 2020

Needs adviceon

AngularJS

ASP.NET Core

MSSQL

We are going to develop a microservices-based application. It consists of AngularJS, ASP.NET Core, and MSSQL.

We have 3 types of microservices. Emailservice, Filemanagementservice, Filevalidationservice

I am a beginner in microservices. But I have read about RabbitMQ, but come to know that there are Redis and Kafka also in the market. So, I want to know which is best.

933k views933k

Comments

Kirill

GO/C developer at Duckling Sales

Feb 16, 2021

Decided

Maybe not an obvious comparison with Kafka, since Kafka is pretty different from rabbitmq. But for small service, Rabbit as a pubsub platform is super easy to use and pretty powerful. Kafka as an alternative was the original choice, but its really a kind of overkill for a small-medium service. Especially if you are not planning to use k8s, since pure docker deployment can be a pain because of networking setup. Google PubSub was another alternative, its actually pretty cheap, but I never tested it since Rabbit was matching really good for mailing/notification services.

266k views266k

Comments

Ishfaq

Feb 28, 2020

Needs advice

Our backend application is sending some external messages to a third party application at the end of each backend (CRUD) API call (from UI) and these external messages take too much extra time (message building, processing, then sent to the third party and log success/failure), UI application has no concern to these extra third party messages.

So currently we are sending these third party messages by creating a new child thread at end of each REST API call so UI application doesn't wait for these extra third party API calls.

I want to integrate Apache Kafka for these extra third party API calls, so I can also retry on failover third party API calls in a queue(currently third party messages are sending from multiple threads at the same time which uses too much processing and resources) and logging, etc.

Question 1: Is this a use case of a message broker?

Question 2: If it is then Kafka vs RabitMQ which is the better?

804k views804k

Comments

Detailed Comparison

Amazon EMR	Kafka
It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.	Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Elastic- Amazon EMR enables you to quickly and easily provision as much capacity as you need and add or remove capacity at any time. Deploy multiple clusters or resize a running cluster;Low Cost- Amazon EMR is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low hourly pricing, Amazon EC2 Spot integration, Amazon EC2 Reserved Instance integration, elasticity, and Amazon S3 integration.;Flexible Data Stores- With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.;Hadoop Tools- EMR supports powerful and proven Hadoop tools such as Hive, Pig, and HBase.	Written at LinkedIn in Scala;Used by LinkedIn to offload processing of all page and other views;Defaults to using persistence, uses OS disk cache for hot data (has higher throughput then any of the above having persistence enabled);Supports both on-line as off-line processing
Statistics
GitHub Stars -	GitHub Stars 31.2K
GitHub Forks -	GitHub Forks 14.8K
Stacks 543	Stacks 24.2K
Followers 682	Followers 22.3K
Votes 54	Votes 607
Pros & Cons
Pros 15 On demand processing power 12 Don't need to maintain Hadoop Cluster yourself 7 Hadoop Tools 6 Elastic 4 Backed by Amazon	Pros 126 High-throughput 119 Distributed 92 Scalable 86 High-Performance 66 Durable Cons 32 Non-Java clients are second-class citizens 29 Needs Zookeeper 9 Operational difficulties 5 Terrible Packaging

What are some alternatives to Amazon EMR, Kafka?

RabbitMQ

RabbitMQ gives your applications a common platform to send and receive messages, and your messages a safe place to live until received.

Celery

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

Amazon SQS

Transmit any volume of data, at any level of throughput, without losing messages or requiring other services to be always available. With SQS, you can offload the administrative burden of operating and scaling a highly available messaging cluster, while paying a low price for only what you use.

Google BigQuery

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.

NSQ

NSQ is a realtime distributed messaging platform designed to operate at scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. See features & guarantees.

Amazon Redshift

It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

ActiveMQ

Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 License.

ZeroMQ

The 0MQ lightweight messaging kernel is a library which extends the standard socket interfaces with features traditionally provided by specialised messaging middleware products. 0MQ sockets provide an abstraction of asynchronous message queues, multiple messaging patterns, message filtering (subscriptions), seamless access to multiple transport protocols and more.

Qubole

Qubole is a cloud based service that makes big data easy for analysts and data engineers.

Apache NiFi

An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Related Comparisons

Amazon EMR vs Kafka: What are the differences?

Introduction

In this Markdown document, we will compare and highlight the key differences between Amazon EMR (Elastic MapReduce) and Kafka.

Scalability: Amazon EMR is a fully managed big data processing service that allows users to analyze large datasets using popular frameworks such as Apache Spark and Hadoop. It provides the ability to scale the cluster size up or down based on the processing workload. On the other hand, Kafka is a distributed event streaming platform that is designed to handle high-throughput, fault-tolerant, and scalable data streaming. Kafka scales horizontally by adding more brokers to the cluster and distributing the data across multiple nodes.
Data Processing Paradigm: Amazon EMR primarily focuses on batch processing and is commonly used for analyzing large static datasets. It provides support for various data processing frameworks like Apache Hadoop, Apache Spark, and Apache Hive. On the other hand, Kafka is more oriented towards real-time data streaming. It offers a publish-subscribe message model where producers write data to topics and consumers read data from those topics in real-time.
Data Storage: Amazon EMR integrates with various storage options such as Amazon S3, HDFS (Hadoop Distributed File System), and Amazon EBS (Elastic Block Store). It allows users to store the input and output data in these storage systems. Kafka, on the other hand, stores the data in its own internal storage system called "log". The data is retained for a configurable amount of time or until a certain size threshold is reached.
Data Retention and Replayability: In Amazon EMR, the input and output data are typically stored in durable storage systems like Amazon S3 or HDFS, which allows for long-term retention and data replayability. On the other hand, Kafka retains data for a configurable amount of time or until a size threshold is reached within its internal storage system. Kafka provides the ability to replay the data, but the retention policy is finite.
Data Integration: Amazon EMR provides seamless integration with various AWS services like Amazon Redshift, Amazon DynamoDB, and Amazon Kinesis. It allows users to transfer data easily between different services and perform complex data analytics. Kafka, on the other hand, acts as a data pipeline or messaging system that enables data integration between different applications and systems. It serves as a central hub for real-time data streaming.
Data Processing Latency: Amazon EMR is designed for batch processing, which typically involves higher latency compared to real-time processing. The processing time in EMR depends on the size of the data and the complexity of the computations. Kafka, on the other hand, is optimized for low-latency and real-time data streaming. It can process and stream millions of messages per second with low latency, making it suitable for use cases that require near real-time processing.

Amazon EMR vs Kafka

Overview

Amazon EMR vs Kafka: What are the differences?