Amazon Redshift vs Kafka

Overview

Amazon Redshift

Stacks1.5K

Followers1.4K

Votes108

Kafka

Stacks24.2K

Followers22.3K

Votes607

GitHub Stars31.2K

Forks14.8K

Amazon Redshift vs Kafka: What are the differences?

Introduction

Amazon Redshift and Kafka are both popular technologies used for data processing and data management. However, they have significant differences in terms of their purpose, functionality, and architecture.

Data Storage and Processing: Amazon Redshift is a fully-managed data warehousing solution in the cloud, designed to handle large amounts of structured data for analytics purposes. It uses columnar storage for efficient data compression and query performance. On the other hand, Kafka is a distributed streaming platform that is optimized for high-throughput, fault-tolerant, and low-latency data streaming and processing. It is mainly used for real-time data integration and data pipeline scenarios.
Data Ingestion: Redshift primarily allows batch ingestion of data through various methods like bulk loading, data import, and data integration with other AWS services. Kafka, on the other hand, provides a distributed, publish-subscribe messaging system where multiple producers can write data simultaneously to the Kafka cluster, allowing real-time streaming ingestion of data from multiple sources.
Data Processing Paradigm: Redshift follows a traditional, SQL-based, batch-oriented approach for data processing and analytics. It is optimized for running complex analytical queries on massive volumes of data. Kafka, on the other hand, follows a stream processing paradigm, where data is processed and analyzed in real-time or near real-time as it flows through the Kafka cluster. It enables real-time data processing, transformations, and stream analytics on data streams.
Data Persistence: In Amazon Redshift, data is stored persistently in the cluster, even if the cluster is shut down or restarted. The data is distributed across multiple nodes in a cluster and is replicated for fault tolerance. In Kafka, data is stored in a distributed commit log, which allows durable storage and enables fault tolerance. However, Kafka does not provide built-in persistence for an indefinite period, and the retention duration can be configured based on the required period.
Data Durability and Fault Tolerance: Redshift ensures data durability and fault tolerance through the replication of data across multiple nodes within a cluster. It automatically takes care of data backups and replication to handle system failures. Kafka, on the other hand, ensures fault tolerance and data durability through the concept of replication and data replication across multiple Kafka brokers. It provides high availability and fault tolerance by distributing and replicating data across the Kafka cluster.
Data Latency: Amazon Redshift is optimized for OLAP (Online Analytical Processing) workloads, where the focus is on running complex analytical queries on large volumes of data. It provides high-performance query execution and supports highly concurrent workloads. Kafka, on the other hand, is optimized for low-latency data processing and enables near real-time streaming processing of data. It supports fast data ingestion and processing of high-volume data streams with low latency.

In summary, Amazon Redshift is a data warehousing solution designed for storing and analyzing large volumes of structured data, while Kafka is a distributed streaming platform optimized for real-time data streaming, processing, and integration. Redshift follows a traditional batch-oriented SQL-based approach for data processing, whereas Kafka follows a real-time stream processing paradigm. Redshift provides persistent storage and is optimized for complex analytical queries, while Kafka enables low-latency, fault-tolerant data streaming and processing.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon Redshift, Kafka

viradiya

Apr 12, 2020

Needs adviceon

AngularJS

ASP.NET Core

MSSQL

We are going to develop a microservices-based application. It consists of AngularJS, ASP.NET Core, and MSSQL.

We have 3 types of microservices. Emailservice, Filemanagementservice, Filevalidationservice

I am a beginner in microservices. But I have read about RabbitMQ, but come to know that there are Redis and Kafka also in the market. So, I want to know which is best.

933k views933k

Comments

datocrats-org

Jul 29, 2020

Needs adviceon

Amazon EC2

Tableau

PowerBI

We need to perform ETL from several databases into a data warehouse or data lake. We want to

keep raw and transformed data available to users to draft their own queries efficiently
give users the ability to give custom permissions and SSO
move between open-source on-premises development and cloud-based production environments

We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.

319k views319k

Comments

Ishfaq

Feb 28, 2020

Needs advice

Our backend application is sending some external messages to a third party application at the end of each backend (CRUD) API call (from UI) and these external messages take too much extra time (message building, processing, then sent to the third party and log success/failure), UI application has no concern to these extra third party messages.

So currently we are sending these third party messages by creating a new child thread at end of each REST API call so UI application doesn't wait for these extra third party API calls.

I want to integrate Apache Kafka for these extra third party API calls, so I can also retry on failover third party API calls in a queue(currently third party messages are sending from multiple threads at the same time which uses too much processing and resources) and logging, etc.

Question 1: Is this a use case of a message broker?

Question 2: If it is then Kafka vs RabitMQ which is the better?

804k views804k

Comments

Detailed Comparison

Amazon Redshift	Kafka
It is optimized for data sets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.	Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Optimized for Data Warehousing- It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources.;Scalable- With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change.;No Up-Front Costs- You pay only for the resources you provision. You can choose On-Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing.;Fault Tolerant- Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3.;SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers.;Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster.;Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-acccelerated AES-256 encryption for data at rest.<br>	Written at LinkedIn in Scala;Used by LinkedIn to offload processing of all page and other views;Defaults to using persistence, uses OS disk cache for hot data (has higher throughput then any of the above having persistence enabled);Supports both on-line as off-line processing
Statistics
GitHub Stars -	GitHub Stars 31.2K
GitHub Forks -	GitHub Forks 14.8K
Stacks 1.5K	Stacks 24.2K
Followers 1.4K	Followers 22.3K
Votes 108	Votes 607
Pros & Cons
Pros 41 Data Warehousing 27 Scalable 17 SQL 14 Backed by Amazon 5 Encryption	Pros 126 High-throughput 119 Distributed 92 Scalable 86 High-Performance 66 Durable Cons 32 Non-Java clients are second-class citizens 29 Needs Zookeeper 9 Operational difficulties 5 Terrible Packaging
Integrations
SQLite MySQL Oracle PL/SQL	No integrations available

What are some alternatives to Amazon Redshift, Kafka?

RabbitMQ

RabbitMQ gives your applications a common platform to send and receive messages, and your messages a safe place to live until received.

Celery

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

Amazon SQS

Transmit any volume of data, at any level of throughput, without losing messages or requiring other services to be always available. With SQS, you can offload the administrative burden of operating and scaling a highly available messaging cluster, while paying a low price for only what you use.

Google BigQuery

Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure. Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python.

NSQ

NSQ is a realtime distributed messaging platform designed to operate at scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. See features & guarantees.

ActiveMQ

Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 License.

ZeroMQ

The 0MQ lightweight messaging kernel is a library which extends the standard socket interfaces with features traditionally provided by specialised messaging middleware products. 0MQ sockets provide an abstraction of asynchronous message queues, multiple messaging patterns, message filtering (subscriptions), seamless access to multiple transport protocols and more.

Qubole

Qubole is a cloud based service that makes big data easy for analysts and data engineers.

Apache NiFi

An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Amazon EMR

It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.

Related Comparisons

Amazon Redshift vs Kafka: What are the differences?

Introduction

Data Storage and Processing: Amazon Redshift is a fully-managed data warehousing solution in the cloud, designed to handle large amounts of structured data for analytics purposes. It uses columnar storage for efficient data compression and query performance. On the other hand, Kafka is a distributed streaming platform that is optimized for high-throughput, fault-tolerant, and low-latency data streaming and processing. It is mainly used for real-time data integration and data pipeline scenarios.
Data Ingestion: Redshift primarily allows batch ingestion of data through various methods like bulk loading, data import, and data integration with other AWS services. Kafka, on the other hand, provides a distributed, publish-subscribe messaging system where multiple producers can write data simultaneously to the Kafka cluster, allowing real-time streaming ingestion of data from multiple sources.
Data Processing Paradigm: Redshift follows a traditional, SQL-based, batch-oriented approach for data processing and analytics. It is optimized for running complex analytical queries on massive volumes of data. Kafka, on the other hand, follows a stream processing paradigm, where data is processed and analyzed in real-time or near real-time as it flows through the Kafka cluster. It enables real-time data processing, transformations, and stream analytics on data streams.
Data Persistence: In Amazon Redshift, data is stored persistently in the cluster, even if the cluster is shut down or restarted. The data is distributed across multiple nodes in a cluster and is replicated for fault tolerance. In Kafka, data is stored in a distributed commit log, which allows durable storage and enables fault tolerance. However, Kafka does not provide built-in persistence for an indefinite period, and the retention duration can be configured based on the required period.
Data Durability and Fault Tolerance: Redshift ensures data durability and fault tolerance through the replication of data across multiple nodes within a cluster. It automatically takes care of data backups and replication to handle system failures. Kafka, on the other hand, ensures fault tolerance and data durability through the concept of replication and data replication across multiple Kafka brokers. It provides high availability and fault tolerance by distributing and replicating data across the Kafka cluster.
Data Latency: Amazon Redshift is optimized for OLAP (Online Analytical Processing) workloads, where the focus is on running complex analytical queries on large volumes of data. It provides high-performance query execution and supports highly concurrent workloads. Kafka, on the other hand, is optimized for low-latency data processing and enables near real-time streaming processing of data. It supports fast data ingestion and processing of high-volume data streams with low latency.

Amazon Redshift vs Kafka

Overview

Amazon Redshift vs Kafka: What are the differences?