What is Kafka and what are its top alternatives?
Top Alternatives to Kafka
- ActiveMQ
Apache ActiveMQ is fast, supports many Cross Language Clients and Protocols, comes with easy to use Enterprise Integration Patterns and many advanced features while fully supporting JMS 1.1 and J2EE 1.4. Apache ActiveMQ is released under the Apache 2.0 License. ...
- RabbitMQ
RabbitMQ gives your applications a common platform to send and receive messages, and your messages a safe place to live until received. ...
- Amazon Kinesis
Amazon Kinesis can collect and process hundreds of gigabytes of data per second from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data. ...
- Apache Spark
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. ...
- Akka
Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. ...
- Apache Storm
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. ...
- Apache Flink
Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala. ...
- Redis
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker. Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams. ...
Kafka alternatives & related posts
- Easy to use18
- Open source14
- Efficient13
- JMS compliant10
- High Availability6
- Scalable5
- Distributed Network of brokers3
- Persistence3
- Support XA (distributed transactions)3
- Docker delievery1
- Highly configurable1
- RabbitMQ0
- ONLY Vertically Scalable1
- Support1
- Low resilience to exceptions and interruptions1
- Difficult to scale1
related ActiveMQ posts
I want to choose Message Queue with the following features - Highly Available, Distributed, Scalable, Monitoring. I have RabbitMQ, ActiveMQ, Kafka and Apache RocketMQ in mind. But I am confused which one to choose.
I use ActiveMQ because RabbitMQ have stopped giving the support for AMQP 1.0 or above version and the earlier version of AMQP doesn't give the functionality to support OAuth.
If OAuth is not required and we can go with AMQP 0.9 then i still recommend rabbitMq.
- It's fast and it works with good metrics/monitoring234
- Ease of configuration80
- I like the admin interface59
- Easy to set-up and start with50
- Durable21
- Standard protocols18
- Intuitive work through python18
- Written primarily in Erlang10
- Simply superb8
- Completeness of messaging patterns6
- Scales to 1 million messages per second3
- Reliable3
- Distributed2
- Supports AMQP2
- Better than most traditional queue based message broker2
- Supports MQTT2
- Clusterable1
- Clear documentation with different scripting language1
- Great ui1
- Inubit Integration1
- Better routing system1
- High performance1
- Runs on Open Telecom Platform1
- Delayed messages1
- Reliability1
- Open-source1
- Too complicated cluster/HA config and management9
- Needs Erlang runtime. Need ops good with Erlang runtime6
- Configuration must be done first, not by your code5
- Slow4
related RabbitMQ posts
As Sentry runs throughout the day, there are about 50 different offline tasks that we execute—anything from “process this event, pretty please” to “send all of these cool people some emails.” There are some that we execute once a day and some that execute thousands per second.
Managing this variety requires a reliably high-throughput message-passing technology. We use Celery's RabbitMQ implementation, and we stumbled upon a great feature called Federation that allows us to partition our task queue across any number of RabbitMQ servers and gives us the confidence that, if any single server gets backlogged, others will pitch in and distribute some of the backlogged tasks to their consumers.
#MessageQueue
Hi, I am building an enhanced web-conferencing app that will have a voice/video call, live chats, live notifications, live discussions, screen sharing, etc features. Ref: Zoom.
I need advise finalizing the tech stack for this app. I am considering below tech stack:
- Frontend: React
- Backend: Node.js
- Database: MongoDB
- IAAS: #AWS
- Containers & Orchestration: Docker / Kubernetes
- DevOps: GitLab, Terraform
- Brokers: Redis / RabbitMQ
I need advice at the platform level as to what could be considered to support concurrent video streaming seamlessly.
Also, please suggest what could be a better tech stack for my app?
#SAAS #VideoConferencing #WebAndVideoConferencing #zoom #stack
Amazon Kinesis
related Amazon Kinesis posts
We are in the process of building a modern content platform to deliver our content through various channels. We decided to go with Microservices architecture as we wanted scale. Microservice architecture style is an approach to developing an application as a suite of small independently deployable services built around specific business capabilities. You can gain modularity, extensive parallelism and cost-effective scaling by deploying services across many distributed servers. Microservices modularity facilitates independent updates/deployments, and helps to avoid single point of failure, which can help prevent large-scale outages. We also decided to use Event Driven Architecture pattern which is a popular distributed asynchronous architecture pattern used to produce highly scalable applications. The event-driven architecture is made up of highly decoupled, single-purpose event processing components that asynchronously receive and process events.
To build our #Backend capabilities we decided to use the following: 1. #Microservices - Java with Spring Boot , Node.js with ExpressJS and Python with Flask 2. #Eventsourcingframework - Amazon Kinesis , Amazon Kinesis Firehose , Amazon SNS , Amazon SQS, AWS Lambda 3. #Data - Amazon RDS , Amazon DynamoDB , Amazon S3 , MongoDB Atlas
To build #Webapps we decided to use Angular 2 with RxJS
#Devops - GitHub , Travis CI , Terraform , Docker , Serverless
As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.
We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.
- Open-source60
- Fast and Flexible48
- One platform for every big data problem8
- Great for distributed SQL like applications8
- Easy to install and to use6
- Works well for most Datascience usecases3
- Interactive Query2
- In memory Computation2
- Machine learning libratimery, Streaming in real2
- Speed4
related Apache Spark posts
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
- Great concurrency model32
- Fast17
- Actor Library12
- Open source10
- Resilient7
- Message driven5
- Scalable5
- Mixing futures with Akka tell is difficult3
- Closing of futures2
- No type safety2
- Very difficult to refactor1
- Typed actors still not stable1
related Akka posts
To solve the problem of scheduling and executing arbitrary tasks in its distributed infrastructure, PagerDuty created an open-source tool called Scheduler. Scheduler is written in Scala and uses Cassandra for task persistence. It also adds Apache Kafka to handle task queuing and partitioning, with Akka to structure the library’s concurrency.
The service’s logic schedules a task by passing it to the Scheduler’s Scala API, which serializes the task metadata and enqueues it into Kafka. Scheduler then consumes the tasks, and posts them to Cassandra to prevent data loss.
I decided to use Akka instead of Kafka streams because I have personal relationships at @Lightbend.
- Flexible10
- Easy setup6
- Clojure3
- Event Processing3
- Real Time2
related Apache Storm posts
Lumosity is home to the world's largest cognitive training database, a responsibility we take seriously. For most of the company's history, our analysis of user behavior and training data has been powered by an event stream--first a simple Node.js pub/sub app, then a heavyweight Ruby app with stronger durability. Both supported decent throughput and latency, but they lacked some major features supported by existing open-source alternatives: replaying existing messages (also lacking in most message queue-based solutions), scaling out many different readers for the same stream, the ability to leverage existing solutions for reading and writing, and possibly most importantly: the ability to hire someone externally who already had expertise.
We ultimately migrated to Kafka in early- to mid-2016, citing both industry trends in companies we'd talked to with similar durability and throughput needs, the extremely strong documentation and community. We pored over Kyle Kingsbury's Jepsen post (https://aphyr.com/posts/293-jepsen-Kafka), as well as Jay Kreps' follow-up (http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen), talked at length with Confluent folks and community members, and still wound up running parallel systems for quite a long time, but ultimately, we've been very, very happy. Understanding the internals and proper levers takes some commitment, but it's taken very little maintenance once configured. Since then, the Confluent Platform community has grown and grown; we've gone from doing most development using custom Scala consumers and producers to being 60/40 Kafka Streams/Connects.
We originally looked into Storm / Heron , and we'd moved on from Redis pub/sub. Heron looks great, but we already had a programming model across services that was more akin to consuming a message consumers than required a topology of bolts, etc. Heron also had just come out while we were starting to migrate things, and the community momentum and direction of Kafka felt more substantial than the older Storm. If we were to start the process over again today, we might check out Pulsar , although the ecosystem is much younger.
To find out more, read our 2017 engineering blog post about the migration!
- Unified batch and stream processing16
- Easy to use streaming apis8
- Out-of-the box connector to kinesis,s3,hdfs8
- Open Source4
- Low latency2
related Apache Flink posts
I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.
I have to build a data processing application with an Apache Beam stack and Apache Flink runner on an Amazon EMR cluster. I saw some instability with the process and EMR clusters that keep going down. Here, the Apache Beam application gets inputs from Kafka and sends the accumulative data streams to another Kafka topic. Any advice on how to make the process more stable?
- Performance884
- Super fast541
- Ease of use512
- In-memory cache443
- Advanced key-value cache323
- Open source193
- Easy to deploy182
- Stable164
- Free155
- Fast121
- High-Performance42
- High Availability40
- Data Structures34
- Very Scalable32
- Replication24
- Pub/Sub22
- Great community22
- "NoSQL" key-value data store19
- Hashes15
- Sets13
- Sorted Sets11
- Lists10
- BSD licensed9
- NoSQL9
- Integrates super easy with Sidekiq for Rails background8
- Async replication8
- Bitmaps8
- Open Source7
- Keys with a limited time-to-live7
- Lua scripting6
- Strings6
- Awesomeness for Free5
- Hyperloglogs5
- Written in ANSI C4
- LRU eviction of keys4
- Networked4
- Outstanding performance4
- Runs server side LUA4
- Transactions4
- Feature Rich4
- Performance & ease of use3
- Data structure server3
- Object [key/value] size each 500 MB2
- Simple2
- Scalable2
- Temporarily kept on disk2
- Dont save data if no subscribers are found2
- Automatic failover2
- Easy to use2
- Existing Laravel Integration2
- Channels concept2
- Cannot query objects directly15
- No secondary indexes for non-numeric data types3
- No WAL1
related Redis posts
We use MongoDB as our primary #datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and #ETL.
As we pull #microservices from our #monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).
When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.
I'm working as one of the engineering leads in RunaHR. As our platform is a Saas, we thought It'd be good to have an API (We chose Ruby and Rails for this) and a SPA (built with React and Redux ) connected. We started the SPA with Create React App since It's pretty easy to start.
We use Jest as the testing framework and react-testing-library to test React components. In Rails we make tests using RSpec.
Our main database is PostgreSQL, but we also use MongoDB to store some type of data. We started to use Redis for cache and other time sensitive operations.
We have a couple of extra projects: One is an Employee app built with React Native and the other is an internal back office dashboard built with Next.js for the client and Python in the backend side.
Since we have different frontend apps we have found useful to have Bit to document visual components and utils in JavaScript.