Kafka

Application and Data / Data Stores / Message Queue

Surabhi Bhawsar

Technical Architect at Pepcus·Aug 27, 2019

Needs advice

Apache Flink

and

Kafka

I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.

7 upvotes·737.6K views

Replies (1)

rmetzger_

Aug 30, 2019

Recommends

Apache Flink

I recommend Apache Flink because it is the pro tool for everybody who has a serious stream processing use case. Flink is used by huge companies, such as Uber, Alibaba or Netflix. AWS is offering Flink as a hosted service. The reason for these companies to decide for Flink are manyfold: Flink offers great performance, support for very large state, exactly-once processing semantics, different APIs (with SQL growing a lot lately), ... Flink supports a many different deployment models, including Kubernetes, Hadoop YARN or custom deployments.

The drawbacks of Apache Flink are medium steep learning curve, and plenty of options (APIs, deployment models, state backends, ...)

These are my personal views, and I have a bias towards Flink, because I've worked a lot on it:

Flink and Kafka (the message bus) work together very well, and that's also the most popular combination (I'm guessing). There's also Kafka Streams, a stream processing library using Kafka (the message bus) as a data transport layer. Some considerations of Kafka Streams vs Flink:

KStreams has a hard dependency on Kafka, Flink is independent of the message bus, and can easily read and write to many systems (KStreams requires Kafka connect for that)
Since KStreams is doing data exchange via kafka topics, there's a lot of load on the Kafka cluster (size it appropriately). Monitoring becomes difficult as processing and data storage are in the same cluster. Do you really want your production data being discarded because your processing is eating up all your IO?
Flink is the older project, it has been battle tested for many years across a lot of different scenarios. There's more libraries, such as a CEP (Complex Event Processing) library and more and more machine learning integrations.

4 upvotes·1 comment·11K views

Nicholas Nezis

July 1st 2020 at 4:14AM

How does Flink compare with Apache Heron?

Ashish Singh

Tech Lead, Big Data Platform at Pinterest·Nov 27, 2019

Shared insights

To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges like, supporting deeply nested and huge thrift schemas, slow/ bad worker detection and remediation, auto-scaling cluster, graceful cluster shutdown and impersonation support for ldap authenticator.

Our infrastructure is built on top of Amazon EC2 and we leverage Amazon S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data.

We have hundreds of petabytes of data and tens of thousands of Apache Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time.

Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.

#BigData #AWS #DataScience #DataEngineering

Presto at Pinterest - Pinterest Engineering Blog - Medium (medium.com)

38 upvotes·1 comment·3.7M views

Kaibo Hao

January 28th 2020 at 12:46AM

ECS on AWS will reduce your cost on EC2 and Kubernetes. Athena may be another tool for reducing your cost by replacing the Presto. It takes advantage of the S3 as the storage and provided the serverless management for your infrastructure.

Joshua Dean Küpper

CEO at Scrayos UG (haftungsbeschränkt)·Jul 14, 2017

Shared insights

Scrayos UG (haftungsbeschränkt)

(

JustChunks

)

We make extensive use of Redis for our caches and use it as a way to save "semi-permanent" stuff like user-submit settings (that get refreshed on each login) or cooldowns that expire very fast. Additionally we also utilize the Pub-Sub capabilities that Redis has to offer.

We decided against using a dedicated Message-Broker/Streaming Platform like RabbitMQ or Kafka, as we already had a packet-based, custom protocol for communication between servers and services, and we only needed some "tiny" Pub-Sub magic to fill in the gaps. An entire additional service just for this oddjob would've been a total overkill.

1 upvote·204.8K views

Kaveri Garg

Nov 9, 2020

Needs advice

Google Cloud Functions

Google Cloud Pub/Sub

and

Kafka

My Stack

I want to read data from Kafka. The file is in CSV, or whichever format is coming from SAP and read from another 3rd party application. So I need to create a Message Bus for the same. Please suggest. I can use a microservice as well.

2 upvotes·82.4K views

Needs advice

and

I have to build a data processing application with an Apache Beam stack and Apache Flink runner on an Amazon EMR cluster. I saw some instability with the process and EMR clusters that keep going down. Here, the Apache Beam application gets inputs from Kafka and sends the accumulative data streams to another Kafka topic. Any advice on how to make the process more stable?

6 upvotes·2.9M views

Replies (1)

Alonso Isidoro Roman

Ingeniero de software, bigdata architect ·Dec 8, 2021

Recommends

Kafka Streams

freelancer-aironman

So, you are using Apache Beam and Apache Flink to read from an input kafka topic, apply some transformations to the input and then write to another output kafka topic? it looks like that this is a solution for kafka-streams framework, isn't?. if the process is not very stable, it is probably because you don't have the right amount of memory for these processes, or you don't have enough dedicated cores for it.

Investigate using the Confluent platform's control-center tool, look at logs, examine process exceptions, focus on caused by.

Unless you have a great need to use Apache Flink's supposedly better real-time data streaming capabilities, stick with kafka-streams to do that task. Then look into doing the same with Beam and Flink, but when you have it, you can measure if you really have a big performance improvement when reading and writing to kafka topics. I honestly doubt it.

4 upvotes·5.1K views

Muhammad Faizan Fareed

Needs advice

and

I'm building a website where users can participate, like and dislike any given challenge.

Problem : If 10k or 1 million users join the given challenge at a time it can cause a race condition in my database MySQL and in also Redis.

What I want : Aggregating joined participated users, likes and dislikes.

Solution : I'm thinking about using Kafka as a Queue message broker then users event one by one saving into Redis, database and aggregate them.

One problem is also here saving and doing aggregate takes time now; how can I show users they have successfully joined the challenge?

One solution is that when a user joins the challenge I send a request to the Kafka queue then update the current user UI and show a success message (not updating the other users' joined messages to current user because I am not using Websockets)

Other App example Take the same example of https://stackshare.io posts. On posts users can like, dislike and comments.

Estimated users : 1 million Stack : Django, Mysql, Redis and Kafka

Questions

How I can manage these kinds of things?
How do big tech companies handle this?
Where am I right or wrong?
Are there other tools that can help me in this situation?
I am using locks in Redis when total like, dislike and joined users increment or decrement. Should I be doing this? Is it the same for transactions in MySQL?

I need the best approach to handle this situation that can also be scalable.

Thanks in advance for reading my post and giving me suggestions on this. ☺️

3 upvotes·120.6K views

Replies (1)

Oirere Jr.

Software Engineeer at Serftre Inc·Mar 16, 2020

Recommends

Kafka

Consider using SQL support on Apache Pinot, which is an online analytic processing datastore which can write complex SQL queries and also join different tables in Pinot with those in other datastores. Pinot enables you to build dashboards for quick analysis and reporting on aggregated data.

4 upvotes·21.4K views

Needs advice

and

We are going to develop a microservices-based application. It consists of AngularJS, ASP.NET Core, and MSSQL.

We have 3 types of microservices. Emailservice, Filemanagementservice, Filevalidationservice

I am a beginner in microservices. But I have read about RabbitMQ, but come to know that there are Redis and Kafka also in the market. So, I want to know which is best.

9 upvotes·930.9K views

Replies (4)

Maheedhar Aluri

Lead Architect ·Apr 15, 2020

Recommends

Kafka

Kafka is an Enterprise Messaging Framework whereas Redis is an Enterprise Cache Broker, in-memory database and high performance database.Both are having their own advantages, but they are different in usage and implementation. Now if you are creating microservices check the user consumption volumes, its generating logs, scalability, systems to be integrated and so on. I feel for your scenario initially you can go with KAFKA bu as the throughput, consumption and other factors are scaling then gradually you can add Redis accordingly.

8 upvotes·855.2K views

Rob Kraft

Jul 16, 2020

Recommends

Angular

I first recommend that you choose Angular over AngularJS if you are starting something new. AngularJs is no longer getting enhancements, but perhaps you meant Angular. Regarding microservices, I recommend considering microservices when you have different development teams for each service that may want to use different programming languages and backend data stores. If it is all the same team, same code language, and same data store I would not use microservices. I might use a message queue, in which case RabbitMQ is a good one. But you may also be able to simply write your own in which you write a record in a table in MSSQL and one of your services reads the record from the table and processes it. The most challenging part of doing it yourself is writing a service that does a good job of reading the queue without reading the same message multiple times or missing a message; and that is where RabbitMQ can help.

4 upvotes·847.2K views

View all (4)

Nilesh Akhade

Technical Architect at Self Employed·Jul 8, 2020

Needs advice

and

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

5 upvotes·575.1K views

Replies (2)

lvhuyen

Jul 9, 2020

Recommends

Elasticsearch

The first solution that came to me is to use upsert to update ElasticSearch:

Use the primary-key as ES document id
Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.

Cons: The load on ES will be higher, due to upsert.

To use Flink:

Create a KeyedDataStream by the primary-key
In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
When the Timer fires, read the 1st record from the State and send out as the output record.
Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State

Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.

Averell Huyen Levan – Medium (medium.com)

5 upvotes·2 comments·475.9K views

Nilesh Akhade

July 10th 2020 at 4:04PM

In flink approach, we cant query the data while its being processed (in flink memory). Consequently we have to wait for 6 hours for event to be available. Although this can be worked around by maintaining copy of data being processed for 15mins.

Thank you so much for detailed solution.

What are your views on preferring Apache Flink over Kafka Streams and Apache Spark for this use case?

Ashwani Agarwal

July 16th 2020 at 9:41AM

What do you think about having MongoDB for 1st case i/o ES? My point of view is that it's easier to get started with MongoDB.

Akshaya Rawat

Senior Specialist Platform at Publicis Sapient·Sep 4, 2020

Recommends

Apache Spark

Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"

3 upvotes·409.7K views

Needs advice

and

I want to choose Message Queue with the following features - Highly Available, Distributed, Scalable, Monitoring. I have RabbitMQ, ActiveMQ, Kafka and Apache RocketMQ in mind. But I am confused which one to choose.

4 upvotes·407.6K views

Replies (16)

DJ Ambrisco

Principle Software Engineer at DispatchHealth·Mar 21, 2019

Recommends

Kafka

MachineShop

Kafka was only introduced to our platform in August 2018 as a means to manage our data pipeline and to replace other messaging systems used to decouple various components in our system. Kafka provides the scale and storage we need to manage data for however many devices we might service. Additionally, Kafka has helped us lay the framework for improved and highly detailed statistics gathering and analysis.

5 upvotes·5.1K views

mediafinger

Feb 13, 2019

Recommends

RabbitMQ

The question for which Message Queue to use mentioned "availability, distributed, scalability, and monitoring". I don't think that this excludes many options already. I does not sound like you would take advantage of Kafka's strengths (replayability, based on an even sourcing architecture). You could pick one of the AMQP options.

I would recommend the RabbitMQ message broker, which not only implements the AMQP standard 0.9.1 (it can support 1.x or other protocols as well) but has also several very useful extensions built in. It ticks the boxes you mentioned and on top you will get a very flexible system, that allows you to build the architecture, pick the options and trade-offs that suite your case best.

For more information about RabbitMQ, please have a look at the linked markdown I assembled. The second half explains many configuration options. It also contains links to managed hosting and to libraries (though it is missing Python's - which should be Puka, I assume).

rabbitmq_presentation/README.markdown at master · mediafinger/rabbitmq_presentation · GitHub (github.com)

4 upvotes·158.8K views

View all (16)

Needs advice

and

Hello Stackshare. I'm currently doing some research on real-time reporting and analytics architectures. We have a use case where 1million+ records of users, 4million+ activities, and messages that we want to report against. The start was to present it directly from MySQL, which didn't go well and puts a heavy load on the database. Anybody can suggest something where we feed the data and can report in realtime? Read some articles about ElasticSearch and Kafka https://medium.com/@D11Engg/building-scalable-real-time-analytics-alerting-and-anomaly-detection-architecture-at-dream11-e20edec91d33 EDIT: also considering Neo4j

8 upvotes·39.2K views

Replies (4)

Daniel Zurawski

Technical Lead at SuperAwesome·Oct 28, 2020

Recommends

(

)

One of the reasons why your real-time reporting built on top of MySQL might not be performing so well is due to the fact that you are most likely interested in aggregates (e.g. group by & SUM, AVG, TopN). In data warehousing, there is a term known as column-oriented vs row-oriented databases - the key here is that in column-oriented DBMSs, you more precisely access the data you need to answer a question, avoiding having to scan the entire table to calculate an answer. Most of the time pre-aggregates can be calculated on insertion instead of at query time.

An excellent OLAP modern tool that I successfully used for many years to index events from Kafka at a staggering rate and query millions of events in less than a second is Apache Druid and it's an example of a distributed column-oriented data store. There are of course many more technologies out there for answering OLAP business intelligence questions, but personally, I think you won't go very far with a traditional RDBMS or a Lucene based search engine like ElasticSearch for building a Business Intelligence database for vast amounts of data.

"Apache Druid is an open-source data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence (OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation."

If you don't want to invest resources into deploying and hosting it yourself, there are other companies out there that can host it for you, but I will leave that up to you to research.

Here is an excellent article by my former work colleagues explaining how they implemented real-time analytics on top of Druid: https://medium.com/superawesome-engineering/how-we-use-apache-druids-real-time-analytics-to-power-kidtech-at-superawesome-8da6a0fb28b1. Also, I recommend reading through this HackerNews thread that talks in-depth about time-series databases: https://news.ycombinator.com/item?id=18403507.

How we use Apache Druid’s real-time analytics to power kidtech at SuperAwesome | by Natasha Mulla | SuperAwesome Engineering | Sep, 2020 | Medium (medium.com)

10 upvotes·13.2K views

snid chakravarty

Data Engineer at Westpac·Nov 4, 2020

Recommends

Druid

KSQL

With the nature of application that you're building, you might even consider setting up some KSQL streams. I have just recently finished a poc on establishing a streaming analytics pipeline with KSQL dB (Both standalone and confluent supported) setting up kafka streams. Also they have a headless deployment mode in production which keeps your KSQL script pretty secured.

3 upvotes·10.3K views

View all (4)