What is Kafka?
Who uses Kafka?
Kafka Integrations
Here are some stack decisions, common use cases and reviews by companies and developers who chose Kafka in their tech stack.
We send and receive messages continuously with the help of Kafka. But Kafka is provided by both Apache and Vert.x which are called Kafka and Vert.x Kafka respectively. I need to know which one is best and why in terms of performance. Also the purpose of their own type.
I would like to build a mobile app that can scale to around 1M users over 1 year. We are currently testing with 100 users without any real load issues. We use the MERN stack with React Native Expo, and Google Cloud Services for GCB. We also use Google Cloud Run. We use a microservices architecture that we manage ourselves but thought of using Kafka. However, I need advice on optimising the app in terms of:
- load balancing,
- caching,
- database optimisation,
- autoscaling,
- load testing, and
- continuous optimisation frameworks
Any help would be appreciated! Thanks:)
Currently been using an older version of OpenFaaS, but the new version now requires payment for things we did on the older version. Been looking for alternatives to OpenFaas that have Kafka integrations, and scale to 0 capabilities.
looked at Apache OpenWhisk, but we run on RKE2, and my initial install of Openwhisk appears to be too out of date to support RKE2 and missing images from docker.io. So now looking at Knative. What are your thoughts? We need support to be able to process functions about 10k a min, which can vary on time of execution, between ms and mins. So looking for horizontal scaling that can be controlled by other metrics, than just cpu and ram utilization, but more so, for example if the wait is over 5 scale out.. Issue with older openfaas, was scaling on RKE2 was not working great, for example, I could get it to scale from 5 to 20 pods, but only 12 of them would ever have data, but my backlog would have 100k's of files waiting.. So even though it scaled up, it was as if the distribution of work was only being married to specific pods. If I killed the pods that had no work, they come up again with no work, if I killed one with work, then another pod would scale up and another pod would start to get work. And On occasion with hours, it would reset down to the original deployment allotment of pods, and never scale up again, until I go into Kubernetes and tell it to add more pods.
So hoping to find a solution that doesn't require as much triage, to work with scaling, as points in time we are at higher volume and other points of time could be no volume.
Which is the most portable and performant Kafka library? I am evaluating confluent-kafka and kafka-python.
I want to collect the dependency data that Java applications build in the maven tool by CI/CD tools. I want to know how to pick collection tech, and what is the pros and cons between Kafka an RabbitMQ.
Thanks!
My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.
The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.
I make a report based on the two files in Jupyter notebook and convert it to HTML.
- Everything is done with vanilla python and Pandas.
- sometimes I may get a different format of data
- cloud service is Microsoft Azure.
What I'm considering is the following:
Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.
the intermediate states can be stored in druid too. and visualization would be with apache superset.
Blog Posts
Kafka's Features
- Written at LinkedIn in Scala
- Used by LinkedIn to offload processing of all page and other views
- Defaults to using persistence, uses OS disk cache for hot data (has higher throughput then any of the above having persistence enabled)
- Supports both on-line as off-line processing