Need advice about which tool to choose?Ask the StackShare community!

Apache Spark

3K
3.5K
+ 1
140
ZeroMQ

260
586
+ 1
71
Add tool

Apache Spark vs ZeroMQ: What are the differences?

Apache Spark is a powerful open-source data processing engine while ZeroMQ is a high-performance messaging library.

**1. Programming Model**: Apache Spark provides a high-level API for parallel data processing with built-in libraries for various tasks, while ZeroMQ focuses on message queuing and distribution with low-latency communication.
**2. Use Case**: Apache Spark is primarily used for big data analytics, machine learning, and data processing workflows, whereas ZeroMQ is ideal for building distributed applications and microservices with a focus on messaging patterns.
**3. Fault Tolerance**: Apache Spark provides built-in fault tolerance mechanisms like lineage tracking and RDDs for handling data losses, while ZeroMQ relies on the application layer for handling message delivery and recovery in case of failures.
**4. Scalability**: Apache Spark offers scalability through distributed computing with support for large clusters for processing massive datasets, whereas ZeroMQ scales well for distributed messaging tasks but does not inherently support large-scale data processing.
**5. Flexibility**: Apache Spark is more opinionated in terms of its execution model and data handling approaches, while ZeroMQ offers more flexibility in designing custom messaging patterns and workflows tailored to specific use cases.
**6. Performance**: Apache Spark is optimized for in-memory processing and can deliver high throughput for data-intensive tasks, whereas ZeroMQ provides low-latency messaging capabilities but may not be as performant for data processing operations.

In Summary, Apache Spark and ZeroMQ differ in their programming models, use cases, fault tolerance mechanisms, scalability, flexibility, and performance characteristics.

Advice on Apache Spark and ZeroMQ
Meili Triantafyllidi
Software engineer at Digital Science · | 6 upvotes · 498.9K views
Needs advice
on
Amazon SQSAmazon SQSRabbitMQRabbitMQ
and
ZeroMQZeroMQ

Hi, we are in a ZMQ set up in a push/pull pattern, and we currently start to have more traffic and cases that the service is unavailable or stuck. We want to: * Not loose messages in services outages * Safely restart service without losing messages (ZeroMQ seems to need to close the socket in the receiver before restart manually)

Do you have experience with this setup with ZeroMQ? Would you suggest RabbitMQ or Amazon SQS (we are in AWS setup) instead? Something else?

Thank you for your time

See more
Replies (2)
Shishir Pandey
Recommends
on
RabbitMQRabbitMQ

ZeroMQ is fast but you need to build build reliability yourself. There are a number of patterns described in the zeromq guide. I have used RabbitMQ before which gives lot of functionality out of the box, you can probably use the worker queues example from the tutorial, it can also persists messages in the queue.

I haven't used Amazon SQS before. Another tool you could use is Kafka.

See more
Kevin Deyne
Principal Software Engineer at Accurate Background · | 5 upvotes · 232.2K views
Recommends
on
RabbitMQRabbitMQ

Both would do the trick, but there are some nuances. We work with both.

From the sound of it, your main focus is "not losing messages". In that case, I would go with RabbitMQ with a high availability policy (ha-mode=all) and a main/retry/error queue pattern.

Push messages to an exchange, which sends them to the main queue. If an error occurs, push the errored out message to the retry exchange, which forwards it to the retry queue. Give the retry queue a x-message-ttl and set the main exchange as a dead-letter-exchange. If your message has been retried several times, push it to the error exchange, where the message can remain until someone has time to look at it.

This is a very useful and resilient pattern that allows you to never lose messages. With the high availability policy, you make sure that if one of your rabbitmq nodes dies, another can take over and messages are already mirrored to it.

This is not really possible with SQS, because SQS is a lot more focused on throughput and scaling. Combined with SNS it can do interesting things like deduplication of messages and such. That said, one thing core to its design is that messages have a maximum retention time. The idea is that a message that has stayed in an SQS queue for a while serves no more purpose after a while, so it gets removed - so as to not block up any listener resources for a long time. You can also set up a DLQ here, but these similarly do not hold onto messages forever. Since you seem to depend on messages surviving at all cost, I would suggest that the scaling/throughput benefit of SQS does not outweigh the difference in approach to messages there.

See more
Nilesh Akhade
Technical Architect at Self Employed · | 5 upvotes · 575.1K views

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

See more
Replies (2)
Recommends
on
ElasticsearchElasticsearch

The first solution that came to me is to use upsert to update ElasticSearch:

  1. Use the primary-key as ES document id
  2. Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.

Cons: The load on ES will be higher, due to upsert.

To use Flink:

  1. Create a KeyedDataStream by the primary-key
  2. In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
  3. When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
  4. When the Timer fires, read the 1st record from the State and send out as the output record.
  5. Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State

Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.

See more
Akshaya Rawat
Senior Specialist Platform at Publicis Sapient · | 3 upvotes · 409.7K views
Recommends
on
Apache SparkApache Spark

Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"

See more
Manage your open source components, licenses, and vulnerabilities
Learn More
Pros of Apache Spark
Pros of ZeroMQ
  • 61
    Open-source
  • 48
    Fast and Flexible
  • 8
    One platform for every big data problem
  • 8
    Great for distributed SQL like applications
  • 6
    Easy to install and to use
  • 3
    Works well for most Datascience usecases
  • 2
    Interactive Query
  • 2
    Machine learning libratimery, Streaming in real
  • 2
    In memory Computation
  • 23
    Fast
  • 20
    Lightweight
  • 11
    Transport agnostic
  • 7
    No broker required
  • 4
    Low level APIs are in C
  • 4
    Low latency
  • 1
    Open source
  • 1
    Publish-Subscribe

Sign up to add or upvote prosMake informed product decisions

Cons of Apache Spark
Cons of ZeroMQ
  • 4
    Speed
  • 5
    No message durability
  • 3
    Not a very reliable system - message delivery wise
  • 1
    M x N problem with M producers and N consumers

Sign up to add or upvote consMake informed product decisions

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

What is ZeroMQ?

The 0MQ lightweight messaging kernel is a library which extends the standard socket interfaces with features traditionally provided by specialised messaging middleware products. 0MQ sockets provide an abstraction of asynchronous message queues, multiple messaging patterns, message filtering (subscriptions), seamless access to multiple transport protocols and more.

Need advice about which tool to choose?Ask the StackShare community!

What companies use Apache Spark?
What companies use ZeroMQ?
Manage your open source components, licenses, and vulnerabilities
Learn More

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Apache Spark?
What tools integrate with ZeroMQ?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Mar 24 2021 at 12:57PM

Pinterest

GitJenkinsKafka+7
3
2315
MySQLKafkaApache Spark+6
2
2169
Aug 28 2019 at 3:10AM

Segment

PythonJavaAmazon S3+16
7
2768
What are some alternatives to Apache Spark and ZeroMQ?
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Splunk
It provides the leading platform for Operational Intelligence. Customers use it to search, monitor, analyze and visualize machine data.
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Apache Beam
It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
Apache Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
See all alternatives