Need advice about which tool to choose?Ask the StackShare community!

Dremio

117
347
+ 1
8
Apache Spark

2.9K
3.5K
+ 1
140
Add tool

Apache Spark vs Dremio: What are the differences?

Introduction

Apache Spark and Dremio are both popular tools used for data processing and analysis. While they share some similarities, there are key differences that set them apart from each other. Here are six important differences between Apache Spark and Dremio:

  1. Architecture: Apache Spark follows a distributed computing architecture, allowing it to process large-scale datasets across a cluster of machines. On the other hand, Dremio follows a distributed SQL architecture that focuses on accelerating query performance using data lake engines.

  2. Data Processing: Spark is a general-purpose data processing engine that supports various workloads, including batch processing, real-time streaming, and machine learning. Dremio, on the other hand, is specifically designed for SQL-based data processing tasks and offers high-speed query execution.

  3. Data Sources: Spark is known for its versatility when it comes to data sources. It supports a wide range of data formats and can seamlessly integrate with various data storage systems, such as Hadoop Distributed File System (HDFS), Apache Cassandra, and Amazon S3. Dremio, on the other hand, focuses on providing optimization and self-service data access to data stored in data lakes, including popular file formats such as Parquet, JSON, and CSV.

  4. SQL Optimization: While both Spark and Dremio support SQL queries, Dremio incorporates advanced query optimization techniques to improve query performance. It leverages query acceleration techniques like columnar in-memory caching, indexing, and reflection, which allows for faster query execution. Spark, on the other hand, doesn't provide built-in query acceleration and relies more on parallel processing capabilities.

  5. Governance and Security: Dremio places a strong emphasis on data governance and security. It provides fine-grained access control, auditing, and data lineage features to ensure data compliance and security. Spark, on the other hand, does not have built-in governance and security features but can integrate with external tools to meet these requirements.

  6. Data Catalog and Discovery: Dremio includes a built-in data catalog that provides a unified view of data from multiple sources within the data lake. It also offers data discovery capabilities, making it easier to explore and analyze data. In contrast, Spark does not provide a native data catalog and data discovery functionality, although it can be integrated with external tools like Apache Hive for similar capabilities.

In Summary, Apache Spark and Dremio differ in their architecture, data processing capabilities, support for different data sources, SQL optimization techniques, governance and security features, and data catalog and discovery functionalities.

Advice on Dremio and Apache Spark

We need to perform ETL from several databases into a data warehouse or data lake. We want to

  • keep raw and transformed data available to users to draft their own queries efficiently
  • give users the ability to give custom permissions and SSO
  • move between open-source on-premises development and cloud-based production environments

We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.

See more
Replies (3)
John Nguyen
Recommends
on
AirflowAirflowAWS LambdaAWS Lambda

You could also use AWS Lambda and use Cloudwatch event schedule if you know when the function should be triggered. The benefit is that you could use any language and use the respective database client.

But if you orchestrate ETLs then it makes sense to use Apache Airflow. This requires Python knowledge.

See more
Recommends
on
AirflowAirflow

Though we have always built something custom, Apache airflow (https://airflow.apache.org/) stood out as a key contender/alternative when it comes to open sources. On the commercial offering, Amazon Redshift combined with Amazon Kinesis (for complex manipulations) is great for BI, though Redshift as such is expensive.

See more
Recommends

You may want to look into a Data Virtualization product called Conduit. It connects to disparate data sources in AWS, on prem, Azure, GCP, and exposes them as a single unified Spark SQL view to PowerBI (direct query) or Tableau. Allows auto query and caching policies to enhance query speeds and experience. Has a GPU query engine and optimized Spark for fallback. Can be deployed on your AWS VM or on prem, scales up and out. Sounds like the ideal solution to your needs.

See more
Nilesh Akhade
Technical Architect at Self Employed · | 5 upvotes · 527.7K views

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

See more
Replies (2)
Recommends
on
ElasticsearchElasticsearch

The first solution that came to me is to use upsert to update ElasticSearch:

  1. Use the primary-key as ES document id
  2. Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.

Cons: The load on ES will be higher, due to upsert.

To use Flink:

  1. Create a KeyedDataStream by the primary-key
  2. In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
  3. When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
  4. When the Timer fires, read the 1st record from the State and send out as the output record.
  5. Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State

Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.

See more
Akshaya Rawat
Senior Specialist Platform at Publicis Sapient · | 3 upvotes · 370.3K views
Recommends
on
Apache SparkApache Spark

Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"

See more
karunakaran karthikeyan
Needs advice
on
DremioDremio
and
TalendTalend

I am trying to build a data lake by pulling data from multiple data sources ( custom-built tools, excel files, CSV files, etc) and use the data lake to generate dashboards.

My question is which is the best tool to do the following:

  1. Create pipelines to ingest the data from multiple sources into the data lake
  2. Help me in aggregating and filtering data available in the data lake.
  3. Create new reports by combining different data elements from the data lake.

I need to use only open-source tools for this activity.

I appreciate your valuable inputs and suggestions. Thanks in Advance.

See more
Replies (1)
Rod Beecham
Partnering Lead at Zetaris · | 3 upvotes · 64.6K views
Recommends
on
DremioDremio

Hi Karunakaran. I obviously have an interest here, as I work for the company, but the problem you are describing is one that Zetaris can solve. Talend is a good ETL product, and Dremio is a good data virtualization product, but the problem you are describing best fits a tool that can combine the five styles of data integration (bulk/batch data movement, data replication/data synchronization, message-oriented movement of data, data virtualization, and stream data integration). I may be wrong, but Zetaris is, to the best of my knowledge, the only product in the world that can do this. Zetaris is not a dashboarding tool - you would need to combine us with Tableau or Qlik or PowerBI (or whatever) - but Zetaris can consolidate data from any source and any location (structured, unstructured, on-prem or in the cloud) in real time to allow clients a consolidated view of whatever they want whenever they want it. Please take a look at www.zetaris.com for more information. I don't want to do a "hard sell", here, so I'll say no more! Warmest regards, Rod Beecham.

See more
Get Advice from developers at your company using StackShare Enterprise. Sign up for StackShare Enterprise.
Learn More
Pros of Dremio
Pros of Apache Spark
  • 3
    Nice GUI to enable more people to work with Data
  • 2
    Connect NoSQL databases with RDBMS
  • 2
    Easier to Deploy
  • 1
    Free
  • 61
    Open-source
  • 48
    Fast and Flexible
  • 8
    One platform for every big data problem
  • 8
    Great for distributed SQL like applications
  • 6
    Easy to install and to use
  • 3
    Works well for most Datascience usecases
  • 2
    Interactive Query
  • 2
    Machine learning libratimery, Streaming in real
  • 2
    In memory Computation

Sign up to add or upvote prosMake informed product decisions

Cons of Dremio
Cons of Apache Spark
  • 1
    Works only on Iceberg structured data
  • 4
    Speed

Sign up to add or upvote consMake informed product decisions

- No public GitHub repository available -

What is Dremio?

Dremio—the data lake engine, operationalizes your data lake storage and speeds your analytics processes with a high-performance and high-efficiency query engine while also democratizing data access for data scientists and analysts.

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Need advice about which tool to choose?Ask the StackShare community!

What companies use Dremio?
What companies use Apache Spark?
See which teams inside your own company are using Dremio or Apache Spark.
Sign up for StackShare EnterpriseLearn More

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Dremio?
What tools integrate with Apache Spark?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Mar 24 2021 at 12:57PM

Pinterest

GitJenkinsKafka+7
3
2157
MySQLKafkaApache Spark+6
2
2015
Aug 28 2019 at 3:10AM

Segment

PythonJavaAmazon S3+16
7
2568
What are some alternatives to Dremio and Apache Spark?
Presto
Distributed SQL Query Engine for Big Data
Apache Drill
Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google's Dremel.
Denodo
It is the leader in data virtualization providing data access, data governance and data delivery capabilities across the broadest range of enterprise, cloud, big data, and unstructured data sources without moving the data from their original repositories.
AtScale
Its Virtual Data Warehouse delivers performance, security and agility to exceed the demands of modern-day operational analytics.
Snowflake
Snowflake eliminates the administration and management demands of traditional data warehouses and big data platforms. Snowflake is a true data warehouse as a service running on Amazon Web Services (AWS)—no infrastructure to manage and no knobs to turn.
See all alternatives