Amazon Athena vs Azure Data Factory

Overview

Amazon Athena

Stacks521

Followers840

Votes49

Azure Data Factory

Stacks254

Followers484

Votes0

GitHub Stars516

Forks610

Amazon Athena vs Azure Data Factory: What are the differences?

Introduction

Amazon Athena and Azure Data Factory are two popular cloud-based data analytics services offered by Amazon Web Services (AWS) and Microsoft Azure respectively. While both services provide capabilities for processing and analyzing data, there are key differences between them that determine their suitability for specific use cases.

Connectivity and Integration: When it comes to connectivity and integration, Amazon Athena is designed to work directly with data stored in Amazon S3, which means it does not require any specific data ingestion or transformation operations. On the other hand, Azure Data Factory offers more flexibility in terms of connectivity as it can integrate with various data sources and services within the Azure ecosystem, allowing data to be sourced from different platforms and processed.
Data Transformation Capabilities: While both services support data transformation operations, Azure Data Factory provides a more comprehensive set of data transformation capabilities compared to Amazon Athena. Azure Data Factory offers a visual interface for developing data transformation pipelines and supports built-in activities for data wrangling and transformation. In contrast, Amazon Athena primarily focuses on querying and analyzing data, with limited transformation capabilities.
Data Processing and Computing: Amazon Athena leverages the serverless processing capabilities of AWS Glue, which allows users to analyze data without the need for infrastructure provisioning or management. AWS Glue automatically handles the underlying compute resources required for query execution. On the other hand, Azure Data Factory uses Azure Data Lake Analytics or Azure HDInsight for data processing, which provides a distributed computing environment for executing complex data processing tasks.
Pricing Model: The pricing model of Amazon Athena is based on the amount of data scanned during query execution. Users are charged on per-TB basis for the amount of data processed. Azure Data Factory, on the other hand, follows a more flexible pricing model based on the number of data movement and transformation activities executed. Users pay for the number of activities executed and the runtime of those activities.
Managed Service Offering: Both Amazon Athena and Azure Data Factory are offered as managed services, but Amazon Athena is a serverless offering that fully manages the underlying infrastructure and resources. Azure Data Factory, on the other hand, provides more control and flexibility over the infrastructure configuration as it allows users to choose between serverless or dedicated integration runtimes.
Third-Party Ecosystem: Azure Data Factory has better integration capabilities with various third-party services and data connectors available in the Azure ecosystem. It provides built-in connectors for popular data sources, databases, and data platforms, making it easier to ingest and process data from those services. Amazon Athena, although it can work with external data catalogs and external tables, has a more limited set of built-in connectors and integrations.

In summary, Amazon Athena and Azure Data Factory differ in terms of connectivity and integration options, data transformation capabilities, data processing and computing environments, pricing models, managed service offerings, and third-party ecosystem integrations. These differences influence the suitability and flexibility of the services for specific use cases and requirements.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Amazon Athena, Azure Data Factory

Vamshi

Data Engineer at Tata Consultancy Services

May 29, 2020

Needs adviceon

PySpark

Azure Data Factory

Databricks

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

269k views269k

Comments

Pavithra

Mar 12, 2020

Needs adviceon

Amazon S3

Amazon Athena

Amazon Redshift

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

522k views522k

Comments

Detailed Comparison

Amazon Athena	Azure Data Factory
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.	It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.
-	Real-Time Integration; Parallel Processing; Data Chunker; Data Masking; Proactive Monitoring; Big Data Processing
Statistics
GitHub Stars -	GitHub Stars 516
GitHub Forks -	GitHub Forks 610
Stacks 521	Stacks 254
Followers 840	Followers 484
Votes 49	Votes 0
Pros & Cons
Pros 16 Use SQL to analyze CSV files 8 Glue crawlers gives easy Data catalogue 7 Cheap 6 Query all my data without running servers 24x7 4 No data base servers yay	No community feedback yet
Integrations
Amazon S3 Presto	Octotree Java .NET

What are some alternatives to Amazon Athena, Azure Data Factory?

Apache Spark

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Presto

Distributed SQL Query Engine for Big Data

Apache Flink

Apache Flink is an open source system for fast and versatile data analytics in clusters. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

lakeFS

It is an open-source data version control system for data lakes. It provides a “Git for data” platform enabling you to implement best practices from software engineering on your data lake, including branching and merging, CI/CD, and production-like dev/test environments.

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

Apache Kylin

Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets, originally contributed from eBay Inc.

Apache Camel

An open source Java framework that focuses on making integration easier and more accessible to developers.

Splunk

It provides the leading platform for Operational Intelligence. Customers use it to search, monitor, analyze and visualize machine data.

Apache Impala

Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

Vertica

It provides a best-in-class, unified analytics platform that will forever be independent from underlying infrastructure.

Related Comparisons

Amazon Athena vs Azure Data Factory: What are the differences?

Introduction

Connectivity and Integration: When it comes to connectivity and integration, Amazon Athena is designed to work directly with data stored in Amazon S3, which means it does not require any specific data ingestion or transformation operations. On the other hand, Azure Data Factory offers more flexibility in terms of connectivity as it can integrate with various data sources and services within the Azure ecosystem, allowing data to be sourced from different platforms and processed.
Data Transformation Capabilities: While both services support data transformation operations, Azure Data Factory provides a more comprehensive set of data transformation capabilities compared to Amazon Athena. Azure Data Factory offers a visual interface for developing data transformation pipelines and supports built-in activities for data wrangling and transformation. In contrast, Amazon Athena primarily focuses on querying and analyzing data, with limited transformation capabilities.
Data Processing and Computing: Amazon Athena leverages the serverless processing capabilities of AWS Glue, which allows users to analyze data without the need for infrastructure provisioning or management. AWS Glue automatically handles the underlying compute resources required for query execution. On the other hand, Azure Data Factory uses Azure Data Lake Analytics or Azure HDInsight for data processing, which provides a distributed computing environment for executing complex data processing tasks.
Pricing Model: The pricing model of Amazon Athena is based on the amount of data scanned during query execution. Users are charged on per-TB basis for the amount of data processed. Azure Data Factory, on the other hand, follows a more flexible pricing model based on the number of data movement and transformation activities executed. Users pay for the number of activities executed and the runtime of those activities.
Managed Service Offering: Both Amazon Athena and Azure Data Factory are offered as managed services, but Amazon Athena is a serverless offering that fully manages the underlying infrastructure and resources. Azure Data Factory, on the other hand, provides more control and flexibility over the infrastructure configuration as it allows users to choose between serverless or dedicated integration runtimes.
Third-Party Ecosystem: Azure Data Factory has better integration capabilities with various third-party services and data connectors available in the Azure ecosystem. It provides built-in connectors for popular data sources, databases, and data platforms, making it easier to ingest and process data from those services. Amazon Athena, although it can work with external data catalogs and external tables, has a more limited set of built-in connectors and integrations.

Amazon Athena vs Azure Data Factory

Overview

Amazon Athena vs Azure Data Factory: What are the differences?