Alternatives to AWS Data Pipeline logo

Alternatives to AWS Data Pipeline

AWS Glue, Airflow, AWS Step Functions, Apache NiFi, and AWS Batch are the most popular alternatives and competitors to AWS Data Pipeline.
94
390
+ 1
1

What is AWS Data Pipeline and what are its top alternatives?

AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.
AWS Data Pipeline is a tool in the Data Transfer category of a tech stack.

Top Alternatives to AWS Data Pipeline

  • AWS Glue
    AWS Glue

    A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. ...

  • Airflow
    Airflow

    Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed. ...

  • AWS Step Functions
    AWS Step Functions

    AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly. ...

  • Apache NiFi
    Apache NiFi

    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. ...

  • AWS Batch
    AWS Batch

    It enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. ...

  • Azure Data Factory
    Azure Data Factory

    It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud. ...

  • Embulk
    Embulk

    It is an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. ...

  • Google BigQuery Data Transfer Service
    Google BigQuery Data Transfer Service

    BigQuery Data Transfer Service lets you focus your efforts on analyzing your data. You can setup a data transfer with a few clicks. Your analytics team can lay the foundation for a data warehouse without writing a single line of code. ...

AWS Data Pipeline alternatives & related posts

AWS Glue logo

AWS Glue

425
760
9
Fully managed extract, transform, and load (ETL) service
425
760
+ 1
9
PROS OF AWS GLUE
  • 9
    Managed Hive Metastore
CONS OF AWS GLUE
    Be the first to leave a con

    related AWS Glue posts

    Pardha Saradhi
    Technical Lead at Incred Financial Solutions · | 6 upvotes · 73.6K views

    Hi,

    We are currently storing the data in Amazon S3 using Apache Parquet format. We are using Presto to query the data from S3 and catalog it using AWS Glue catalog. We have Metabase sitting on top of Presto, where our reports are present. Currently, Presto is becoming too costly for us, and we are looking for alternatives for it but want to use the remaining setup (S3, Metabase) as much as possible. Please suggest alternative approaches.

    See more

    Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

    1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
    2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
    3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
    4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
    5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
    6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
    7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
    8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
    See more
    Airflow logo

    Airflow

    1.5K
    2.5K
    125
    A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
    1.5K
    2.5K
    + 1
    125
    PROS OF AIRFLOW
    • 50
      Features
    • 14
      Task Dependency Management
    • 12
      Beautiful UI
    • 12
      Cluster of workers
    • 10
      Extensibility
    • 6
      Open source
    • 5
      Complex workflows
    • 5
      Python
    • 3
      Good api
    • 3
      Apache project
    • 3
      Custom operators
    • 2
      Dashboard
    CONS OF AIRFLOW
    • 2
      Running it on kubernetes cluster relatively complex
    • 2
      Open source - provides minimum or no support
    • 1
      Logical separation of DAGs is not straight forward
    • 1
      Observability is not great when the DAGs exceed 250

    related Airflow posts

    Shared insights
    on
    JenkinsJenkinsAirflowAirflow

    I am looking for an open-source scheduler tool with cross-functional application dependencies. Some of the tasks I am looking to schedule are as follows:

    1. Trigger Matillion ETL loads
    2. Trigger Attunity Replication tasks that have downstream ETL loads
    3. Trigger Golden gate Replication Tasks
    4. Shell scripts, wrappers, file watchers
    5. Event-driven schedules

    I have used Airflow in the past, and I know we need to create DAGs for each pipeline. I am not familiar with Jenkins, but I know it works with configuration without much underlying code. I want to evaluate both and appreciate any advise

    See more
    Shared insights
    on
    AWS Step FunctionsAWS Step FunctionsAirflowAirflow

    I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.

    I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.

    I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

    See more
    AWS Step Functions logo

    AWS Step Functions

    216
    363
    24
    Build Distributed Applications Using Visual Workflows
    216
    363
    + 1
    24
    PROS OF AWS STEP FUNCTIONS
    • 6
      Integration with other services
    • 4
      Easily Accessible via AWS Console
    • 4
      Complex workflows
    • 4
      Pricing
    • 2
      Scalability
    • 2
      High Availability
    • 2
      Workflow Processing
    CONS OF AWS STEP FUNCTIONS
      Be the first to leave a con

      related AWS Step Functions posts

      Shared insights
      on
      AWS Step FunctionsAWS Step FunctionsAirflowAirflow

      I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.

      I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.

      I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

      See more
      Matheus Moreira
      Backend Engineer at IntuitiveCare · | 5 upvotes · 33.9K views
      Shared insights
      on
      AWS Step FunctionsAWS Step FunctionsAirflowAirflow

      We have some lambdas we need to orchestrate to get our workflow going. In the past, we already attempted to use Airflow as the orchestrator, but the need to coordinate the tasks in a database generates an overhead that we cannot afford. For our use case, there are hundreds of inputs per minute and we need to scale to support all the inputs and have an efficient way to analyze them later. The ideal product would be AWS Step Functions since it can manage our load demand graciously, but it is too expensive and we cannot afford that. So, I would like to get alternatives for an orchestrator that does not need a complex backend, can manage hundreds of inputs per minute, and is not too expensive.

      See more
      Apache NiFi logo

      Apache NiFi

      310
      641
      63
      A reliable system to process and distribute data
      310
      641
      + 1
      63
      PROS OF APACHE NIFI
      • 16
        Visual Data Flows using Directed Acyclic Graphs (DAGs)
      • 8
        Free (Open Source)
      • 7
        Simple-to-use
      • 5
        Reactive with back-pressure
      • 5
        Scalable horizontally as well as vertically
      • 4
        Fast prototyping
      • 3
        Bi-directional channels
      • 2
        Data provenance
      • 2
        Built-in graphical user interface
      • 2
        End-to-end security between all nodes
      • 2
        Can handle messages up to gigabytes in size
      • 1
        Hbase support
      • 1
        Kudu support
      • 1
        Hive support
      • 1
        Slack integration
      • 1
        Support for custom Processor in Java
      • 1
        Lot of articles
      • 1
        Lots of documentation
      CONS OF APACHE NIFI
      • 2
        HA support is not full fledge
      • 2
        Memory-intensive

      related Apache NiFi posts

      I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

      For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

      See more
      AWS Batch logo

      AWS Batch

      87
      236
      6
      Fully Managed Batch Processing at Any Scale
      87
      236
      + 1
      6
      PROS OF AWS BATCH
      • 3
        Containerized
      • 3
        Scalable
      CONS OF AWS BATCH
      • 2
        More overhead than lambda
      • 1
        Image management

      related AWS Batch posts

      Azure Data Factory logo

      Azure Data Factory

      217
      439
      0
      Hybrid data integration service that simplifies ETL at scale
      217
      439
      + 1
      0
      PROS OF AZURE DATA FACTORY
        Be the first to leave a pro
        CONS OF AZURE DATA FACTORY
          Be the first to leave a con

          related Azure Data Factory posts

          Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

          1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
          2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
          3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
          4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
          5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
          6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
          7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
          8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
          See more

          We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.

          See more
          Embulk logo

          Embulk

          26
          24
          0
          Bulk data loader that helps data transfer between various databases
          26
          24
          + 1
          0
          PROS OF EMBULK
            Be the first to leave a pro
            CONS OF EMBULK
              Be the first to leave a con

              related Embulk posts

              Google BigQuery Data Transfer Service logo

              Google BigQuery Data Transfer Service

              18
              20
              0
              Automate data movement from SaaS applications to Google BigQuery on a scheduled, managed basis
              18
              20
              + 1
              0
              PROS OF GOOGLE BIGQUERY DATA TRANSFER SERVICE
                Be the first to leave a pro
                CONS OF GOOGLE BIGQUERY DATA TRANSFER SERVICE
                  Be the first to leave a con

                  related Google BigQuery Data Transfer Service posts