What is Airflow?
Who uses Airflow?
Here are some stack decisions, common use cases and reviews by companies and developers who chose Airflow in their tech stack.
I am looking for an open-source scheduler tool with cross-functional application dependencies. Some of the tasks I am looking to schedule are as follows:
- Trigger Matillion ETL loads
- Trigger Attunity Replication tasks that have downstream ETL loads
- Trigger Golden gate Replication Tasks
- Shell scripts, wrappers, file watchers
- Event-driven schedules
I have used Airflow in the past, and I know we need to create DAGs for each pipeline. I am not familiar with Jenkins, but I know it works with configuration without much underlying code. I want to evaluate both and appreciate any advise
I need to implement a Node.js cron scheduler like Airflow. Is it possible to implement it without working on Python? Till now, all my jobs are running on my server only via internal script calling another job scripts. Any alternative or better way to implement?
I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?
For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?
I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
We're looking to do a project for a company that has incoming data from 2 sources, namely MongoDB and MySQL. We need to make it such that we are combining data from these 2 sources and showing it in real-time to PostgreSQL. Ideally, about 600,000 records per day. Which tool would be better for this use case? Airflow or Kafka?
We have some lambdas we need to orchestrate to get our workflow going. In the past, we already attempted to use Airflow as the orchestrator, but the need to coordinate the tasks in a database generates an overhead that we cannot afford. For our use case, there are hundreds of inputs per minute and we need to scale to support all the inputs and have an efficient way to analyze them later. The ideal product would be AWS Step Functions since it can manage our load demand graciously, but it is too expensive and we cannot afford that. So, I would like to get alternatives for an orchestrator that does not need a complex backend, can manage hundreds of inputs per minute, and is not too expensive.
Jobs that mention Airflow as a desired skillset
- Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writting code that instantiate pipelines dynamically.
- Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
- Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built in the core of Airflow using powerful Jinja templating engine.
- Scalable: Airflow has a modular architecture and uses a message queue to talk to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.