I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
On one side SF could be compared to Airflow: same distributed flow control, but there is a difference: SF is more autonomous: you could specify retry rules and so on, but only before you start execution. It is not possible to have same level of manual control as in Airflow. Also there limitations on number of executed activities. Not small, but still restrictive. I think if you have complex not fully automated process Airflow still good for you. Thinking about your scale, you just need to spawn and control Batch - this fits perfectly for airflow steps. This is my personal opinion.
"Airflow is nice since I can look at which tasks failed"
I think you dont need this, AWS batch itself will give you a list of what failed, and you can see log details in AWS Cloudwatch for failing runs, with no additional "orchestration" component.