Currently, we are using AWS data pipelines to load data from RDS to Redshift. But we are facing a lot of issues like running for long hours and failing frequently with no space left on device issues. Also, the EC2 instance needs to be modified whenever we face space issues. So to overcome this, we are exploring AWS Glue. Is this advisable to migrate our ETL to AWS Glue? Any suggestions are very much helpful for us.
Thanks, Anitha KG
Glue Jobs are basically a serverless version of Spark running on AWS, this means you don't have to size your cluster. The cool part is that you can setup a crawler on top of your RDS database, and using this metadata information to query RDS from a Glue job, and only then load it to Redshift after some transformation. Here the reference: https://aws.amazon.com/blogs/database/how-to-extract-transform-and-load-data-for-analytic-processing-using-aws-glue-part-2/
Using DataPipeline you will always fail the vertical scaling of your EC2 machine. Another solution, if you want to use DataPipeline, is how your process the data, for example, you can make chunked requests to RDS, save the result to S3, only then loading to Redshift. This solution will require much more effort and orchestration skills.