Business Analyst at Amazon·
Needs advice
on
AWS Data PipelineAWS Data Pipeline
and
AWS GlueAWS Glue

Currently, we are using AWS data pipelines to load data from RDS to Redshift. But we are facing a lot of issues like running for long hours and failing frequently with no space left on device issues. Also, the EC2 instance needs to be modified whenever we face space issues. So to overcome this, we are exploring AWS Glue. Is this advisable to migrate our ETL to AWS Glue? Any suggestions are very much helpful for us.

Thanks, Anitha KG

READ LESS
1 upvote·3.3K views
Replies (1)
Data Engineer at Heycar·
Recommends
on
AWS Glue

Glue Jobs are basically a serverless version of Spark running on AWS, this means you don't have to size your cluster. The cool part is that you can setup a crawler on top of your RDS database, and using this metadata information to query RDS from a Glue job, and only then load it to Redshift after some transformation. Here the reference: https://aws.amazon.com/blogs/database/how-to-extract-transform-and-load-data-for-analytic-processing-using-aws-glue-part-2/

Using DataPipeline you will always fail the vertical scaling of your EC2 machine. Another solution, if you want to use DataPipeline, is how your process the data, for example, you can make chunked requests to RDS, save the result to S3, only then loading to Redshift. This solution will require much more effort and orchestration skills.

READ MORE
1 upvote·3.3K views
Avatar of Anitha Gangadharam

Anitha Gangadharam

Business Analyst at Amazon