Shared insights
Apache SparkApache Spark

I use Apache Spark because it is THE framework for big data processing from big tech to startup. It can be run on pretty much any platform. It's open source, and lots of community support and code samples to draw from.

The Python API is good for low-med level transformations, but most recommend starting with Scala/Java to use full spark capabilities.

It comes with quite learning curve to make sense of how data is shuffling through different nodes, but it's worth it for running large-scale ETL.

Also, keep in mind the streaming and batch frameworks are not unified, so you'll have learn them both separately.

6 upvotes·26.9K views
Avatar of Sung Won Chung