Avatar of Sung Won Chung

Context: I wanted to create an end to end IoT data pipeline simulation in Google Cloud IoT Core and other GCP services. I never touched Terraform meaningfully until working on this project, and it's one of the best explorations in my development career. The documentation and syntax is incredibly human-readable and friendly. I'm used to building infrastructure through the google apis via Python , but I'm so glad past Sung did not make that decision. I was tempted to use Google Cloud Deployment Manager, but the templates were a bit convoluted by first impression. I'm glad past Sung did not make this decision either.

Solution: Leveraging Google Cloud Build Google Cloud Run Google Cloud Bigtable Google BigQuery Google Cloud Storage Google Compute Engine along with some other fun tools, I can deploy over 40 GCP resources using Terraform!

Check Out My Architecture: CLICK ME

Check out the GitHub repo attached

READ MORE
GitHub - sungchun12/iot-python-webapp: Live, real-time dashboard in a serverless docker web app, and deployed via terraform with a built-in CICD trigger-See Mock Website (github.com)
29 upvotes·4 comments·743.8K views
Franco Austin
Franco Austin
·
November 22nd 2019 at 10:54AM

Relly like the architectural artefact, what did you use to produce it?

·
Reply
Sung Won Chung
Sung Won Chung
·
November 24th 2019 at 9:57PM

I used draw.io. Free and easy to use!

https://www.draw.io/

·
Reply
SANJEEV TOORA (UK)
SANJEEV TOORA (UK)
·
January 29th 2020 at 8:55AM

Architecture looks great! Just out of curiosity, did you have unit and integration testing components for Terraform e.g. Terratest, Kitchen Terraform, Sentinel?

·
Reply
Sung Won Chung
Sung Won Chung
·
January 29th 2020 at 3:17PM

I did NOT. This was to demo the mechanics. Not something for robust production. Would love if you can make a pull request for some of those tests ;)

·
Reply

I use GitHub because it's the coolest kid on the block for open source. Searching for repos you need/want is easy.

Especially with the apache foundation moving their workloads to them, unlimited private repos, and a package registry on the way, they are becoming the one stop shop for open source needs.

I'm curious to see how the GitHub Sponsors(patreon for developers) plays out, and what it'll do for open source. Hopefully, they design it in a way where it's not abused by big tech to "plant" developers that look like they're building open source when they're actually building proprietary tools.

Bitbucket GitLab

READ MORE
10 upvotes·38.8K views

I use AWS Lambda because it is the most mature of the major cloud platforms for serverless functions. The fact that you can add VPC configs at the start is huge from a security perspective. However, it does take a lot of work to configure the Amazon VPC to work with AWS Secrets Manager and Lambda. It's also nice because it works so well with Amazon API Gateway

I typically use it to connect with databases to insert and extract information for downstream analytics.

I won't be surprised if one day the majority of workloads run on this service. Not having to manage and maintain infrastructure is truly a blessing.

READ MORE
8 upvotes·1 comment·50.2K views
Mike Harvey
Mike Harvey
·
January 13th 2021 at 5:44PM

True, yet Lambda does have its drawbacks in term of cold start latency and other arguable deficits. IMO, there is no perfect solution for serverless compute and other offering have their value propositions which should be considered based on need. On this subject though, Secrets Manager and even Parameter Store are great solutions that centralize and eliminate infrastructure (like many AWS offerings). The integration hurdles are relatively the same for any serverless compute environment - you need to integrate with these as 3rd-party services to take advantage of them, and typically in the code.

·
Reply
Shared insights
on
Python
in

I use Python because it is one of the most versatile and easy to read programming languages. The open source community is vibrant and there are so many tutorials and Medium blogs it can be overwhelming, and that's a good problem to have!

I primarily use it for automating backend infrastructure tasks, data exploration via Jupyter, and data engineering development. It's great to maintain most of my stack in one language for consistency.

HOWEVER, when it comes to scaling data engineering workloads compared to other languages like Java and Scala, performance speed degrades significantly. You'll notice that most of the big tech companies use Scala or Java for Spark because the Python API is still a second-class citizen in new releases.

ANOTHER HOWEVER, I'm excited for the future of parallelism in python and how that may replace complex spark workloads. It's still young, but growing: Ray

READ MORE
7 upvotes·2.3K views
Shared insights
in

I used dbt over manually setting up python wrappers around SQL scripts because it makes managing transformations within Google BigQuery much easier. This saves future Sung dozens of hours maintaining plumbing code to run a couple SQL queries. Check out my tutorial in the link!

I haven't seen any other tool make it as easy to run dependent SQL DAGs directly in a data warehouse.

READ MORE
GitHub - sungchun12/dbt_bigquery_example: dbt(data build tool) tutorial on bigquery with extensive NOTES (github.com)
6 upvotes·29.2K views
Shared insights
on
Apache Spark
in

I use Apache Spark because it is THE framework for big data processing from big tech to startup. It can be run on pretty much any platform. It's open source, and lots of community support and code samples to draw from.

The Python API is good for low-med level transformations, but most recommend starting with Scala/Java to use full spark capabilities.

It comes with quite learning curve to make sense of how data is shuffling through different nodes, but it's worth it for running large-scale ETL.

Also, keep in mind the streaming and batch frameworks are not unified, so you'll have learn them both separately.

READ MORE
6 upvotes·26.9K views

I use Google BigQuery because it makes is super easy to query and store data for analytics workloads. If you're using GCP, you're likely using BigQuery. However, running data viz tools directly connected to BigQuery will run pretty slow. They recently announced BI Engine which will hopefully compete well against big players like Snowflake when it comes to concurrency.

What's nice too is that it has SQL-based ML tools, and it has great GIS support!

READ MORE
4 upvotes·190.9K views

I use Google Cloud Build because it's my first foray into the CICD world(loving it so far), and I wanted to work with something GCP native to avoid giving permissions to other SaaS tools like CircleCI and Travis CI.

I really like it because it's free for the first 120 minutes, and it's one of the few CICD tools that enterprises are open to using since it's contained within GCP.

One of the unique things is that it has the Kaniko cache, which speeds up builds by creating intermediate layers within the docker image vs. pushing the full thing from the start. Helpful when you're installing just a few additional dependencies.

Feel free to checkout an example: Cloudbuild Example

READ MORE
4 upvotes·188.4K views

I use Amazon Athena because similar to Google BigQuery , you can store and query data easily. Especially since you can define data schema in the Glue data catalog, there's a central way to define data models.

However, I would not recommend for batch jobs. I typically use this to check intermediary datasets in data engineering workloads. It's good for getting a look and feel of the data along its ETL journey.

READ MORE
4 upvotes·186.4K views