Application and Data / Data Stores / Database Tools
CTO at My Job Glasses·

Hello, For security and strategic reasons, we are migrating our apps from AWS/Google to a cloud provider with more security certifications and fewer functionalities, named Outscale. So far we have been using Google BigQuery as our data warehouse with ELT workflows (using Stitch and dbt ) and we need to migrate our data ecosystem to this new cloud provider.

We are setting up a Kubernetes cluster in our new cloud provider for our apps. Regarding the data warehouse, it's not clear if there are advantages/inconvenients about setting it up on kubernetes (apart from having to create node groups and tolerations with more ram/cpu). Also, we are not sure what's the best Open source or on-premise tool to use. The main requirement is that data must remain in the secure cluster, and no external entity (especially US) can have access to it. We have a dev cluster/environment and a production cluster/environment on this cloud.

Regarding the actual DWH usage - Today we have ~1.5TB in BigQuery in production. We're going to run our initial rests with ~50-100GB of data for our test cluster - Most of our data comes from other databases, so in most cases, we already have replicated sources somewhere, and there are only a handful of collections whose source is directly in the DWH (such as snapshots, some external data we've fetched at some point, google analytics, etc) and needs appropriate level of replication - We are a team of 30-ish people, we do not have critical needs regarding analytics speed, and we do not need real time. We rebuild our DBT models 2-3 times a day and this usually proves enough

Apart from postgreSQL, I haven't really found open-source or on-premise alternatives for setting up a data warehouse, and running transformations with DBT. There is also the question of data ingestion, I've selected Airbyte and @meltano and I have troubles understanding if one of the 2 is better but Airbytes seems to have a bigger community.

What do you suggest regarding the data warehouse, and the ELT workflows ? - Kubernetes or not kubernetes ? - Postgresql or something else ? if postgre, what are the important configs you'd have in mind ? - Airbyte/DBT or something else.

6 upvotes·42.4K views
Replies (1)

With only the provided info to guide the answer, it should be something like: The security posture is not likely to improve. When a new user arrives at AWS/GCP, these things are true: https://aws.amazon.com/compliance/iso-27001-faqs https://cloud.google.com/security/compliance/iso-27001

The only thing that can lessen the effects are our configuration decisions post-signup; this will generally be true wherever you go. So, you're in as good of a position as you're likely to ever be in. In both cases: There is the idea of a "vulnerability mitigation"; meaning: perhaps a security control that is violated but the mitigating factor (whatever that happens to be) makes it a non-issue. It then becomes an exception. I used to handle things like this (POA&M) for the DoD in a past life; https://www.dhs.gov/sites/default/files/publications/4300A-Handbook-Attachment-H-POAM-Guide.pdf (search for "mitigation")

With services you have no control over, like elastic IPs, you request one and you get it: this mechanism will always be ISO-27001-compliant. With services where there is a "shared responsibility", the operator can screw this up and take the service (we'll say managed-node-group) and introduce a security flaw; perhaps that config goes out with an old Ubuntu build for the workers with no security controls in place - it happens. In this case, this is user-error on the part of the operator. They simply need to follow instructions and use the amazon-linux nodes, that are configured by AWS/Red Hat to be secure = back in compliance.

If the move from one cloud to another is just to check certification boxes then the issue seems to be business - not technology. If security is taken seriously (via design) your company can be secure at any cloud.

So, in effect, the issue won't really be addressed - it will only get moved to the new cloud. That stuff aside...

If you're managing high volumes of data, then Kafka is the answer to 9 out of 10 data problems. You can ingest as much data as your nodes can handle; it's all in memory so those are the types of nodes that should be selected. After that, Kafka can 1) accept the inbound data 2) process it with its ETL facility 3) pass that processed data to postgres for long-term storage It can be (roughly) that simple.

In fact, one of the guys I work with pulls data from (example) graph db, into kafka, transforms it and stores the resultant dataset in a relational-db. It can pull data from any data source, transform it and publish it to any subscriber (another data source). In this way it could become the sun in the middle of your data solar system. And runs brilliantly on Kubernetes: https://strimzi.io

Put in the simplest terms: source:any data-type -> ETL -> destination:any other data type.

I've never been a fan of storing data on Kubernetes. I know they say you can do it - it will never be as stable as something like RDS/Cloud_SQL - it can't be - because it's managed by humans with lax discipline. RDS, for example, doesn't get touched unless someone triggers a highly-polished upgrade program that has likely gone through years of development. Nothing you can do on Kubernetes will replace that time investment. Just use the Cloud's data storage mechanisms and you'll be able to sleep at night. Else, one false move by an operator and you're restoring from backups, with endless testing - and you'll never really know what needle was lost from the haystack.

2 upvotes·7.4K views
Shared insights

I used dbt over manually setting up python wrappers around SQL scripts because it makes managing transformations within Google BigQuery much easier. This saves future Sung dozens of hours maintaining plumbing code to run a couple SQL queries. Check out my tutorial in the link!

I haven't seen any other tool make it as easy to run dependent SQL DAGs directly in a data warehouse.

GitHub - sungchun12/dbt_bigquery_example: dbt(data build tool) tutorial on bigquery with extensive NOTES (github.com)
6 upvotes·50.4K views