Databricks - Reviews | Why developers use Databricks.

Jun 3, 2022

Needs advice

and

We are building cloud based analytical app and most of the data for UI is supplied from SQL server to Delta lake and then from Delta Lake to Azure Cosmos DB as JSON using Databricks. So that API can send it to front-end. Sometimes we get larger documents while transforming table rows into JSONs and it exceeds 2mb limit of cosmos size. What is the best solution for replacing Cosmos DB?

4 upvotes·36.1K views

Replies (2)

Ivan Reche

CTO at BT Créditos·Jun 5, 2022

You could probably use CosmosDB to store metadata and then store your big documents in a Storage Account Blob Container. Then, you store the link for the documents in CosmosDB. It's a cheap way of solving this without leaving Azure.

4 upvotes·1 comment·6.3K views

Arjun R

June 6th 2022 at 9:39AM

Thanks for the input Ivan Reche. If we store big documents to blob container then how will python API's can query those and send it to UI? and if any updates happen on UI, then API has to write those changes back to big documents as copy.

Chris Spanellis

CTO at Estimator360 Inc·Jun 13, 2022

Do you know what the max size of one of your documents might be? Mongo (which you can also use on Azure) allows for larger sized documents (I think maybe 20MB). With that said, I ran into this issue when I was first using Cosmos, and I wound up rethinking the way I was storing documents. I don't know if this is an option for your scenario, but I ended up doing was breaking my documents up into smaller subdocuments. A thought process that I have come to follow is that if any property is an array (or at least can be an array with a length of N), make that array simple a list of IDs that point to other documents.

2 upvotes·1 comment·5K views

Dan Trigwell

August 16th 2023 at 7:59AM

Aerospike might be one to check out. Can store 8Mb objects and provides much better performance and cost effectiveness compared with Cosmos and Mongo.

Vamshi Krishna

Data Engineer at Tata Consultancy Services·May 29, 2020

Needs advice

AWS Data Pipeline

AWS Glue

and

Azure Data Factory

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

5 upvotes·268.3K views