Databricks

Databricks

Utilities / Analytics / General Analytics

We are building cloud based analytical app and most of the data for UI is supplied from SQL server to Delta lake and then from Delta Lake to Azure Cosmos DB as JSON using Databricks. So that API can send it to front-end. Sometimes we get larger documents while transforming table rows into JSONs and it exceeds 2mb limit of cosmos size. What is the best solution for replacing Cosmos DB?

READ MORE
4 upvotes·31.4K views
Replies (2)
CTO at BT Créditos·

You could probably use CosmosDB to store metadata and then store your big documents in a Storage Account Blob Container. Then, you store the link for the documents in CosmosDB. It's a cheap way of solving this without leaving Azure.

READ MORE
3 upvotes·1 comment·6.2K views
Arjun R
Arjun R
·
June 6th 2022 at 9:39AM

Thanks for the input Ivan Reche. If we store big documents to blob container then how will python API's can query those and send it to UI? and if any updates happen on UI, then API has to write those changes back to big documents as copy.

·
Reply
CTO at Estimator360 Inc·

Do you know what the max size of one of your documents might be? Mongo (which you can also use on Azure) allows for larger sized documents (I think maybe 20MB). With that said, I ran into this issue when I was first using Cosmos, and I wound up rethinking the way I was storing documents. I don't know if this is an option for your scenario, but I ended up doing was breaking my documents up into smaller subdocuments. A thought process that I have come to follow is that if any property is an array (or at least can be an array with a length of N), make that array simple a list of IDs that point to other documents.

READ MORE
2 upvotes·1 comment·5K views
Dan Trigwell
Dan Trigwell
·
August 16th 2023 at 7:59AM

Aerospike might be one to check out. Can store 8Mb objects and provides much better performance and cost effectiveness compared with Cosmos and Mongo.

·
Reply
Data Engineer at Tata Consultancy Services·

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

READ MORE
4 upvotes·255.6K views