What is Azure HDInsight and what are its top alternatives?
Azure HDInsight is a fully managed cloud-based service from Microsoft that provides Apache Hadoop and Apache Spark clusters. It allows users to process big data workloads in a cost-effective and scalable manner. Key features include support for various big data frameworks, integration with other Azure services, enterprise-grade security, and easy scalability. However, limitations include high costs for large workloads and potential complexity in managing different big data frameworks simultaneously.
Amazon EMR: Amazon EMR is a cloud-based big data platform that utilizes various open-source tools such as Apache Spark, Hadoop, and Hive. Key features include easy setup, integration with other AWS services, and cost-effectiveness. Pros include seamless integration with AWS services, while cons include potential complexity for users not familiar with AWS.
Google Cloud Dataproc: Google Cloud Dataproc is a managed Apache Spark and Hadoop service that runs on Google Cloud Platform. Key features include easy cluster management, autoscaling, and integration with other Google Cloud services. Pros include seamless integration with Google Cloud Platform, while cons include potential higher costs compared to other alternatives.
Cloudera Distribution for Hadoop (CDH): CDH is a distribution of Apache Hadoop and related projects from Cloudera. Key features include comprehensive data management capabilities, enterprise-grade security, and support for various big data frameworks. Pros include extensive support and documentation, while cons include potential higher costs for enterprise deployments.
MapR: MapR is a converged data platform that integrates Hadoop, Spark, and other big data frameworks. Key features include high performance, enterprise-grade reliability, and global data consistency. Pros include faster performance compared to other alternatives, while cons include potential higher costs for large-scale deployments.
IBM BigInsights: IBM BigInsights is an enterprise-grade Hadoop distribution with additional analytics capabilities. Key features include advanced analytics tools, integration with IBM Watson services, and enterprise-grade security. Pros include seamless integration with IBM ecosystem, while cons include potential higher costs for smaller deployments.
Hortonworks Data Platform (HDP): HDP is an open-source distribution of Apache Hadoop from Hortonworks. Key features include comprehensive data management tools, enterprise-grade security, and support for various big data frameworks. Pros include open-source nature, while cons include potential complexity in managing different components.
Databricks: Databricks is a unified data analytics platform that leverages Apache Spark for big data processing. Key features include collaborative notebooks, automated cluster management, and integration with various data sources. Pros include ease of use for data scientists, while cons include potential higher costs for large-scale deployments.
Qubole: Qubole is a cloud-native data platform that simplifies big data processing using Apache Spark, Hadoop, and Presto. Key features include self-service analytics, auto-scaling, and cost optimization. Pros include ease of use for data analysts, while cons include potential limitations in customization compared to other alternatives.
Snowflake: Snowflake is a cloud data platform that offers a data warehouse-as-a-service solution for analytics. Key features include instant elasticity, built-in security, and support for structured and semi-structured data. Pros include easy scalability for varying workloads, while cons include potential limitations for unstructured data processing.
Apache Flink: Apache Flink is an open-source stream processing framework that can also be used for batch processing. Key features include low-latency processing, fault tolerance, and support for event time processing. Pros include high throughput and low latency, while cons include potential complexity in setting up and managing Flink clusters.
Top Alternatives to Azure HDInsight
- Amazon EMR
It is used in a variety of applications, including log analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. ...
- Azure Databricks
Accelerate big data analytics and artificial intelligence (AI) solutions with Azure Databricks, a fast, easy and collaborative Apache Spark–based analytics service. ...
- Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...
- Azure Machine Learning
Azure Machine Learning is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. ...
- Azure Data Factory
It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud. ...
- Databricks
Databricks Unified Analytics Platform, from the original creators of Apache Spark™, unifies data science and engineering across the Machine Learning lifecycle from data preparation to experimentation and deployment of ML applications. ...
- JavaScript
JavaScript is most known as the scripting language for Web pages, but used in many non-browser environments as well such as node.js or Apache CouchDB. It is a prototype-based, multi-paradigm scripting language that is dynamic,and supports object-oriented, imperative, and functional programming styles. ...
- Git
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. ...
Azure HDInsight alternatives & related posts
Amazon EMR
- On demand processing power15
- Don't need to maintain Hadoop Cluster yourself12
- Hadoop Tools7
- Elastic6
- Backed by Amazon4
- Flexible3
- Economic - pay as you go, easy to use CLI and SDKs3
- Don't need a dedicated Ops group2
- Massive data handling1
- Great support1
related Amazon EMR posts
I have to build a data processing application with an Apache Beam stack and Apache Flink runner on an Amazon EMR cluster. I saw some instability with the process and EMR clusters that keep going down. Here, the Apache Beam application gets inputs from Kafka and sends the accumulative data streams to another Kafka topic. Any advice on how to make the process more stable?
I use AWS Glue because I thought it was worth all they hype Fall 2018. However, you had to use Python 2.7 with no pandas support, and cold starts lasted as long as 15 minutes. Also, setting up a dev environment for iterative development was near impossible at the time.
It was a terrible experience for me. I recommend using Amazon EMR instead. Even talking with a friend that works at Amazon, they use EMR instead of Glue for internal spark workloads. Just because a company makes something doesn't mean they use that something :/
related Azure Databricks posts
- Great ecosystem39
- One stack to rule them all11
- Great load balancer4
- Amazon aws1
- Java syntax1
related Hadoop posts
The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.
For databases, a custom Hadoop streamer pulled database data and wrote it to S3.
Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
related Azure Machine Learning posts
related Azure Data Factory posts
Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:
- Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
- Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
- Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
- Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
- Processing-> We want to use SAS if at all possible. What will work with SAS code?
- Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
- I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
- An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.
- Best Performances on large datasets1
- True lakehouse architecture1
- Scalability1
- Databricks doesn't get access to your data1
- Usage Based Billing1
- Security1
- Data stays in your cloud account1
- Multicloud1
related Databricks posts
From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.
I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.
Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.
I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?
JavaScript
- Can be used on frontend/backend1.7K
- It's everywhere1.5K
- Lots of great frameworks1.2K
- Fast896
- Light weight745
- Flexible425
- You can't get a device today that doesn't run js392
- Non-blocking i/o286
- Ubiquitousness236
- Expressive191
- Extended functionality to web pages55
- Relatively easy language49
- Executed on the client side46
- Relatively fast to the end user30
- Pure Javascript25
- Functional programming21
- Async15
- Full-stack13
- Setup is easy12
- Its everywhere12
- JavaScript is the New PHP11
- Because I love functions11
- Like it or not, JS is part of the web standard10
- Can be used in backend, frontend and DB9
- Expansive community9
- Future Language of The Web9
- Easy9
- No need to use PHP8
- For the good parts8
- Can be used both as frontend and backend as well8
- Everyone use it8
- Most Popular Language in the World8
- Easy to hire developers8
- Love-hate relationship7
- Powerful7
- Photoshop has 3 JS runtimes built in7
- Evolution of C7
- Popularized Class-Less Architecture & Lambdas7
- Agile, packages simple to use7
- Supports lambdas and closures7
- 1.6K Can be used on frontend/backend6
- It's fun6
- Hard not to use6
- Nice6
- Client side JS uses the visitors CPU to save Server Res6
- Versitile6
- It let's me use Babel & Typescript6
- Easy to make something6
- Its fun and fast6
- Can be used on frontend/backend/Mobile/create PRO Ui6
- Function expressions are useful for callbacks5
- What to add5
- Client processing5
- Everywhere5
- Scope manipulation5
- Stockholm Syndrome5
- Promise relationship5
- Clojurescript5
- Because it is so simple and lightweight4
- Only Programming language on browser4
- Hard to learn1
- Test1
- Test21
- Easy to understand1
- Not the best1
- Easy to learn1
- Subskill #41
- Hard 彤0
- A constant moving target, too much churn22
- Horribly inconsistent20
- Javascript is the New PHP15
- No ability to monitor memory utilitization9
- Shows Zero output in case of ANY error8
- Thinks strange results are better than errors7
- Can be ugly6
- No GitHub3
- Slow2
related JavaScript posts
Oof. I have truly hated JavaScript for a long time. Like, for over twenty years now. Like, since the Clinton administration. It's always been a nightmare to deal with all of the aspects of that silly language.
But wowza, things have changed. Tooling is just way, way better. I'm primarily web-oriented, and using React and Apollo together the past few years really opened my eyes to building rich apps. And I deeply apologize for using the phrase rich apps; I don't think I've ever said such Enterprisey words before.
But yeah, things are different now. I still love Rails, and still use it for a lot of apps I build. But it's that silly rich apps phrase that's the problem. Users have way more comprehensive expectations than they did even five years ago, and the JS community does a good job at building tools and tech that tackle the problems of making heavy, complicated UI and frontend work.
Obviously there's a lot of things happening here, so just saying "JavaScript isn't terrible" might encompass a huge amount of libraries and frameworks. But if you're like me, yeah, give things another shot- I'm somehow not hating on JavaScript anymore and... gulp... I kinda love it.
How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:
Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.
Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:
https://eng.uber.com/distributed-tracing/
(GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)
Bindings/Operator: Python Java Node.js Go C++ Kubernetes JavaScript OpenShift C# Apache Spark
- Distributed version control system1.4K
- Efficient branching and merging1.1K
- Fast959
- Open source845
- Better than svn726
- Great command-line application368
- Simple306
- Free291
- Easy to use232
- Does not require server222
- Distributed27
- Small & Fast22
- Feature based workflow18
- Staging Area15
- Most wide-spread VSC13
- Role-based codelines11
- Disposable Experimentation11
- Frictionless Context Switching7
- Data Assurance6
- Efficient5
- Just awesome4
- Github integration3
- Easy branching and merging3
- Compatible2
- Flexible2
- Possible to lose history and commits2
- Rebase supported natively; reflog; access to plumbing1
- Light1
- Team Integration1
- Fast, scalable, distributed revision control system1
- Easy1
- Flexible, easy, Safe, and fast1
- CLI is great, but the GUI tools are awesome1
- It's what you do1
- Phinx0
- Hard to learn16
- Inconsistent command line interface11
- Easy to lose uncommitted work9
- Worst documentation ever possibly made7
- Awful merge handling5
- Unexistent preventive security flows3
- Rebase hell3
- When --force is disabled, cannot rebase2
- Ironically even die-hard supporters screw up badly2
- Doesn't scale for big data1
related Git posts
Our whole DevOps stack consists of the following tools:
- GitHub (incl. GitHub Pages/Markdown for Documentation, GettingStarted and HowTo's) for collaborative review and code management tool
- Respectively Git as revision control system
- SourceTree as Git GUI
- Visual Studio Code as IDE
- CircleCI for continuous integration (automatize development process)
- Prettier / TSLint / ESLint as code linter
- SonarQube as quality gate
- Docker as container management (incl. Docker Compose for multi-container application management)
- VirtualBox for operating system simulation tests
- Kubernetes as cluster management for docker containers
- Heroku for deploying in test environments
- nginx as web server (preferably used as facade server in production environment)
- SSLMate (using OpenSSL) for certificate management
- Amazon EC2 (incl. Amazon S3) for deploying in stage (production-like) and production environments
- PostgreSQL as preferred database system
- Redis as preferred in-memory database/store (great for caching)
The main reason we have chosen Kubernetes over Docker Swarm is related to the following artifacts:
- Key features: Easy and flexible installation, Clear dashboard, Great scaling operations, Monitoring is an integral part, Great load balancing concepts, Monitors the condition and ensures compensation in the event of failure.
- Applications: An application can be deployed using a combination of pods, deployments, and services (or micro-services).
- Functionality: Kubernetes as a complex installation and setup process, but it not as limited as Docker Swarm.
- Monitoring: It supports multiple versions of logging and monitoring when the services are deployed within the cluster (Elasticsearch/Kibana (ELK), Heapster/Grafana, Sysdig cloud integration).
- Scalability: All-in-one framework for distributed systems.
- Other Benefits: Kubernetes is backed by the Cloud Native Computing Foundation (CNCF), huge community among container orchestration tools, it is an open source and modular tool that works with any OS.
Often enough I have to explain my way of going about setting up a CI/CD pipeline with multiple deployment platforms. Since I am a bit tired of yapping the same every single time, I've decided to write it up and share with the world this way, and send people to read it instead ;). I will explain it on "live-example" of how the Rome got built, basing that current methodology exists only of readme.md and wishes of good luck (as it usually is ;)).
It always starts with an app, whatever it may be and reading the readmes available while Vagrant and VirtualBox is installing and updating. Following that is the first hurdle to go over - convert all the instruction/scripts into Ansible playbook(s), and only stopping when doing a clear vagrant up
or vagrant reload
we will have a fully working environment. As our Vagrant environment is now functional, it's time to break it! This is the moment to look for how things can be done better (too rigid/too lose versioning? Sloppy environment setup?) and replace them with the right way to do stuff, one that won't bite us in the backside. This is the point, and the best opportunity, to upcycle the existing way of doing dev environment to produce a proper, production-grade product.
I should probably digress here for a moment and explain why. I firmly believe that the way you deploy production is the same way you should deploy develop, shy of few debugging-friendly setting. This way you avoid the discrepancy between how production work vs how development works, which almost always causes major pains in the back of the neck, and with use of proper tools should mean no more work for the developers. That's why we start with Vagrant as developer boxes should be as easy as vagrant up
, but the meat of our product lies in Ansible which will do meat of the work and can be applied to almost anything: AWS, bare metal, docker, LXC, in open net, behind vpn - you name it.
We must also give proper consideration to monitoring and logging hoovering at this point. My generic answer here is to grab Elasticsearch, Kibana, and Logstash. While for different use cases there may be better solutions, this one is well battle-tested, performs reasonably and is very easy to scale both vertically (within some limits) and horizontally. Logstash rules are easy to write and are well supported in maintenance through Ansible, which as I've mentioned earlier, are at the very core of things, and creating triggers/reports and alerts based on Elastic and Kibana is generally a breeze, including some quite complex aggregations.
If we are happy with the state of the Ansible it's time to move on and put all those roles and playbooks to work. Namely, we need something to manage our CI/CD pipelines. For me, the choice is obvious: TeamCity. It's modern, robust and unlike most of the light-weight alternatives, it's transparent. What I mean by that is that it doesn't tell you how to do things, doesn't limit your ways to deploy, or test, or package for that matter. Instead, it provides a developer-friendly and rich playground for your pipelines. You can do most the same with Jenkins, but it has a quite dated look and feel to it, while also missing some key functionality that must be brought in via plugins (like quality REST API which comes built-in with TeamCity). It also comes with all the common-handy plugins like Slack or Apache Maven integration.
The exact flow between CI and CD varies too greatly from one application to another to describe, so I will outline a few rules that guide me in it: 1. Make build steps as small as possible. This way when something breaks, we know exactly where, without needing to dig and root around. 2. All security credentials besides development environment must be sources from individual Vault instances. Keys to those containers should exist only on the CI/CD box and accessible by a few people (the less the better). This is pretty self-explanatory, as anything besides dev may contain sensitive data and, at times, be public-facing. Because of that appropriate security must be present. TeamCity shines in this department with excellent secrets-management. 3. Every part of the build chain shall consume and produce artifacts. If it creates nothing, it likely shouldn't be its own build. This way if any issue shows up with any environment or version, all developer has to do it is grab appropriate artifacts to reproduce the issue locally. 4. Deployment builds should be directly tied to specific Git branches/tags. This enables much easier tracking of what caused an issue, including automated identifying and tagging the author (nothing like automated regression testing!).
Speaking of deployments, I generally try to keep it simple but also with a close eye on the wallet. Because of that, I am more than happy with AWS or another cloud provider, but also constantly peeking at the loads and do we get the value of what we are paying for. Often enough the pattern of use is not constantly erratic, but rather has a firm baseline which could be migrated away from the cloud and into bare metal boxes. That is another part where this approach strongly triumphs over the common Docker and CircleCI setup, where you are very much tied in to use cloud providers and getting out is expensive. Here to embrace bare-metal hosting all you need is a help of some container-based self-hosting software, my personal preference is with Proxmox and LXC. Following that all you must write are ansible scripts to manage hardware of Proxmox, similar way as you do for Amazon EC2 (ansible supports both greatly) and you are good to go. One does not exclude another, quite the opposite, as they can live in great synergy and cut your costs dramatically (the heavier your base load, the bigger the savings) while providing production-grade resiliency.