What is Spring Batch and what are its top alternatives?
Spring Batch is a lightweight, comprehensive batch framework designed to facilitate the development of robust batch applications. It offers key features such as transaction management, job processing, job execution flow, and resource management. However, Spring Batch may have limitations in terms of complex job requirements and debugging capabilities.
- Apache Beam: Apache Beam is a unified programming model for both batch and streaming data processing. It supports multiple execution engines and offers rich features for scalable and fault-tolerant processing. Pros include flexibility in choosing execution engine, while cons may include a steeper learning curve compared to Spring Batch.
- Apache Storm: Apache Storm is a real-time computation system designed for processing large volumes of data with low latency. It provides fault-tolerance and scalability for continuous data processing. Pros include real-time processing capabilities, while cons may include a focus on streaming rather than batch processing.
- Apache Flink: Apache Flink is a powerful and scalable stream processing framework that also supports batch processing. It offers low-latency and high-throughput processing capabilities with efficient fault-tolerance mechanisms. Pros include unified batch and stream processing, while cons may include complexity for simple batch jobs.
- Spring Cloud Data Flow: Spring Cloud Data Flow is a cloud-native toolkit for building and deploying data microservices on modern runtime platforms. It provides a unified interface for composing and orchestrating data pipelines. Pros include cloud-native approach, while cons may include a potentially steep learning curve.
- Airflow: Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. It allows the creation of complex workflows with dependencies and triggers. Pros include rich DAG functionalities, while cons may include a more complex setup compared to Spring Batch.
- Celery: Celery is a distributed task queue system for message passing between processes. It supports both real-time and batch processing tasks with flexible scheduling and monitoring capabilities. Pros include distributed task execution, while cons may include a steeper learning curve for beginners.
- AWS Glue: AWS Glue is a fully managed extract, transform, and load (ETL) service for processing and transforming data at scale. It offers serverless data integration with built-in automation features. Pros include serverless processing, while cons may include potential vendor lock-in.
- Google Cloud Dataflow: Google Cloud Dataflow is a fully managed service for executing a wide range of data processing patterns such as ETL, batch computation, and real-time analysis. It offers scalability, monitoring, and integration with other Google Cloud services. Pros include seamless integration with Google Cloud ecosystem, while cons may include potential cost considerations.
- Luigi: Luigi is a Python-based dependency framework for defining and running complex pipelines of batch jobs. It provides tooling for building data workflows with support for task dependencies and scheduling. Pros include simplicity for defining dependencies, while cons may include a focus on Python-based workflows.
- Talend Open Studio: Talend Open Studio is an open-source data integration tool for building and deploying data pipelines. It offers a visual interface for designing workflows and supports batch and real-time processing. Pros include a user-friendly visual interface, while cons may include potential limitations in advanced data processing functionalities.
Top Alternatives to Spring Batch
- Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. ...
- Talend
It is an open source software integration platform helps you in effortlessly turning data into business insights. It uses native code generation that lets you run your data pipelines seamlessly across all cloud providers and get optimized performance on all platforms. ...
- Spring Boot
Spring Boot makes it easy to create stand-alone, production-grade Spring based Applications that you can "just run". We take an opinionated view of the Spring platform and third-party libraries so you can get started with minimum fuss. Most Spring Boot applications need very little Spring configuration. ...
- Apache Spark
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. ...
- Kafka
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. ...
- AWS Batch
It enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. ...
- Node.js
Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices. ...
- Django
Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. ...
Spring Batch alternatives & related posts
- Great ecosystem39
- One stack to rule them all11
- Great load balancer4
- Amazon aws1
- Java syntax1
related Hadoop posts
The early data ingestion pipeline at Pinterest used Kafka as the central message transporter, with the app servers writing messages directly to Kafka, which then uploaded log files to S3.
For databases, a custom Hadoop streamer pulled database data and wrote it to S3.
Challenges cited for this infrastructure included high operational overhead, as well as potential data loss occurring when Kafka broker outages led to an overflow of in-memory message buffering.
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
related Talend posts
Spring Boot
- Powerful and handy149
- Easy setup134
- Java128
- Spring90
- Fast85
- Extensible46
- Lots of "off the shelf" functionalities37
- Cloud Solid32
- Caches well26
- Productive24
- Many receipes around for obscure features24
- Modular23
- Integrations with most other Java frameworks23
- Spring ecosystem is great22
- Auto-configuration21
- Fast Performance With Microservices21
- Community18
- Easy setup, Community Support, Solid for ERP apps17
- One-stop shop15
- Easy to parallelize14
- Cross-platform14
- Easy setup, good for build erp systems, well documented13
- Powerful 3rd party libraries and frameworks13
- Easy setup, Git Integration12
- It's so easier to start a project on spring5
- Kotlin4
- Microservice and Reactive Programming1
- The ability to integrate with the open source ecosystem1
- Heavy weight23
- Annotation ceremony18
- Java13
- Many config files needed11
- Reactive5
- Excellent tools for cloud hosting, since 5.x4
- Java 😒😒1
related Spring Boot posts
We are in the process of building a modern content platform to deliver our content through various channels. We decided to go with Microservices architecture as we wanted scale. Microservice architecture style is an approach to developing an application as a suite of small independently deployable services built around specific business capabilities. You can gain modularity, extensive parallelism and cost-effective scaling by deploying services across many distributed servers. Microservices modularity facilitates independent updates/deployments, and helps to avoid single point of failure, which can help prevent large-scale outages. We also decided to use Event Driven Architecture pattern which is a popular distributed asynchronous architecture pattern used to produce highly scalable applications. The event-driven architecture is made up of highly decoupled, single-purpose event processing components that asynchronously receive and process events.
To build our #Backend capabilities we decided to use the following: 1. #Microservices - Java with Spring Boot , Node.js with ExpressJS and Python with Flask 2. #Eventsourcingframework - Amazon Kinesis , Amazon Kinesis Firehose , Amazon SNS , Amazon SQS, AWS Lambda 3. #Data - Amazon RDS , Amazon DynamoDB , Amazon S3 , MongoDB Atlas
To build #Webapps we decided to use Angular 2 with RxJS
#Devops - GitHub , Travis CI , Terraform , Docker , Serverless
Is learning Spring and Spring Boot for web apps back-end development is still relevant in 2021? Feel free to share your views with comparison to Django/Node.js/ ExpressJS or other frameworks.
Please share some good beginner resources to start learning about spring/spring boot framework to build the web apps.
- Open-source61
- Fast and Flexible48
- One platform for every big data problem8
- Great for distributed SQL like applications8
- Easy to install and to use6
- Works well for most Datascience usecases3
- Interactive Query2
- Machine learning libratimery, Streaming in real2
- In memory Computation2
- Speed4
related Apache Spark posts
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :
Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:
https://eng.uber.com/marmaray-hadoop-ingestion-open-source/
(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )
- High-throughput126
- Distributed119
- Scalable92
- High-Performance86
- Durable66
- Publish-Subscribe38
- Simple-to-use19
- Open source18
- Written in Scala and java. Runs on JVM12
- Message broker + Streaming system9
- KSQL4
- Avro schema integration4
- Robust4
- Suport Multiple clients3
- Extremely good parallelism constructs2
- Partioned, replayable log2
- Simple publisher / multi-subscriber model1
- Fun1
- Flexible1
- Non-Java clients are second-class citizens32
- Needs Zookeeper29
- Operational difficulties9
- Terrible Packaging5
related Kafka posts
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
As we've evolved or added additional infrastructure to our stack, we've biased towards managed services. Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.
We also use managed Amazon ElastiCache instances instead of spinning up Amazon EC2 instances to run Redis workloads, as well as shifting to Amazon Kinesis instead of Kafka.
- Containerized3
- Scalable3
- More overhead than lambda3
- Image management1
related AWS Batch posts
Node.js
- Npm1.4K
- Javascript1.3K
- Great libraries1.1K
- High-performance1K
- Open source805
- Great for apis486
- Asynchronous477
- Great community423
- Great for realtime apps390
- Great for command line utilities296
- Websockets84
- Node Modules83
- Uber Simple69
- Great modularity59
- Allows us to reuse code in the frontend58
- Easy to start42
- Great for Data Streaming35
- Realtime32
- Awesome28
- Non blocking IO25
- Can be used as a proxy18
- High performance, open source, scalable17
- Non-blocking and modular16
- Easy and Fun15
- Easy and powerful14
- Future of BackEnd13
- Same lang as AngularJS13
- Fullstack12
- Fast11
- Scalability10
- Cross platform10
- Simple9
- Mean Stack8
- Great for webapps7
- Easy concurrency7
- Typescript6
- Fast, simple code and async6
- React6
- Friendly6
- Control everything5
- Its amazingly fast and scalable5
- Easy to use and fast and goes well with JSONdb's5
- Scalable5
- Great speed5
- Fast development5
- It's fast4
- Easy to use4
- Isomorphic coolness4
- Great community3
- Not Python3
- Sooper easy for the Backend connectivity3
- TypeScript Support3
- Blazing fast3
- Performant and fast prototyping3
- Easy to learn3
- Easy3
- Scales, fast, simple, great community, npm, express3
- One language, end-to-end3
- Less boilerplate code3
- Npm i ape-updating2
- Event Driven2
- Lovely2
- Creat for apis1
- Node0
- Bound to a single CPU46
- New framework every day45
- Lots of terrible examples on the internet40
- Asynchronous programming is the worst33
- Callback24
- Javascript19
- Dependency based on GitHub11
- Dependency hell11
- Low computational power10
- Can block whole server easily7
- Callback functions may not fire on expected sequence7
- Very very Slow7
- Breaking updates4
- Unstable4
- No standard approach3
- Unneeded over complication3
- Can't read server session1
- Bad transitive dependency management1
related Node.js posts
When I joined NYT there was already broad dissatisfaction with the LAMP (Linux Apache HTTP Server MySQL PHP) Stack and the front end framework, in particular. So, I wasn't passing judgment on it. I mean, LAMP's fine, you can do good work in LAMP. It's a little dated at this point, but it's not ... I didn't want to rip it out for its own sake, but everyone else was like, "We don't like this, it's really inflexible." And I remember from being outside the company when that was called MIT FIVE when it had launched. And been observing it from the outside, and I was like, you guys took so long to do that and you did it so carefully, and yet you're not happy with your decisions. Why is that? That was more the impetus. If we're going to do this again, how are we going to do it in a way that we're gonna get a better result?
So we're moving quickly away from LAMP, I would say. So, right now, the new front end is React based and using Apollo. And we've been in a long, protracted, gradual rollout of the core experiences.
React is now talking to GraphQL as a primary API. There's a Node.js back end, to the front end, which is mainly for server-side rendering, as well.
Behind there, the main repository for the GraphQL server is a big table repository, that we call Bodega because it's a convenience store. And that reads off of a Kafka pipeline.
How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:
Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.
Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:
https://eng.uber.com/distributed-tracing/
(GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)
Bindings/Operator: Python Java Node.js Go C++ Kubernetes JavaScript OpenShift C# Apache Spark
- Rapid development670
- Open source487
- Great community424
- Easy to learn379
- Mvc276
- Beautiful code232
- Elegant223
- Free206
- Great packages203
- Great libraries194
- Comes with auth and crud admin panel79
- Restful79
- Powerful78
- Great documentation75
- Great for web71
- Python57
- Great orm43
- Great for api41
- All included32
- Fast29
- Web Apps25
- Easy setup23
- Clean23
- Used by top startups21
- Sexy19
- ORM19
- The Django community15
- Allows for very rapid development with great libraries14
- Convention over configuration14
- King of backend world11
- Full stack10
- Great MVC and templating engine10
- Fast prototyping8
- Mvt8
- Easy to develop end to end AI Models7
- Batteries included7
- Its elegant and practical7
- Have not found anything that it can't do6
- Very quick to get something up and running6
- Cross-Platform6
- Easy Structure , useful inbuilt library5
- Great peformance5
- Zero code burden to change databases5
- Python community5
- Map4
- Just the right level of abstraction4
- Easy to change database manager4
- Modular4
- Many libraries4
- Easy to use4
- Easy4
- Full-Text Search4
- Scaffold3
- Fastapi1
- Built in common security1
- Scalable1
- Great default admin panel1
- Node js1
- Gigante ta1
- Rails0
- Underpowered templating26
- Autoreload restarts whole server22
- Underpowered ORM22
- URL dispatcher ignores HTTP method15
- Internal subcomponents coupling10
- Not nodejs8
- Configuration hell8
- Admin7
- Not as clean and nice documentation like Laravel5
- Python4
- Not typed3
- Bloated admin panel included3
- Overwhelming folder structure2
- InEffective Multithreading2
- Not type safe1
related Django posts
Simple controls over complex technologies, as we put it, wouldn't be possible without neat UIs for our user areas including start page, dashboard, settings, and docs.
Initially, there was Django. Back in 2011, considering our Python-centric approach, that was the best choice. Later, we realized we needed to iterate on our website more quickly. And this led us to detaching Django from our front end. That was when we decided to build an SPA.
For building user interfaces, we're currently using React as it provided the fastest rendering back when we were building our toolkit. It’s worth mentioning Uploadcare is not a front-end-focused SPA: we aren’t running at high levels of complexity. If it were, we’d go with Ember.js.
However, there's a chance we will shift to the faster Preact, with its motto of using as little code as possible, and because it makes more use of browser APIs. One of our future tasks for our front end is to configure our Webpack bundler to split up the code for different site sections. For styles, we use PostCSS along with its plugins such as cssnano which minifies all the code.
All that allows us to provide a great user experience and quickly implement changes where they are needed with as little code as possible.
Hey, so I developed a basic application with Python. But to use it, you need a python interpreter. I want to add a GUI to make it more appealing. What should I choose to develop a GUI? I have very basic skills in front end development (CSS, JavaScript). I am fluent in python. I'm looking for a tool that is easy to use and doesn't require too much code knowledge. I have recently tried out Flask, but it is kinda complicated. Should I stick with it, move to Django, or is there another nice framework to use?