Apache Spark logo

Apache Spark

Fast and general engine for large-scale data processing

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Apache Spark is a tool in the Big Data Tools category of a tech stack.
Apache Spark is an open source tool with 27.8K GitHub stars and 22.6K GitHub forks. Here’s a link to Apache Spark's open source repository on GitHub

Who uses Apache Spark?

Companies
419 companies reportedly use Apache Spark in their tech stacks, including Uber, Shopify, and Slack.

Developers
1403 developers on StackShare have stated that they use Apache Spark.

Apache Spark Integrations

Jupyter, Azure Cosmos DB, Snowflake, Couchbase, and Apache Hive are some of the popular tools that integrate with Apache Spark. Here's a list of all 32 tools that integrate with Apache Spark.
Public Decisions about Apache Spark

Here are some stack decisions, common use cases and reviews by companies and developers who chose Apache Spark in their tech stack.

Shared insights
on
Java
Scala
Apache Spark

I am new to Apache Spark and Scala both. I am basically a Java developer and have around 10 years of experience in Java.

I wish to work on some Machine learning or AI tech stacks. Please assist me in the tech stack and help make a clear Road Map. Any feedback is welcome.

Technologies apart from Scala and Spark are also welcome. Please note that the tools should be relevant to Machine Learning or Artificial Intelligence.

See more

Blog Posts

Apache Spark's Features

  • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • Write applications quickly in Java, Scala or Python
  • Combine SQL, streaming, and complex analytics
  • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3

Apache Spark Alternatives & Comparisons

What are some alternatives to Apache Spark?
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Splunk
It provides the leading platform for Operational Intelligence. Customers use it to search, monitor, analyze and visualize machine data.
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Apache Beam
It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
Apache Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
See all alternatives

Apache Spark's Followers
1991 developers follow Apache Spark to keep up with related blogs and decisions.