Apache Spark logo

Apache Spark

Fast and general engine for large-scale data processing
+ 1

What is Apache Spark?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
Apache Spark is a tool in the Big Data Tools category of a tech stack.
Apache Spark is an open source tool with 38.9K GitHub stars and 28.1K GitHub forks. Here’s a link to Apache Spark's open source repository on GitHub

Who uses Apache Spark?

523 companies reportedly use Apache Spark in their tech stacks, including Uber, Shopify, and Slack.

2286 developers on StackShare have stated that they use Apache Spark.

Apache Spark Integrations

Jupyter, Snowflake, Azure Cosmos DB, Databricks, and Couchbase are some of the popular tools that integrate with Apache Spark. Here's a list of all 50 tools that integrate with Apache Spark.
Pros of Apache Spark
Fast and Flexible
One platform for every big data problem
Great for distributed SQL like applications
Easy to install and to use
Works well for most Datascience usecases
Interactive Query
Machine learning libratimery, Streaming in real
In memory Computation
Decisions about Apache Spark

Here are some stack decisions, common use cases and reviews by companies and developers who chose Apache Spark in their tech stack.

Needs advice
Apache SparkApache Spark

My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

I make a report based on the two files in Jupyter notebook and convert it to HTML.

  • Everything is done with vanilla python and Pandas.
  • sometimes I may get a different format of data
  • cloud service is Microsoft Azure.

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

See more

I am working on a project of an e-learning platform and I'm confused about which technology to choose in order to create a big data pipeline aws / azure or Apache Spark.

Can Spark do the job (data ingestion /data storage/data processing) and finally create dashboards

See more

I recently started a new position as a data scientist at an E-commerce company. The company is founded about 4-5 years ago and is new to many data-related areas. Specifically, I'm their first data science employee. So I have to take care of both data analysis tasks as well as bringing new technologies to the company.

  1. They have used Elasticsearch (and Kibana) to have reporting dashboards on their daily purchases and users interactions on their e-commerce website.

  2. They also use the Oracle database system to keep records of their daily turnovers and lists of their current products, clients, and sellers lists.

  3. They use Data-Warehouse with cockpit 10 for generating reports on different aspects of their business including number 2 in this list.

At the moment, I grab batches of data from their system to perform predictive analytics from data science perspectives. In some cases, I use a static form of data such as monthly turnover, client values, and high-demand products, and run my predictive analysis using Python (VS code). Also, I use Google Datastudio or Google Sheets to present my findings. In other cases, I try to do time-series analysis using offline batches of data extracted from Elastic Search to do user recommendations and user personalization.

I really want to use modern data science tools such as Apache Spark, Google BigQuery, AWS, Azure, or others where they really fit. I think these tools can improve my performance as a data scientist and can provide more continuous analytics of their business interactions. But honestly, I'm not sure where each tool is needed and what part of their system should be replaced by or combined with the current state of technology to improve productivity from the above perspectives.

See more
Αλέξανδρος Παπαδόπουλος
Junior Researcher at Παπαδόπουλος Αλέξανδρος · | 2 upvotes · 27K views
Needs advice
Apache SparkApache Spark

I use Kafka with Lenses. I would integrate Apache Spark in order to achieve data processing, but I could not find the appropriate connector. Should I use only MySQL for data processing?

See more
Needs advice
Apache SparkApache Spark

I am new to Apache Spark and Scala both. I am basically a Java developer and have around 10 years of experience in Java.

I wish to work on some Machine learning or AI tech stacks. Please assist me in the tech stack and help make a clear Road Map. Any feedback is welcome.

Technologies apart from Scala and Spark are also welcome. Please note that the tools should be relevant to Machine Learning or Artificial Intelligence.

See more

Blog Posts

Mar 24 2021 at 12:57PM


MySQLKafkaApache Spark+6
Aug 28 2019 at 3:10AM


PythonJavaAmazon S3+16

Apache Spark's Features

  • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • Write applications quickly in Java, Scala or Python
  • Combine SQL, streaming, and complex analytics
  • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3

Apache Spark Alternatives & Comparisons

What are some alternatives to Apache Spark?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
It provides the leading platform for Operational Intelligence. Customers use it to search, monitor, analyze and visualize machine data.
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Apache Beam
It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
Apache Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
See all alternatives

Apache Spark's Followers
3502 developers follow Apache Spark to keep up with related blogs and decisions.