Introduction to Apache Spark
What is Apache Spark?
Apache SparkTM is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing. You’ll find it used by organizations from any industry, including at FINRA, Yelp, Zillow, DataXu, Urban Institute, and CrowdStrike. Apache Spark has become one of the most popular big data distributed processing framework with 365,000 meetup members in 2017.
What is the history of Spark?
Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab, a collaboration involving students, researchers, and faculty, focused on data-intensive application domains. The goal of Spark was to create a new framework, optimized for fast iterative processing like machine learning, and interactive data analysis, while retaining the scalability, and fault tolerance of Hadoop MapReduce. The first paper entitled, “Spark: Cluster Computing with Working Sets” was published in June 2010, and Spark was open sourced under a BSD license. In June, 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February, 2014. Spark can run standalone, on Apache Mesos, or most frequently on Apache Hadoop.
Today, Spark has become one of the most active projects in the Hadoop ecosystem, with many organizations adopting Spark alongside Hadoop to process big data. In 2017, Spark had 365,000 meetup members, which represents a 5x growth over two years. It has received contribution by more than 1,000 developers from over 200 organizations since 2009.
How does Spark work?
Hadoop MapReduce is a programming model for processing big data sets with a parallel, distributed algorithm. Developers can write massively parallelized operators, without having to worry about work distribution, and fault tolerance. However, a challenge to MapReduce is the sequential multi-step process it takes to run a job. With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS. Because each step requires a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O.
Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics.
Apache Hadoop compared to Spark
Outside of the differences in the design of Spark and Hadoop MapReduce, many organizations have found these big data frameworks to be complimentary, using them together to solve a broader business challenge.
Hadoop is an open source framework that has the Hadoop Distributed File System (HDFS) as storage, YARN as a way of managing computing resources used by different applications, and an implementation of the MapReduce programming model as an execution engine. In a typical Hadoop implementation, different execution engines are also deployed such as Spark, Tez, and Presto.
Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others. Spark on Hadoop leverages YARN to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.
What are the benefits of Apache Spark?
There are many benefits of Apache Spark to make it one of the most active projects in the Hadoop ecosystem. These include:
Through in-memory caching, and optimized query execution, Spark can run fast analytic queries against data of any size.
Apache Spark natively supports Java, Scala, R, and Python, giving you a variety of languages for building your applications. These APIs make it easy for your developers, because they hide the complexity of distributed processing behind simple, high-level operators that dramatically lowers the amount of code required.
Apache Spark comes with the ability to run multiple workloads, including interactive queries, real-time analytics, machine learning, and graph processing. One application can combine multiple workloads seamlessly.
Apache Spark Workloads
The Spark framework includes Spark Core as the foundation for the platform, Spark SQL for interactive queries, Spark Streaming for real-time analytics, Spark MLlib for machine learning, and Spark GraphX for graph processing.
Spark Core is the foundation of the platform. It is responsible for memory management, fault recovery, scheduling, distributing & monitoring jobs, and interacting with storage systems. Spark Core is exposed through an application programming interface (APIs) built for Java, Scala, Python and R. These APIs hide the complexity of distributed processing behind simple, high-level operators.
Spark includes MLlib, a library of algorithms to do machine learning on data at scale. Machine Learning models can be trained by data scientists with R or Python on any Hadoop data source, saved using MLlib, and imported into a Java or Scala-based pipeline. Spark was designed for fast, interactive computation that runs in memory, enabling machine learning to run quickly. The algorithms include the ability to do classification, regression, clustering, collaborative filtering, and pattern mining.
Spark Streaming is a real-time solution that leverages Spark Core’s fast scheduling capability to do streaming analytics. It ingests data in mini-batches, and enables analytics on that data with the same application code written for batch analytics. This improves developer productivity, because they can use the same code for batch processing, and for real-time streaming applications. Spark Streaming supports data from Twitter, Kafka, Flume, HDFS, and ZeroMQ, and many others found from the Spark Packages ecosystem.
Spark SQL is a distributed query engine that provides low-latency, interactive queries up to 100x faster than MapReduce. It includes a cost-based optimizer, columnar storage, and code generation for fast queries, while scaling to thousands of nodes. Business analysts can use standard SQL or the Hive Query Language for querying data. Developers can use APIs, available in Scala, Java, Python, and R. It supports various data sources out-of-the-box including JDBC, ODBC, JSON, HDFS, Hive, ORC, and Parquet. Other popular stores—Amazon Redshift, Amazon S3, Couchbase, Cassandra, MongoDB, Salesforce.com, Elasticsearch, and many others can be found from the Spark Packages ecosystem.
Spark GraphX is a distributed graph processing framework built on top of Spark. GraphX provides ETL, exploratory analysis, and iterative graph computation to enable users to interactively build, and transform a graph data structure at scale. It comes with a highly flexible API, and a selection of distributed Graph algorithms.
Who uses Apache Spark?
As of 2016, surveys show that more than 1,000 organizations are using Spark in production. Some of them are listed on the Powered By Spark page. Apache Spark has become one of the most popular big data distributed processing framework with 365,000 meetup members in 2017. Examples of various customers include:
Yelp’s advertising targeting team makes prediction models to determine the likelihood of a user interacting with an advertisement. By using Apache Spark on Amazon EMR to process large amounts of data to train machine learning models, Yelp increased revenue and advertising click-through rate.
Zillow owns and operates one of the largest online real-estate website. It uses machine-learning algorithms from Spark on Amazon EMR to process large data sets in near real time to calculate Zestimates—a home valuation tool that provides buyers and sellers with the estimated market value for a specific home.
CrowdStrike provides endpoint protection to stop breaches. They use Amazon EMR with Spark to process hundreds of terabytes of event data and roll it up into higher-level behavioral descriptions on the hosts. From that data, CrowdStrike can pull event data together and identify the presence of malicious activity.
Hearst Corporation, a large diversified media and information company, has customers viewing content on over 200 web properties. Using Apache Spark Streaming on Amazon EMR, Hearst’s editorial staff can keep a real-time pulse on which articles are performing well and which themes are trending.
bigfinite stores and analyzes vast amounts of pharmaceutical-manufacturing data using advanced analytical techniques running on AWS. Spark on Amazon EMR is used to run its proprietary algorithms that are developed in Python and Scala.
GumGum, an in-image and in-screen advertising platform, uses Spark on Amazon EMR for inventory forecasting, processing of clickstream logs, and ad hoc analysis of unstructured data in Amazon S3. Spark’s performance enhancements saved GumGum time and money for these workflows.
Intent Media uses Spark and MLlib to train and deploy machine learning models at massive scale. Spark is used to help online travel companies optimize revenue on their websites and apps through sophisticated data science capabilities.
FINRA is a leader in the Financial Services industry who sought to move toward real-time data insights of billions of time-ordered market events by migrating from SQL batch processes on-prem, to Apache Spark in the cloud. By using Apache Spark on Amazon EMR, FINRA can now test on realistic data from market downturns, enhancing their ability to provide investor protection and promote market integrity
Apache Spark use cases
Spark is a general-purpose distributed processing system used for big data workloads. It has been deployed in every type of big data use case to detect patterns, and provide real-time insight. Example use cases include:
Spark is used in banking to predict customer churn, and recommend new financial products. In investment banking, Spark is used to analyze stock prices to predict future trends.
Spark is used to build comprehensive patient care, by making data available to front-line health workers for every patient interaction. Spark can also be used to predict/recommend patient treatment.
Spark is used to eliminate downtime of internet-connected equipment, by recommending when to do preventive maintenance.
Spark is used to attract, and keep customers through personalized services and offers.
Deploying Apache Spark in the cloud
Spark is an ideal workload in the cloud, because the cloud provides performance, scalability, reliability, availability, and massive economies of scale. ESG research found 43% of respondents considering cloud as their primary deployment for Spark. The top reasons customers perceived the cloud as an advantage for Spark are faster time to deployment, better availability, more frequent feature/functionality updates, more elasticity, more geographic coverage, and costs linked to actual utilization.
Build your Apache Spark cluster in the cloud on Amazon Web Services
Amazon EMR is the best place to deploy Apache Spark in the cloud, because it combines the integration and testing rigor of commercial Hadoop & Spark distributions with the scale, simplicity, and cost effectiveness of the cloud. It allows you to launch Spark clusters in minutes without needing to do node provisioning, cluster setup, Spark configuration, or cluster tuning. EMR enables you to provision one, hundreds, or thousands of compute instances in minutes. You can use Auto Scaling to have EMR automatically scale up your Spark clusters to process data of any size, and back down when your job is complete to avoid paying for unused capacity. You can lower your bill by committing to a set term, and saving up to 75% using Amazon EC2 Reserved Instances, or running your clusters on spare AWS compute capacity and saving up to 90% using EC2 Spot. Learn more.
Get started by spinning up your first Spark cluster on AWS
Sign up for an AWS account
Learn more about Spark on Amazon EMR