Why you should be worried about how Apache Spark works? To get the most out of Spark, it is important to understand some of the principlesused to design Spark and, at a cursory level, how Spark programs are executed this article introduces the overall design of Spark as well as its place in the big dataecosystem. Spark is often considered an alternative to Apache MapReduce, since Spark can also be used for distributed data processing with Hadoop. Sparks design principles are quite different from those of MapReduce. Unlike Hadoop MapReduce, Spark does not need to be run in tandem withApache Hadoopalthough it often is. Spark has inherited parts of its API, design,and supported formats from other existing computational frameworks, particularlyDryadLINQ.2 However, Sparks internals, especially how it handles failures, differfrom many traditional systems. Sparks ability to leverage lazy evaluation withinmemory computations makes it particularly unique. Sparks creators believe it to bethe first high-level programming language for fast, distributed data processing.
How Apache Spark Works & DryadLINQ
DryadLINQ is a Microsoft research project that puts the .NET Language Integrated Query (LINQ) on top ofthe Dryad distributed execution engine. Like Spark, the DryadLINQ API defines an object representing a distributed dataset and then exposes functions to transform data as methods defined on that dataset object.DryadLINQ is lazily evaluated and its schedule is similar to Sparks. However, DryadLINQ doesnt use inmemorystorage.For more information see the DryadLINQ documentation
How Spark Fits into the Big Data Ecosystem
Apache Spark is an open source framework that provides methods to process data inparallel that are generalizable; the same high-level Spark functions can be used to performdisparate data processing tasks on data of different sizes and structures. On its own, Spark is not a data storage solution; it performs computations on Spark JVMs(Java Virtual Machines) that last only for the duration of a Spark application. Sparkcan be run locally on a single machine with a single JVM (called local mode). Moreoften, Spark is used in tandem with a distributed storage system (e.g., HDFS, Cassandra,or S3) and a cluster managerthe storage system to house the data processedwith Spark, and the cluster manager to orchestrate the distribution of Spark applicationsacross the cluster. Spark currently supports three kinds of cluster managers:Standalone Cluster Manager, Apache Mesos, and Hadoop YARN.The Standalone Cluster Manager is included in Spark, but using the Standalone managerrequires installing Spark on each node of the cluster.
Spark provides a high-level query language to process data. Spark Core, the maindata processing framework in the Spark ecosystem, has APIs in Scala, Java, Python,and R. Spark are built around a data abstraction called Resilient Distributed Datasets (RDDs). RDDs are a representation of lazily evaluated, statically typed, distributed collections. RDDs have a number of predefined coarse-grained transformations(functions that are applied to the entire dataset), such as map, join, and reduce to manipulate the distributed datasets, as well as I/O functionality to read and write data between the distributed storage system and the Spark JVMs
In addition to Spark Core, the Spark ecosystem includes a number of other first-partycomponents, including Spark SQL, Spark MLlib, Spark ML, Spark Streaming, andGraphX,4 which provide more specific data processing functionality. Some of thesecomponents have the same general performance considerations as the Core; MLlib, for example, is written almost entirely on the Spark API. However, some of themhave unique considerations. Spark SQL, for example, has a different query optimizerthan Spark Core.
Spark SQL: Spark Component
Spark SQL is a component that can be used in tandem with Spark Core and has APIsin Scala, Java, Python, and R and basic SQL queries. Spark SQL defines an interfacefor a semi-structured data type, called DataFrames, and as of Spark 1.6, a semistructured, typed version of RDDs called Datasets. Spark SQL is a veryimportant component of Spark performance, and much of what can be accomplishedwith Spark Core can be done by leveraging Spark SQL.
Spark Machine Learning Packages
Spark has two machine learning packages: ML and MLlib. MLlib is a package ofmachine learning and statistics algorithms written with Spark. Spark ML is still in theearly stages and has only existed since Spark 1.2. Spark ML provides a higher-levelAPI than MLlib with the goal of allowing users to more easily create practical machine learning pipelines. Spark MLlib is primarily built on top of RDDs and usesfunctions from Spark Core, while ML is built on top of Spark SQL DataFrames.Eventually, the Spark community plans to move over to ML and deprecate MLlib.Spark ML and MLlib both have additional performance considerations from Spark Core and Spark SQL
Spark Streaming uses the scheduling of the Spark Core for streaming analytics onmini-batches of data. Spark Streaming has a number of unique considerations, such asthe window sizes used for batches.
GraphX is a graph processing framework built on top of Spark with an API for graphcomputations. GraphX is one of the least mature components of Spark, so we dontcover it in much detail. In future versions of Spark, typed graph functionality will beintroduced on top of the Dataset API.
Additional PySpark Resource & Reading Material
PySpark Frequentl Asked Question
Refer our PySpark FAQ space where important queries and informations are clarified. It also links to important PySpark Tutorial apges with-in site.
PySpark Examples Code
Find our GitHub Repository which list PySpark Example with code snippet
PySpark/Spark Related Interesting Blogs
Here are the list of informative blogs and related articles, which you might find interesting