Apache Spark 101- Introduction for Big Data newcomers

apache spark for big data developers

Apache Spark is revolutionizing the world of Big Data. Many Data Engineers and Data Analysts are looking into or already working with Spark. Companies have a need to handle new Big Data issues that arise from having massive server logs, files, etc.

As MapReduce programming model is saying farewell to the world of Hadoop, Apache Spark is taking over as the preferred cluster computing framework for Big Data ETL processing and analytics. More and more companies and Big Data teams are adopting this software as it is showcased in many Hadoop/Big Data conferences around the world.


So, What is Apache Spark?

  • fast and general-purpose cluster computing system.
  • Designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries.
  • Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing.
  • The main features Spark offers for speed is the ability to run computations in memory.
  • Apache Spark is one of the most active open source Big Data projects.

About Apache Spark

  • A Big Data analytics cluster-computing framework written in Scala.
  • Open Sourced originally developed at AMPLabat UC Berkeley.
  • Spark works with data in memory RAM.
  • Designed for running Iterative algorithms & Interactive analytics.
  • Highly compatible with Hadoop’s Storage APIs.
  • It supports Batch, Streaming and Interactive operations.
  • Created by Matei Zaharia

Apache Spark Ecosystem

Apache Spark Intro

Credit: DataBricks


  • Spark SQL provides the capability to expose the Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools. Spark SQL allows the users to ETL their data from different formats it’s currently in (like JSON, Parquet, a Database), transform it, and expose it for ad-hoc querying.


  • Spark Streaming can be used for processing the real-time streaming data. This is based on micro batch style of computing and processing. It uses the DStream which is basically a series of RDDs, to process the real-time data.


  • Spark’s MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.


  • Spark’s GraphX is a Spark API for graphs and graph-parallel computation. It includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

Apache Spark Core Concepts

At a very high level, every Spark application consists of a driver program that launches various parallel operations on a cluster.

Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.

Apache Spark nodes

Initializing SparkContext:

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val conf = new SparkConf().setMaster(“local”).setAppName(“MyApp”) val sc = new SparkContext(conf)

in setMaster(String clusterUrl) needs to be passed which tells Hadoop Spark how to connect to a cluster. We have used local here which is  special value that runs spark on one thread on the local machine without connecting to the cluster.

Apache Spark’s core abstraction for working with data is the Resilient Distributed Dataset (RDD).


An RDD in Apache Spark is simply an immutable distributed collection of objects.

Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.

RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

How to use RDDs

Users can create RDDs in these ways:

  • loading an external dataset e.g. SparkContext.textFile()
  • distributing a collection of objects (e.g., a list or set) in their driver program
  • using parallelize e.g. SparkContext.parallelize(List(“Hello”, “Hello World”))

RDDs offer two types of operations:

  • Transformations: Transformations construct a new RDD from a previous one. e.g. map(), filter(). transformations return RDDs.
  • Actions: computes a result based on an RDD and return it to the driver program or save it to an external storage system (HDFS). e.g. first(), count(), sum(). actions return some other data type.

Apache Spark’s RDD Examples – Book

Assuming… RDD containing {1,2,3,3}

apache spark

Screen Shot 2015-12-04 at 2.09.02 PM


Apache Spark – Transformations

Transformed RDDs are computed lazily, only when you use them in an action.


filter() operation does not mutate the existing inputRDD. It returns a pointer to an entirely new RDD. inputRDD can still be reused later in the program.

Spark keeps track of the set of dependencies between different RDDs, and it is called lineage graph.

Apache Spark – Actions

Actions are the operations that return a final value to the driver program or write data to an external storage system. For example: count(), take()

It is important to note that each time we call a new action, the entire RDD must be computed “from scratch.” To avoid this inefficiency, users can persist intermediate results.

Spark computes RDD’s only in a lazy fashion:  that is, the first time they are used in an action.

Spark’s RDDs are recomputed by default each time you run an Action on them.

 If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist().

Few Examples in Apache Spark

Word Count Program

Finding “ERRORs” in a log file

Improvements – Spark vs MapReduce


  • Speed: Spark is capable of running batch jobs processing 10 to 100 times faster than MapReduce engine, according to Cloudera. It does it mainly by reducing the number of disk reads and writes.
  • Ease of use: Implementing both as batch processing stream at the top of Spark could eliminate much of the complexity from MapReduce. It simplifies deployment, maintenance, application development.
  • Real time batch proccesing: Spark manipulates data in real time using Spark Streaming. 
  • Connectivity platform: A set of APIs for runtime Scala Hadoop Spark are available in Java, and Python , allowing developers to write applications that can run on top of Spark in these languages. Spark can interact with the data in HDFS (Hadoop File System) database Hadoop HBase, Cassandra data store and several other storage layers.
  • Fault tolerance: In place of persistence or existence of checkpoints in intermediate results, Spark remembers the sequence of operations that led to a given dataset. Spark reconstructs the data set based on the stored information. 

Spark vs MapReduce – Google Search Trends

If you are curious about the popularity of Spark, let Google Trends give you an idea of how much Apache Spark is growing on Google searches.

Apache Spark advanced tutorial

This is the best video tutorial on the web about Apache Spark. It is very detailed and explains every part and component of Apache Spark. It is a very long video, but it is definitely worth it whether you are a novice or experienced Apache Spark developer.

Sameer Farooqui (Databrikcs) is the presenter.