Spark

Pranav Kumar

7 min readOct 27, 2022

I’m pretty fast in my tasks.

Spark was founded by Bill Chambers and Matei Zaharia.
Spark is an in-memory processing framework.
It uses the RAM (majority) of the cluster and processes the programs much faster.
It is 100 times faster than Map Reduce programs.
It is open source.
It is handled by Apache Software Foundation.
It’s native language is Scala but it also supports Java, R, Python & SQL.
It came with Hadoop v2 in 2012.

#) HISTORY

In 2009, the R&D team at university Berkley were working on ‘AMPLab’ . The name given to the project was MESOS.
In 2009, there was no YARN because it came with Hadoop v2 in 2012. So, there was no cluster manager. So, they came up with MESOS which is a cluster manager.
YARN is very similar to MESOS.
So, the team at Berkley started testing MESOS with Map Reduce Programs which were quite slow at that time.
So, they wanted to create a new framework to test MESOS.
AND, they created SPARK which would process the programs in-memory.
In 2012, the project was taken over by Apache.
It got famous in 2015 because of it’s fast nature.
Companies use 1.6 version of it as it’s very stable but the latest version is 3.3.0 .
After v1.6, there was direct v2.0 .
In 2012, the founders of SPARK created DATABRICKS which commercially distributes spark.
If you want just SPARK then, DATABRICKS is the best.
SPARK is an independent project which doesn’t require HADOOP. But, mostly you’ll see it running on top of HADOOP.
There are currently more than 10 SQL engines that can run on top of HADOOP.

#) PRESENT LANDSCAPE

The biggest USP of SPARK is that it is Unified Engine Processing. What 50+ tools are doing on top of HADOOP, SPARK can do alone.

2. Speed is a by-product of SPARK.

3. SPARK consists of Scheduling, Distributing and Monitoring. Spark can use both RAM and Hard Disk.

4. Spark supports Scala, Java, Python and R.

5. Spark 2.0 has an abstraction layer which makes it possible to run these many languages.

#) MODES OF SPARK

LOCAL — Running it in local market. Using only for development and testing function.
MESOS — Spark was designed on top of MESOS.
STANDALONE — If you want to run Spark when you don’t have Yarn installed. Then, spark gives you it’s own cluster manager called STANDALONE.
YARN — Most of the companies will be running YARN on it’s cluster.

#) DETAILS

Spark has no storage because it’s an execution engine. There is no storage component i.e. you have to provide data somewhere.
That means if you are running spark on Hadoop then HDFS has the data.
If it is running in local machine then your file system has the data.
Spark can also run on KUBERNETES. Kubernetes is container orchestration that means data center manager. In data centers, you can run multiple DOCKER containers. You can manage all of them using Kubernetes.
In cloud, you can have multiple options. Let’s say you have to go to AWS you run multiple programs using EMR. Elastic Map Reduce is very fast. It can create a 100–200 machine node cluster in 5–10 minutes. These are disposable clusters i.e. once you have done your job, you have to delete them because you are paying for it.

#) ZOOKEEPER

Spark supports Zookeeper.
Zookeeper manages clusters & keeps a list of them.
It is a service coordinator.
By-default, we’ll be having active and passive namenode.
Passive namenode is also known as STANDBY namenode.
If, let’s say the active namenode crashes then standby namenode will inform zookeeper and will say now I’m the ACTIVE namenode.

#) READING DATA

Spark can read data from almost any file system. Eg:- HDFS, AWS S3, LOCAL.
It can read from any NoSQL database & RDMS.
If you have MySQL Database or MongoDb it can directly access the data and process it & store it back into the table.
It can work with FLUME and KAFKA for high availability of data.

#) SPARK ECOSYSTEM

SPARK SQL — In Spark SQL, we can create tables and dataframe. Spark SQL is integrated with HIVE by-default. Let’s say, If I have a Hadoop cluster and I’m running SPARK on top of it, then if I open Spark, I can retrieve data from HIVE. Hive is very slow. So, you can read the Hive tables in Spark as query. These days, Hive is just used for storage &most of the processing is done by Spark.
GraphX — It is used for processing graph. it is important because the data stored in social media sites are stored in Graph format. Representation using graph is easy.
SPARK STREAMING — It is used for processing real-time streaming data. Even, if your machines are not working your data still will be arriving because of FLUME & KAFKA.

#) SPARK PROGRAM

DRIVER — When we are running a spark program there will be something known as Driver. It is the master of program i.e. the program runs on master machine in the cluster. It declares transformations and actions in RDD. Once, action is performed on RDD, Spark Context gives the program to the driver.
EXECUTOR — It’s slave of the program. Whenever a spark program is run, there will be Driver & Executor. And, when we’ll say LOCAL, a JVM will be created and it’ll have both Driver and Executor. It runs computations and store data on worker nodes.

A Yarn container will be allocated. It’s not very efficient as you’ll be given only 1 container in LOCAL MODE. If you’ll run a python code, it will understand it’s logic and will run it inside JVM.

LOCAL MODE is only for testing and development.

Let’s say, I’m running a cluster and have 4 datanodes and I write a spark program. Then,

It’s not easy to write a spark program as you have to mention RAM, processor cores & executors.

#) RDD

It stands for Resilient Distributed Datasets.
They represent your data in RAM.
They run parallelly and are fault-tolerant in nature.
RDDs are distributed and immutable.

#) DAG

It stands for Directed Acyclic Graph.
It consists of vertices and edges.
Vertices represent RDDs and edges represent operations applied to RDDs.
This graph is unidirectional i.e. it has only one flow.
It converts logical execution to physical execution.

#) SPARK DATAFRAMES

Spark Dataframes are distributed collection of datasets organized into columns similar to SQL.
It can be created from an array of data from DBs, existing RDDs, Hive Tables etc.
They support many data formats such as CSV, Avro, Elastic search etc.
It can process large data ranging from KB to PB on a single mode to large clusters.

#) SPARK DATASET

Spark Dataset is a data structure in Spark SQL which is strongly typed and is map to relational schema.
Datasets have combined powers of both RDD and Dataframes.

#) CHECKPOINT

Apache Spark provides an API for adding and managing checkpoints.
Checkpointing is the process of making streaming applications resilient to failures.
It allows to save data and metadata into a checkpoint directory.
In case of failure, it can recover this data & start from whenever it has stopped.

#) TRANSFORMATIONS

Transformations are the functions that are applied on RDD that help in creating another RDD.
It does not occur until an action takes place.
Eg:- map() and filter()
map() — Map function iterates over every line in RDD and splits into a new RDD.
filter() — Filter function creates a new RDD by selecting elements from current RDD that passes the function argument.

#) ACTIONS

Actions help in bringing back the data from a RDD to local machine.
Eg:- reduce() and take()
reduce() — It’s an action that is applied repeatedly until one value is left.
take() — It’s an action that takes into consideration all value from RDD to the local node.

Spark

Written by Pranav Kumar

No responses yet