Follow

Follow

What is Apache Spark , RDD in JAVA ?

sandeep negi's photo

··

2 min read

Motivation

The internal session is more of the introduction to apache spark so this is beginner friendly.

What is Spark?
History of Spark?
Hadoop vs spark
Spark ecosystem
Architecture

Spark

Distributted parallel execution framework, data can be divided into various sytem, Easy to perform quick operations, Supports variosus languages, --> JAVA, R, Scala

Application --> query, data scientis and query, Query, analyze and Transform.

History

Researched project --> open source under BSD --> becomes Apache project --> databricks used it.

HADOOP Vs SPARK.

Hadoop --> 1. first map --> and then reduce .

Quite lot of rights
take the data and pushed again into hardisk

Spark --> reading and writing is faster, 100 times faster

Apache Spark Features

Lazy Evaluation
Real-time computation
Spark and Hadoop integration
Machine Learning for iterative tasks

Spark Ecosystem

Spark Streaming Spark SQL MLib GraphX SparkR

SPARK Architecture

Master Slave architecture

Apache Spark RDD

resilient distributed dataset
this is a datasture and this datascructure can be modified
in memory computatio, Lazy Evaluation, Fault Tolerant, immutability,
Coarse Grained Operation -->

Aapche Spark Teminologies

job
Stage
Partitions

DAG --> execution plan , acyclic graph STAGE : shuffling of data

Operations

Transformations : what we want to achive
Actions : when we actually run it is called as action

How to create a basic program

Basic maven project
Add the apache spark dependency in the pom.xml
now we can use the api
Create instance and the data and do the operations.