What is Apache Spark , RDD in JAVA ?
Motivation
The internal session is more of the introduction to apache spark so this is beginner friendly.
- What is Spark?
- History of Spark?
- Hadoop vs spark
- Spark ecosystem
- Architecture
Spark
- Distributted parallel execution framework, data can be divided into various sytem, Easy to perform quick operations, Supports variosus languages, --> JAVA, R, Scala
Application --> query, data scientis and query, Query, analyze and Transform.
History
- Researched project --> open source under BSD --> becomes Apache project --> databricks used it.
HADOOP Vs SPARK.
Hadoop --> 1. first map --> and then reduce .
- Quite lot of rights
- take the data and pushed again into hardisk
Spark --> reading and writing is faster, 100 times faster
Apache Spark Features
- Lazy Evaluation
- Real-time computation
- Spark and Hadoop integration
- Machine Learning for iterative tasks
Spark Ecosystem
Spark Streaming Spark SQL MLib GraphX SparkR
SPARK Architecture
Master Slave architecture
Apache Spark RDD
- resilient distributed dataset
- this is a datasture and this datascructure can be modified
- in memory computatio, Lazy Evaluation, Fault Tolerant, immutability,
- Coarse Grained Operation -->
Aapche Spark Teminologies
- job
- Stage
- Partitions
DAG --> execution plan , acyclic graph STAGE : shuffling of data
Operations
- Transformations : what we want to achive
- Actions : when we actually run it is called as action
How to create a basic program
- Basic maven project
- Add the apache spark dependency in the pom.xml
- now we can use the api
- Create instance and the data and do the operations.