What is Apache Spark , RDD in JAVA ?

Motivation

The internal session is more of the introduction to apache spark so this is beginner friendly.

  1. What is Spark?
  2. History of Spark?
  3. Hadoop vs spark
  4. Spark ecosystem
  5. Architecture

Spark

  1. Distributted parallel execution framework, data can be divided into various sytem, Easy to perform quick operations, Supports variosus languages, --> JAVA, R, Scala

Application --> query, data scientis and query, Query, analyze and Transform.

History
  • Researched project --> open source under BSD --> becomes Apache project --> databricks used it.

HADOOP Vs SPARK.

Hadoop --> 1. first map --> and then reduce .

  • Quite lot of rights
  • take the data and pushed again into hardisk

Spark --> reading and writing is faster, 100 times faster

Apache Spark Features

  1. Lazy Evaluation
  2. Real-time computation
  3. Spark and Hadoop integration
  4. Machine Learning for iterative tasks

Spark Ecosystem

Spark Streaming Spark SQL MLib GraphX SparkR

SPARK Architecture

Master Slave architecture

Apache Spark RDD

  • resilient distributed dataset
  • this is a datasture and this datascructure can be modified
  • in memory computatio, Lazy Evaluation, Fault Tolerant, immutability,
  • Coarse Grained Operation -->

Aapche Spark Teminologies

  1. job
  2. Stage
  3. Partitions

DAG --> execution plan , acyclic graph STAGE : shuffling of data

Operations

  1. Transformations : what we want to achive
  2. Actions : when we actually run it is called as action

How to create a basic program

  1. Basic maven project
  2. Add the apache spark dependency in the pom.xml
  3. now we can use the api
  4. Create instance and the data and do the operations.