🪅
Hadoop tools
  • Introduction
  • Ingestion
    • Sqoop
    • Flume
  • Transformation
    • Pig
    • Hive
    • Spark - Scala
      • Examples
    • Spark - Python
  • NoSQL
    • HBase
  • Big Data Principles
  • Big Data Architectures
Powered by GitBook
On this page
  • Context
  • Basic blocks
  • RDD
  • Creation
  • Operations
  • DF

Was this helpful?

  1. Transformation

Spark - Scala

PreviousHiveNextExamples

Last updated 4 months ago

Was this helpful?

Context

spark = Spark SQL Context
sc = SparkContext

Basic blocks

  • RDD = resilient distributed dataset = collection of distributed data elements

  • DS = dataset = collection of distributed data elements

  • DF = dataframe = dataset with named columns (schema)

RDD

Creation

  • declare data

val data = Array(1,3,4,5)
val rddData = sc.parallelize(data)
  • from file

val dataFS = sc.textFile("file:///...")
val dataHDFS = sc.textFile("hdfs:///..")
  • sql?

Operations

Operations are lazy on spark, and the more the merrier, so that the optimizer can better tune the DAG

Transformations

Create a new RDD by applying an operation. Doesn't trigger an execution

  • map(f: (A) ⇒B) = apply the function for each item

rdd.map(_ => _ + 2)
  • flatMap(f: (A) ⇒B) = flatten the data structure(tuple2 or whatever) and apply function for each item

rdd.flatMap(_ => _ + 3)
  • filter(f: (A) ⇒Boolean) = keep only the data for which the lambda returns true

rdd.filter(_ => _ % 2 == 0) // keep
  • Actions

Return the value to the driver after the final computation on the data. Triggers an execution

DF

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html