Spark - Scala

Context

spark = Spark SQL Context
sc = SparkContext

val data = Array(1,3,4,5)
val rddData = sc.parallelize(data)

val dataFS = sc.textFile("file:///...")
val dataHDFS = sc.textFile("hdfs:///..")

Operations are lazy on spark, and the more the merrier, so that the optimizer can better tune the DAG

Create a new RDD by applying an operation. Doesn't trigger an execution

rdd.map(_ => _ + 2)

flatMap(f: (A) ⇒B) = flatten the data structure(tuple2 or whatever) and apply function for each item

rdd.flatMap(_ => _ + 3)

rdd.filter(_ => _ % 2 == 0) // keep

Return the value to the driver after the final computation on the data. Triggers an execution

Last updated 4 months ago

Was this helpful?