Spark - Scala
Context
Basic blocks
RDD = resilient distributed dataset = collection of distributed data elements
DS = dataset = collection of distributed data elements
DF = dataframe = dataset with named columns (schema)
RDD
https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
Creation
declare data
from file
sql?
Operations
Operations are lazy on spark, and the more the merrier, so that the optimizer can better tune the DAG
Transformations
Create a new RDD by applying an operation. Doesn't trigger an execution
map(f: (A) ⇒B) = apply the function for each item
flatMap(f: (A) ⇒B) = flatten the data structure(tuple2 or whatever) and apply function for each item
filter(f: (A) ⇒Boolean) = keep only the data for which the lambda returns true
Actions
Return the value to the driver after the final computation on the data. Triggers an execution
DF
https://spark.apache.org/docs/2.2.0/sql-programming-guide.html
Last updated
Was this helpful?