Introduction
Defined by the 3 V's
volume (data size) = TB/PB of data
velocity (data/event speed)= GB/sec, 1 mil events/sec
variety (data formats)
structured: tabular (SQL, CSV, Excel)
semi-structured: JSON, XML, binary formats
unstructured: text, pdf, images, videos, binary blobs
Storing data
lakehouse
data lake ++
data is optimistically consistent (transactional)
uses iceberg data format
data swamp
raw/all data
required to enable reprocessing
data is stored to enable reprocessing, or processing in the future
data lake
all data
used in analytics & reporting
data is stored in the hope that it could be processed in the future by analytics
cleansed & standardised data
data governance (data catalogue, lineage, security, metadata)
less size than data swap
data warehouse
enterprise-wide data
sub-collection of day to day/ historical data, used in reporting
less size than data lake
data mart
department-wide data
reporting
less size than data warehouse
Processing data
streaming
row/event/micro-batch based processing
mostly for preparing data, or short-term decisions/analytics (decision engine)
batch
large chunks of data for processing
mostly for long-term analytics (finding connections between data)
lambda
stream + batch in parallel
kappa
stream only
Last updated
Was this helpful?