Introduction

Defined by the 3 V's

volume (data size) = TB/PB of data
velocity (data/event speed)= GB/sec, 1 mil events/sec
variety (data formats)
- structured: tabular (SQL, CSV, Excel)
- semi-structured: JSON, XML, binary formats
- unstructured: text, pdf, images, videos, binary blobs

lakehouse
- data lake ++
- data is optimistically consistent (transactional)
- uses iceberg data format
data swamp
- raw/all data
- required to enable reprocessing
- data is stored to enable reprocessing, or processing in the future
data lake
- all data
- used in analytics & reporting
- data is stored in the hope that it could be processed in the future by analytics
- cleansed & standardised data
- data governance (data catalogue, lineage, security, metadata)
- less size than data swap
data warehouse
- enterprise-wide data
- sub-collection of day to day/ historical data, used in reporting
- less size than data lake
data mart
- department-wide data
- reporting
- less size than data warehouse

streaming
- row/event/micro-batch based processing
- mostly for preparing data, or short-term decisions/analytics (decision engine)
batch
- large chunks of data for processing
- mostly for long-term analytics (finding connections between data)
lambda
- stream + batch in parallel
kappa
- stream only

Last updated 5 months ago

Was this helpful?