Introduction

Defined by the 3 V's

  • volume (data size) = TB/PB of data

  • velocity (data/event speed)= GB/sec, 1 mil events/sec

  • variety (data formats)

    • structured: tabular (SQL, CSV, Excel)

    • semi-structured: JSON, XML, binary formats

    • unstructured: text, pdf, images, videos, binary blobs

Storing data

  • lakehouse

    • data lake ++

    • data is optimistically consistent (transactional)

    • uses iceberg data format

  • data swamp

    • raw/all data

    • required to enable reprocessing

    • data is stored to enable reprocessing, or processing in the future

  • data lake

    • all data

    • used in analytics & reporting

    • data is stored in the hope that it could be processed in the future by analytics

    • cleansed & standardised data

    • data governance (data catalogue, lineage, security, metadata)

    • less size than data swap

  • data warehouse

    • enterprise-wide data

    • sub-collection of day to day/ historical data, used in reporting

    • less size than data lake

  • data mart

    • department-wide data

    • reporting

    • less size than data warehouse

Processing data

  • streaming

    • row/event/micro-batch based processing

    • mostly for preparing data, or short-term decisions/analytics (decision engine)

  • batch

    • large chunks of data for processing

    • mostly for long-term analytics (finding connections between data)

  • lambda

    • stream + batch in parallel

    kappa

    • stream only

Last updated

Was this helpful?