Objectives:
- How to build robust, scalable, and self-maintaining pipelines.
- Best practices, like built-in data governance, for ensuring clean and reliable data flows.
- Incremental loading techniques to refresh data quickly and cost-effectively.
- How to build a Data Lake with dlt.
Where do files come from?
- CSV, parquet, databases, google drives…etc. DE’s consolidate this data and make it clear, normalized and clean.
- well structured vs. unstructured data (or weakly)
- DEs job is to create a Data Pipeline:
- Collect from → data stores, data streams, applications
- Ingest: depending on your source (i.e. json, APIs), you may need to some work to clean, normalize and flatten data.
- Flattening data = transforming nested, hierarchical, or multi-dimensional data structures into a simpler, tabular format where each row represents a single record with all necessary attributes
- Store: Lake, warehouse, lakehouse
- Lakes - store massive amts (usually files: parquet) of data cheaply but read/write is slow
- warehouse - bigquery, snowflake, redshift…
- lakehouse - data storage in files + layers of metadata to improve search
- compute: batch processing (chunk by chunk), stream processing (record by record)
- consume: DS, BI, self-service analytics…
- DE north stars
- optimize data storage to be cheap AND reliable (high performance)
- data is good quality - addressing duplicates, inconsistencies and missing values
- governance - secure, compliant and well-managed data. no sensitive data can be exposed (personal data).
- adapt data architecture - will need to scale pipelines (e.g. what if you go from 10 to 1000 sources of data).
<aside>
💡
We will focus on the collection, ingestion and storage part of the pipeline
</aside>
ETL - extract, transform (normalize), load
Most of the data is stored in SQL DBs, APIs and Files. We will cover APIs in this workshop.
APIs
- weakly structured data and generally complex in structure
- RESTful APIs: provide records of data from business applications. e.g. list of customers from CRM sys.
- File-based APIs: return secure file paths to bulk data (JSON / parquet) stored in buckets e.g. DL’ing monthly sales report
- Database APIs: connect to DB, to get data - typically in JSON format.