Intro to Docker
Docker is a platform for developing, shipping, and running applications inside lightweight, portable containers.
Containers are isolated environments that bundle an application with all its dependencies, ensuring it runs consistently across different systems.
- encapsulation: packages apps and dependencies together
- portability: ensures applications run properly across different environments (dev, stag, prod)
- scalability: simplifies deployment and scaling of distributed systems
- efficiency: reduces overhead compared to VMs
Why is this Important for DE?
- reproducibility
- local experiments
- integration tests (CI/CD) ***not covered in this course
- Running pipelines on the cloud
- Spark → ensure all dependencies for data pipeline in spark are present using Docker
- Serverless (AWS lambda, google functions)
- batch jobs
Structure
- Source → Data Pipeline (e.g. python script) → Destination (e.g. table in postgres)
- Postgres on local PC will not interfere with Postgres in the container. Same for pgAdmin.