TOOLS Learning Roadmap

This roadmap organizes 7 comprehensive guides into a structured learning path for production data engineering tools. Follow this sequence to master the complete stack for Le Wagon bootcamp and beyond.

Start here based on your experience:

🟢 New to containers: Start with TIER 1
🟡 Know Docker, need orchestration: Start with TIER 2
🔴 Need production patterns & scaling: Continue to TIER 3

🎯 TIER 1: Containerization Fundamentals (8-10 hours)

Package and run applications consistently everywhere.

#	Topic	Page	Time	Learn
1	Docker Fundamentals	Docker-Fundamentals	3h	Images, containers, volumes, networking
2	Docker Compose	Docker-Compose	4h	Multi-container orchestration locally
3	Git & GitHub	Git-GitHub	2h	Version control, collaboration, PRs

Subtotal: ~9 hours
After TIER 1 you can: Package applications in Docker, run multi-container stacks locally, collaborate on code.

🚀 TIER 2: Orchestration & Transformation (10-12 hours)

Automate pipelines and transform data with industry-standard tools.

#	Topic	Page	Time	Learn
4	Apache Airflow	Apache-Airflow	5h	DAGs, operators, scheduling, monitoring
5	dbt (Data Build Tool)	dbt-Data-Build-Tool	5h	SQL transformations, testing, lineage

Subtotal: ~10 hours (cumulative: ~19 hours)
After TIER 2 you can: Orchestrate complex pipelines, test data quality, scale transformations.

🔥 TIER 3: Scaling & Cloud (8-10 hours)

Process massive datasets and use managed cloud platforms.

#	Topic	Page	Time	Learn
6	PySpark Fundamentals	PySpark-Fundamentals	4h	Distributed computing, RDDs, DataFrames, SQL
7	Cloud Data Warehouses	Cloud-Data-Warehouses	4h	Snowflake, BigQuery, Redshift architecture & setup

Subtotal: ~8 hours (cumulative: ~27 hours)
After TIER 3 you can: Process terabyte-scale data, use cloud platforms, build enterprise-grade pipelines.

📊 Learning Path by Goal

Goal: “I need bootcamp basics (MVP)”

Follow TIER 1 + TIER 2 (~19 hours)
Time to complete: 2-3 weeks
Bootcamp readiness: 95%
Most important for Le Wagon

Goal: “I want to be job-ready post-bootcamp”

Follow TIER 1 + TIER 2 + TIER 3 (~27 hours)
Time to complete: 4-5 weeks
Bootcamp readiness: 100%
Job-market readiness: 85%

🗓️ Le Wagon Timeline Alignment

Before Bootcamp (Oct 31, 2026)

Milestone	Date	What to Know	Pages	Priority
TIER 1 Complete	By Oct 8	Docker, Compose, Git basics	1-3	🔴 CRITICAL
TIER 2 Complete	By Oct 25	Airflow DAGs, dbt models	4-5	🔴 CRITICAL
TIER 3 Start	By Oct 31	PySpark, Cloud intro	6-7	🟡 Nice-to-have

During Bootcamp (Nov 2026 - March 2027)

Weeks 1-2 (Container & Orchestration):
You already know Docker + Airflow → focus on production patterns
Reference: Docker-Compose, Apache-Airflow

Weeks 3-4 (Data Transformation):
Learn dbt in depth, data quality testing
Reference: dbt-Data-Build-Tool

Weeks 5-6 (Big Data & Cloud):
Introduction to Spark & cloud warehouses
Reference: PySpark-Fundamentals, Cloud-Data-Warehouses

Weeks 7+ (Capstone):
Build end-to-end pipeline with all tools integrated
Reference: All pages (integrated architecture)

📝 Daily Study Workflow (45 min/day)

Weeks 1-2 (TIER 1, Pages 1-3)

25 min: Read page + understand concepts
15 min: Follow along with examples locally (Docker, Git)
5 min: Notes & summary

Weeks 3-4 (TIER 2, Pages 4-5)

20 min: Read advanced concepts
20 min: Write code (DAG / dbt model)
5 min: Run & verify against data

Weeks 5-6 (TIER 3, Pages 6-7)

15 min: Read architecture & concepts
20 min: Set up environment (Spark, cloud credentials)
10 min: Run example query or transformation

🎓 Before You Start: Prerequisites

You should:

Have Docker installed (docker --version)
Understand Python basics (functions, loops, dicts)
Have git installed (git --version)
Know basic command line navigation
Have PostgreSQL or similar DB available locally

If not: Spend 2 hours on Prerequisites before starting TIER 1.

✅ Consolidation Checklist

After TIER 1

Build a Docker image from a Dockerfile
Run a container with port mapping and volumes
Define a docker-compose.yml with 2+ services
Create a GitHub repo and push code
Merge a pull request on GitHub

After TIER 2

Write an Airflow DAG with 3+ tasks
Schedule an Airflow pipeline to run daily
Write a dbt model with 3+ tests
Generate dbt documentation (dbt docs)
Build a complete ETL: Airflow → dbt → Postgres

After TIER 3

Run PySpark locally on 1GB+ dataset
Query a cloud warehouse (Snowflake/BigQuery)
Load data into cloud warehouse via Airflow
Build PySpark transformation in dbt
Create a project spanning all 7 tools

🚨 Common Pitfalls (Don’t Fall Into These!)

Pitfall	Why It Matters	Fix
Not using volumes in Docker	Data lost when container stops	Always mount volumes for persistence
Hardcoding credentials	Security risk	Use .env files, never commit secrets
Circular DAG dependencies	Pipeline fails silently	Always use `>>` syntax correctly
No dbt tests	Data quality fails in production	Write tests before deployment
Large commits	Impossible to review	Commit frequently, small changes
Using pandas for GB+ datasets	Memory crashes, slow	Use PySpark or cloud warehouses
Not monitoring cloud costs	Surprise bills	Enable cost alerts on Snowflake/BigQuery

📚 Full Page Directory

#	Title	Description	Time	Difficulty	When
1	Docker-Fundamentals	Images, containers, volumes, networking	3h	🟢 Beginner	Pre-bootcamp
2	Docker-Compose	Multi-container apps, services, networks	4h	🟡 Intermediate	Pre-bootcamp
3	Git-GitHub	Version control, branches, pull requests	2h	🟢 Beginner	Pre-bootcamp
4	Apache-Airflow	DAGs, operators, scheduling, monitoring	5h	🟡 Intermediate	Pre-bootcamp
5	dbt-Data-Build-Tool	Models, tests, documentation, lineage	5h	🟡 Intermediate	Pre-bootcamp
6	PySpark-Fundamentals	RDDs, DataFrames, SQL, distributed computing	4h	🟡 Intermediate	Week 5+ (bootcamp)
7	Cloud-Data-Warehouses	Snowflake, BigQuery, Redshift, architecture	4h	🟡 Intermediate	Week 5+ (bootcamp)

Total: ~27 hours
Recommended pace: 1.5 hours/day over 4-5 weeks

🔗 Integration with Your Knowledge Base

These TOOLS pages connect with your existing knowledge:

Python Fundamentals (1_PYTHON)
        ↓
   SQL Basics (2_SQL)
        ↓
Docker (package code + DB)
        ↓
Docker Compose (local multi-container stack)
        ↓
Git (version control everything)
        ↓
Airflow (orchestrate workflows)
        ↓
dbt (transform in warehouse)
        ↓
PySpark (parallel processing)
        ↓
Cloud Warehouse (production platform)

🎯 Your First Project (After TIER 1)

Build a containerized pipeline:

Create Dockerfile for Python ETL script
Add docker-compose.yml with Postgres + app
Push to GitHub with clear commits
Run locally: docker-compose up

Time: 3-4 hours
Deliverable: GitHub repo with working stack

🎉 Your Second Project (After TIER 2)

Build complete ETL pipeline:

Extract: Airflow task pulls from API
Load: Lands in Postgres staging
Transform: dbt models clean and aggregate
Test: dbt tests validate quality
Orchestrate: Airflow schedules daily runs
Version control: Everything in Git

Time: 8-10 hours
Deliverable: End-to-end pipeline, scheduled, tested

🌟 Your Capstone (After TIER 3)

Production-ready data pipeline:

Extract: Airflow pulls from multiple APIs
Load: Lands in cloud warehouse (Snowflake/BigQuery)
Transform: dbt + PySpark for complex computations
Test: Comprehensive data quality checks
Scale: Process 1GB+ dataset efficiently
Monitor: Airflow tracking + cost monitoring
Deploy: CI/CD via GitHub + dbt Cloud

Time: 20+ hours
Deliverable: Job-ready portfolio project

📞 Getting Help

When stuck:

Check the page’s “Tips & Gotchas” section
Run diagnostics:
- Docker: docker ps, docker logs
- Git: git log, git status
- Airflow: airflow dags list, check logs
- dbt: dbt debug, dbt run --select model_name
- Spark: Check Spark UI at localhost:4040
Search error message (usually very clear)
Check official docs:
- Docker: docs.docker.com
- Airflow: airflow.apache.org
- dbt: docs.getdbt.com
- PySpark: spark.apache.org/docs/latest/api/python
- Cloud DW: cloud.google.com/bigquery/docs (or Snowflake/Redshift equivalent)

🏆 Final Goals

By the end of this roadmap, you’ll be able to:

✅ Package Python code in Docker
✅ Compose multi-service stacks locally
✅ Orchestrate complex pipelines with Airflow
✅ Transform data with dbt & testing
✅ Collaborate professionally with Git
✅ Scale to terabyte datasets with PySpark
✅ Deploy on cloud warehouses (Snowflake/BigQuery)
✅ Build production-ready ETL pipelines
✅ Follow Le Wagon bootcamp from Day 1 with confidence
✅ Land jobs with modern data engineering skills

🚀 Next Steps

Install prerequisites:

docker --version
docker-compose --version
git --version
python --version

Start TIER 1: Open Docker-Fundamentals
Build as you learn: Every page has hands-on examples
Push to GitHub: Version control your learning
Track progress: Check off Consolidation Checklist after each TIER

Ready to start? → Begin with Docker-Fundamentals ✨

Already know Docker? → Jump to Apache-Airflow

Want to scale? → See PySpark-Fundamentals and Cloud-Data-Warehouses

Questions? Review the “Related” section at the bottom of each page for cross-references.

This roadmap is designed for data engineers preparing for Le Wagon bootcamp and job market entry. Total time investment: ~27 hours to master production-grade modern data engineering tools.

Last updated: Jan 22, 2026

Explorer

Ben Warai Otoko

TOOLS Learning Roadmap (Updated)

Table of Contents