graph TD CAP["CAP Theorem"] PACELC["PACELC Theorem"] SCAL["Scalability"] FT["Fault Tolerance"] REPL["Replication & Sharding"] CAP --> PACELC CAP --> REPL SCAL --> REPL SCAL --> FT BATCH["Batch Processing"] STREAM["Stream Processing"] LAMBDA["Lambda Architecture"] KAPPA["Kappa Architecture"] EDR["Event Driven Architecture"] BATCH --> LAMBDA STREAM --> LAMBDA STREAM --> KAPPA STREAM --> EDR HDFS["HDFS"] OBJ["Object Storage"] PART["Data Partitioning"] LOCAL["Data Locality"] PART --> HDFS PART --> OBJ LOCAL --> HDFS MAPR["MapReduce"] SCHED["Task Scheduling"] PARA["Data vs Task Parallelism"] CLUSTER["Cluster Computing"] CLUSTER --> MAPR CLUSTER --> SCHED LOCAL --> MAPR PARA --> MAPR QUEUE["Message Queues vs Streaming"] KAFKA["Kafka Architecture"] DELIV["Delivery Semantics"] CONS["Consumer Groups"] QUEUE --> KAFKA KAFKA --> CONS KAFKA --> DELIV EVTUAL["Eventual Consistency"] STRONG["Strong Consistency"] IDEMP["Idempotency"] RETRY["Retries & Backoff"] DLQ["Dead Letter Queue"] CAP --> EVTUAL CAP --> STRONG RETRY --> DLQ IDEMP --> RETRY DELIV --> IDEMP LOAD["Load Balancing"] BACK["Backpressure"] CB["Circuit Breaker"] SD["Service Discovery"] LEADER["Leader Election"] SCAL --> LOAD BACK --> KAFKA FT --> CB CLUSTER --> SD CLUSTER --> LEADER LAT["Latency vs Throughput"] SKEW["Data Skew"] CACHE["Caching"] PART --> SKEW CACHE --> LAT LAT --> MAPR ETL["ETL vs ELT"] LAKE["Data Lake vs Warehouse"] LAKEHOUSE["Lakehouse Architecture"] SCHEMA["Schema Evolution"] OBJ --> LAKE LAKE --> LAKEHOUSE ETL --> BATCH SCHEMA --> LAKEHOUSE
Core Fundamentals
CAP Theorem (Consistency, Availability, Partition Tolerance)
PACELC Theorem (extension of CAP for latency vs consistency tradeoff)
Horizontal vs Vertical Scaling
Replication & Sharding
Fault Tolerance
Data Processing Models
Batch Processing vs Stream Processing
Lambda Architecture
Kappa Architecture
Event-Driven Architecture
Distributed Storage Systems
Distributed File Systems (HDFS)
Object Storage (S3, GCS, ADLS)
Data Partitioning Strategies
Data Locality Principle
Distributed Computing Concepts
MapReduce Model
Distributed Task Scheduling
Data Parallelism vs Task Parallelism
Cluster Computing
Messaging & Streaming Systems
Message Queues vs Streaming Systems
Kafka Architecture (brokers, partitions, offsets)
Exactly-once vs At-least-once vs At-most-once delivery
Consumer Groups & Offset Management
Consistency & Reliability
Eventual Consistency
Strong Consistency
Idempotency in distributed systems
Retries, Backoff Strategies
Dead Letter Queues (DLQ)
System Reliability & Design
Load Balancing
Backpressure Handling
Circuit Breaker Pattern
Service Discovery
Leader Election
Performance & Optimization
Partitioning & Bucketing
Caching in distributed systems
Data skew handling
Throughput vs Latency tradeoffs
Data Engineering Specific Concepts
ETL vs ELT in distributed systems
Data Lake vs Data Warehouse architecture
Lakehouse architecture (Delta Lake / Iceberg / Hudi)
Schema evolution in distributed pipelines