Distributes systems are systems where components located on different networked computers communicate and coordinate their actions by passing messages. They appear as a single system to end users, but under the hood, they’re a network of autonomous machines.
Characteristics
| Characteristic | Description |
|---|---|
| Concurrency | Multiple components operate at the same time. |
| Lack of global clock | No shared time; each machine has its own clock. |
| Independent failures | Components may crash independently. |
| Message passing | Communication happens via network, which is unreliable and slow. |
Why build distributed systems?
- Scalability: Handle large volumes of data/traffic.
- Fault Tolerance: No single point of failure.
- Geographical distribution: Serve global users with low latency.
- Resource sharing: Compute/storage resources can be shared across nodes.
Key Components
- Nodes: Independent machines (physical or virtual).
- Network: Communication layer ( TCP, HTTP, gRPC, etc.).
- Middleware: Handles messaging, coordination, serialization (e.g., Kafka, gRPC, Thrift).
- Coordination services: like Zookeeper or etcd.
Challenges
| Problem | Explanation |
|---|---|
| Network Failures | Packets get lost, duplicated, or delayed. |
| Partial Failures | One part can fail while others continue. |
| Latency | Varies and is unpredictable. |
| Data Consistency | Keeping replicas in sync is hard. |
| Clock Synchronization | Clocks can drift. NTP helps, but isn’t perfect. |
| Security | Communication must be encrypted and authenticated. |
Theoretical Foundations
1. CAP Theorem
You can only have 2 out of 3:
- Consistency
- Availability
- Partition Tolerance Most systems pick AP or CP, because partition tolerance is non-negotiable in real-world distributed systems.
2. FLP Impossibility
No deterministic consensus algorithm can guarantee progress in an asynchronous system with even one faulty node.
Design Principles
| Principle | Examples |
|---|---|
| Idempotency | Retry-safe operations (e.g., PUT, DELETE). |
| Statelessness | Makes scaling and failure handling easier. |
| Eventual Consistency | Accepting temporary inconsistency in favor of availability. |
| Backpressure | Systems should degrade gracefully under load. |
| Retry with Jitter | Avoid retry storms. |
Common Patterns
- Leader Election – For coordination (e.g., Raft, Paxos)
- Sharding – For horizontal scaling of data.
- Replication – For availability and fault tolerance.
- Consensus Protocols – To agree on state (e.g., Raft, Paxos).
- Distributed Transactions – 2PC (blocking), Saga (non-blocking, eventual).
Real world examples
| System Type | Example |
|---|---|
| Distributed DB | Cassandra, MongoDB, CockroachDB |
| Message Queue | Kafka, RabbitMQ |
| File Systems | HDFS, Ceph |
| Coordination | Zookeeper, etcd |
| Container Orchestration | Kubernetes |
Tools and Tech
| Category | Tools |
|---|---|
| Communication | gRPC, REST, Thrift |
| Coordination | etcd, Zookeeper |
| Orchestration | Kubernetes, Nomad |
| Monitoring | Prometheus, Grafana, OpenTelemetry |
| Tracing | Jaeger, Zipkin |
| Logging | ELK, Loki |
When to Use a Distributed System
Use when:
- One machine can’t handle your compute/storage needs.
- You need high availability and fault tolerance.
- You’re dealing with geo-distributed workloads.
Avoid if:
- A monolith works well for your current scale (added complexity ≠ better).
- You don’t have the ops maturity or infra budget.