Distributed Systems

Distributes systems are systems where components located on different networked computers communicate and coordinate their actions by passing messages. They appear as a single system to end users, but under the hood, they’re a network of autonomous machines.

Characteristics

Characteristic	Description
Concurrency	Multiple components operate at the same time.
Lack of global clock	No shared time; each machine has its own clock.
Independent failures	Components may crash independently.
Message passing	Communication happens via network, which is unreliable and slow.

Why build distributed systems?

Scalability: Handle large volumes of data/traffic.
Fault Tolerance: No single point of failure.
Geographical distribution: Serve global users with low latency.
Resource sharing: Compute/storage resources can be shared across nodes.

Key Components

Nodes: Independent machines (physical or virtual).
Network: Communication layer ( TCP, HTTP, gRPC, etc.).
Middleware: Handles messaging, coordination, serialization (e.g., Kafka, gRPC, Thrift).
Coordination services: like Zookeeper or etcd.

Challenges

Problem	Explanation
Network Failures	Packets get lost, duplicated, or delayed.
Partial Failures	One part can fail while others continue.
Latency	Varies and is unpredictable.
Data Consistency	Keeping replicas in sync is hard.
Clock Synchronization	Clocks can drift. NTP helps, but isn’t perfect.
Security	Communication must be encrypted and authenticated.

Theoretical Foundations

1. CAP Theorem

You can only have 2 out of 3:

Consistency
Availability
Partition Tolerance Most systems pick AP or CP, because partition tolerance is non-negotiable in real-world distributed systems.

2. FLP Impossibility

No deterministic consensus algorithm can guarantee progress in an asynchronous system with even one faulty node.

Design Principles

Principle	Examples
Idempotency	Retry-safe operations (e.g., PUT, DELETE).
Statelessness	Makes scaling and failure handling easier.
Eventual Consistency	Accepting temporary inconsistency in favor of availability.
Backpressure	Systems should degrade gracefully under load.
Retry with Jitter	Avoid retry storms.

Common Patterns

Leader Election – For coordination (e.g., Raft, Paxos)
Sharding – For horizontal scaling of data.
Replication – For availability and fault tolerance.
Consensus Protocols – To agree on state (e.g., Raft, Paxos).
Distributed Transactions – 2PC (blocking), Saga (non-blocking, eventual).

Real world examples

System Type	Example
Distributed DB	Cassandra, MongoDB, CockroachDB
Message Queue	Kafka, RabbitMQ
File Systems	HDFS, Ceph
Coordination	Zookeeper, etcd
Container Orchestration	Kubernetes

Tools and Tech

Category	Tools
Communication	gRPC, REST, Thrift
Coordination	etcd, Zookeeper
Orchestration	Kubernetes, Nomad
Monitoring	Prometheus, Grafana, OpenTelemetry
Tracing	Jaeger, Zipkin
Logging	ELK, Loki

When to Use a Distributed System

Use when:

One machine can’t handle your compute/storage needs.
You need high availability and fault tolerance.
You’re dealing with geo-distributed workloads.

Avoid if:

A monolith works well for your current scale (added complexity ≠ better).
You don’t have the ops maturity or infra budget.

Gaurav’s Notes

Explorer