Failover

Failover is the automatic switching of traffic or workloads from a failed system component to a healthy one, ensuring high availability and minimal downtime .

Why It Matters

In distributed systems and microservices , failures are inevitable:

Hardware crashes
Network issues
Container crashes
Software bugs

Failover ensures your service stays available, even when something goes wrong.

Types of Failover

Type	Example
Active-Passive	One server handles traffic; backup waits idle (e.g., master slave DB)
Active-Active	Multiple servers handle traffic; if one fails, others take over (e.g., load-balanced web servers)
Manual Failover	Requires human intervention
Automatic Failover	Happens without human input (e.g., health check fails, switch to replica)

Example Scenarios

1. Database Failover

If db-primary fails, switch to db-replica
Tools: MySQL Replication, PostgreSQL Patroni, Aurora RDS Auto Failover

2. Load Balancer Failover

If a backend server dies, LB routes traffic to healthy instances
Tools: Nginx, HAProxy, AWS ALB/ELB, Istio

3. DNS Failover

Use DNS health checks to switch traffic to backup IPs
Tools: Route 53, Cloudflare Load Balancer

4. Kubernetes Failover

If a pod dies:

A new pod is spun up by the Deployment
Traffic reroutes automatically via the Service layer

Key Components

Component	Role
Health checks	Detect failures (e.g., `/healthz` endpoint)
Load balancer	Routes traffic to healthy nodes
Replication	Keeps backup instances in sync
State externalization	So stateless services can failover cleanly

Considerations

Challenge	How to Handle
Split-brain scenarios	Use leader election or quorum-based consensus (e.g., Raft, Zookeeper)
Data loss on failover	Ensure proper replication and durability
Failover delay	Tune health checks and TTLs
Stateful apps	Use StatefulSets, external storage, or sticky sessions

Mental Model

Think of failover like backup generators in a hospital. If the main power dies, generators automatically turn on — ensuring critical services don’t stop.

Checklist for Reliable Failover

Health checks are fast and accurate
Backups/replicas are synced in real time
Traffic rerouting is automatic (DNS, LB, proxy)
Failback (returning to primary) is well-handled
Chaos testing validates failover works under stress

Gaurav’s Notes

Explorer