Failover is the automatic switching of traffic or workloads from a failed system component to a healthy one, ensuring high availability and minimal downtime .
Why It Matters
In distributed systems and microservices , failures are inevitable:
- Hardware crashes
- Network issues
- Container crashes
- Software bugs
Failover ensures your service stays available, even when something goes wrong.
Types of Failover
| Type | Example |
|---|---|
| Active-Passive | One server handles traffic; backup waits idle (e.g., master slave DB) |
| Active-Active | Multiple servers handle traffic; if one fails, others take over (e.g., load-balanced web servers) |
| Manual Failover | Requires human intervention |
| Automatic Failover | Happens without human input (e.g., health check fails, switch to replica) |
Example Scenarios
1. Database Failover
- If
db-primaryfails, switch todb-replica - Tools: MySQL Replication, PostgreSQL Patroni, Aurora RDS Auto Failover
2. Load Balancer Failover
- If a backend server dies, LB routes traffic to healthy instances
- Tools: Nginx, HAProxy, AWS ALB/ELB, Istio
3. DNS Failover
- Use DNS health checks to switch traffic to backup IPs
- Tools: Route 53, Cloudflare Load Balancer
4. Kubernetes Failover
If a pod dies:
- A new pod is spun up by the Deployment
- Traffic reroutes automatically via the Service layer
Key Components
| Component | Role |
|---|---|
| Health checks | Detect failures (e.g., /healthz endpoint) |
| Load balancer | Routes traffic to healthy nodes |
| Replication | Keeps backup instances in sync |
| State externalization | So stateless services can failover cleanly |
Considerations
| Challenge | How to Handle |
|---|---|
| Split-brain scenarios | Use leader election or quorum-based consensus (e.g., Raft, Zookeeper) |
| Data loss on failover | Ensure proper replication and durability |
| Failover delay | Tune health checks and TTLs |
| Stateful apps | Use StatefulSets, external storage, or sticky sessions |
Mental Model
Think of failover like backup generators in a hospital. If the main power dies, generators automatically turn on — ensuring critical services don’t stop.
Checklist for Reliable Failover
- Health checks are fast and accurate
- Backups/replicas are synced in real time
- Traffic rerouting is automatic (DNS, LB, proxy)
- Failback (returning to primary) is well-handled
- Chaos testing validates failover works under stress