Failover is the automatic switching of traffic or workloads from a failed system component to a healthy one, ensuring high availability and minimal downtime .

Why It Matters

In distributed systems and microservices , failures are inevitable:

  • Hardware crashes
  • Network issues
  • Container crashes
  • Software bugs

Failover ensures your service stays available, even when something goes wrong.

Types of Failover

TypeExample
Active-PassiveOne server handles traffic; backup waits idle (e.g., master slave DB)
Active-ActiveMultiple servers handle traffic; if one fails, others take over (e.g., load-balanced web servers)
Manual FailoverRequires human intervention
Automatic FailoverHappens without human input (e.g., health check fails, switch to replica)

Example Scenarios

1. Database Failover

  • If db-primary fails, switch to db-replica
  • Tools: MySQL Replication, PostgreSQL Patroni, Aurora RDS Auto Failover

2. Load Balancer Failover

  • If a backend server dies, LB routes traffic to healthy instances
  • Tools: Nginx, HAProxy, AWS ALB/ELB, Istio

3. DNS Failover

  • Use DNS health checks to switch traffic to backup IPs
  • Tools: Route 53, Cloudflare Load Balancer

4. Kubernetes Failover

If a pod dies:

  • A new pod is spun up by the Deployment
  • Traffic reroutes automatically via the Service layer

Key Components

ComponentRole
Health checksDetect failures (e.g., /healthz endpoint)
Load balancerRoutes traffic to healthy nodes
ReplicationKeeps backup instances in sync
State externalizationSo stateless services can failover cleanly

Considerations

ChallengeHow to Handle
Split-brain scenariosUse leader election or quorum-based consensus (e.g., Raft, Zookeeper)
Data loss on failoverEnsure proper replication and durability
Failover delayTune health checks and TTLs
Stateful appsUse StatefulSets, external storage, or sticky sessions

Mental Model

Think of failover like backup generators in a hospital. If the main power dies, generators automatically turn on — ensuring critical services don’t stop.

Checklist for Reliable Failover

  • Health checks are fast and accurate
  • Backups/replicas are synced in real time
  • Traffic rerouting is automatic (DNS, LB, proxy)
  • Failback (returning to primary) is well-handled
  • Chaos testing validates failover works under stress