✅ 1. Understand the Goals
Requirement | Description |
---|---|
⏱️ High availability | Keep the application running with minimal disruption |
💾 Data integrity | Avoid data loss (use sync replication if needed) |
🔄 Automatic switch | Detect failure and switch to standby quickly |
🔁 Failback | Optionally return to the original primary after recovery |
🧱 2. Choose the Right Failover Setup
🧭 Replication Strategy
Type | Primary Use | Notes |
---|---|---|
Synchronous | Strong consistency | May increase write latency |
Asynchronous | Better performance | Possible data loss |
Semi-sync | Middle ground | Acknowledged after 1 replica gets the write |
🛠️ 3. Implementing Failover – Key Components
🔍 A. Health Checks & Monitoring
- Regularly check if the primary is reachable.
- Tools: Keepalived, Patroni, Consul, pg_auto_failover, custom scripts
🧠 B. Failover Manager / Orchestrator
Automatically promotes a replica if the primary fails.
DB | Tool | Function |
---|---|---|
PostgreSQL | Patroni, pg_auto_failover | Automatic leader election |
MySQL | MHA, Orchestrator | Monitor & promote |
MongoDB | Built-in | Replica sets handle this |
Cloud DBs | AWS RDS, GCP SQL | Managed failover |
📦 C. Virtual IP or Proxy Layer
- So clients don’t need to know which DB is active
- Use tools like HAProxy, pgpool-II, or ProxySQL
- These reroute traffic to the active node
💽 D. Application Logic
- Use retry logic and connection pooling
- Support failover-aware drivers (e.g., JDBC with multiple hosts)
💡 4. Example: PostgreSQL + Patroni + etcd + HAProxy
[App] ⇄ [HAProxy]
⇄ Primary (Postgres Node 1)
⇄ Replica (Postgres Node 2)
[Patroni + etcd cluster]
⇨ Monitors nodes, triggers failover, updates HAProxy
🔁 5. Manual Failover (Fallback Plan)
If automation fails:
- Promote a replica manually:
pg_ctl promote /var/lib/postgresql/data
Update connection strings or proxy config.
Restart apps if needed.
Resync the old primary as a new replica.
📋 6. Test Regularly
- Simulate failure (kill primary node, cut network).
- Measure failover time and check data consistency.
- Have runbooks for manual failover.
🧠 Summary
Database failover is a critical part of high availability.
It requires replication, health checks, automatic promotion, and application resilience. The goal is to recover fast with minimal data loss.