In read-replica or sharded setups, the “migration problem” isn’t writing SQL — it’s orchestration + safety: where you run it, in what order, how you handle heterogeneity, and how you keep the app compatible during the rollout.
Read replicas
Core rule
Run schema migrations on the primary only. Replicas get DDL via replication (physical or logical, depending on DB). Your job is to ensure the migration is replication-safe and doesn’t break reads while it’s propagating.
Practical strategy
- Expand → Deploy → Backfill → Switch reads → Contract
- Expand: add nullable columns / new tables / additive indexes
- Deploy: code that can handle both old+new schema
- Backfill: do heavy writes in batches (throttle)
- Switch reads: if needed, start reading new fields once backfill is complete and replicas caught up
- Contract: drop old columns/indexes later
Replica lag aware rollout
- Before enabling new read paths:
- ensure replicas have replayed the migration (monitor replication lag)
- If your app reads from replicas:
- guard new reads behind a feature flag until lag is acceptable
- or “read-your-writes” path: route reads to primary for operations that depend on just-written schema/data
DDL choices that won’t wreck replicas
- Prefer non-blocking operations (DB-specific):
- Postgres:
CREATE INDEX CONCURRENTLY(but not in a transaction), careful with long-running DDL - MySQL: online DDL options if supported
- Postgres:
- Avoid huge table rewrites during peak hours; they can create replication delay and cascade into stale reads.
Monitoring checks (must-have)
- “Migration applied on primary” (schema history table)
- Replica replay status / lag
- Errors on replicas (DDL replication conflicts are rare with physical replication, more likely with logical decoding / filtered replication)
Sharded systems
Here the “primary-only” rule becomes primary-per-shard, and orchestration becomes the whole game.
1) Treat each shard as its own database with its own migration state
You need:
- Per-shard schema history (Flyway/Liquibase tables exist on every shard)
- A migration orchestrator that can:
- discover shards
- apply migrations shard-by-shard
- report progress and failures
- retry safely
2) Order of operations (typical)
Canary → small batch → full rollout
- Run migrations on 1 shard (or a dedicated canary shard)
- Then N shards at a time (control blast radius)
- Then the rest
Throttle concurrency to protect:
- DB CPU/IO
- shared infrastructure (storage, network)
- downstream systems (replication, CDC, analytics)
3) Handle heterogeneous shard versions
In real life, shards drift. Your application must be compatible with:
- shard at version V
- shard at V+1 (during rollout)
- sometimes V+2 (if rollouts overlap)
So you design migrations and code with backward/forward compatibility:
- Additive changes first (expand)
- Reads tolerate missing columns (or use fallback)
- Writes dual-write if needed
- Contract only after all shards are confirmed upgraded
4) Data migrations in sharded environments
Avoid “update the whole shard in one transaction”.
- Backfill in small batches with checkpoints:
WHERE id > last_id ORDER BY id LIMIT 10k
- Make it idempotent (rerunnable)
- Use rate limiting and time windows
5) “Global” constraints are not global anymore
Uniqueness, sequences, and FKs across shards are tricky:
- Global uniqueness typically requires:
- snowflake/UUID IDs
- or keyspace allocation per shard
- Cross-shard FKs: usually avoided; enforced at app level or via async validation
6) Tooling patterns
- Flyway/Liquibase still work — but you don’t run them once; you run them per shard.
- Common orchestrators:
- pipeline step that iterates shards
- K8s Job per shard (or per shard batch)
- internal “Schema Service” (more mature orgs)
7) Failure strategy (what interviewers love)
- If shard 17 fails:
- stop the rollout
- keep app compatible with mixed versions
- fix migration
- re-run only failed shards (idempotent migrations make this boring)
One clean “interview answer” you can say
“For read replicas, migrations run on the primary and replicate; the main concern is making schema changes additive and lag-aware so replica reads don’t break while the change propagates. For sharded systems, each shard has its own migration state, and we run migrations via an orchestrator with canary + batched rollout, keeping the application backward/forward compatible across mixed shard versions using an expand–migrate–contract approach and idempotent backfills.”