2) Check migration metadata tables (source of truth)
This tells you whether it ran, failed, or got stuck.
Flyway
Query flyway_schema_history:
success = falserows → failed migrationinstalled_rank/ version to see orderchecksumandexecution_timefor clues
Liquibase
Query:
DATABASECHANGELOG→ last executed changeset, execTypeDATABASECHANGELOGLOCK→ is it locked and by whom?
Common pipeline failure: Liquibase lock left behind by a killed job → next run blocks.
3) Reproduce in the same environment the pipeline used
Pipeline failures often depend on:
- DB version/params
- privileges
- data volume
- statement timeouts
- transaction settings
Repro checklist
- Same Docker image / migration tool version (Flyway/Liquibase)
- Same JDBC URL params
- Same migration user/role
- Same schema/search_path
- Same baseline/placeholder values
If the pipeline spins an ephemeral DB: pull the same compose/Testcontainers config locally.
4) Classify the failure (most common buckets)
A) Syntax / compatibility
- Wrong SQL dialect for the DB
- Using non-transactional DDL inside a transaction (Postgres
CREATE INDEX CONCURRENTLY)
Fix: adjust SQL, split into separate migrations, or mark non-transactional properly.
B) Permission / ownership
- “must be owner of relation”, “permission denied”
Fix: run under proper migration role; ensure role owns objects or has required grants.
C) Locking / timeouts
- “could not obtain lock”, “lock wait timeout”, deadlock
Fix: - Make DDL less blocking (concurrent/online)
- Increase
lock_timeout/statement_timeoutcarefully - Run off-peak or use expand–migrate–contract
- For Liquibase: clear stale lock (with tooling)
D) Data issues (DML fails)
- constraint violations, nulls, duplicate keys
Fix: - precondition checks (Liquibase preconditions)
- backfill in batches
- make migration idempotent / safe for rerun
E) “Half-applied” state
- Some statements executed, then failure
Fix approach depends on DB + tool: - If migration is transactional and failed → usually rolled back
- If non-transactional statements were used → you may need fix-forward scripts
5) Safe recovery actions (what to do next)
Flyway
- If a migration failed, Flyway records it as failed.
- Typical recovery:
- Fix the migration or add a new fix migration (preferred if already applied elsewhere)
- If the failed row remains:
flyway repair(after you’re sure about the state) - Re-run
migrate
Liquibase
- If lock is stuck: use Liquibase commands (preferred) rather than manual DB edits:
liquibase releaseLocks(if your setup supports it)- or as a last resort: fix
DATABASECHANGELOGLOCKcarefully
- If a changeset partially applied: usually fix-forward with a new changeset.
Rule: on shared envs, avoid “editing history”; prefer new migration.
6) Pipeline hardening (so debugging is rare)
Add a “preflight” stage:
validate(checksums, changelog correctness)updateSQL/dryRunOutputartifact- run migrations on an ephemeral DB from scratch
- optionally run against a restored snapshot nightly (big-data/perf catch)
Also ensure:
- one migration runner (job/step) to avoid concurrency
- DB connection + lock timeouts are explicit and logged
- artifacts include: generated SQL, tool version, DB version
7) Interview-ready answer (tight)
“First I identify the exact failing migration/version from pipeline logs and check the schema history tables (
flyway_schema_historyorDATABASECHANGELOG/LOCK) to see whether it failed, partially applied, or left a lock. Then I reproduce using the same tool version, config, and DB role. Most issues fall into syntax/compatibility, permissions, or locking/timeouts; I fix-forward with a new migration when history is shared, and use Flywayrepairor Liquibase lock release only when I’m sure it’s a metadata/lock problem. Finally I harden the pipeline with validate + dry-run SQL artifacts and an ephemeral DB migration test stage.”