Java.DBMigrationTools.What’s the best approach to seed data with a migration tool?

the “best” approach depends on what kind of data you’re seeding. The senior answer is: don’t treat all seed data the same—split it into categories and choose the mechanism that matches the lifecycle.


1) Classify seed data first

A) Reference / canonical data (good for migrations)

Examples:

  • roles: ADMIN, USER
  • countries, currencies
  • feature flags defaults
  • permission matrices (if stable)
  • system config keys required for startup

✅ This belongs in migration tooling because:

  • app may not start without it
  • it must be consistent across environments
  • it’s part of “schema contract”

B) Demo / test / local-only data (don’t put in prod migrations)

Examples:

  • fake users, sample orders
  • mock catalog entries
  • dev fixtures

❌ This should be injected via:

  • separate scripts
  • test setup
  • profiles
  • Testcontainers init scripts
  • dedicated “dev seed” task

C) Environment-specific operational data (avoid in migrations)

Examples:

  • prod API keys
  • customer-specific config
  • partner identifiers

❌ Keep in secrets/config management, not DB migrations.


2) Golden rule: seed data must be idempotent

Your seed should be safe if it runs:

  • again
  • on partially seeded environments
  • after rollback/restore
  • during blue/green deploys

Postgres example

INSERT INTO role(code)
VALUES ('ADMIN'), ('USER')
ON CONFLICT (code) DO NOTHING;

MySQL example

INSERT IGNORE INTO role(code) VALUES ('ADMIN'), ('USER');

Or explicit upsert with unique key.

📌 Interview phrase:

“Seed data should be idempotent and based on natural keys, not generated IDs.”

3) Prefer “natural keys” and stable identifiers

Bad:

  • insert role, then later code refers to role_id = 1

Good:

  • refer by code = 'ADMIN'
  • or use UUIDs you control
  • or insert with fixed IDs only if you own the full lifecycle

This avoids drift when different environments generate IDs differently.

4) Keep seed operations small and fast

Avoid:

  • inserting 100k rows in a migration
  • heavy transforms
  • long transactions

For big datasets:

  • use batch jobs / ETL
  • or preloaded snapshots

Migration tools are part of the deploy path—keep them predictable.

5) Tool-specific best practices

Flyway

Best practice for reference seed data:

  • separate files:
    • V10__create_tables.sql
    • V11__seed_reference_data.sql

If data needs periodic refresh:

  • consider repeatable migrations carefully:
    • R__views.sql
    • R__seed_data.sql ⚠️ usually no, unless it’s fully idempotent and “authoritative”

In Flyway, repeatables rerun on checksum change—this can surprise you.

Liquibase

Liquibase gives you more control:

  • Use <loadData> for CSV-based seed data (nice for reference tables)
  • Use contexts/labels:
    • context="dev" for dev-only seeds
    • no context for prod-required reference data

Also:

  • Use preconditions if necessary (but don’t hide required data behind conditions unless you really mean it)

6) Recommended pattern: reference seed in migrations + fixtures elsewhere

What I’d do in a real project

  • Reference data (roles, currencies) → migrations (idempotent)
  • Dev/demo → separate “seed task”:
    • Gradle task / Maven profile
    • docker-compose init scripts
    • Testcontainers init
  • Test data → test setup (JUnit / integration test bootstrap)

7) Interview-ready answer (30 seconds)

“I seed only canonical reference data through the migration tool, and I keep it idempotent using natural keys and upserts so reruns are safe. I avoid putting demo or environment-specific data into migrations; those go into separate dev/test seed mechanisms or config management. For Flyway I prefer versioned seed migrations; for Liquibase I often use contexts/labels and loadData for reference tables.”

This entry was posted in Без рубрики. Bookmark the permalink.