SQL.What is CDC (Change Data Capture)? ?

CDC (Change Data Capture) is a technique for capturing changes in a database (INSERT, UPDATE, DELETE) and streaming them to other systems without querying tables directly.

In simple words:

Instead of asking the DB “what changed?”, CDC listens to the DB’s change log.


Why CDC exists (the real problem it solves)

Without CDC, people do things like:

  • periodic SELECT * WHERE updated_at > last_run
  • batch jobs every N minutes
  • triggers on tables

Problems:

  • missed updates
  • race conditions
  • heavy load
  • hard to scale
  • brittle logic

CDC fixes this by:

  • capturing every committed change
  • in correct order
  • with minimal overhead

How CDC works (Postgres example)

PostgreSQL already has a log of all changes:

  • WAL (Write-Ahead Log)

CDC tools:

  1. Read WAL
  2. Decode row-level changes
  3. Emit events to Kafka (or another stream)

Your app never queries the tables for changes.

Typical CDC pipeline

Postgres (OLTP)
  ↓ WAL
CDC Tool (Debezium)
  ↓ Kafka
Consumers:
  - OLAP loader (ClickHouse)
  - Cache updater
  - Search indexer
  - Audit / ML / monitoring

Key idea:

One write → many downstream consumers

What exactly is captured?

For each change:

  • operation type: c (insert), u (update), d (delete)
  • before / after values
  • table name
  • primary key
  • transaction metadata
  • commit order

Example (simplified):

{
  "op": "c",
  "table": "transactions",
  "after": {
    "id": 123,
    "client_id": 42,
    "amount": 100
  }
}

CDC vs “just consume Kafka twice”

AspectCDCDual Kafka Consumers
Source of truthDatabaseKafka
Captures manual DB changes
OrderingTransactionalPer topic/partition
ReprocessingStrongDepends on design
ComplexityMediumLow–Medium
ConsistencyVery strongApp-level

Senior rule of thumb:

If DB is the source of truth → CDC
If Kafka is the source of truth → dual consume

Why CDC is perfect for OLTP → OLAP

CDC guarantees:

  • no missed rows
  • no double counting (with correct handling)
  • exact history
  • low impact on OLTP

This is why almost all serious data platforms use CDC.

Common CDC tools

  • Debezium (most popular)
  • Maxwell (MySQL)
  • GoldenGate (Oracle)
  • Native cloud CDC (AWS DMS, GCP Datastream)

Important limitations (know these!)

1️⃣ Schema evolution matters

  • Column rename / type change affects CDC consumers
  • You need schema versioning

2️⃣ Large transactions

  • Huge updates produce huge CDC events
  • OLAP pipelines must handle bursts

3️⃣ Deletes are special

  • Soft delete vs hard delete must be decided
  • OLAP must understand delete semantics

Interview-ready answer (short & clean)

“CDC, or Change Data Capture, is a technique to stream database changes by reading the database’s transaction log instead of querying tables. It captures inserts, updates, and deletes in commit order and is commonly used to replicate data from OLTP systems to OLAP, caches, or search indexes.”


Senior bonus insight ⭐

CDC is:

  • event sourcing at the database level
  • but without rewriting the app

That’s why teams love it.

This entry was posted in Без рубрики. Bookmark the permalink.