CDC (Change Data Capture) is a technique for capturing changes in a database (INSERT, UPDATE, DELETE) and streaming them to other systems without querying tables directly.
In simple words:
Instead of asking the DB “what changed?”, CDC listens to the DB’s change log.
Why CDC exists (the real problem it solves)
Without CDC, people do things like:
- periodic
SELECT * WHERE updated_at > last_run - batch jobs every N minutes
- triggers on tables
Problems:
- missed updates
- race conditions
- heavy load
- hard to scale
- brittle logic
CDC fixes this by:
- capturing every committed change
- in correct order
- with minimal overhead
How CDC works (Postgres example)
PostgreSQL already has a log of all changes:
- WAL (Write-Ahead Log)
CDC tools:
- Read WAL
- Decode row-level changes
- Emit events to Kafka (or another stream)
Your app never queries the tables for changes.
Typical CDC pipeline
Postgres (OLTP)
↓ WAL
CDC Tool (Debezium)
↓ Kafka
Consumers:
- OLAP loader (ClickHouse)
- Cache updater
- Search indexer
- Audit / ML / monitoring
Key idea:
One write → many downstream consumers
What exactly is captured?
For each change:
- operation type:
c(insert),u(update),d(delete) - before / after values
- table name
- primary key
- transaction metadata
- commit order
Example (simplified):
{
"op": "c",
"table": "transactions",
"after": {
"id": 123,
"client_id": 42,
"amount": 100
}
}
CDC vs “just consume Kafka twice”
| Aspect | CDC | Dual Kafka Consumers |
|---|---|---|
| Source of truth | Database | Kafka |
| Captures manual DB changes | ✅ | ❌ |
| Ordering | Transactional | Per topic/partition |
| Reprocessing | Strong | Depends on design |
| Complexity | Medium | Low–Medium |
| Consistency | Very strong | App-level |
Senior rule of thumb:
If DB is the source of truth → CDC
If Kafka is the source of truth → dual consume
Why CDC is perfect for OLTP → OLAP
CDC guarantees:
- no missed rows
- no double counting (with correct handling)
- exact history
- low impact on OLTP
This is why almost all serious data platforms use CDC.
Common CDC tools
- Debezium (most popular)
- Maxwell (MySQL)
- GoldenGate (Oracle)
- Native cloud CDC (AWS DMS, GCP Datastream)
Important limitations (know these!)
1️⃣ Schema evolution matters
- Column rename / type change affects CDC consumers
- You need schema versioning
2️⃣ Large transactions
- Huge updates produce huge CDC events
- OLAP pipelines must handle bursts
3️⃣ Deletes are special
- Soft delete vs hard delete must be decided
- OLAP must understand delete semantics
Interview-ready answer (short & clean)
“CDC, or Change Data Capture, is a technique to stream database changes by reading the database’s transaction log instead of querying tables. It captures inserts, updates, and deletes in commit order and is commonly used to replicate data from OLTP systems to OLAP, caches, or search indexes.”
Senior bonus insight ⭐
CDC is:
- event sourcing at the database level
- but without rewriting the app
That’s why teams love it.