SQL.What is CDC (Change Data Capture)? ?

CDC (Change Data Capture) is a technique for capturing changes in a database (INSERT, UPDATE, DELETE) and streaming them to other systems without querying tables directly.

In simple words:

Instead of asking the DB “what changed?”, CDC listens to the DB’s change log.


Why CDC exists (the real problem it solves)

Without CDC, people do things like:

  • periodic SELECT * WHERE updated_at > last_run
  • batch jobs every N minutes
  • triggers on tables

Problems:

  • missed updates
  • race conditions
  • heavy load
  • hard to scale
  • brittle logic

CDC fixes this by:

  • capturing every committed change
  • in correct order
  • with minimal overhead

How CDC works (Postgres example)

PostgreSQL already has a log of all changes:

  • WAL (Write-Ahead Log)

CDC tools:

  1. Read WAL
  2. Decode row-level changes
  3. Emit events to Kafka (or another stream)

Your app never queries the tables for changes.

Typical CDC pipeline

Postgres (OLTP)
  ↓ WAL
CDC Tool (Debezium)
  ↓ Kafka
Consumers:
  - OLAP loader (ClickHouse)
  - Cache updater
  - Search indexer
  - Audit / ML / monitoring

Key idea:

One write → many downstream consumers

What exactly is captured?

For each change:

  • operation type: c (insert), u (update), d (delete)
  • before / after values
  • table name
  • primary key
  • transaction metadata
  • commit order

Example (simplified):

{
  "op": "c",
  "table": "transactions",
  "after": {
    "id": 123,
    "client_id": 42,
    "amount": 100
  }
}

CDC vs “just consume Kafka twice”

AspectCDCDual Kafka Consumers
Source of truthDatabaseKafka
Captures manual DB changes
OrderingTransactionalPer topic/partition
ReprocessingStrongDepends on design
ComplexityMediumLow–Medium
ConsistencyVery strongApp-level

Senior rule of thumb:

If DB is the source of truth → CDC
If Kafka is the source of truth → dual consume

Why CDC is perfect for OLTP → OLAP

CDC guarantees:

  • no missed rows
  • no double counting (with correct handling)
  • exact history
  • low impact on OLTP

This is why almost all serious data platforms use CDC.

Common CDC tools

  • Debezium (most popular)
  • Maxwell (MySQL)
  • GoldenGate (Oracle)
  • Native cloud CDC (AWS DMS, GCP Datastream)

Important limitations (know these!)

1️⃣ Schema evolution matters

  • Column rename / type change affects CDC consumers
  • You need schema versioning

2️⃣ Large transactions

  • Huge updates produce huge CDC events
  • OLAP pipelines must handle bursts

3️⃣ Deletes are special

  • Soft delete vs hard delete must be decided
  • OLAP must understand delete semantics

Interview-ready answer (short & clean)

“CDC, or Change Data Capture, is a technique to stream database changes by reading the database’s transaction log instead of querying tables. It captures inserts, updates, and deletes in commit order and is commonly used to replicate data from OLTP systems to OLAP, caches, or search indexes.”


Senior bonus insight ⭐

CDC is:

  • event sourcing at the database level
  • but without rewriting the app

That’s why teams love it.

Posted in Без рубрики | Comments Off on SQL.What is CDC (Change Data Capture)? ?

SQL.If i should write a little operations of income and outome of client from kafka and then show them big reports, what should i do ? Shouldnt it be the one db to not to write to both OLTP and OLAP databases ?

Recommended architecture for “income/outcome ops + big reports”

1) OLTP is the source of truth (writes)

Your service consumes Kafka events (income/outcome) and writes them to an OLTP database:

  • Postgres (or similar)
  • Proper constraints, FK, idempotency keys, transactions
  • Optimized for correctness + small/fast queries

This DB is where you can answer:

  • “What is the current balance?”
  • “Show last 50 transactions”
  • “Did we already process this Kafka message?”

2) OLAP is for heavy reports (reads)

Big reports (aggregations over long periods, many dimensions) should read from an OLAP store:

  • ClickHouse / BigQuery / Redshift / Snowflake (depending on your stack)
  • Denormalized model, partitions, pre-aggregations
  • Handles huge scans + GROUP BY efficiently

3) Data moves from OLTP → OLAP automatically

You have 2 common patterns:

✅ Pattern A: CDC (best in many cases)

  • Debezium reads Postgres WAL (change data capture)
  • Publishes changes to Kafka
  • OLAP ingestor consumes and loads to ClickHouse/warehouse

Benefits:

  • App writes only OLTP
  • Strong audit trail
  • Near real-time analytics

✅ Pattern B: Dual consume from Kafka (also common)

  • Your OLTP writer service consumes Kafka → writes Postgres
  • Separately, an analytics pipeline consumes Kafka → writes OLAP

Benefits:

  • No CDC complexity
  • Analytics can evolve independently

Tradeoff:

  • You must handle duplicates/out-of-order in both sinks

Should it be “one DB” to avoid writing twice?

One DB is great until reports get heavy. Then it becomes a bottleneck.

If you keep everything in Postgres:

You’ll eventually hit:

  • long-running aggregation queries blocking resources
  • cache eviction
  • I/O pressure from full scans
  • pressure to add “analytics indexes” that slow writes

So the principle is:

One write path (OLTP), many read models (OLAP, caches, materialized views).

This is basically CQRS-ish: separate write model from read model.


Practical “staged” approach (what I’d do in a real team)

Stage 1 (MVP / moderate data)

  • Postgres only
  • Keep raw operations table
  • Use:
    • indexes on (client_id, time)
    • maybe a few materialized views / summary tables updated incrementally

Good if reports are not insane.

Stage 2 (reports grow / data grows)

Add OLAP:

  • Keep OLTP as truth
  • Stream data into ClickHouse (or your OLAP)
  • Dashboards/reports query OLAP only

Key engineering requirements (Kafka → DB)

Idempotency (must-have)

Kafka is at-least-once in practice. You will see duplicates.

Use:

  • event_id (unique) with UNIQUE(event_id)
  • or (client_id, external_tx_id) unique constraint

Write pattern:

  • insert transaction row (or upsert)
  • update aggregates if insert succeeded

Ordering / time

Events can arrive late/out of order:

  • store event time separately from processing time
  • OLAP reports should use event time with late-arrival handling

Exactly-once? Don’t chase it blindly

For most systems:

  • at-least-once + idempotent writes is the winning combo.

A concrete blueprint (clean)

  1. Kafka topic: transactions
  2. Consumer service ledger-writer:
    • validates, deduplicates (event_id)
    • writes to Postgres ledger_transactions
    • (optional) updates balances table
  3. Analytics pipeline:
    • either CDC from Postgres
    • or another Kafka consumer
    • loads into OLAP fact table fact_transactions
  4. Reports service queries OLAP

Interview-ready answer

“I wouldn’t make the app write to both OLTP and OLAP as sources of truth. I’d write once to OLTP and replicate to OLAP using CDC or a separate Kafka consumer. OLTP stays correct and transactional; OLAP is optimized for large aggregations and reports.”

Posted in Без рубрики | Comments Off on SQL.If i should write a little operations of income and outome of client from kafka and then show them big reports, what should i do ? Shouldnt it be the one db to not to write to both OLTP and OLAP databases ?

SQL.What is OLTP, OLAP ?

Short, interview-ready definition

OLTP systems are optimized for many small, fast transactions (reads/writes).
OLAP systems are optimized for complex analytical queries over large datasets.

1️⃣ OLTP — Online Transaction Processing

Understanding OLTP | Tiger Data

What it is

Systems that support day-to-day operations of an application.

Typical operations

  • Insert / update single rows
  • Read by primary key
  • Short transactions
  • High concurrency

Example queries

SELECT * FROM users WHERE id = 42;

INSERT INTO orders(user_id, amount) VALUES (42, 100);

UPDATE accounts SET balance = balance - 10 WHERE id = 7;

Characteristics

  • Many concurrent users
  • Low latency (milliseconds)
  • Strong consistency
  • Small result sets
  • Heavy use of indexes

Typical databases

  • PostgreSQL
  • MySQL
  • Oracle
  • SQL Server

Backend example

  • User registration
  • Payments
  • Order creation
  • Profile updates

2️⃣ OLAP — Online Analytical Processing

What is a Star Schema?. In a Nutshell: The OLAP Data Warehouse ...

What it is

Systems designed for analysis, reporting, and decision making.

Typical operations

  • Full table scans
  • Aggregations
  • Grouping over millions/billions of rows
  • Read-heavy

Example queries

SELECT country, SUM(amount)
FROM orders
GROUP BY country;
SELECT DATE_TRUNC('month', created_at), COUNT(*)
FROM users
GROUP BY 1;

Characteristics

  • Few users
  • Long-running queries
  • Large result sets
  • Mostly reads
  • Columnar storage often used

Typical databases

  • ClickHouse
  • BigQuery
  • Redshift
  • Snowflake

Backend example

  • Dashboards
  • BI reports
  • Trend analysis
  • Forecasting

3️⃣ OLTP vs OLAP — side-by-side

AspectOLTPOLAP
GoalRun the businessAnalyze the business
Query typeSimple, shortComplex, long
Data volumeSmall per queryHuge per query
Reads/WritesRead + WriteMostly Read
ConcurrencyVery highLow
SchemaNormalizedDenormalized (star/snowflake)
Index usageHeavyOften minimal
JoinsFew, selectiveMany, wide

4️⃣ Why this matters for backend developers

🔴 Wrong expectations = bad performance

Mistakes:

  • Running OLAP queries on OLTP DB
  • Adding 10 indexes to speed up reports
  • Blaming “Postgres is slow”

Reality:

OLTP and OLAP optimize opposite things.

🔴 Join algorithms differ

  • OLTP
    • Nested loop + index scan
    • Fast PK/FK joins
  • OLAP
    • Hash joins
    • Merge joins
    • Full scans are normal

🔴 Schema design differs

  • OLTP:
    • Normalized
    • Strong constraints
    • Minimal redundancy
  • OLAP:
    • Denormalized
    • Pre-aggregated
    • Redundancy is fine

5️⃣ Modern architecture (very important)

Most real systems use both:

OLTP DB  →  CDC / ETL  →  OLAP Warehouse

Example:

  • Postgres → Kafka → ClickHouse
  • App writes to OLTP
  • Analytics reads from OLAP

Senior insight:

Don’t mix workloads — separate them.


6️⃣ Interview-ready final answer (say this cleanly)

“OLTP systems handle many short, concurrent transactions like inserts and updates, while OLAP systems are optimized for large analytical queries and aggregations.
They differ in access patterns, schema design, indexing, and join strategies.
As a backend developer, mixing OLAP workloads into an OLTP database leads to performance problems, so the workloads should be separated.”

Posted in Без рубрики | Comments Off on SQL.What is OLTP, OLAP ?

SQL.What is full scan / index scan ?

1️⃣ Full Scan (a.k.a. Sequential Scan)

Definition

A full scan reads the entire table from start to end, checking every row.

In PostgreSQL it’s called:

Seq Scan

Example

SELECT *
FROM orders
WHERE status = 'CANCELLED';

If:

  • No index on status, or
  • Optimizer thinks many rows match

Plan:

Seq Scan on orders
  Filter: status = 'CANCELLED'

Characteristics

✅ Pros

  • Very simple
  • Fast for small tables
  • Efficient for large result sets
  • Sequential I/O (disk-friendly)

❌ Cons

  • Reads everything
  • Wastes work if only few rows match

When optimizer chooses full scan

  • Table is small
  • Condition is not selective (returns many rows)
  • No usable index
  • Index would require many random reads
  • Stale statistics make index look expensive

2️⃣ Index Scan

Definition

An index scan uses an index to locate matching rows and then fetches them from the table.

In PostgreSQL:

Index Scan

Example

CREATE INDEX idx_orders_status ON orders(status);

SELECT *
FROM orders
WHERE status = 'CANCELLED';

Plan:

Index Scan using idx_orders_status on orders
  Index Cond: status = 'CANCELLED'

How it works (simplified)

  1. Traverse B-tree index
  2. Find matching keys
  3. Fetch corresponding rows from table (by TID)

Characteristics

✅ Pros

  • Excellent for selective queries
  • Reads only matching rows

❌ Cons

  • Random I/O (can be expensive)
  • Becomes slower when many rows match

3️⃣ Index-Only Scan (important senior detail)

Definition

Index-only scan reads data directly from the index without touching the table.

Plan:

Index Only Scan

Example

CREATE INDEX idx_orders_status_created
ON orders(status, created_at);

SELECT status, created_at
FROM orders
WHERE status = 'CANCELLED';

If:

  • All columns are in index
  • Visibility map allows it

Plan:

Index Only Scan

🔥 Fastest option.

4️⃣ Why optimizer sometimes prefers full scan over index scan

This is a very common interview trap.

Example

SELECT *
FROM orders
WHERE status IN ('NEW', 'PROCESSING', 'CANCELLED');

If this matches 70% of rows:

  • Index scan:
    • Many random fetches
  • Full scan:
    • Sequential read once

👉 Full scan wins.

Rule of thumb:

Index scan is good for few rows, full scan for many rows.

5️⃣ Full Scan vs Index Scan (comparison)

AspectFull ScanIndex Scan
ReadsAll rowsMatching rows
I/OSequentialRandom
SelectivityBadGood
Small tablesFineOverkill
Large result setsGoodBad

6️⃣ What decides which one is chosen?

Optimizer uses:

  • Table size
  • Index availability
  • Predicate selectivity
  • Statistics (NDV, histograms)
  • Cost of random vs sequential I/O
  • LIMIT
  • Correlation (index order vs physical order)

7️⃣ How joins relate to scans (important)

Nested Loop + Index Scan

Nested Loop
  -> Seq Scan users
  -> Index Scan orders(user_id)

Classic OLTP pattern.

Classic OLTP pattern.

Hash Join
  -> Seq Scan orders
  -> Seq Scan users

Classic analytics pattern.

8️⃣ Interview-ready answer (clean & short)

“A full scan reads the entire table and is efficient when many rows are needed or the table is small.
An index scan uses an index to locate matching rows and is efficient for selective queries.
The optimizer chooses between them based on statistics, selectivity, and I/O cost — sometimes a full scan is faster than using an index.”

9️⃣ Senior bonus insight ⭐

If someone says:

“Index scan is always faster”

🚩 That’s a red flag.

Correct mindset:

Indexes reduce search, not work.

Posted in Без рубрики | Comments Off on SQL.What is full scan / index scan ?

SQL.How DBMS chooses the plan ?

SELECT *
FROM orders o
JOIN users u ON o.user_id = u.id;

What the optimizer considers

1) Sizes and selectivity

  • If orders is huge and users is smaller → hash join is often cheapest (build hash on users, probe with orders).
  • If orders is filtered down to a few rows (e.g., WHERE o.id = ?) → nested loop is usually best.

2) Indexes

  • If there’s an index on orders.user_id and users.id and the planner expects few matching rows → nested loop + index lookup can win.
  • If both sides can be read in sorted order (via indexes) and planner thinks it’s cheaper than hashing → merge join can win.

3) Memory (work_mem) and spill risk

  • Hash join needs memory. If it expects spill to disk (big hash) it may prefer merge join (sort + merge) or nested loop.

4) Existing order

  • If one or both inputs are already coming out sorted for other reasons (e.g., index scan for a filter), merge join becomes attractive.

5) Statistics

If stats are stale/wrong, planner may pick a “weird” plan.


Typical outcomes in real life

✅ Most common for orders JOIN users in OLTP

Nested loop (especially if query is selective):

SELECT * FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.id = 123;

Plan idea:

  • Index lookup on orders.id → 1 row
  • For that row, index lookup users.id → 1 row
    Nested loop is perfect here.

✅ Common for big scans / analytics

Hash join:

  • Seq Scan orders
  • Hash users
  • Hash Join

✅ Sometimes happens

Merge join:

  • if both sides can be streamed sorted (or cheaply sorted)
  • or hash join is estimated expensive / memory constrained

How to prove it on your DB (Postgres)

Run:

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM orders o
JOIN users u ON o.user_id = u.id;

You’ll see one of:

  • Hash Join
  • Nested Loop
  • Merge Join

To test alternatives:

SET enable_mergejoin = off;
SET enable_hashjoin  = off;
SET enable_nestloop  = off;

(only for experiments, not production)


Interview-ready phrasing

“Equality join doesn’t imply hash join. Hash join is one possible physical algorithm. Postgres may choose nested loop if it expects few rows and has indexes, or merge join if inputs are sorted or hashing is costly. The choice depends on statistics, indexes, and memory.”

Posted in Без рубрики | Comments Off on SQL.How DBMS chooses the plan ?

SQL.JoinExamples

1️⃣ Equality joins (= joins)

Example 1: classic PK → FK join

SELECT *
FROM orders o
JOIN users u ON o.user_id = u.id;

Join condition: =

Can use:

  • Nested Loop + index
  • Hash Join
  • Merge Join

Most common join type in OLTP

Example 2: joining by business key

SELECT *
FROM products p
JOIN inventory i ON p.sku = i.sku;

Again:

  • Pure equality
  • Ideal for hash join if tables are large
  • Ideal for nested loop if index exists and one side is small

Key property

Equality joins are the most optimizer-friendly joins.

2️⃣ Range joins (BETWEEN, intervals)

Example 1: time ranges

SELECT *
FROM events e
JOIN shifts s
  ON e.event_time BETWEEN s.start_time AND s.end_time;

This is a range join:

  • One row matches a range in another table

Algorithms:

  • ❌ Hash join → cannot
  • ✅ Merge join
  • ⚠️ Nested loop (often expensive)

Example 2: pricing tiers

SELECT *
FROM orders o
JOIN price_ranges p
  ON o.amount BETWEEN p.min_amount AND p.max_amount;

Classic analytics / billing scenario.

Key property

Range joins require ordering, not hashing.

3️⃣ > / < joins (inequality joins)

Example 1: temporal precedence

SELECT *
FROM trades t1
JOIN trades t2
  ON t1.trade_time < t2.trade_time;

Example 2: hierarchy / ranking

SELECT *
FROM employees e1
JOIN employees e2
  ON e1.salary > e2.salary;

Algorithms:

  • ❌ Hash join → impossible
  • ✅ Merge join
  • ⚠️ Nested loop (often disastrous)

Key property

Any join that is not = immediately removes hash join from consideration.

4️⃣ What are “statistics” for the optimizer?

Statistics are metadata about the data, used to estimate query cost.

Typical stats:

  • Table row count
  • Column value distribution
  • Number of distinct values
  • NULL fraction
  • Min / max values
  • Histograms (value frequency buckets)
  • Correlation between column order and physical storage

Example: why stats matter

SELECT *
FROM orders
WHERE status = 'CANCELLED';

If optimizer knows:

  • status has 5 values
  • 'CANCELLED' = 0.1% of rows

→ Index scan is cheap.

If stats are wrong:

  • Planner thinks 50%
    → Full table scan chosen
    → Query suddenly slow

5️⃣ How statistics are collected

PostgreSQL (typical flow)

  1. ANALYZE
ANALYZE orders;

2. Or automatically via autovacuum

autovacuum_analyze_threshold
autovacuum_analyze_scale_factor
  1. DB samples rows (not full scan)
  2. Builds histograms and estimates

Important:

  • Stats are approximate
  • Based on sampling
  • Can be stale after massive data changes

When stats become wrong

  • Bulk INSERT / DELETE
  • Skewed data growth
  • Partition pruning issues
  • Changing data patterns (hot vs cold values)

Senior habit:

If query plan looks insane — check statistics first.

6️⃣ Who should presort data for a merge join?

This is a great interview trick question.

Answer:

The database engine decides.
It will either:

  • use already sorted data (via index scan), or
  • sort the data itself (explicit sort step)

Case 1: Data already sorted (best case)

CREATE INDEX idx_orders_created_at ON orders(created_at);



SELECT *
FROM orders o
JOIN shipments s
  ON o.created_at < s.shipped_at;

Planner may choose:

  • Index scan → already sorted
  • Merge join without extra sort

Case 2: Data not sorted

Planner inserts steps:

Seq Scan → Sort → Merge Join

Sorting cost can dominate the query.

❗ Should developers presort with ORDER BY?

No

SELECT *
FROM orders o
JOIN shipments s
  ON o.created_at < s.shipped_at
ORDER BY o.created_at;

ORDER BY is applied after join

Does NOT help merge join

Often makes things worse

Correct senior thinking

  • Developers:
    • create indexes
    • write sargable predicates
  • Optimizer:
    • chooses sort vs index vs hash

SQL is declarative — you describe what, not how.


7️⃣ Interview-ready summary (say this cleanly)

“Equality joins use = and can be executed with nested loop, hash join, or merge join.
Range and inequality joins (BETWEEN, <, >) cannot use hash join and rely on merge join or nested loops.
The optimizer chooses algorithms based on statistics like row counts and value distributions, which are collected via ANALYZE or autovacuum.
For merge joins, the database decides whether to use existing order from indexes or to sort data itself — developers should not manually presort.”

Posted in Без рубрики | Comments Off on SQL.JoinExamples

SQL.What are join algos ?

Short definition (interview-ready)

Join algorithms are the physical strategies a database engine uses to execute a JOIN efficiently, depending on data size, indexes, memory, and join condition.

Logical JOIN (INNER / LEFT / …) ≠ how it is executed.


The Big Three (must-know)

1️⃣ Nested Loop Join (NLJ)

How it works

for each row in A:
    for each row in B:
        if join_condition matches:
            emit row

Variants

  • Simple nested loop
  • Index nested loop (very important)

When used

  • Small tables
  • One side is tiny
  • Index exists on join column

Cost

  • O(N × M) (bad without index)
  • With index: O(N × log M)

Typical usage

  • OLTP queries
  • PK–FK joins with index

Interview insight

Nested loop is slow only if the inner side is not indexed.

2️⃣ Hash Join (workhorse of analytics)

How it works

  1. Build hash table from smaller input
  2. Probe with rows from larger input
build hash(B.key → row)
for each row in A:
    lookup hash(A.key)

When used

  • Large tables
  • Equality joins (= only)
  • No useful indexes

Cost

  • O(N + M)
  • Needs memory (or spills to disk)

Cannot be used for

  • <, >, range joins
  • Non-equality predicates

Typical usage

  • Data warehouses
  • Big scans
  • Batch analytics

Interview insight

Hash join is fast but memory-hungry.

3️⃣ Merge Join (sorted join)

How it works

  1. Both inputs are sorted by join key
  2. Walk through them like merge-sort
A: 1 3 5 7
B: 3 4 5 8
→ walk in order

When used

  • Inputs already sorted
  • Index scan provides sorted order
  • Range joins supported

Cost

  • O(N + M)
  • Sorting cost if not already sorted

Typical usage

  • When both sides have B-tree indexes
  • Large ordered scans

Interview insight

Merge join is great when data is already ordered — otherwise sorting kills it.

Summary table

AlgorithmNeeds IndexNeeds MemorySupports =Supports <,>Typical Use
Nested LoopOptionalLowOLTP, small data
Hash JoinHighLarge analytics
Merge JoinOptionalMediumOrdered data

How the optimizer chooses (important)

The query planner considers:

  • Table sizes (statistics)
  • Index availability
  • Join condition (= vs range)
  • Memory (work_mem, join_buffer)
  • Cost estimates

👉 You don’t choose the algorithm directly (usually).

You influence it via:

  • Indexes
  • Query shape
  • Hints (some DBs)
  • Statistics freshness

Real backend implications (this is where seniors shine)

🔴 Bad schema → bad join algorithm

  • Missing index → hash join instead of nested loop
  • Nullable FK → broken plans
  • Wrong statistics → terrible join order

🔴 ORM-generated SQL

  • Too many joins → hash join explosion
  • Accidental cartesian joins
  • Unpredictable execution plans

PostgreSQL-specific notes (bonus)

Postgres supports:

  • Nested Loop
  • Hash Join
  • Merge Join

You can see which one is used via:

EXPLAIN ANALYZE

And even disable one for testing:

SET enable_hashjoin = off;

Interview-ready final answer (say this smoothly)

“JOIN algorithms are the physical execution strategies used by the database engine.
The main ones are nested loop, hash join, and merge join.
Nested loop is good for small or indexed joins, hash join is efficient for large equality joins but uses memory, and merge join works well when inputs are already sorted.
The optimizer chooses based on statistics, indexes, and join conditions.”

Posted in Без рубрики | Comments Off on SQL.What are join algos ?

SQL.What is the problem in using null in in operator ?

This is a classic SQL trap and a favorite interview question. Let’s break it down cleanly and precisely.


Short answer (what interviewers want to hear)

IN breaks when the subquery contains NULL because SQL uses three-valued logic.
Any comparison with NULL becomes UNKNOWN, which can make the whole predicate evaluate to false.


The core problem: SQL is NOT boolean

SQL has three-valued logic:

  • TRUE
  • FALSE
  • UNKNOWN ← caused by NULL

And WHERE keeps only rows where the condition is TRUE.


What people expect vs what actually happens

Example data

orders
-------
user_id
-------
1
2
NULL

Query

SELECT *
FROM users
WHERE id IN (SELECT user_id FROM orders);

Logical expansion (important!)

id = 1 OR id = 2 OR id = NULL

Now remember:

id = NULL  --> UNKNOWN

So the expression becomes:

(TRUE OR FALSE OR UNKNOWN)

Depending on values, this can still work sometimes.

The REAL killer: NOT IN

SELECT *
FROM users
WHERE id NOT IN (SELECT user_id FROM orders);

Expansion

id <> 1 AND id <> 2 AND id <> NULL

But:

id <> NULL --> UNKNOWN

So the whole condition becomes:

TRUE AND TRUE AND UNKNOWN --> UNKNOWN

And UNKNOWN is filtered out by WHERE.

Result: ❌ ZERO ROWS

Even though logically you expected some users.

Why this happens (formal explanation)

  • IN / NOT IN is syntactic sugar for comparisons
  • NULL means “unknown value”, not “empty”
  • SQL cannot decide truth → returns UNKNOWN
  • WHERE only keeps TRUE

Why EXISTS works correctly

SELECT *
FROM users u
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.user_id = u.id
);

Key difference:

  • EXISTS checks row existence
  • No equality comparison with NULL
  • No three-valued logic explosion

This is why seniors say:

“Use EXISTS, not IN, for subqueries.”

Safe alternatives (this is important)

✅ Option 1: Filter NULLs explicitly

WHERE id IN (
  SELECT user_id
  FROM orders
  WHERE user_id IS NOT NULL
)

✅ Option 2: Use EXISTS (preferred)

WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.user_id = u.id
)

❌ Avoid

NOT IN (subquery_that_may_return_null)

Interview-ready crisp answer (2–3 sentences)

“The problem with IN is that if the subquery returns NULL, SQL’s three-valued logic turns comparisons into UNKNOWN.
This is especially dangerous with NOT IN, which can filter out all rows unexpectedly.
That’s why EXISTS is safer for correlated subqueries.”


Senior-level bonus insight ⭐

This bug often appears when:

  • Foreign keys are nullable (bad schema design)
  • Legacy tables violate relational integrity
  • Developers rely on ORM-generated SQL

👉 Another reason why proper foreign keys + NOT NULL constraints matter.

Posted in Без рубрики | Comments Off on SQL.What is the problem in using null in in operator ?

SQL.How does the relational model differ from object-oriented models, and why does this matter for backend development?

1️⃣ What the relational model is (Codd)

The relational model describes data as:

  • Relations (tables) → unordered sets of tuples (rows)
  • Attributes → typed columns
  • Primary keys → tuple identity
  • Foreign keys → integrity between relations
  • Declarative operations (relational algebra → SQL)

Key idea:

Data is independent of how it is accessed or stored

There is no navigation, only set-based queries.

2️⃣ What the object-oriented model is

The OO model represents the world as:

  • Objects with identity
  • Encapsulation (state + behavior)
  • References / pointers
  • Inheritance & polymorphism
  • Graph navigation (order.getCustomer().getAddress())

Key idea:

Data is accessed by navigating object graphs

3️⃣ Core differences (this is the heart)

AspectRelational ModelObject-Oriented Model
IdentityValue-based (primary key)Reference-based (object identity)
StructureFlat relationsObject graphs
AccessDeclarative (SQL)Navigational
DuplicationNormalized, avoidedCommon (denormalized fields)
IntegrityEnforced by DBEnforced by code
BehaviorData onlyData + behavior
OrderingUnordered setsOrdered collections

4️⃣ The Object-Relational Impedance Mismatch

This mismatch is fundamental, not accidental.

Example

Order order = orderRepository.findById(id);
String city = order.getCustomer().getAddress().getCity();

This looks like one operation in OO.

But in relational terms:

  • orders
  • customers
  • addresses

multiple joins, lazy loading, N+1 risks, transactions, caching issues.

Objects want graphs. Databases want sets.

5️⃣ Why this matters for backend developers

🔴 Performance traps

  • N+1 queries
  • Over-fetching / under-fetching
  • Accidental Cartesian products
  • LazyInitializationException

If you think OO only, you will:

  • write slow code
  • blame the database
  • add caches instead of fixing queries

🔴 Transaction boundaries

  • DB transactions operate on rows
  • ORM sessions operate on object graphs

Mistake:

“I loaded the entity, why is it detached now?”

Correct thinking:

“The DB transaction ended, the object graph is no longer managed.”


🔴 Equality & identity bugs

entityA.equals(entityB)
  • OO: reference vs logical equality
  • DB: primary key identity

Classic bugs:

  • equals/hashCode before ID assigned
  • composite keys
  • mutable keys in HashSet

🔴 Schema vs domain design

Bad approach:

“Let’s design entities first and generate schema.”

Better approach:

“Design the relational model first, then map a domain model on top.”

Why?

  • DB is long-lived
  • Code is disposable
  • Schema mistakes are expensive

6️⃣ How senior engineers think about this

Senior backend engineers separate concerns:

  • DB model → optimized for consistency, querying, constraints
  • Domain model → optimized for behavior and use cases
  • DTOs → optimized for API contracts
  • ORM → mapping tool, not a magic layer

ORM is a convenience, not an abstraction leak-free layer

7️⃣ Interview-ready short answer (say this confidently)

“The relational model is set-based and declarative, focusing on data integrity and relations between values, while the object-oriented model is navigational and reference-based, focusing on behavior and object graphs.
This mismatch matters because ORMs can hide performance costs, transaction boundaries, and identity issues.
As a backend developer, you must think relationally when designing schemas and queries, and use OO models consciously on top — not assume they map naturally.”

8️⃣ Bonus: red flags interviewers listen for 🚩

If a candidate says:

  • “Hibernate handles it”
  • “The ORM will optimize it”
  • “We don’t need foreign keys”
  • “Objects map 1-to-1 to tables”

👉 That’s mid-level thinking.

Posted in Без рубрики | Comments Off on SQL.How does the relational model differ from object-oriented models, and why does this matter for backend development?

SQL.What guarantees does the relational model provide, and what does SQL not guarantee?

What the relational model guarantees

The relational model defines what states of data are valid.

These guarantees are logical, not performance-related.


1️⃣ Set-based data (no duplicates by default)

  • Relations are sets, not lists
  • Rows have no inherent order
  • Duplicates are eliminated conceptually

Example:

SELECT a FROM t;

Logically:

  • Result is an unordered set

➡️ Order only exists if explicitly defined.

2️⃣ Entity identity via keys

The relational model guarantees:

  • Every row can be uniquely identified
  • Keys uniquely identify tuples

This is why:

  • Candidate keys exist
  • Primary keys exist

Without keys → relation is not well-defined.

3️⃣ Referential integrity

Relationships between relations are:

  • Explicit
  • Enforceable
  • Consistent

A foreign key guarantees:

  • Child rows reference existing parent rows
  • Orphan rows are impossible

This is a core relational guarantee.

4️⃣ Integrity constraints

The model guarantees enforcement of:

  • Domain constraints (NOT NULL, type rules)
  • Key constraints (PRIMARY KEY, UNIQUE)
  • Referential constraints (FOREIGN KEY)

Invalid states are unrepresentable.

5️⃣ Declarative semantics

You specify:

  • What result you want
  • Not how to get it

This enables:

  • Query optimization
  • Execution plan changes
  • Storage evolution

6️⃣ Closure (important theory point)

Relational operations:

  • Take relations as input
  • Produce relations as output

This allows:

  • Query composition
  • Subqueries
  • Views

What SQL does NOT guarantee

SQL is a language, not a correctness oracle.

Many things developers assume are explicitly not guaranteed.


❌ 1️⃣ Row order (unless ORDER BY)

SELECT * FROM users;

No guaranteed order
✔ Only ORDER BY defines order

Indexes do not guarantee order.

❌ 2️⃣ Execution order of clauses

Logical SQL order ≠ execution order:

SELECT ...
FROM ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ...

You cannot rely on evaluation order inside WHERE.


❌ 3️⃣ Deterministic evaluation of expressions

WHERE f(x) AND g(x)

SQL does not guarantee:

  • Left-to-right evaluation
  • Short-circuiting

The optimizer may reorder conditions.

❌ 4️⃣ Absence of duplicates (unless enforced)

SELECT col FROM table;
  • Duplicates may appear
  • SQL works on bags (multisets)

Only DISTINCT or constraints remove duplicates.

❌ 5️⃣ Consistency without constraints

SQL does not guarantee:

  • Uniqueness without UNIQUE
  • Referential integrity without FOREIGN KEY
  • Valid ranges without CHECK

If you don’t declare it, SQL won’t enforce it.

❌ 6️⃣ Stable execution plans

The same query:

  • May use different indexes tomorrow
  • May switch join algorithms
  • May behave differently after stats change

Execution plans are not stable contracts.

❌ 7️⃣ Isolation without transactions

SQL does not guarantee:

  • Atomicity
  • Isolation
  • Consistency

Unless:

BEGIN;
...
COMMIT;

Even then — isolation level matters.

Big mental model (interview gold)

Relational ModelSQL
Defines valid data statesDefines how to query data
Guarantees correctnessMakes no implicit promises
Based on setsOperates on multisets
OrderlessOrder must be declared
DeclarativeOptimizer-controlled

One-sentence interview answer (perfect)

The relational model guarantees data correctness through keys, constraints, referential integrity, and declarative semantics, while SQL does not guarantee row order, execution order, uniqueness, consistency, or stable plans unless they are explicitly specified.

Common interview traps ❌

  • Assuming index order = result order
  • Relying on WHERE clause evaluation order
  • Assuming SQL enforces business rules automatically
  • Confusing logical correctness with physical execution
Posted in Без рубрики | Comments Off on SQL.What guarantees does the relational model provide, and what does SQL not guarantee?