SQL.What are join algos ?

Short definition (interview-ready)

Join algorithms are the physical strategies a database engine uses to execute a JOIN efficiently, depending on data size, indexes, memory, and join condition.

Logical JOIN (INNER / LEFT / …) ≠ how it is executed.


The Big Three (must-know)

1️⃣ Nested Loop Join (NLJ)

How it works

for each row in A:
    for each row in B:
        if join_condition matches:
            emit row

Variants

  • Simple nested loop
  • Index nested loop (very important)

When used

  • Small tables
  • One side is tiny
  • Index exists on join column

Cost

  • O(N × M) (bad without index)
  • With index: O(N × log M)

Typical usage

  • OLTP queries
  • PK–FK joins with index

Interview insight

Nested loop is slow only if the inner side is not indexed.

2️⃣ Hash Join (workhorse of analytics)

How it works

  1. Build hash table from smaller input
  2. Probe with rows from larger input
build hash(B.key → row)
for each row in A:
    lookup hash(A.key)

When used

  • Large tables
  • Equality joins (= only)
  • No useful indexes

Cost

  • O(N + M)
  • Needs memory (or spills to disk)

Cannot be used for

  • <, >, range joins
  • Non-equality predicates

Typical usage

  • Data warehouses
  • Big scans
  • Batch analytics

Interview insight

Hash join is fast but memory-hungry.

3️⃣ Merge Join (sorted join)

How it works

  1. Both inputs are sorted by join key
  2. Walk through them like merge-sort
A: 1 3 5 7
B: 3 4 5 8
→ walk in order

When used

  • Inputs already sorted
  • Index scan provides sorted order
  • Range joins supported

Cost

  • O(N + M)
  • Sorting cost if not already sorted

Typical usage

  • When both sides have B-tree indexes
  • Large ordered scans

Interview insight

Merge join is great when data is already ordered — otherwise sorting kills it.

Summary table

AlgorithmNeeds IndexNeeds MemorySupports =Supports <,>Typical Use
Nested LoopOptionalLowOLTP, small data
Hash JoinHighLarge analytics
Merge JoinOptionalMediumOrdered data

How the optimizer chooses (important)

The query planner considers:

  • Table sizes (statistics)
  • Index availability
  • Join condition (= vs range)
  • Memory (work_mem, join_buffer)
  • Cost estimates

👉 You don’t choose the algorithm directly (usually).

You influence it via:

  • Indexes
  • Query shape
  • Hints (some DBs)
  • Statistics freshness

Real backend implications (this is where seniors shine)

🔴 Bad schema → bad join algorithm

  • Missing index → hash join instead of nested loop
  • Nullable FK → broken plans
  • Wrong statistics → terrible join order

🔴 ORM-generated SQL

  • Too many joins → hash join explosion
  • Accidental cartesian joins
  • Unpredictable execution plans

PostgreSQL-specific notes (bonus)

Postgres supports:

  • Nested Loop
  • Hash Join
  • Merge Join

You can see which one is used via:

EXPLAIN ANALYZE

And even disable one for testing:

SET enable_hashjoin = off;

Interview-ready final answer (say this smoothly)

“JOIN algorithms are the physical execution strategies used by the database engine.
The main ones are nested loop, hash join, and merge join.
Nested loop is good for small or indexed joins, hash join is efficient for large equality joins but uses memory, and merge join works well when inputs are already sorted.
The optimizer chooses based on statistics, indexes, and join conditions.”

This entry was posted in Без рубрики. Bookmark the permalink.