SQL.JoinExamples

1️⃣ Equality joins (= joins)

Example 1: classic PK → FK join

SELECT *
FROM orders o
JOIN users u ON o.user_id = u.id;

Join condition: =

Can use:

  • Nested Loop + index
  • Hash Join
  • Merge Join

Most common join type in OLTP

Example 2: joining by business key

SELECT *
FROM products p
JOIN inventory i ON p.sku = i.sku;

Again:

  • Pure equality
  • Ideal for hash join if tables are large
  • Ideal for nested loop if index exists and one side is small

Key property

Equality joins are the most optimizer-friendly joins.

2️⃣ Range joins (BETWEEN, intervals)

Example 1: time ranges

SELECT *
FROM events e
JOIN shifts s
  ON e.event_time BETWEEN s.start_time AND s.end_time;

This is a range join:

  • One row matches a range in another table

Algorithms:

  • ❌ Hash join → cannot
  • ✅ Merge join
  • ⚠️ Nested loop (often expensive)

Example 2: pricing tiers

SELECT *
FROM orders o
JOIN price_ranges p
  ON o.amount BETWEEN p.min_amount AND p.max_amount;

Classic analytics / billing scenario.

Key property

Range joins require ordering, not hashing.

3️⃣ > / < joins (inequality joins)

Example 1: temporal precedence

SELECT *
FROM trades t1
JOIN trades t2
  ON t1.trade_time < t2.trade_time;

Example 2: hierarchy / ranking

SELECT *
FROM employees e1
JOIN employees e2
  ON e1.salary > e2.salary;

Algorithms:

  • ❌ Hash join → impossible
  • ✅ Merge join
  • ⚠️ Nested loop (often disastrous)

Key property

Any join that is not = immediately removes hash join from consideration.

4️⃣ What are “statistics” for the optimizer?

Statistics are metadata about the data, used to estimate query cost.

Typical stats:

  • Table row count
  • Column value distribution
  • Number of distinct values
  • NULL fraction
  • Min / max values
  • Histograms (value frequency buckets)
  • Correlation between column order and physical storage

Example: why stats matter

SELECT *
FROM orders
WHERE status = 'CANCELLED';

If optimizer knows:

  • status has 5 values
  • 'CANCELLED' = 0.1% of rows

→ Index scan is cheap.

If stats are wrong:

  • Planner thinks 50%
    → Full table scan chosen
    → Query suddenly slow

5️⃣ How statistics are collected

PostgreSQL (typical flow)

  1. ANALYZE
ANALYZE orders;

2. Or automatically via autovacuum

autovacuum_analyze_threshold
autovacuum_analyze_scale_factor
  1. DB samples rows (not full scan)
  2. Builds histograms and estimates

Important:

  • Stats are approximate
  • Based on sampling
  • Can be stale after massive data changes

When stats become wrong

  • Bulk INSERT / DELETE
  • Skewed data growth
  • Partition pruning issues
  • Changing data patterns (hot vs cold values)

Senior habit:

If query plan looks insane — check statistics first.

6️⃣ Who should presort data for a merge join?

This is a great interview trick question.

Answer:

The database engine decides.
It will either:

  • use already sorted data (via index scan), or
  • sort the data itself (explicit sort step)

Case 1: Data already sorted (best case)

CREATE INDEX idx_orders_created_at ON orders(created_at);



SELECT *
FROM orders o
JOIN shipments s
  ON o.created_at < s.shipped_at;

Planner may choose:

  • Index scan → already sorted
  • Merge join without extra sort

Case 2: Data not sorted

Planner inserts steps:

Seq Scan → Sort → Merge Join

Sorting cost can dominate the query.

❗ Should developers presort with ORDER BY?

No

SELECT *
FROM orders o
JOIN shipments s
  ON o.created_at < s.shipped_at
ORDER BY o.created_at;

ORDER BY is applied after join

Does NOT help merge join

Often makes things worse

Correct senior thinking

  • Developers:
    • create indexes
    • write sargable predicates
  • Optimizer:
    • chooses sort vs index vs hash

SQL is declarative — you describe what, not how.


7️⃣ Interview-ready summary (say this cleanly)

“Equality joins use = and can be executed with nested loop, hash join, or merge join.
Range and inequality joins (BETWEEN, <, >) cannot use hash join and rely on merge join or nested loops.
The optimizer chooses algorithms based on statistics like row counts and value distributions, which are collected via ANALYZE or autovacuum.
For merge joins, the database decides whether to use existing order from indexes or to sort data itself — developers should not manually presort.”

This entry was posted in Без рубрики. Bookmark the permalink.