Database.Middle.What is a hash join?

A hash join is a join algorithm used by database engines to efficiently combine rows from two tables based on a join condition, typically equality (e.g., A.id = B.id).

🧠 How a Hash Join Works

It operates in two main phases:

🔨 1. Build Phase

  • The DB scans the smaller table (called the build input) and builds an in-memory hash table using the join key.
  • Each row is inserted into a hash bucket based on the hash of its join key.

🔍 2. Probe Phase

  • The DB then scans the larger table (called the probe input).
  • For each row, it computes the hash of the join key and probes the hash table to find matches.

🧩 Example:

SELECT *
FROM orders
JOIN customers ON orders.customer_id = customers.id;

If customers is smaller, it becomes the build table.

The DB hashes all customers.id values into a hash table.

Then it scans orders and for each orders.customer_id, looks it up in the hash table.

✅ When Hash Join Is Preferred

ScenarioWhy Hash Join Works Well
Join condition is equality (=)Works best for equi-joins
One table is smallFits easily into memory as hash map
No useful index existsDoesn’t rely on indexes
Parallel processingHash joins are easily parallelizable

🔁 Comparison with Other Join Types

Join TypeWorks Best When…Notes
Nested LoopOne table is very small or has an indexSimple but slow for large joins
Merge JoinBoth inputs are sorted on the join keyRequires sorted inputs or extra sorting
Hash JoinEqui-join with at least one small or unsorted inputGreat for large, unsorted datasets

❌ Limitations of Hash Join

  • Can use a lot of memory if the build table is large.
  • For very large joins, may spill to disk (slows down performance).
  • Not suited for non-equality joins (e.g., <, BETWEEN).

📦 Variants

  • In-memory hash join – fully in memory (fastest)
  • Grace hash join – spills to disk in partitions if the build table doesn’t fit in memory
  • Hybrid hash join – mixes in-memory and partitioning
This entry was posted in Без рубрики. Bookmark the permalink.