A hash join is a join algorithm used by database engines to efficiently combine rows from two tables based on a join condition, typically equality (e.g., A.id = B.id
).
🧠 How a Hash Join Works
It operates in two main phases:
🔨 1. Build Phase
- The DB scans the smaller table (called the build input) and builds an in-memory hash table using the join key.
- Each row is inserted into a hash bucket based on the hash of its join key.
🔍 2. Probe Phase
- The DB then scans the larger table (called the probe input).
- For each row, it computes the hash of the join key and probes the hash table to find matches.
🧩 Example:
SELECT *
FROM orders
JOIN customers ON orders.customer_id = customers.id;
If customers
is smaller, it becomes the build table.
The DB hashes all customers.id
values into a hash table.
Then it scans orders
and for each orders.customer_id
, looks it up in the hash table.
✅ When Hash Join Is Preferred
Scenario | Why Hash Join Works Well |
---|---|
Join condition is equality (= ) | Works best for equi-joins |
One table is small | Fits easily into memory as hash map |
No useful index exists | Doesn’t rely on indexes |
Parallel processing | Hash joins are easily parallelizable |
🔁 Comparison with Other Join Types
Join Type | Works Best When… | Notes |
---|---|---|
Nested Loop | One table is very small or has an index | Simple but slow for large joins |
Merge Join | Both inputs are sorted on the join key | Requires sorted inputs or extra sorting |
Hash Join | Equi-join with at least one small or unsorted input | Great for large, unsorted datasets |
❌ Limitations of Hash Join
- Can use a lot of memory if the build table is large.
- For very large joins, may spill to disk (slows down performance).
- Not suited for non-equality joins (e.g.,
<
,BETWEEN
).
📦 Variants
- In-memory hash join – fully in memory (fastest)
- Grace hash join – spills to disk in partitions if the build table doesn’t fit in memory
- Hybrid hash join – mixes in-memory and partitioning