Database.Advanced.What is bloom filter

🧠 Key Concept

A Bloom filter is a bit array of fixed size, with multiple hash functions. You use it to:

  • Insert: Map an item to a few bit positions and set those to 1.
  • Query: Check if all the bit positions for an item are 1.

This means:

  • If even one bit is 0, the item is definitely not in the set.
  • If all bits are 1, the item might be in the set (→ false positive possible).

⚙️ Structure:

  • Bit array bits[] of size m (e.g., 10,000 bits, all initialized to 0).
  • k independent hash functions (e.g., 3 different hashes per item).

✅ Example — Step-by-Step

Let’s say:

  • Bit array size m = 10
  • Number of hash functions k = 3
  • Hash functions output indexes from 0 to 9

1. Insert item “apple”

Suppose the 3 hash functions give:

  • hash1("apple") % 10 = 1
  • hash2("apple") % 10 = 4
  • hash3("apple") % 10 = 7

Now set bits at positions 1, 4, and 7:

Index:   0 1 2 3 4 5 6 7 8 9
Bits:    0 1 0 0 1 0 0 1 0 0

2. Insert item “grape”

Hash outputs:

  • hash1("grape") % 10 = 2
  • hash2("grape") % 10 = 4
  • hash3("grape") % 10 = 5

Set bits at positions 2, 4, and 5:

Index:   0 1 2 3 4 5 6 7 8 9
Bits:    0 1 1 0 1 1 0 1 0 0

3. Check if “apple” is in the set

Compute same hashes:

  • 1, 4, 7 → bits are 1, 1, 1 → ✅ Might be in set (probably is)

4. Check if “peach” is in the set

Hash outputs:

  • hash1("peach") % 10 = 3 → bit 3 is 0

So: ❌ “peach” is definitely not in the set

🔄 Why False Positives?

Now suppose you check “lemon” and the hash values happen to hit 1, 2, 4 — all of which are 1 due to other elements, not because of “lemon”. So:

→ Bloom filter returns “maybe in the set”, even though “lemon” was never added.

This is called a false positive.

🧮 Formula for False Positive Probability

The false positive rate can be approximated by:

Where:

  • n = number of inserted elements
  • k = number of hash functions
  • m = number of bits

💡 Real World Usage:

  • Web browsers: “Have I seen this URL before?”
  • Databases: “Should I even try looking on disk?”
  • Distributed systems: Avoid network requests for absent keys
  • Blockchain: Bitcoin uses it to filter transactions

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It is fast and memory-efficient, but allows false positives (it may say “yes” even if the element isn’t in the set) — however, false negatives are impossible (if it says “no”, the element is definitely not in the set).


🔧 How it works:

  1. A Bloom filter uses:
    • A bit array of size m (initially all 0s).
    • k independent hash functions, each maps an input to one of the m bits.
  2. To add an item:
    • Hash the item with each of the k hash functions.
    • Set each resulting bit position in the array to 1.
  3. To check if an item is in the set:
    • Hash the item with the same k functions.
    • If all k corresponding bits are 1, the item may be in the set.
    • If any bit is 0, the item is definitely not in the set.

✅ Example:

  • Add “apple” → hash1, hash2, …, set bits 10, 23, 45 to 1.
  • Add “banana” → set bits 7, 23, 90.
  • Check “apple” → bits 10, 23, 45 → all 1 → “maybe in set”
  • Check “grape” → bit 18 is 0 → “definitely not in set”

⚖️ Pros and Cons:

ProsCons
Very space-efficientAllows false positives
Fast insert and lookupCannot delete elements (basic version)
No need to store actual dataNeeds tuning for optimal m and k

🧠 Applications:

  • Database systems (e.g., avoid unnecessary disk reads)
  • Web caching
  • Spell checkers
  • Network intrusion detection
  • Big data systems (e.g., Apache HBase, Apache Cassandra, Google Bigtable)
This entry was posted in Без рубрики. Bookmark the permalink.