Probabilistic Data Structures in Redis: HyperLogLog and Bloom Filters
A deep dive into the mathematical principles behind HyperLogLog and its Redis implementation
This article dives deep into the mathematical principles of HyperLogLog and its Redis implementation. For Redis core concepts (data structures, threading model, persistence, clustering, etc.), see Redis Related Concepts.
Bloom Filter#
- What is the principle of Bloom filters? Use cases?
- A Bloom filter implementation:
Bloom Filter Implementation
import java.util.BitSet;
public class MyBloomFilter {
/**
* Size of the bit array
*/
private static final int DEFAULT_SIZE = 2 << 24;
/**
* Through this array, 6 different hash functions can be created
*/
private static final int[] SEEDS = new int[]{3, 13, 46, 71, 91, 134};
/**
* Bit array. Elements in the array can only be 0 or 1
*/
private BitSet bits = new BitSet(DEFAULT_SIZE);
/**
* Array storing classes containing hash functions
*/
private SimpleHash[] func = new SimpleHash[SEEDS.length];
/**
* Initialize array of classes containing hash functions, each class has a different hash function
*/
public MyBloomFilter() {
// Initialize multiple different Hash functions
for (int i = 0; i < SEEDS.length; i++) {
func[i] = new SimpleHash(DEFAULT_SIZE, SEEDS[i]);
}
}
/**
* Add element to bit array
*/
public void add(Object value) {
for (SimpleHash f : func) {
bits.set(f.hash(value), true);
}
}
/**
* Check if specified element exists in bit array
*/
public boolean contains(Object value) {
boolean ret = true;
for (SimpleHash f : func) {
ret = ret && bits.get(f.hash(value));
}
return ret;
}
/**
* Static inner class for hash operations
*/
public static class SimpleHash {
private int cap;
private int seed;
public SimpleHash(int cap, int seed) {
this.cap = cap;
this.seed = seed;
}
/**
* Calculate hash value
*/
public int hash(Object value) {
int h;
return (value == null) ? 0 : Math.abs((cap - 1) & seed * ((h = value.hashCode()) ^ (h >>> 16)));
}
}
}javaHyperLogLog#
- Understanding the principle of HyperLogLog juejin ↗
- First, data is converted to a bit string through a hash function (Redis uses 64 bits, so it can represent a range of ). 0 represents the reverse side of a coin, otherwise it’s heads. You can estimate how many experiments were conducted in total based on the maximum number of times heads appeared in multiple coin flip experiments. Similarly, you can estimate how much data was stored based on the maximum position k_max where 1 appeared in the stored data after conversion.
- In Redis, HyperLogLog is set to: m(buckets)=16834, p=6, L=16834 * 6. Memory usage = 16834 * 6 / 8 / 1024 = 12K
- When storing, value is hashed to 64 bits, i.e., a 64-bit bit string. The first 14 bits (14= ) are used for bucketing. Converting the first 14 bits from binary to decimal gives the bucket number. Since the complete value bit string is 64 bits, after subtracting 14, 50 bits remain. In extreme cases, the position where 1 appears is at position 50, i.e., position is 50. At this point index = 50. Then convert index to binary, which is: 110010 (can be stored)
- Following the above approach, different values will be set to different buckets. If they end up in the same bucket, i.e., the first 14 bits are the same but the position where 1 appears later is different, then compare whether the original index is larger than the new index. If yes, replace. If no, keep unchanged.
- Finally, all 16384 buckets corresponding to a key have been set with many values, and each bucket has a k_max. When calling pfcount, according to the estimation method introduced earlier, you can calculate how many times the key has been set with values, which is the statistical value.
- Value is converted to a 64-bit bit string and finally recorded in each bucket according to the above approach. 64 bits converted to decimal is: 2^64. HyperLogLog only uses: 16384 * 6 /8 / 1024 K of storage space to count up to 2^64 numbers.
- How to estimate the value? Formula? Gemini ↗
- For multiple rounds of experiments: Initial formula (m is the number of experiment rounds, R is the harmonic mean: ):
- The ordinary average is: (k_max_1 + … + k_max_m)/m=
- For the final statistical formula:
About the derivation of for HyperLogLog: deepseek ↗
I. Definition of Bernoulli Trial and Meaning of Probability #
Bernoulli trial is defined as “continuously flipping a coin until heads appears, recording the number of flips required”. This “Bernoulli trial” actually belongs to the category of geometric distribution. Specifically:
- Success probability of a single trial: The probability of getting heads on each coin flip is , and tails is .
- Probability of number of flips : If the first heads appears on the -th flip, then the first flips are all tails, and the -th flip is heads. Therefore: This is the probability mass function of the geometric distribution.
II. Maximum Likelihood Estimation Derivation Process#
1. Problem Modeling#
- Goal: Estimate the value of by observing the maximum number of flips in trials.
- Key probability formula: The probability of observing a maximum number of flips of is:
Explanation:
- First term : All trials have flip counts not exceeding .
- Second term : All trials have flip counts not exceeding .
- Subtracting the two gives “all trial counts do not exceed , and at least one equals ” probability.
2. Log-Likelihood Function#
To simplify calculation, take the natural logarithm of the probability:
3. Taylor Expansion Approximation#
When is large, is very small, and we can use the approximation (first-order Taylor expansion):
Substituting, the probability simplifies to:
4. Maximizing the Likelihood Function#
Let , then the probability expression becomes:
Take the derivative with respect to and set it to zero:
Solving:
Therefore:
5. Engineering Approximation#
In the problem, the constant factor is ignored, simplified to:
This is because:
- Order of magnitude matching: When is large, and are of the same order of magnitude.
- Integer estimation: In practice, needs to be an integer, taking is more concise.
III. Intuitive Verification#
Example: When #
- Theoretical estimate: .
- Actual probability calculation: Comparing probabilities for and , we find that the probability is highest when , verifying the reasonableness of the estimate.
IV. Summary#
-
Bernoulli trial probability: The probability of flipping times in a single trial is .
-
Maximum likelihood estimation: By maximizing the probability of observing , we derive .
-
Engineering simplification: Ignoring the constant factor , we get the intuitive integer relationship .