Package org.forgerock.bloomfilter
ForgeRock Bloom Filters
Implementations of thread-safe, scalable and rolling Bloom Filters. These are Set-like data structures that can scale to very large numbers of entries while only using a small amount of memory (a few bits) per element. The trade-off is that the set membership operation may report false positives (i.e., it may claim that an item is a member of the set when it isn't). The probability of false positives can be tuned by increasing the amount of memory used.
The BloomFilter
interface describes the general contract of bloom filters in
more detail, and the BloomFilters
utility class provides static factory and
builder methods for constructing bloom filters for various requirements.
Example
BloomFilter<CharSequence> blacklistedSessions = BloomFilters.create(Funnels.stringFunnel(UTF8))
.withInitialCapacity(10000) // Initial size
.withCapacityGrowthFactor(2.0) // Double size when full
.withFalsePositiveProbability(0.01) // 1% probability of false positives
.withWriteBatchSize(1000) // Batch writes
.build();
blacklistedSessions.add("Some session token");
if (blacklistedSessions.mightContain("Some other session")) {
// Take steps to confirm if token is actually black listed or not.
}
Scalable and Rolling Bloom Filters
Beyond fixed-capacity Bloom Filters, whose probability of false positives rapidly increases once they have reached capacity, this package also provides scalable and rolling Bloom Filters. The former are an efficient and flexible implementation of the classic Scalable Bloom Filters paper by Almeida et al., Information Processing Letters, 101(6), p.255–261, 2007. The latter are a time-limited variation on this idea, whereby buckets in the scalable bloom filter can expire over time, freeing up memory. The buckets are then recycled ensuring that memory usage is kept reasonable.
Scalable Bloom Filters are useful for storing sets of objects where you do not know a priori the number
of elements you might need to store. By dynamically expanding the capacity of the Bloom Filter, as well as
reducing the false positive probability of subsequent buckets according to a geometric series, the Scalable Bloom
Filter can expand to accommodate orders of magnitude more elements that originally estimated, while preserving the
overall configured false positive probability. Use the BloomFilters.BloomFilterBuilder.withFalsePositiveProbabilityScaleFactor(double)
and
BloomFilters.BloomFilterBuilder.withCapacityGrowthFactor(double)
builder methods
to configure the scale factors for capacity and false positive probability in these implementations. The defaults
(0.8 and 2.0 respectively) provide a good trade off of memory growth and performance.
BloomFilters.BloomFilterBuilder.withExpiryStrategy(org.forgerock.bloomfilter.ExpiryStrategy)
method to configure how elements in your Bloom Filter will expire. By default, elements do not expire.
Concurrency Strategies
The implementations provided are currently all thread-safe, and adopt a flexible approach to concurrency control. Two concurrency strategies are currently supported:- SYNCHRONIZED - uses synchronized blocks to ensure mutual exclusion of critical sections. For fixed-capacity bloom filters all methods are mutually exclusive. For scalable and rolling bloom filters, individual buckets are independently synchronized, allowing some additional concurrency. This implementation provides moderate read and write performance.
- COPY_ON_WRITE - provides very fast read access at the cost of extremely expensive write operations, which must make a copy of (part of) the Bloom Filter to apply any changes. Write batching (see below) can be used to amortize write costs over many operations, however this implementation will still create additional temporary garbage and pressure on the garbage collector. Suitable for situations in which read performance (mightContain) is paramount and writes are relatively rare (and can tolerate increased latency).
BloomFilters.BloomFilterBuilder.withConcurrencyStrategy(org.forgerock.bloomfilter.ConcurrencyStrategy)
method to specify the concurrency strategy to use. The default is COPY_ON_WRITE.
Write Batching
To compensate for the relatively poor performance of COPY_ON_WRITE concurrency (see previous section), the implementation supports write batching. When enabled, individual calls to theBloomFilter.add(java.lang.Object)
method will be buffered in a traditional concurrent
collection class until the write batch size is reached. At this point, the buffer will be flushed to the
underlying bloom filter implementation in a single operation. This amortizes the cost of copying the underlying
collection, at the cost of increased worst-case latencies (when the buffer is flushed) and increased memory churn.
The implementation is very highly optimised, supporting very high throughput of both writes and reads, and so is a
good choice when throughput is paramount and occasional high write latencies can be tolerated. Use
BloomFilters.BloomFilterBuilder.withWriteBatchSize(int)
to enable write batching.- See Also:
-
ClassDescriptionBloomFilter<E>General interface contract for implementations of Bloom Filters.Factory methods for creating bloom filters with various requirements.Builder for constructing and configuring Bloom Filter implementations.Builder pattern for Rolling Bloom Filters, which are Scalable Bloom Filters whose elements can expire allowing space to be reclaimed over time.Builder pattern for Scalable Bloom Filters.Provides a snapshot of the current statistics and configuration of a Bloom Filter implementation.Strategy that determines how thread-safety of bloom filters should be managed.A thread-safe implementation of a Bloom Filter that can expand over time to accommodate arbitrary numbers of elements, while also allowing old elements to be deleted after they have expired.Strategy pattern for determining when elements added to a
ConcurrentRollingBloomFilter
should expire.