hexops / xorfilter

xorfilter: Zig implementation of Xor Filters

This is a Zig implementation of Xor Filters and Fuse Filters, which are faster and smaller than Bloom and Cuckoo filters and allow for quickly checking if a key is part of a set.

Benefits of Zig implementation
Research papers
Usage
Serialization
Note about extremely large datasets
Special thanks
Changelog

Benefits of Zig implementation

The two primary algorithms of interest here are:

Xor8 (recommended, has no more than a 0.3% false-positive probability)
Fuse8 (better than xor+ variants when you have > 100 million keys)

Thanks to Zig's bit-width integers and type system, many more bit variants - any that is log2 - is supported as well via e.g. Xor(u4) or Fuse(u4):

xor4, xor16, xor32, xor64, etc.
fuse4, fuse16, etc.

Note, however, that Zig represents e.g. u4 as a full byte. The more exotic bit-widths u4, u20, etc. are primarily interesting for more compact serialization.

Research papers

Blog post: Xor Filters: Faster and Smaller Than Bloom Filters.

Xor Filters (arxiv paper):

Thomas Mueller Graf, Daniel Lemire, Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122

Fuse Filters (arxiv paper), as described by @jbapple:

For large enough sets of keys, Dietzfelbinger & Walzer's fuse filters, described in "Dense Peelable Random Uniform Hypergraphs", can accomodate fill factors up to 87.9% full, rather than 1 / 1.23 = 81.3%.

Usage

Decide if you want to use Xor8 or Fuse8 (you probably want Xor8): "Should I use xor filters or fuse filters?".
Convert your keys into u64 values. If you have strings, structs, etc. then use something like Zig's std.hash_map.getAutoHashFn to convert your keys to u64 first. "It is not important to have a good hash function, but collisions should be unlikely (~1/2^64)."
Your keys must be unique, or else filter construction will fail. If you don't have unique keys, you can use the xorfilter.AutoUnique(u64)(keys) helper to deduplicate in typically O(N) time complexity, see the tests in src/unique.zig for more info.

Here is an example:

const xorfilter = @import("../xorfilter/src/main.zig")

test "mytest" {
    const allocator = std.heap.page_allocator;

    // Initialize the xor filter with room for 10000 keys.
    const size = 10000; // room for 10000 keys
    const filter = try xorfilter.Xor8.init(allocator, size);
    defer filter.deinit();

    // Generate some consecutive keys.
    var keys = try allocator.alloc(u64, size);
    defer allocator.free(keys);
    for (keys) |key, i| {
        keys[i] = i;
    }

    // If your keys are not unique, make them so:
    keys = xorfilter.Unique(u64)(keys);

    // If this fails, your keys are not unique or allocation failed.
    try filter.populate(allocator, keys[0..]);

    // Now we can quickly test for containment!
    testing.expect(filter.contain(1) == true);
}

(you can just add this project as a Git submodule in yours for now, as Zig's official package manager is still under way.)

Serialization

To serialize the filters, you only need to encode the three struct fields:

pub fn Xor(comptime T: type) type {
    return struct {
        seed: u64,
        blockLength: u64,
        fingerprints: []T,
...

T will be the chosen fingerprint size, e.g. u8 for Xor8.

Look at std.io.Writer and std.io.BitWriter for ideas on actual serialization.

(The same is true of the fuse filter, replacing blockLength with segmentLength.)

Should I use xor filters or fuse filters?

Xor8 is the recommended default, and has no more than a 0.3% false-positive probability. If you have > 100 million keys, fuse8 may be better.

My non-expert understanding is that fuse filters are more compressed and optimal than xor+ filters with extremely large sets of keys based on[1][2][3]. You should use them in place of xor+, and refer to the xor filter paper for whether or not you are at a scale that requires xor+/fuse filters.

Note that the fuse filter algorithm does require a large number of unique keys in order for population to succeed, see FastFilter/xor_singleheader#21 - if you have few (<~125k consecutive) keys creation will fail.

Note about extremely large datasets

This implementation supports key iterators, so you do not need to have all of your keys in-memory, see Xor8.populateIter and Fuse8.populateIter.

If you intend to use a xor filter with datasets of 100m+ keys, there is a possible faster implementation for construction found in the C implementation xor8_buffered_populate which is not yet implemented here.

Benchmarks

Benchmarks were ran on both a 2019 Macbook Pro and Windows 10 (WSL2) desktop machine using e.g.:

zig run -O ReleaseFast src/benchmark.zig -- --xor 8 --num-keys 1000000

Benchmarks: 2019 Macbook Pro (1M - 100M keys)

CPU: 2.3Ghz Intel Core i9
Memory: 16 GB 2667 MHz DDR4
Zig version: ``0.8.0-dev.1032+8098b3f84`

Algorithm	# of keys	populate	time per containment check	fpp (estimated)	bits per entry (memory)
xor4	1M	92ms	28ns/check	0.0625333000	9.8
xor8	1M	124ms	29ns/check	0.0039010000	9.8
xor16	1M	106ms	30ns/check	0.0000140000	19.7
xor32	1M	99ms	33ns/check	0.0000000000	39.4
fuse8	1M	96ms	28ns/check	0.0039010000	9.8
fuse16	1M	93ms	30ns/check	0.0000140000	19.7

xor4	10M	1.6s	90ns/check	0.0626137000	9.8
xor8	10M	1.6s	89ns/check	0.0039369000	9.8
xor16	10M	1.6s	105ns/check	0.0000173000	19.7
xor32	10M	1.6s	119ns/check	0.0000000000	39.4
fuse8	10M	1.6s	92ns/check	0.0039369000	9.8
fuse16	10M	1.6s	113ns/check	0.0000173000	19.7

xor4	100M	23s	128ns/check	0.0625772000	9.8
xor8	100M	20s	125ns/check	0.0039238000	9.8
xor16	100M	21s	136ns/check	0.0000147000	19.7
xor32	100M	22s	135ns/check	0.0000000000	39.4
fuse8	100M	21s	124ns/check	0.0039238000	9.8
fuse16	100M	22s	126ns/check	0.0000147000	19.7

Benchmarks: Windows 10 WSL2 Desktop (1M - 250M keys)

CPU: 3.79Ghz AMD Ryzen 9 3900X
Memory: 32 GB 2133 MHz DDR4
Zig version: ``0.8.0-dev.1039+bea791b63`

Algorithm	# of keys	populate	time per containment check	fpp (estimated)	bits per entry (memory)
xor4	1M	112ms	23ns/check	0.0625333000	9.8
xor8	1M	112ms	23ns/check	0.0039010000	9.8
xor16	1M	117ms	24ns/check	0.0000140000	19.7
xor32	1M	119ms	25ns/check	0.0000000000	39.4
fuse8	1M	112ms	23ns/check	0.0039010000	9.8
fuse16	1M	115ms	24ns/check	0.0000140000	19.7

xor4	10M	1.6s	39ns/check	0.0626137000	9.8
xor8	10M	1.6s	38ns/check	0.0039369000	9.8
xor16	10M	1.7s	126ns/check	0.0000173000	19.7
xor32	10M	1.8s	158ns/check	0.0000000000	39.4
fuse8	10M	1.6s	38ns/check	0.0039369000	9.8
fuse16	10M	1.7s	127ns/check	0.0000173000	19.7

xor4	100M	21s	175ns/check	0.0625772000	9.8
xor8	100M	20s	175ns/check	0.0039238000	9.8
xor16	100M	20s	180ns/check	0.0000147000	19.7
xor32	100M	20s	190ns/check	0.0000000000	39.4
fuse8	100M	20s	181ns/check	0.0039238000	9.8
fuse16	100M	20s	183ns/check	0.0000147000	19.7

xor4	250M	1.1min	194ns/check	0.0625503000	9.8
xor8	250M	1.2min	190ns/check	0.0038876000	9.8
xor16	250M	1.2min	196ns/check	0.0000125000	19.7
xor32	250M	1.1min	203ns/check	0.0000000000	39.4
fuse8	250M	1.1min	199ns/check	0.0038876000	9.8
fuse16	250M	1.1min	203ns/check	0.0000125000	19.7

Special thanks

Thomas Mueller Graf and Daniel Lemire - for their excellent research into xor filters, xor+ filters, their C implementation, and more.
Martin Dietzfelbinger and Stefan Walzer - for their excellent research into fuse filters.
Jim Apple - for their C implementation[1] of fuse filters
Andrew Gutekanst - for providing substantial help in debugging several issues in the Zig implementation.

If it was not for the above people, I (@slimsag) would not have been able to write this implementation and learn from the excellent C implementation. Please credit the above people if you use this library.

Changelog

The API is generally finalized, but we may make some adjustments as Zig changes or we learn of more idiomatic ways to express things. We will release v1.0 once Zig v1.0 is released.

v0.8.0: initial release with support for Xor and Fuse filters of varying bit sizes, key iterators, serialization, and a slice de-duplication helper.

hexops / xorfilter

README.md

xorfilter: Zig implementation of Xor Filters

Benefits of Zig implementation

Research papers

Usage

Serialization

Should I use xor filters or fuse filters?

Note about extremely large datasets

Benchmarks

Special thanks

Changelog

About

Releases

Packages

Contributors 2

Languages

hexops / xorfilter

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

xorfilter: Zig implementation of Xor Filters

Benefits of Zig implementation

Research papers

Usage

Serialization

Should I use xor filters or fuse filters?

Note about extremely large datasets

Benchmarks

Special thanks

Changelog

About

Resources

License

Releases

Packages 0

Contributors 2

Languages

Packages