Bitsets in Practice: Cache Lines, Masks, and SIMD Tradeoffs
Real benchmark numbers for bitset scan strategies, plus layout and SIMD tradeoffs.
Bitsets are still one of the highest-leverage structures in systems code, but they are easy to benchmark badly.
I re-ran this post with real numbers on February 17, 2026 using Quick Bench on x86, and updated the guidance around what actually mattered.
TL;DR from measured runs
On this Quick Bench x86 run (single CPU, 2.3 GHz, clang 17, C++20, -O3), for 65,536-bit sets:
- In dense scans, AVX2/AVX-512 improved over bit-by-bit, but only modestly (about
1.40xto1.45x). - In sparse late-hit scans, moving from bit-by-bit to chunked
countr_zerowas the big win (~73xto~74x). - For sparse late-hit, AVX2/AVX-512 added little over the scalar chunked path in this environment.
The key point: algorithm shape and cache behavior usually dominate before ISA-specific SIMD does.
What Bitsets Buy You (and What They Don’t)
Bitsets are excellent for:
- dense membership/flag state
- fast boolean algebra over sets
- compact predicate transport
Bitsets do not automatically solve:
- random payload gather cost
- bad locality in adjacent data
- branch-heavy scan loops
The storage is cheap. The surrounding access pattern often is not.
Layout and Mask Semantics
#include <bitset> #include <cstdint> #include <iostream> int main() { std::cout << "sizeof(bitset<8>): " << sizeof(std::bitset<8>) << "\n"; std::cout << "sizeof(bitset<64>): " << sizeof(std::bitset<64>) << "\n"; std::cout << "sizeof(uint8_t): " << sizeof(uint8_t) << "\n"; }
std::bitset size/layout is implementation-defined, so validate on your target toolchain and ABI before making memory assumptions.
Mask semantics should also be explicit, especially around zero:
None should mean “no flags set,” and predicates should decide whether mask == 0 is allowed. Most production bugs I see here are semantic, not microarchitectural.
Scan Paths That Matter
For scalar scans, start with chunked words and a zero guard:
#include <bit> #include <cstdint> #include <optional> std::optional<uint32_t> first_set_u64(uint64_t w) { if (w == 0) return std::nullopt; return static_cast<uint32_t>(std::countr_zero(w)); } std::optional<uint32_t> first_set_chunks(const uint64_t* words, uint32_t n) { for (uint32_t i = 0; i < n; ++i) { uint64_t w = words[i]; if (w == 0) continue; return i * 64u + static_cast<uint32_t>(std::countr_zero(w)); } return std::nullopt; }
If you drop to builtins/intrinsics directly (__builtin_ctz*, _tzcnt_u64), keep the zero check. Many forms are undefined on zero input.
Benchmark Setup (Quick Bench, Reproducible)
Source file used for this update:
benchmarks/quick_bench_bitset_scan.cpp
Quick Bench settings used:
- compiler: clang 17
- std: C++20
- optim: O3
- ISA variants tested: scalar flags, AVX2 flags, AVX-512 flags
Environment:
- x86 container host (
1CPU,2300MHz reported by harness) - benchmark library build type:
release - workload: 65,536-bit sets (
1024 x uint64_t)
Cases:
Dense: random non-zero first word (early hit)SparseLate: single set bit in the last word (late hit)
Algorithms compared:
bit_by_bit: checks every bitctz_scan: word scan +std::countr_zeroctz_unrolled4: same as above, 4-word unrollavx2_scan: AVX2 chunk probe + scalarcountr_zeroon winning laneavx512_scan: AVX-512 chunk probe + scalarcountr_zeroon winning lane
Results
Numbers below are from the run on February 17, 2026.
I am using cpu_time from Quick Bench because real_time was scheduler-noisy on this 1-vCPU environment.
| Case | bit_by_bit (ns) | ctz_scan (ns) | ctz_unrolled4 (ns) | avx2_scan (ns) | avx512_scan (ns) |
|---|---|---|---|---|---|
| Dense | 242.52 | 181.76 | 177.64 | 173.34 | 166.93 |
| SparseLate | 1855886.53 | 25498.49 | 24984.23 | 25095.40 | 24930.73 |
Speedup vs bit_by_bit:
| Case | ctz_scan | ctz_unrolled4 | avx2_scan | avx512_scan |
|---|---|---|---|---|
| Dense | 1.33x | 1.37x | 1.40x | 1.45x |
| SparseLate | 72.78x | 74.28x | 73.95x | 74.44x |
Interpretation:
- Dense early-hit scans can benefit from SIMD, but this workload showed incremental gains, not order-of-magnitude gains.
- Sparse late-hit scans are dominated by the shift from bit-by-bit to chunked scan strategy.
- Once on chunked scan, AVX2/AVX-512 were nearly tied with scalar unrolled in this run; memory/access pattern dominated.
SIMD Tradeoffs (What This Means in Practice)
SIMD is still valuable, but only after the scalar/chunk baseline is right:
- If your bottleneck is a bit scan over contiguous words, chunked scalar can already be excellent.
- If your bottleneck is sparse/random payload gather, SIMD mask/compact alone often does not fix memory stalls.
- Keep ISA portability in mind: AVX2/AVX-512 code paths need fallbacks, and many deployments are still mixed.
Practical Decision Guide
- tiny fixed flags (
<= 64bits): integer masks (uint64_t) - medium fixed compile-time sets:
std::bitset<N> - large dense runtime sets: custom chunked bitset (
uint64_twords) - large sparse sets: roaring/compressed bitmaps
Most common mistake: choosing SIMD first and layout second. Do the opposite.
Checklist
- Define mask semantics explicitly, especially
None. - Guard
ctz/clzpaths against zero input. - Validate
std::bitsetlayout assumptions on your exact STL/ABI. - Establish a clean chunked scalar baseline before SIMD.
- Benchmark realistic sparsity and cache conditions, then decide.
Benchmark Code
benchmarks/quick_bench_bitset_scan.cpp#include <benchmark/benchmark.h> #include <bit> #include <cstdint> #include <random> #include <vector> #if defined(__x86_64__) || defined(_M_X64) || defined(__i386) || defined(_M_IX86) #include <immintrin.h> #define QB_X86 1 #else #define QB_X86 0 #endif namespace { constexpr size_t kWordsPerSet = 1024; // 65,536 bits constexpr size_t kSetCount = 64; struct Dataset { std::vector<uint64_t> dense; std::vector<uint64_t> sparse; }; const Dataset& get_dataset() { static const Dataset ds = [] { Dataset out; out.dense.assign(kSetCount * kWordsPerSet, 0); out.sparse.assign(kSetCount * kWordsPerSet, 0); std::mt19937_64 rng(0xD00D1234ULL); std::uniform_int_distribution<uint64_t> full_dist(1, UINT64_MAX); std::uniform_int_distribution<uint32_t> bit_dist(0, 63); for (size_t i = 0; i < kSetCount; ++i) { uint64_t* dense_set = out.dense.data() + i * kWordsPerSet; dense_set[0] = full_dist(rng); for (size_t w = 1; w < kWordsPerSet; ++w) { dense_set[w] = full_dist(rng); } uint64_t* sparse_set = out.sparse.data() + i * kWordsPerSet; sparse_set[kWordsPerSet - 1] = 1ULL << bit_dist(rng); } return out; }(); return ds; } uint32_t scan_bit_by_bit(const uint64_t* words, size_t word_count) { for (size_t w = 0; w < word_count; ++w) { const uint64_t word = words[w]; for (uint32_t b = 0; b < 64; ++b) { if ((word >> b) & 1ULL) return static_cast<uint32_t>(w * 64 + b); } } return UINT32_MAX; } uint32_t scan_word_ctz(const uint64_t* words, size_t word_count) { for (size_t w = 0; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; } uint32_t scan_word_ctz_unrolled4(const uint64_t* words, size_t word_count) { size_t w = 0; for (; w + 4 <= word_count; w += 4) { const uint64_t w0 = words[w + 0]; if (w0 != 0) return static_cast<uint32_t>((w + 0) * 64 + std::countr_zero(w0)); const uint64_t w1 = words[w + 1]; if (w1 != 0) return static_cast<uint32_t>((w + 1) * 64 + std::countr_zero(w1)); const uint64_t w2 = words[w + 2]; if (w2 != 0) return static_cast<uint32_t>((w + 2) * 64 + std::countr_zero(w2)); const uint64_t w3 = words[w + 3]; if (w3 != 0) return static_cast<uint32_t>((w + 3) * 64 + std::countr_zero(w3)); } for (; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; } uint32_t scan_word_ctz_avx2(const uint64_t* words, size_t word_count) { #if QB_X86 && defined(__AVX2__) const __m256i zero = _mm256_setzero_si256(); size_t w = 0; for (; w + 4 <= word_count; w += 4) { const __m256i v = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(words + w)); const __m256i eq = _mm256_cmpeq_epi64(v, zero); const unsigned eq_mask = static_cast<unsigned>(_mm256_movemask_pd(_mm256_castsi256_pd(eq))); if (eq_mask != 0xF) { const unsigned nz_mask = (~eq_mask) & 0xF; const size_t lane = std::countr_zero(nz_mask); const uint64_t word = words[w + lane]; return static_cast<uint32_t>((w + lane) * 64 + std::countr_zero(word)); } } for (; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; #else return scan_word_ctz_unrolled4(words, word_count); #endif } uint32_t scan_word_ctz_avx512(const uint64_t* words, size_t word_count) { #if QB_X86 && defined(__AVX512F__) && defined(__AVX512VL__) const __m512i zero = _mm512_setzero_si512(); size_t w = 0; for (; w + 8 <= word_count; w += 8) { const __m512i v = _mm512_loadu_si512(reinterpret_cast<const void*>(words + w)); const __mmask8 eq_mask = _mm512_cmpeq_epi64_mask(v, zero); if (eq_mask != 0xFF) { const unsigned nz_mask = (~static_cast<unsigned>(eq_mask)) & 0xFF; const size_t lane = std::countr_zero(nz_mask); const uint64_t word = words[w + lane]; return static_cast<uint32_t>((w + lane) * 64 + std::countr_zero(word)); } } for (; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; #else return scan_word_ctz_avx2(words, word_count); #endif } template <uint32_t (*Fn)(const uint64_t*, size_t)> static void BM_Dense(benchmark::State& state) { const auto& ds = get_dataset(); for (auto _ : state) { uint64_t acc = 0; for (size_t i = 0; i < kSetCount; ++i) { acc += Fn(ds.dense.data() + i * kWordsPerSet, kWordsPerSet); } benchmark::DoNotOptimize(acc); } } template <uint32_t (*Fn)(const uint64_t*, size_t)> static void BM_SparseLate(benchmark::State& state) { const auto& ds = get_dataset(); for (auto _ : state) { uint64_t acc = 0; for (size_t i = 0; i < kSetCount; ++i) { acc += Fn(ds.sparse.data() + i * kWordsPerSet, kWordsPerSet); } benchmark::DoNotOptimize(acc); } } BENCHMARK(BM_Dense<scan_bit_by_bit>); BENCHMARK(BM_Dense<scan_word_ctz>); BENCHMARK(BM_Dense<scan_word_ctz_unrolled4>); BENCHMARK(BM_Dense<scan_word_ctz_avx2>); BENCHMARK(BM_Dense<scan_word_ctz_avx512>); BENCHMARK(BM_SparseLate<scan_bit_by_bit>); BENCHMARK(BM_SparseLate<scan_word_ctz>); BENCHMARK(BM_SparseLate<scan_word_ctz_unrolled4>); BENCHMARK(BM_SparseLate<scan_word_ctz_avx2>); BENCHMARK(BM_SparseLate<scan_word_ctz_avx512>); } // namespace
Footnotes
- quick-bench for benchmarking consistently.
- Google benchmark for the benchmarking harness & library tooling.
- countr_zero reference for std::countr_zero.
- AVX based __builtin_ctz, read more about builtins.