Bitsets in Practice: Cache Lines, Masks, and SIMD Tradeoffs

Real benchmark numbers for bitset scan strategies, plus layout and SIMD tradeoffs.

Bitsets are still one of the highest-leverage structures in systems code, but they are easy to benchmark badly.

I re-ran this post with real numbers on February 17, 2026 using Quick Bench on x86, and updated the guidance around what actually mattered.

TL;DR from measured runs

On this Quick Bench x86 run (single CPU, 2.3 GHz, clang 17, C++20, -O3), for 65,536-bit sets:

  • In dense scans, AVX2/AVX-512 improved over bit-by-bit, but only modestly (about 1.40x to 1.45x).
  • In sparse late-hit scans, moving from bit-by-bit to chunked countr_zero was the big win (~73x to ~74x).
  • For sparse late-hit, AVX2/AVX-512 added little over the scalar chunked path in this environment.

The key point: algorithm shape and cache behavior usually dominate before ISA-specific SIMD does.

What Bitsets Buy You (and What They Don’t)

Bitsets are excellent for:

  • dense membership/flag state
  • fast boolean algebra over sets
  • compact predicate transport

Bitsets do not automatically solve:

  • random payload gather cost
  • bad locality in adjacent data
  • branch-heavy scan loops

The storage is cheap. The surrounding access pattern often is not.

Layout and Mask Semantics

#include <bitset> #include <cstdint> #include <iostream> int main() { std::cout << "sizeof(bitset<8>): " << sizeof(std::bitset<8>) << "\n"; std::cout << "sizeof(bitset<64>): " << sizeof(std::bitset<64>) << "\n"; std::cout << "sizeof(uint8_t): " << sizeof(uint8_t) << "\n"; }

std::bitset size/layout is implementation-defined, so validate on your target toolchain and ABI before making memory assumptions.

Mask semantics should also be explicit, especially around zero:

None should mean “no flags set,” and predicates should decide whether mask == 0 is allowed. Most production bugs I see here are semantic, not microarchitectural.

Scan Paths That Matter

For scalar scans, start with chunked words and a zero guard:

#include <bit> #include <cstdint> #include <optional> std::optional<uint32_t> first_set_u64(uint64_t w) { if (w == 0) return std::nullopt; return static_cast<uint32_t>(std::countr_zero(w)); } std::optional<uint32_t> first_set_chunks(const uint64_t* words, uint32_t n) { for (uint32_t i = 0; i < n; ++i) { uint64_t w = words[i]; if (w == 0) continue; return i * 64u + static_cast<uint32_t>(std::countr_zero(w)); } return std::nullopt; }

If you drop to builtins/intrinsics directly (__builtin_ctz*, _tzcnt_u64), keep the zero check. Many forms are undefined on zero input.

Benchmark Setup (Quick Bench, Reproducible)

Source file used for this update:

  • benchmarks/quick_bench_bitset_scan.cpp

Quick Bench settings used:

  • compiler: clang 17
  • std: C++20
  • optim: O3
  • ISA variants tested: scalar flags, AVX2 flags, AVX-512 flags

Environment:

  • x86 container host (1 CPU, 2300 MHz reported by harness)
  • benchmark library build type: release
  • workload: 65,536-bit sets (1024 x uint64_t)

Cases:

  • Dense: random non-zero first word (early hit)
  • SparseLate: single set bit in the last word (late hit)

Algorithms compared:

  • bit_by_bit: checks every bit
  • ctz_scan: word scan + std::countr_zero
  • ctz_unrolled4: same as above, 4-word unroll
  • avx2_scan: AVX2 chunk probe + scalar countr_zero on winning lane
  • avx512_scan: AVX-512 chunk probe + scalar countr_zero on winning lane

Results

Numbers below are from the run on February 17, 2026.
I am using cpu_time from Quick Bench because real_time was scheduler-noisy on this 1-vCPU environment.

Casebit_by_bit (ns)ctz_scan (ns)ctz_unrolled4 (ns)avx2_scan (ns)avx512_scan (ns)
Dense242.52181.76177.64173.34166.93
SparseLate1855886.5325498.4924984.2325095.4024930.73

Speedup vs bit_by_bit:

Casectz_scanctz_unrolled4avx2_scanavx512_scan
Dense1.33x1.37x1.40x1.45x
SparseLate72.78x74.28x73.95x74.44x

Interpretation:

  1. Dense early-hit scans can benefit from SIMD, but this workload showed incremental gains, not order-of-magnitude gains.
  2. Sparse late-hit scans are dominated by the shift from bit-by-bit to chunked scan strategy.
  3. Once on chunked scan, AVX2/AVX-512 were nearly tied with scalar unrolled in this run; memory/access pattern dominated.

SIMD Tradeoffs (What This Means in Practice)

SIMD is still valuable, but only after the scalar/chunk baseline is right:

  • If your bottleneck is a bit scan over contiguous words, chunked scalar can already be excellent.
  • If your bottleneck is sparse/random payload gather, SIMD mask/compact alone often does not fix memory stalls.
  • Keep ISA portability in mind: AVX2/AVX-512 code paths need fallbacks, and many deployments are still mixed.

Practical Decision Guide

  • tiny fixed flags (<= 64 bits): integer masks (uint64_t)
  • medium fixed compile-time sets: std::bitset<N>
  • large dense runtime sets: custom chunked bitset (uint64_t words)
  • large sparse sets: roaring/compressed bitmaps

Most common mistake: choosing SIMD first and layout second. Do the opposite.

Checklist

  1. Define mask semantics explicitly, especially None.
  2. Guard ctz/clz paths against zero input.
  3. Validate std::bitset layout assumptions on your exact STL/ABI.
  4. Establish a clean chunked scalar baseline before SIMD.
  5. Benchmark realistic sparsity and cache conditions, then decide.

Benchmark Code

benchmarks/quick_bench_bitset_scan.cpp
#include <benchmark/benchmark.h> #include <bit> #include <cstdint> #include <random> #include <vector> #if defined(__x86_64__) || defined(_M_X64) || defined(__i386) || defined(_M_IX86) #include <immintrin.h> #define QB_X86 1 #else #define QB_X86 0 #endif namespace { constexpr size_t kWordsPerSet = 1024; // 65,536 bits constexpr size_t kSetCount = 64; struct Dataset { std::vector<uint64_t> dense; std::vector<uint64_t> sparse; }; const Dataset& get_dataset() { static const Dataset ds = [] { Dataset out; out.dense.assign(kSetCount * kWordsPerSet, 0); out.sparse.assign(kSetCount * kWordsPerSet, 0); std::mt19937_64 rng(0xD00D1234ULL); std::uniform_int_distribution<uint64_t> full_dist(1, UINT64_MAX); std::uniform_int_distribution<uint32_t> bit_dist(0, 63); for (size_t i = 0; i < kSetCount; ++i) { uint64_t* dense_set = out.dense.data() + i * kWordsPerSet; dense_set[0] = full_dist(rng); for (size_t w = 1; w < kWordsPerSet; ++w) { dense_set[w] = full_dist(rng); } uint64_t* sparse_set = out.sparse.data() + i * kWordsPerSet; sparse_set[kWordsPerSet - 1] = 1ULL << bit_dist(rng); } return out; }(); return ds; } uint32_t scan_bit_by_bit(const uint64_t* words, size_t word_count) { for (size_t w = 0; w < word_count; ++w) { const uint64_t word = words[w]; for (uint32_t b = 0; b < 64; ++b) { if ((word >> b) & 1ULL) return static_cast<uint32_t>(w * 64 + b); } } return UINT32_MAX; } uint32_t scan_word_ctz(const uint64_t* words, size_t word_count) { for (size_t w = 0; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; } uint32_t scan_word_ctz_unrolled4(const uint64_t* words, size_t word_count) { size_t w = 0; for (; w + 4 <= word_count; w += 4) { const uint64_t w0 = words[w + 0]; if (w0 != 0) return static_cast<uint32_t>((w + 0) * 64 + std::countr_zero(w0)); const uint64_t w1 = words[w + 1]; if (w1 != 0) return static_cast<uint32_t>((w + 1) * 64 + std::countr_zero(w1)); const uint64_t w2 = words[w + 2]; if (w2 != 0) return static_cast<uint32_t>((w + 2) * 64 + std::countr_zero(w2)); const uint64_t w3 = words[w + 3]; if (w3 != 0) return static_cast<uint32_t>((w + 3) * 64 + std::countr_zero(w3)); } for (; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; } uint32_t scan_word_ctz_avx2(const uint64_t* words, size_t word_count) { #if QB_X86 && defined(__AVX2__) const __m256i zero = _mm256_setzero_si256(); size_t w = 0; for (; w + 4 <= word_count; w += 4) { const __m256i v = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(words + w)); const __m256i eq = _mm256_cmpeq_epi64(v, zero); const unsigned eq_mask = static_cast<unsigned>(_mm256_movemask_pd(_mm256_castsi256_pd(eq))); if (eq_mask != 0xF) { const unsigned nz_mask = (~eq_mask) & 0xF; const size_t lane = std::countr_zero(nz_mask); const uint64_t word = words[w + lane]; return static_cast<uint32_t>((w + lane) * 64 + std::countr_zero(word)); } } for (; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; #else return scan_word_ctz_unrolled4(words, word_count); #endif } uint32_t scan_word_ctz_avx512(const uint64_t* words, size_t word_count) { #if QB_X86 && defined(__AVX512F__) && defined(__AVX512VL__) const __m512i zero = _mm512_setzero_si512(); size_t w = 0; for (; w + 8 <= word_count; w += 8) { const __m512i v = _mm512_loadu_si512(reinterpret_cast<const void*>(words + w)); const __mmask8 eq_mask = _mm512_cmpeq_epi64_mask(v, zero); if (eq_mask != 0xFF) { const unsigned nz_mask = (~static_cast<unsigned>(eq_mask)) & 0xFF; const size_t lane = std::countr_zero(nz_mask); const uint64_t word = words[w + lane]; return static_cast<uint32_t>((w + lane) * 64 + std::countr_zero(word)); } } for (; w < word_count; ++w) { const uint64_t word = words[w]; if (word == 0) continue; return static_cast<uint32_t>(w * 64 + std::countr_zero(word)); } return UINT32_MAX; #else return scan_word_ctz_avx2(words, word_count); #endif } template <uint32_t (*Fn)(const uint64_t*, size_t)> static void BM_Dense(benchmark::State& state) { const auto& ds = get_dataset(); for (auto _ : state) { uint64_t acc = 0; for (size_t i = 0; i < kSetCount; ++i) { acc += Fn(ds.dense.data() + i * kWordsPerSet, kWordsPerSet); } benchmark::DoNotOptimize(acc); } } template <uint32_t (*Fn)(const uint64_t*, size_t)> static void BM_SparseLate(benchmark::State& state) { const auto& ds = get_dataset(); for (auto _ : state) { uint64_t acc = 0; for (size_t i = 0; i < kSetCount; ++i) { acc += Fn(ds.sparse.data() + i * kWordsPerSet, kWordsPerSet); } benchmark::DoNotOptimize(acc); } } BENCHMARK(BM_Dense<scan_bit_by_bit>); BENCHMARK(BM_Dense<scan_word_ctz>); BENCHMARK(BM_Dense<scan_word_ctz_unrolled4>); BENCHMARK(BM_Dense<scan_word_ctz_avx2>); BENCHMARK(BM_Dense<scan_word_ctz_avx512>); BENCHMARK(BM_SparseLate<scan_bit_by_bit>); BENCHMARK(BM_SparseLate<scan_word_ctz>); BENCHMARK(BM_SparseLate<scan_word_ctz_unrolled4>); BENCHMARK(BM_SparseLate<scan_word_ctz_avx2>); BENCHMARK(BM_SparseLate<scan_word_ctz_avx512>); } // namespace

Footnotes

  1. quick-bench for benchmarking consistently.
  2. Google benchmark for the benchmarking harness & library tooling.
  3. countr_zero reference for std::countr_zero.
  4. AVX based __builtin_ctz, read more about builtins.