Bitsets in Practice: Cache Lines, Masks, and SIMD Tradeoffs

Bitsets are still one of the highest-leverage structures in systems code, but they are easy to benchmark badly.

I re-ran this post with real numbers on February 17, 2026 using Quick Bench on x86, and updated the guidance around what actually mattered.

TL;DR from measured runs

On this Quick Bench x86 run (single CPU, 2.3 GHz, clang 17, C++20, -O3), for 65,536-bit sets:

In dense scans, AVX2/AVX-512 improved over bit-by-bit, but only modestly (about 1.40x to 1.45x).
In sparse late-hit scans, moving from bit-by-bit to chunked countr_zero was the big win (~73x to ~74x).
For sparse late-hit, AVX2/AVX-512 added little over the scalar chunked path in this environment.

The key point: algorithm shape and cache behavior usually dominate before ISA-specific SIMD does.

What Bitsets Buy You (and What They Don’t)

Bitsets are excellent for:

dense membership/flag state
fast boolean algebra over sets
compact predicate transport

Bitsets do not automatically solve:

random payload gather cost
bad locality in adjacent data
branch-heavy scan loops

The storage is cheap. The surrounding access pattern often is not.

Layout and Mask Semantics


#include <bitset>
#include <cstdint>
#include <iostream>

int main() {
    std::cout << "sizeof(bitset<8>): " << sizeof(std::bitset<8>) << "\n";
    std::cout << "sizeof(bitset<64>): " << sizeof(std::bitset<64>) << "\n";
    std::cout << "sizeof(uint8_t): " << sizeof(uint8_t) << "\n";
}

std::bitset size/layout is implementation-defined, so validate on your target toolchain and ABI before making memory assumptions.

Mask semantics should also be explicit, especially around zero:

None should mean “no flags set,” and predicates should decide whether mask == 0 is allowed. Most production bugs I see here are semantic, not microarchitectural.

Scan Paths That Matter

For scalar scans, start with chunked words and a zero guard:


#include <bit>
#include <cstdint>
#include <optional>

std::optional<uint32_t> first_set_u64(uint64_t w) {
    if (w == 0) return std::nullopt;
    return static_cast<uint32_t>(std::countr_zero(w));
}

std::optional<uint32_t> first_set_chunks(const uint64_t* words, uint32_t n) {
    for (uint32_t i = 0; i < n; ++i) {
        uint64_t w = words[i];
        if (w == 0) continue;
        return i * 64u + static_cast<uint32_t>(std::countr_zero(w));
    }
    return std::nullopt;
}

If you drop to builtins/intrinsics directly (__builtin_ctz*, _tzcnt_u64), keep the zero check. Many forms are undefined on zero input.

Benchmark Setup (Quick Bench, Reproducible)

Source file used for this update:

benchmarks/quick_bench_bitset_scan.cpp

Quick Bench settings used:

compiler: clang 17
std: C++20
optim: O3
ISA variants tested: scalar flags, AVX2 flags, AVX-512 flags

Environment:

x86 container host (1 CPU, 2300 MHz reported by harness)
benchmark library build type: release
workload: 65,536-bit sets (1024 x uint64_t)

Cases:

Dense: random non-zero first word (early hit)
SparseLate: single set bit in the last word (late hit)

Algorithms compared:

bit_by_bit: checks every bit
ctz_scan: word scan + std::countr_zero
ctz_unrolled4: same as above, 4-word unroll
avx2_scan: AVX2 chunk probe + scalar countr_zero on winning lane
avx512_scan: AVX-512 chunk probe + scalar countr_zero on winning lane

Results

Numbers below are from the run on February 17, 2026.
I am using cpu_time from Quick Bench because real_time was scheduler-noisy on this 1-vCPU environment.

Case	bit_by_bit (ns)	ctz_scan (ns)	ctz_unrolled4 (ns)	avx2_scan (ns)	avx512_scan (ns)
Dense	242.52	181.76	177.64	173.34	166.93
SparseLate	1855886.53	25498.49	24984.23	25095.40	24930.73

Speedup vs bit_by_bit:

Case	ctz_scan	ctz_unrolled4	avx2_scan	avx512_scan
Dense	1.33x	1.37x	1.40x	1.45x
SparseLate	72.78x	74.28x	73.95x	74.44x

Interpretation:

Dense early-hit scans can benefit from SIMD, but this workload showed incremental gains, not order-of-magnitude gains.
Sparse late-hit scans are dominated by the shift from bit-by-bit to chunked scan strategy.
Once on chunked scan, AVX2/AVX-512 were nearly tied with scalar unrolled in this run; memory/access pattern dominated.

SIMD Tradeoffs (What This Means in Practice)

SIMD is still valuable, but only after the scalar/chunk baseline is right:

If your bottleneck is a bit scan over contiguous words, chunked scalar can already be excellent.
If your bottleneck is sparse/random payload gather, SIMD mask/compact alone often does not fix memory stalls.
Keep ISA portability in mind: AVX2/AVX-512 code paths need fallbacks, and many deployments are still mixed.

Practical Decision Guide

tiny fixed flags (<= 64 bits): integer masks (uint64_t)
medium fixed compile-time sets: std::bitset<N>
large dense runtime sets: custom chunked bitset (uint64_t words)
large sparse sets: roaring/compressed bitmaps

Most common mistake: choosing SIMD first and layout second. Do the opposite.

Checklist

Define mask semantics explicitly, especially None.
Guard ctz/clz paths against zero input.
Validate std::bitset layout assumptions on your exact STL/ABI.
Establish a clean chunked scalar baseline before SIMD.
Benchmark realistic sparsity and cache conditions, then decide.

Benchmark Code

benchmarks/quick_bench_bitset_scan.cpp
#include <benchmark/benchmark.h>

#include <bit>
#include <cstdint>
#include <random>
#include <vector>

#if defined(__x86_64__) || defined(_M_X64) || defined(__i386) || defined(_M_IX86)
#include <immintrin.h>
#define QB_X86 1
#else
#define QB_X86 0
#endif

namespace {

constexpr size_t kWordsPerSet = 1024;  // 65,536 bits
constexpr size_t kSetCount = 64;

struct Dataset {
  std::vector<uint64_t> dense;
  std::vector<uint64_t> sparse;
};

const Dataset& get_dataset() {
  static const Dataset ds = [] {
    Dataset out;
    out.dense.assign(kSetCount * kWordsPerSet, 0);
    out.sparse.assign(kSetCount * kWordsPerSet, 0);

    std::mt19937_64 rng(0xD00D1234ULL);
    std::uniform_int_distribution<uint64_t> full_dist(1, UINT64_MAX);
    std::uniform_int_distribution<uint32_t> bit_dist(0, 63);

    for (size_t i = 0; i < kSetCount; ++i) {
      uint64_t* dense_set = out.dense.data() + i * kWordsPerSet;
      dense_set[0] = full_dist(rng);
      for (size_t w = 1; w < kWordsPerSet; ++w) {
        dense_set[w] = full_dist(rng);
      }

      uint64_t* sparse_set = out.sparse.data() + i * kWordsPerSet;
      sparse_set[kWordsPerSet - 1] = 1ULL << bit_dist(rng);
    }

    return out;
  }();
  return ds;
}

uint32_t scan_bit_by_bit(const uint64_t* words, size_t word_count) {
  for (size_t w = 0; w < word_count; ++w) {
    const uint64_t word = words[w];
    for (uint32_t b = 0; b < 64; ++b) {
      if ((word >> b) & 1ULL) return static_cast<uint32_t>(w * 64 + b);
    }
  }
  return UINT32_MAX;
}

uint32_t scan_word_ctz(const uint64_t* words, size_t word_count) {
  for (size_t w = 0; w < word_count; ++w) {
    const uint64_t word = words[w];
    if (word == 0) continue;
    return static_cast<uint32_t>(w * 64 + std::countr_zero(word));
  }
  return UINT32_MAX;
}

uint32_t scan_word_ctz_unrolled4(const uint64_t* words, size_t word_count) {
  size_t w = 0;
  for (; w + 4 <= word_count; w += 4) {
    const uint64_t w0 = words[w + 0];
    if (w0 != 0) return static_cast<uint32_t>((w + 0) * 64 + std::countr_zero(w0));
    const uint64_t w1 = words[w + 1];
    if (w1 != 0) return static_cast<uint32_t>((w + 1) * 64 + std::countr_zero(w1));
    const uint64_t w2 = words[w + 2];
    if (w2 != 0) return static_cast<uint32_t>((w + 2) * 64 + std::countr_zero(w2));
    const uint64_t w3 = words[w + 3];
    if (w3 != 0) return static_cast<uint32_t>((w + 3) * 64 + std::countr_zero(w3));
  }
  for (; w < word_count; ++w) {
    const uint64_t word = words[w];
    if (word == 0) continue;
    return static_cast<uint32_t>(w * 64 + std::countr_zero(word));
  }
  return UINT32_MAX;
}

uint32_t scan_word_ctz_avx2(const uint64_t* words, size_t word_count) {
#if QB_X86 && defined(__AVX2__)
  const __m256i zero = _mm256_setzero_si256();
  size_t w = 0;
  for (; w + 4 <= word_count; w += 4) {
    const __m256i v = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(words + w));
    const __m256i eq = _mm256_cmpeq_epi64(v, zero);
    const unsigned eq_mask =
        static_cast<unsigned>(_mm256_movemask_pd(_mm256_castsi256_pd(eq)));
    if (eq_mask != 0xF) {
      const unsigned nz_mask = (~eq_mask) & 0xF;
      const size_t lane = std::countr_zero(nz_mask);
      const uint64_t word = words[w + lane];
      return static_cast<uint32_t>((w + lane) * 64 + std::countr_zero(word));
    }
  }
  for (; w < word_count; ++w) {
    const uint64_t word = words[w];
    if (word == 0) continue;
    return static_cast<uint32_t>(w * 64 + std::countr_zero(word));
  }
  return UINT32_MAX;
#else
  return scan_word_ctz_unrolled4(words, word_count);
#endif
}

uint32_t scan_word_ctz_avx512(const uint64_t* words, size_t word_count) {
#if QB_X86 && defined(__AVX512F__) && defined(__AVX512VL__)
  const __m512i zero = _mm512_setzero_si512();
  size_t w = 0;
  for (; w + 8 <= word_count; w += 8) {
    const __m512i v = _mm512_loadu_si512(reinterpret_cast<const void*>(words + w));
    const __mmask8 eq_mask = _mm512_cmpeq_epi64_mask(v, zero);
    if (eq_mask != 0xFF) {
      const unsigned nz_mask = (~static_cast<unsigned>(eq_mask)) & 0xFF;
      const size_t lane = std::countr_zero(nz_mask);
      const uint64_t word = words[w + lane];
      return static_cast<uint32_t>((w + lane) * 64 + std::countr_zero(word));
    }
  }
  for (; w < word_count; ++w) {
    const uint64_t word = words[w];
    if (word == 0) continue;
    return static_cast<uint32_t>(w * 64 + std::countr_zero(word));
  }
  return UINT32_MAX;
#else
  return scan_word_ctz_avx2(words, word_count);
#endif
}

template <uint32_t (*Fn)(const uint64_t*, size_t)>
static void BM_Dense(benchmark::State& state) {
  const auto& ds = get_dataset();
  for (auto _ : state) {
    uint64_t acc = 0;
    for (size_t i = 0; i < kSetCount; ++i) {
      acc += Fn(ds.dense.data() + i * kWordsPerSet, kWordsPerSet);
    }
    benchmark::DoNotOptimize(acc);
  }
}

template <uint32_t (*Fn)(const uint64_t*, size_t)>
static void BM_SparseLate(benchmark::State& state) {
  const auto& ds = get_dataset();
  for (auto _ : state) {
    uint64_t acc = 0;
    for (size_t i = 0; i < kSetCount; ++i) {
      acc += Fn(ds.sparse.data() + i * kWordsPerSet, kWordsPerSet);
    }
    benchmark::DoNotOptimize(acc);
  }
}

BENCHMARK(BM_Dense<scan_bit_by_bit>);
BENCHMARK(BM_Dense<scan_word_ctz>);
BENCHMARK(BM_Dense<scan_word_ctz_unrolled4>);
BENCHMARK(BM_Dense<scan_word_ctz_avx2>);
BENCHMARK(BM_Dense<scan_word_ctz_avx512>);

BENCHMARK(BM_SparseLate<scan_bit_by_bit>);
BENCHMARK(BM_SparseLate<scan_word_ctz>);
BENCHMARK(BM_SparseLate<scan_word_ctz_unrolled4>);
BENCHMARK(BM_SparseLate<scan_word_ctz_avx2>);
BENCHMARK(BM_SparseLate<scan_word_ctz_avx512>);

}  // namespace

Footnotes

quick-bench for benchmarking consistently.
Google benchmark for the benchmarking harness & library tooling.
countr_zero reference for std::countr_zero.
AVX based __builtin_ctz, read more about builtins.