Figuring out bitsets

Using raw SIMD with C++

We will quickly go over cases which aren’t well supported by bitset and try to figure out how to improve performance. This is my first time looking into C++ stl seriously, please be kind towards possible mistakes.

C++ bitset header

C++ provides a very awesome data-type in the form of bitset for helping developers squeeze out more performance from our code.

Hence it makes sense to look into it’s capabilities and try to understand advantages and limitations of it. And what alternatives we have if it’s inadequate for our use cases.

// initializing bitset constexpr auto size = 8; std::bitset<size> zero{"00000000"}; std::bitset<size> one{1};

The Performance: Cache

The reason for performance of bitsets is entirely dependent on their memory space efficiency. As by fitting in the L1 or L2 cache they tend to minimize cycles wasted fetching unnecessary data.

Hence are commonly used in databases, as check for the column state in databases. They are the most common in document database where they are often used to significantly speed up document attribute-filtering, or delete operations to avoid cost of actually deleting data on every query.

Dealing with masks

The advantage of bitset shines once you have a “mask”1 enums for your bitsets to provide flag set and get operations while minimizing space usage. std::bitsets support all the default features expected from numerics values, hence it’s advantages lies with the ability to support bitset of arbitrary sizes, such as 30 or 100.

Note: we can’t represent a zero state with a Mask

#include <bitset> // all states encoded separately struct State { // doesn't work as WALK is also NONE after applying the mask constexpr static std::bitset<4> NONE{0b0000}; // the rest work fine constexpr static std::bitset<4> WALK{0b0001}; constexpr static std::bitset<4> TALK{0b0010}; constexpr static std::bitset<4> JUMP{0b0100}; constexpr static std::bitset<4> DIVE{0b1000}; }; auto match(std::bitset<4> state, std::bitset<4> m) { return (state & m).any(); } auto actions(std::bitset<4> state) { /* we can bitset state can encode multiple states together */ } int main() { std::bitset<4> none{State::NONE}; std::bitset<4> walk{State::WALK}; if (match(none, State::NONE)) { /* true */ } if (match(walk, State::NONE)) { /* also true, err-or !! */ } // both walk and talk auto state = State::WALK | State::TALK; actions(state); // stop talking state &= ~State::TALK; actions(state); // start jumping state |= State::JUMP; actions(state); }

Issues with 8-bit sets

There are a couple issues with using small bitsets in C++, from the start it doesn’t allow usage of type with size less than size_t which leads to over sized bitsets for smaller than 33 bit sets.

int main() { auto size = sizeof(std::bitset<8>); std::cout << size << std::endl; // outputs 8 // whereas auto u8_size = sizeof(uint8_t); std::cout << u8_size << std::endl; // outputs 1 }

This means space usage of 8-bit bitsets makes it entirely unusable. As with just 200 elements the total extra space used would be over 200 * 7 bytes which is well over a 1kB. And assuming a 2kB L1 it’s entirely possible for it to present frequent cache misses.

A solution would have been to specialize sizes less than 33 to allow smaller sizes. Which although would have been surprising wouldn’t be unprecedented considering std::vector<bool> has been specialized in a similar fashion.

Issues with bit scanning greater than 64 bitsets

Other than checks for all zero or one bits often it’s also very useful for us to be able to scan for the first zero or first one bit from either start or end.

This is easy to do with upto 64 bits by using the function to_ullong and then using scan functions for example, with C++20 you can use std::countr_zero or std::countl_zero.

But for values upto 256 bits which it becomes quite difficult to reuse the same approach as there aren’t any default primitive types that could fit 256 bits.

So to provide upto 256 bit support we would need to

Using TZCNT for scanning

Note: std::countr_zero already uses TZCNT2

Since Skylake x86 ISA provides BMI extensions that contain TZCNT that helps us with performing trailing zero bitscan a lot more efficiently over larger types.

On ARM platforms we have clz and similar instructions as well.

// returns the count of trailing zeroes from right hand side. auto least_significant_trailing_zeroes(uint64_t bitset) { // compiler built-in that internally uses tzcnt instruction // supported by gcc and clang return __builtin_ctz(bitset); // note: on msvc it's _tzcnt_u32 } // you can use it to scan larger custom bitset by accessing each chunk of the bitset // of 64 bit size one at a time.

Take a look at Godbolt for generated code. 3

Aside: Consider using uops.info for more information regarding ISA performance with specific architectures, it can be quite useful in understanding performance behaviours.

Using SIMD

With TZCNT the performance is already decent, and there is hardly any room for improvement with scanning performance but we can definitely improve other aspects.

Intel processors have instruction sets AVX-512 that are quite useful4. They should be available in all AMD Zen CPUs as well.

// the headers depend on the platform #include <immintrin.h> #include <intrin.h> #include <vector> // Note: std::vector return only exists for making below code snippets easier to read and grok // for high performance usecases you would likely want to use raw buffers and avoid unnecessary // costs // fetches the data from items list as per the bitmap std::vector<int> fetch_with_bitmap( const int items[16], uint16_t bitset, ) { std::vector<int> ret; for (int i = 0; i < 16; i++) { if (bitset & (1 << i) > 0) { ret.push_back(items[i]); } } return ret; } // Note: similar function can be used for picking data members // efficiently by splitting the indexing and actual data containers // // AVX-512 based bitset optimization for fetch with bitmap std::vector<int> fetch_with_bitmap( const int items[16], uint16_t mask, ) { // select buffer int select[16] = {}; // base indices that can be picked from the mask __m512i base_index = _mm512_setr_epi32( 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ); // to avoid downclocking due to wide register(512-bit) // over usage, consider using _mm256 from AVX2 _mm512_mask_compressstoreu_epi32(select, mask, base_index); // counts the number of `1`s in the mask int len = _popcnt16(mask); // only present due to brevity, shouldn't be used for production code std::vector<int> ret; for (int i = 0; i < len; i++) { ret.push_back(select[i]); } return ret; }

Update(2024): Zen 4 supports AVX-512 already, but Alder Lake and newer Intel CPUs had not shipped with AVX-512 by default.

So is bitset perf acceptable?

From above quick research we discover that there are quite a few cases which make using bitset awkward for some of it’s expected use-cases. So I can only recommend people to judge acceptable performance for applications on the merits of their own requirements.

But for me std::bitset have ended up wholly unacceptable and I would much rather write my own version of std::bitset rather than stick to STL.

Still you should be aware this is a breakdown of libc++ bitset and behaviour pattern of your vendor specific STL or bitset library might suit your needs well enough.

Footnotes

  1. Masks (computing)

  2. LLVM BMI Patch for TZCNT

  3. rep bsf will decode to tzcnt if supported by the CPU.

  4. Intel AVX-512 Set Intrinsics Documentation