Figuring out bitsets
Using raw SIMD with C++
We will quickly go over cases which aren’t well supported by bitset and try to figure out how to improve performance. This is my first time looking into C++ stl seriously, please be kind towards possible mistakes.
C++ bitset
header
C++ provides a very awesome data-type in the form of bitset
for helping developers squeeze out more
performance from our code.
Hence it makes sense to look into it’s capabilities and try to understand advantages and limitations of it. And what alternatives we have if it’s inadequate for our use cases.
// initializing bitset constexpr auto size = 8; std::bitset<size> zero{"00000000"}; std::bitset<size> one{1};
The Performance: Cache
The reason for performance of bitsets is entirely dependent on their memory space efficiency. As by fitting in the L1 or L2 cache they tend to minimize cycles wasted fetching unnecessary data.
Hence are commonly used in databases, as check for the column state in databases. They are the most common in document database where they are often used to significantly speed up document attribute-filtering, or delete operations to avoid cost of actually deleting data on every query.
Dealing with masks
The advantage of bitset shines once you have a “mask”1 enums for your bitsets to provide
flag set and get operations while minimizing space usage. std::bitsets
support all the default features
expected from numerics values, hence it’s advantages lies with the ability to support bitset of
arbitrary sizes, such as 30 or 100.
Note: we can’t represent a zero state with a Mask
#include <bitset> // all states encoded separately struct State { // doesn't work as WALK is also NONE after applying the mask constexpr static std::bitset<4> NONE{0b0000}; // the rest work fine constexpr static std::bitset<4> WALK{0b0001}; constexpr static std::bitset<4> TALK{0b0010}; constexpr static std::bitset<4> JUMP{0b0100}; constexpr static std::bitset<4> DIVE{0b1000}; }; auto match(std::bitset<4> state, std::bitset<4> m) { return (state & m).any(); } auto actions(std::bitset<4> state) { /* we can bitset state can encode multiple states together */ } int main() { std::bitset<4> none{State::NONE}; std::bitset<4> walk{State::WALK}; if (match(none, State::NONE)) { /* true */ } if (match(walk, State::NONE)) { /* also true, err-or !! */ } // both walk and talk auto state = State::WALK | State::TALK; actions(state); // stop talking state &= ~State::TALK; actions(state); // start jumping state |= State::JUMP; actions(state); }
Issues with 8-bit sets
There are a couple issues with using small bitsets in C++, from the start it doesn’t allow usage of
type with size less than size_t
which leads to over sized bitsets for smaller than 33 bit sets.
int main() { auto size = sizeof(std::bitset<8>); std::cout << size << std::endl; // outputs 8 // whereas auto u8_size = sizeof(uint8_t); std::cout << u8_size << std::endl; // outputs 1 }
This means space usage of 8-bit bitsets makes it entirely unusable. As with just 200 elements the
total extra space used would be over 200 * 7
bytes which is well over a 1kB. And assuming a 2kB L1
it’s entirely possible for it to present frequent cache misses.
A solution would have been to specialize sizes less than 33 to allow smaller sizes. Which although
would have been surprising wouldn’t be unprecedented considering std::vector<bool>
has been
specialized in a similar fashion.
Issues with bit scanning greater than 64 bitsets
Other than checks for all zero or one bits often it’s also very useful for us to be able to scan for the first zero or first one bit from either start or end.
This is easy to do with upto 64 bits by using the function to_ullong
and then using scan functions for example,
with C++20 you can use std::countr_zero
or std::countl_zero
.
But for values upto 256 bits which it becomes quite difficult to reuse the same approach as there aren’t any default primitive types that could fit 256 bits.
So to provide upto 256 bit support we would need to
Using TZCNT
for scanning
Note: std::countr_zero
already uses TZCNT
2
Since Skylake x86 ISA provides BMI extensions that contain TZCNT
that helps us with performing
trailing zero bitscan a lot more efficiently over larger types.
On ARM platforms we have clz
and similar instructions as well.
// returns the count of trailing zeroes from right hand side. auto least_significant_trailing_zeroes(uint64_t bitset) { // compiler built-in that internally uses tzcnt instruction // supported by gcc and clang return __builtin_ctz(bitset); // note: on msvc it's _tzcnt_u32 } // you can use it to scan larger custom bitset by accessing each chunk of the bitset // of 64 bit size one at a time.
Take a look at Godbolt for generated code. 3
Aside: Consider using uops.info for more information regarding ISA performance with specific architectures, it can be quite useful in understanding performance behaviours.
Using SIMD
With TZCNT
the performance is already decent, and there is hardly any room for improvement with
scanning performance but we can definitely improve other aspects.
Intel processors have instruction sets AVX-512 that are quite useful4. They should be available in all AMD Zen CPUs as well.
// the headers depend on the platform #include <immintrin.h> #include <intrin.h> #include <vector> // Note: std::vector return only exists for making below code snippets easier to read and grok // for high performance usecases you would likely want to use raw buffers and avoid unnecessary // costs // fetches the data from items list as per the bitmap std::vector<int> fetch_with_bitmap( const int items[16], uint16_t bitset, ) { std::vector<int> ret; for (int i = 0; i < 16; i++) { if (bitset & (1 << i) > 0) { ret.push_back(items[i]); } } return ret; } // Note: similar function can be used for picking data members // efficiently by splitting the indexing and actual data containers // // AVX-512 based bitset optimization for fetch with bitmap std::vector<int> fetch_with_bitmap( const int items[16], uint16_t mask, ) { // select buffer int select[16] = {}; // base indices that can be picked from the mask __m512i base_index = _mm512_setr_epi32( 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ); // to avoid downclocking due to wide register(512-bit) // over usage, consider using _mm256 from AVX2 _mm512_mask_compressstoreu_epi32(select, mask, base_index); // counts the number of `1`s in the mask int len = _popcnt16(mask); // only present due to brevity, shouldn't be used for production code std::vector<int> ret; for (int i = 0; i < len; i++) { ret.push_back(select[i]); } return ret; }
Update(2024): Zen 4 supports AVX-512 already, but Alder Lake and newer Intel CPUs had not shipped with AVX-512 by default.
So is bitset perf acceptable?
From above quick research we discover that there are quite a few cases which make using bitset
awkward for some of it’s expected use-cases. So I can only recommend people to judge acceptable
performance for applications on the merits of their own requirements.
But for me std::bitset
have ended up wholly unacceptable and I would much rather write my own
version of std::bitset
rather than stick to STL.
Still you should be aware this is a breakdown of libc++ bitset and behaviour pattern of your vendor specific STL or bitset library might suit your needs well enough.
Footnotes
-
rep bsf
will decode totzcnt
if supported by the CPU. ↩