From the dumbest case-conversion to SIMD — a benchmark journey
We start character-case conversion with the dumbest algorithm with 26 if-else branches and gradually improve the performance until we reach a SIMD case-conversion algorithm.
Case conversion is a complex topic that looks simple. We’re going to explore its edges, write multiple versions, speak about their benefits and pitfalls, and finally benchmark them.
The code repository#
You can clone the repository from “https://github.com/switchcaselab/code”, which contains all the source code and build scripts you will need.
The repository currently only contains the “from-the-dumbest-case-conversion-to-simd-a-benchmark-journey” project, but I plan to add more projects gradually. I try to share dependencies across projects so it will be easier for you to run them.
C++ projects use CMake as a build system and VCPkg for downloading and compiling dependencies.
Project setup#
First, you should set up CMake. You can download the latest version from their website here or install it through your OS package manager.
Second, you should initialize Vcpkg. If you are using Windows, there is a bootstrap-vcpkg.bat script in the repository; if you are using Linux, there is a bootstrap-vcpkg.sh script. Running the appropriate script will initialize Vcpkg.
Third, you can now generate build scripts using CMake presets. Go to the project directory and run the following commands:
cmake --preset release
cmake --build --preset releaseFourth, there is a case-conversion executable in the cmake-build-release directory. You can run it to see the benchmark results.
The dumbest version#
Let’s define the problem first. Suppose we have a string of ASCII characters. It may contain printable characters, control characters, etc. Sometimes we need to convert the case of characters, but non-alphabet characters should stay as is. For example, the host part of website URLs is case-insensitive; therefore, browsers almost always convert the host part to lowercase before continuing processing. Our goal is to convert the case of each character. For example, ‘A’ should become ‘a’, ‘B’ becomes ‘b’ and so on.
This leads us to the simplest implementation, which uses a long chain of if-else branches.
inline void ToLower_Dumbest(string& s)
{
for (char& c : s)
{
if (c == 'A') c = 'a';
else if (c == 'B') c = 'b';
else if (c == 'C') c = 'c';
else if (c == 'D') c = 'd';
else if (c == 'E') c = 'e';
else if (c == 'F') c = 'f';
else if (c == 'G') c = 'g';
else if (c == 'H') c = 'h';
else if (c == 'I') c = 'i';
else if (c == 'J') c = 'j';
else if (c == 'K') c = 'k';
else if (c == 'L') c = 'l';
else if (c == 'M') c = 'm';
else if (c == 'N') c = 'n';
else if (c == 'O') c = 'o';
else if (c == 'P') c = 'p';
else if (c == 'Q') c = 'q';
else if (c == 'R') c = 'r';
else if (c == 'S') c = 's';
else if (c == 'T') c = 't';
else if (c == 'U') c = 'u';
else if (c == 'V') c = 'v';
else if (c == 'W') c = 'w';
else if (c == 'X') c = 'x';
else if (c == 'Y') c = 'y';
else if (c == 'Z') c = 'z';
}
}In this version, the code loops over all characters and checks each against 26 uppercase characters. If it matches, the code replaces it with its lowercase counterpart.
The dumb version#
Since we previously established that a switch statement is typically faster than a long chain of if-else branches for enumerable variables, the natural question is: why not just use a switch statement?
Let’s implement the previous function with a switch.
inline void ToLower_Dumb(string& s)
{
for (char& c : s)
{
switch (c)
{
case 'A': c = 'a';
break;
case 'B': c = 'b';
break;
case 'C': c = 'c';
break;
case 'D': c = 'd';
break;
case 'E': c = 'e';
break;
case 'F': c = 'f';
break;
case 'G': c = 'g';
break;
case 'H': c = 'h';
break;
case 'I': c = 'i';
break;
case 'J': c = 'j';
break;
case 'K': c = 'k';
break;
case 'L': c = 'l';
break;
case 'M': c = 'm';
break;
case 'N': c = 'n';
break;
case 'O': c = 'o';
break;
case 'P': c = 'p';
break;
case 'Q': c = 'q';
break;
case 'R': c = 'r';
break;
case 'S': c = 's';
break;
case 'T': c = 't';
break;
case 'U': c = 'u';
break;
case 'V': c = 'v';
break;
case 'W': c = 'w';
break;
case 'X': c = 'x';
break;
case 'Y': c = 'y';
break;
case 'Z': c = 'z';
break;
}
}
}This implementation is easy to understand; it replaces every character that matches an uppercase letter with its lowercase counterpart.
The normal version#
Since characters are enumerable and alphabet characters are in a sequence (see here), we can easily use logical operators to determine if a character is in the uppercase character range. Furthermore, we can use arithmetic operators to offset the character to its lowercase equivalent.
inline void ToLower_Normal(string& s)
{
for (char& c : s)
{
if (c >= 'A' && c <= 'Z')
c = static_cast<char>(c - 'A' + 'a');
}
}In this version, it checks if the character is in the range of A-Z, and then offsets the character using the difference between ‘a’ and ‘A’.
Note: (‘a’ - ‘A’) is 32(0x20).
The benchmark journey (Part One)#
Let’s set up a benchmarking system to benchmark our implementations.
To benchmark the code, I used the google/benchmark library. It’s very easy to use. It has three main parts: including the header, writing the benchmark function and using the benchmark main macro.
Google benchmark quick look#
- Header: The main google benchmark header is
benchmark/benchmark.h. It contains all the macros that we need. - BENCHMARK macro: It is used to register a function as a benchmark function.
- BENCHMARK_MAIN macro: This macro implements the main function, automatically runs the benchmarks, and prints the outputs to stdout.
Benchmark function#
A benchmark function is a static function with a void return type. It takes a state argument that controls the benchmark loop. To differentiate between normal functions and benchmark functions, we use a BM_ prefix for the function names.
static void BM_ToLower_Dumbest(benchmark::State& state)
{
for (auto _ : state)
{
std::string s = input;
ToLower_Dumbest(s);
benchmark::DoNotOptimize(s);
}
state.SetBytesProcessed(state.iterations() * static_cast<int64_t>(input.size()));
}BM_ToLower_Dumbest is a benchmark function.
To register the benchmark function, we use the BENCHMARK macro like in the code snippet below:
BENCHMARK(BM_ToLower_Dumbest);The benchmark result#
The code for the benchmark is available in the repository, so I didn’t include it here. However, before showing the results, I need to define the input value that you can see in snippet 4.
We need a long enough string of characters to measure the actual time of the algorithm rather than the noises around it. Therefore, I chose a length of 4096(4K) characters for the input string.
Now we need to generate the string. The easiest way is to use the string(size_t count, char ch) constructor, so let’s go with that.
const string input(4096, 'A');The rest is easy, just run the benchmark and get the results.
| Benchmark | Time | CPU | Iterations | Throughput |
|---|---|---|---|---|
| BM_ToLower_Dumbest | 1474 ns | 1413 ns | 497,778 | 2.70 GiB/s |
| BM_ToLower_Dumb | 3543 ns | 3530 ns | 203,636 | 1.08 GiB/s |
| BM_ToLower_Normal | 1163 ns | 1074 ns | 640,000 | 3.55 GiB/s |
Wait, what just happened? How is it possible that the switch statement is slower than the if-else branches?
The answer lies in the input. Because we chose A and fill the entire string with it, and our if-else chain starts by checking for A, every character is resolved in the very first if branch. A single, highly predictable if branch is extremely fast.
So, what if we choose Z instead of A?
| Benchmark | Time | CPU | Iterations | Throughput |
|---|---|---|---|---|
| BM_ToLower_Dumbest | 46784 ns | 46039 ns | 14,933 | 0.08 GiB/s |
| BM_ToLower_Dumb | 3087 ns | 2888 ns | 248,889 | 1.32 GiB/s |
| BM_ToLower_Normal | 1258 ns | 1221 ns | 448,000 | 3.13 GiB/s |
What a beautiful result. It shows us the difference between switch and if-else branches. The if-else chain’s speed drops dramatically, but switch statement’s speed stays stable across tests.
But this is also not a fair test. Let’s choose a random string, as shown in Snippet 5.
const string input(
"mEHogujvUDQWkHfobSdEfZkmWbjFHsKFdBeafwKoydYSDicqhBGkOfPsyXnNjhTYaboTtQhmPFiEtvVHCGtjfNMBZgevjGVHwewekcFMmPmFamRcvvWrFmmHuRROCRGpFZsnCsVvNZbaQjAlYjbSfOIGplnczRUoyHNYcztRStyqSdvdGkjNEEXOReDATiQtSVIiTRVkddPksnNjebIlDaorSBjzJBnSYCiMdxMeIIPTKpSBsQNFouPDfUnRMupzeTudnOQPMqwWDYYqywDwtYnNCAccZhtwCzcxBraXdYlIggQYllBrJpnVjUfjphRpkTwUzIGvEsYYpBKodXsFBIcGXLKVxPVdDSbHJwfNrvxIXarjDvFPBXiqLQXavJGvNYAouhffkwpoCWFOgOtHzRLPZxoZIYbVbBiiKYjOQIrXtFfWLWovijPgVyJSqVkHYLtntNwoIzwFrngcyYOntoRTwaAFJkjgmchIGDVgYWJKmjhJvRSKLIHfiIpEZaELKLlNzoowuMgqYLLrRdzwCsblRzUgpONTZxqxUTTSQEbLFqJhetCaPdCVIrsntqaPvXknABSFGrMNGnKGKdlEmVtPyTxzzTfetBMGDxasFOZqzagrNQeWUhcvBEbGESqZlRFhexIlHLFdDwkTWenZXUEeZmyGxsylsZbPMPuzfEDpTwomlEyMZPHljQygdaDtPrhyxJhXdhzrlrfNVDsNjEyUhJCtcPYGKBXUIMhsDgRptNbLNNmeUJfZCAsytewffQEMPyAzLdYTiUgVaOPQVSkWqfzCmHEwvHUVojENxFpcAjrfnwujnoeEAIKUZNyNffXHIqNzwoEZgsXwHMbTuHwhYdrzkWHaHNpYVBCqIgCdjGRRmMDXpkEHUNImNIxktufKkGEfFuNHWTEbnWtHCEyBHFTJfyoRzYlJJrsuFDGMBDgFKzXMudQxUnNeeMZQOPZkNNxjQbOeFKFFQvUNPuGJcKBGjbHVuTkMtFPfaabAYelIYxZDZbCMjlqxLjGcGZyqoRtJkTQxakMqpKbHlxVBPMqQZsBUEfDoDxhzrkGAvGHXtmHOEbNHedFSUylGXbyJkYAYTlDUKohgKcukrNYdgDIdAxzDqBWOjxxdfiOmmOdSEZEtLfweKQMfDIHLqsFcLSWzSkYFrdVBDHFLqAONaAqynrWPfisHLCGzRkvrukYjsDggLmudRaditLqWNkYkavfeSoFQTYelxBEEnipoAjeBetWJPJzuhqJaCClQuNNtFHVPYqpnNJaBHJhoUMbluOzkICxImuImLtMNllylXUzZFNWjrSjMVvjtmIZpbYhZldsrXRfXVSjHZMhSRnWwqyqKTnIOBnDQdEmjSOZmkRowpOzGCwlZoHbTJGytcMQuZtaLjRbbiKLxjpSqKFeWlNeEmLDUKbFDnxPjaFZKVvFtbhcdrusitVBgiyYUcwIVaMRUKOAzlotKfeCqKkCPFdvXRrdJUxfXAVSalyovhVboNGEIUDntqwrflnzBUxlmzjxCfFMfZBbRpjpziCxvDEHTblyFkXukWrwBKcILesDuQXmpvJbVyuxsuBUflrkqbfdImzpZDQrJiTKmhxrqVLwpDPoeGHhtJhEmmDiWNfTpBJodQTyuwStLEADxxxRDQPbfiaAVcggItirKGGAycpVATpvttlNZTaCGTlbKDaEgQKKjwBJUJBPYlJcrkdpLwLhZskGGJRuUxIJIDWPTnIClfsXNCGtgGIIuwEiBeWcvpvhMLdOIRajtDSwTLKBDKgEbSqYVEpMqLKBnHTpvVYBMaKsmARnGUMEGoUZDlQRcEEGUiVqwFSwqFOzqTQGTgzkBxJUQuIbWpSXRjUfBTBlGwdxopTFEsnsMuSxDWnwxyjrqeRFGbtMzlIEqSBTffhPgcVvrWGqVvLjxEodmQwZLfHkOQaLrOXciVmWIiDYhSVUZbBQZMqVKSvbIvHoSzlrJSSPIrTmujXopGVefIcBitANJTdYQtNyinQQLRxJdyIZsCGANeJcUgdtpOZUQAsaHQPpBJhCvlYfQItCFgHYEOMmkQqMkwZiDxDBTrKqoMLGBSfkpkAZDMjwwbkephdtpPQbhUumbUlLxYrUfhcbVBFbcvBdsRgMvNeUMsnuZDRvbBWTNUuIUSBjtNCwITvicvrVuzKHLNtrEYzhPjgCRumvLfatTCXPhDiyMbzlRWFCqkmfmSgPetFXfnNPfZFlfMrqVPmanxJAyUZZUxvkUknTmDJJaqdcpwMVHttXRxCYiROMKcWmSMjIuHWbMuTJREiDjENknbUGapWhbcNRyhOhSycHOHaKTixGpkDXwbRNzHedVjxZKCifwzThfONXozCrNzyyLUtMSumWOqvMVGQrtzSwskYINnZVMJNeehbvGyaRhKHNfoViNJKfwtBCsdEjbQpdaXdMBlcHpPuBfGEOLzPYZuswhtwmEcjnbvCCGGJJAFVqBePNrWiJJfMzOdajdBXtNRGpCeIIzwUcSxbHgmbIoPeERcyOxZQQSZFWniEOdiKrJAFXsuwhRVGWNAlhMawGvSycsPucHOAvNIWXXYwPpsHoSETTwHquhxjoFraiebWPtxHnasRcuLTKKmLJxKjNWpzuIMSPcATjAxrMENCHkOpPXOovxGFaegUGEiWXZgTCaWexjTYxDEKOyzleiPfTHJkwkQcUQjzQtGrrrcsoeqJxpWgbsdQSjEapfGXHHYNzNJLsZmuODTdyoEMdMOANuxRxNAFbcpTHZvRUsRfTGCCBAEfhzWHBipPjtdhJBdgoMnqwmtOOWTkpuKBpsoQCRhbsFLYzqBoYzLsQpPLtxQtJyqDsXCjUNLTFmdQwTskAvekGXbaAUXIzTKgqBiixYSCBqZmhInHhxpsnJHqCDiubIphjEmpobwVtmwPGTqkzjnVeyTwAPbOgSUhZuiksprjizGoZSNLdHTwKqqFzplShpuSqZdmveQoQBgOmUzplTilrqaTpIlLQchHjRctWzjGeSnvoZnzpDaZWSKWzqLrjDakFtYcKdhuGGXcSLBrivDCpLFtvJbjknoBwUOAUNTJBBMZGspFjnBOjxQOGJtahcvKwWOVuzrFUkOXyVgwcVFiTPsXjrhzopdghQXtYNolYftuVvLgBQrgkCUKDrlANJgNDbLqAbxNFJapcXnUVVAcDTaMeKtMbnDonhfinOEegxrQUCsCsHALxTTSWxrRKuxOgpoSNcNduFNvsOgKsNWGLNIpKSbOtNuaGXGdpTOpsNIpqTUOaNvwmlioPNUTBLRYKyFQfprewKGHYOHhdFnNqgLRHeBgwctuGlAwDbUooHvESxrvDCmWbByNTpBNBJmQjwrsopNhYCHdXFWDTTTHMnmzYwLuGgbtmSMmAzMtLPokjzmNQlLezoIhDQHFyCwCeudWlhCPlIztlGRPXoLUweUeRKIxvwVPlXioZqvigsnFWesFUmqASYBiYiYLEjaWFoFvPRUEGnRsuSVMtzJCXcfjsniSoxUiiGjiXnjaGkUinCdYVylZaYKsUqjNofbhCwUQuhgDUOYXPkwTCZOfBlfUNLIVfCOMhtHAidpMPJdHrfawzHRTiiVEluZNZaZyPSyYEePSIaTAWKdnuoXoLFwtkVJrDPOpNftXTMoCfqUTbSVtPrNNQwhCIAPUNAIXAYsHyTDozKtqlTvUwBhAgeKoxcDbRYSGqrnIzFuiPNJuSQiXTVkDBvFTXaymJNLUWzVHwbKiiXOmJpGsaJzBvhnHWswIpmRjlxarqwRvdFLOXIBmSCQutEKLEfWGnHLVhgEWVRLybMypZCWhDVZTKwKzMDJYnifoxmWTMLTGMmbrHsZIBylWYgPMIFzgrYKXnVDetfrkhxADvTmhLfcJdsNwELACIAnorloPqcOCWhDwddRIfUwctoeBJeDRfRrxKAcREuiQDnkVMgtbOjvVIGvmpzxYfFAPTFRwqnMmRYAUOWwEURDgGowuUbhnipacqzYSyqfWAjnkXVDpNaDBoMHANXYnggTskXNIrhFNkFkQpaZgoGcqGWJdSOjQaUNndjgcHSKwOPnrSVoOhtkhFWKKpwDZlcWOMzoQVqWpld");The final result.#
| Benchmark | Time | CPU | Iterations | Throughput |
|---|---|---|---|---|
| BM_ToLower_Dumbest | 39463 ns | 38504 ns | 18,667 | 0.10 GiB/s |
| BM_ToLower_Dumb | 18952 ns | 18485 ns | 44,800 | 0.21 GiB/s |
| BM_ToLower_Normal | 2949 ns | 2783 ns | 235,789 | 1.37 GiB/s |
Although the switch approach is 2x faster than the if-else chain, the throughput of both drops significantly because the CPU’s branch predictor cannot predict the random string.
Getting clever#
As you saw, branches are very time-consuming, so the question is: Is it possible to solve the problem without comparison?
Let’s take a look at the binary representation of alphabet characters.
| Upper case character | Binary | Lower case character | Binary |
|---|---|---|---|
| A | 01000001 | a | 01100001 |
| B | 01000010 | b | 01100010 |
| C | 01000011 | c | 01100011 |
| … | … | … | … |
| Y | 01011001 | y | 01111001 |
| Z | 01011010 | c | 01111010 |
As you can see in Figure 4, the binary representation of uppercase and lowercase for each character is identical except the 5th bit, which is zero for uppercase and one for lowercase characters. So we can just toggle the 5th bit to convert an uppercase letter to lowercase and vice versa.
inline void ToLower_Clever(string& s)
{
constexpr auto mask = 0x20;
for (auto &c: s)
{
c |= mask;
}
}There are no branches anymore; these types of algorithms are called Branchless or Branch-free since they don’t rely on conditional jumps!
| Benchmark | Time | CPU | Iterations | Throughput |
|---|---|---|---|---|
| BM_ToLower_Dumbest | 36454 ns | 35296 ns | 19,478 | 0.11 GiB/s |
| BM_ToLower_Dumb | 17058 ns | 17090 ns | 44,800 | 0.22 GiB/s |
| BM_ToLower_Normal | 2723 ns | 2727 ns | 263,529 | 1.40 GiB/s |
| BM_ToLower_Clever | 86.1 ns | 87.2 ns | 8,960,000 | 43.75 GiB/s |
That is unbelievable. Is it real, or is something off? Uh, I found it. There is a bug: because we didn’t limit the bitwise or operator to alphabet characters, it also modifies non-alphabet characters. For example, it converts [ to {. So, this version only works when you are absolutely sure your string contains only alphabet characters.
I’ve renamed the ToLower_Clever to ToLower_Clever_Partial and started digging for a branchless way to limit the formula to uppercase characters only. Finally, I found it.
inline void ToLower_Clever_Partial(string& s)
{
constexpr auto mask = 0x20;
for (auto &c: s)
{
c |= mask;
}
}
inline void ToLower_Clever(string& s)
{
constexpr auto GREATER_EQUAL_A = 0x80 - 'A';
constexpr auto GREATER_THAN_Z = 0x80 - ('Z' + 1);
constexpr auto HIGHEST_BIT_MASK = 0x80;
for (auto &c: s)
{
const auto lowSevenBits = c & ~HIGHEST_BIT_MASK;
const auto highestBit = c & HIGHEST_BIT_MASK;
const auto ge_A = lowSevenBits + GREATER_EQUAL_A;
const auto gt_Z = lowSevenBits + GREATER_THAN_Z;
const auto isUpper = ge_A & ~gt_Z & HIGHEST_BIT_MASK & ~highestBit;
c |= static_cast<char>(isUpper >> 2);
}
}Lines 1 to 8 show the previously renamed function. Let’s explain the algorithm.
There is a very beautiful trick that I’ll explain first.
Carry into bit 7: The first thing we need is to differentiate A..Z characters mathematically. Since all standard ASCII printable characters are less than 0x80 (128), we can use bit 7 (the highest bit) as a mathematical flag. We can offset the character to flip bit 7 if the current character is between A and Z. So we can categorize characters into four groups:
- Characters greater than or equal to
0x80: These characters’ bit 7 is already one. We can safely filter them out. - Characters less than
A: If we offset these characters by(0x80 - 'A'), bit 7 remains zero. - Characters greater than
Z: If we offset these characters by(0x80 - ('Z' + 1)), bit 7 flips to one. - Characters between
AandZ: If we offset them by(0x80 - 'A'), bit 7 flips to one and if we offset them by(0x80 - ('Z' + 1)), bit 7 remains zero.
That’s it! The characters that meet the conditions of the fourth group are our target. See the ToLower_Clever again.
- Line 17: This line clears bit 7 to prevent overflow during addition. We handle characters that originally had bit 7 set in the final step.
- Line 18: This line remembers whether the original character had bit 7 set.
- Line 19: Offsets the character by
(0x80 - 'A'). - Line 20: Offsets the character by
(0x80 - ('Z' + 1)). - Line 21: Calculates if the character is in group four and isolates bit 7. It keeps characters that are greater than or equal to
Aand are not greater thanZand originally had bit 7 as zero. It then ANDs the result with 0x80 to clear the lower 7 bits. SoisUppercan have two possible values: for characters betweenAandZit is0x80, otherwise it is zero. - Line 22: We need to flip the 5th bit of an uppercase character. By shifting
isUpperby two,0x80becomes0x20and voila!
The uppercase characters are converted to their lowercase counterparts like a piece of cake.
Why the bitwise trick works#
Branches, especially mispredicted branches, are very costly for CPUs. On the other hand, bitwise operations are very cheap for CPUs; some CPUs can even execute multiple bitwise operations in a single CPU clock cycle!
A quick look at the benchmark#
| Benchmark | Time | CPU | Iterations | Throughput |
|---|---|---|---|---|
| BM_ToLower_Dumbest | 36454 ns | 35296 ns | 19,478 | 0.11 GiB/s |
| BM_ToLower_Dumb | 17058 ns | 17090 ns | 44,800 | 0.22 GiB/s |
| BM_ToLower_Normal | 2723 ns | 2727 ns | 263,529 | 1.40 GiB/s |
| BM_ToLower_Clever_Partial | 86.1 ns | 87.2 ns | 8,960,000 | 43.75 GiB/s |
| BM_ToLower_Clever | 1566 ns | 1538 ns | 497,778 | 2.48 GiB/s |
As you can see, the clever way is 2x faster than the normal algorithm. Also, the result of ToLower_Clever_Partial is insane; the compiler is doing some vectorization. You can see the compiled assembly code here.
The professional version — SWAR#
The algorithm of converting an uppercase character to lowercase is the same as ToLower_Clever. (Note: This assumes the string’s length is a multiple of eight characters (8 bytes)).
inline void ToLower_SWAR(string& s)
{
constexpr uint64_t GREATER_EQUAL_A = 0x0101010101010101ULL * (0x80 - 'A');
constexpr uint64_t GREATER_THAN_Z = 0x0101010101010101ULL * (0x80 - ('Z' + 1));
constexpr uint64_t HIGHEST_BIT_MASK = 0x8080808080808080ULL;
uint64_t index = 0;
char* data = s.data();
while (index < s.size())
{
uint64_t chunk;
memcpy(&chunk, data + index, 8);
const uint64_t lowSevenBits = chunk & ~HIGHEST_BIT_MASK;
const uint64_t highestBit = chunk & HIGHEST_BIT_MASK;
const uint64_t ge_A = lowSevenBits + GREATER_EQUAL_A;
const uint64_t gt_Z = lowSevenBits + GREATER_THAN_Z;
const uint64_t isUpper = ge_A & ~gt_Z & HIGHEST_BIT_MASK & ~highestBit;
chunk |= (isUpper >> 2);
memcpy(data + index, &chunk, 8);
index += 8;
}
}The GOAT#
If the string’s length is a multiple of 16 bytes, we can utilize SSE SIMD instructions. SSE uses 16-byte wide registers.
For more information on SSE functions and also wider versions like AVX, AVX2, etc, see this.
inline void ToLower_SSE2(string& s)
{
const auto SSE_GREATER_EQUAL_A = _mm_set1_epi8(0x80 - 'A');
const auto SSE_GREATER_THAN_Z = _mm_set1_epi8(0x80 - ('Z' + 1));
const auto SSE_HIGHEST_BIT_MASK = _mm_set1_epi8(static_cast<char>(0x80));
uint64_t index = 0;
char* data = s.data();
while (index < s.size())
{
auto chunk = _mm_loadu_si128(reinterpret_cast<__m128i*>(data + index));
const auto lowSevenBits = _mm_andnot_si128(SSE_HIGHEST_BIT_MASK, chunk);
const auto highestBit = _mm_and_si128(chunk, SSE_HIGHEST_BIT_MASK);
const auto ge_A = _mm_add_epi8(lowSevenBits, SSE_GREATER_EQUAL_A);
const auto gt_Z = _mm_add_epi8(lowSevenBits, SSE_GREATER_THAN_Z);
const auto isUpper = _mm_and_si128(_mm_andnot_si128(gt_Z, ge_A),
_mm_andnot_si128(highestBit, SSE_HIGHEST_BIT_MASK));
chunk = _mm_or_si128(chunk, _mm_srli_epi16(isUpper, 2));
_mm_storeu_si128(reinterpret_cast<__m128i*>(data + index), chunk);
index += 16;
}
}Okay, let me go through it step by step.
- Line 11: Loads 16 characters to the
chunkvariable. The u in_mm_loadu_si128means the memory is unaligned. - Line 12: NOTs
SSE_HIGHEST_BIT_MASKand ANDs it withchunk. - Line 13: ANDs
chunkwithSSE_HIGHEST_BIT_MASK. - Line 15: Adds packed 8-bit integers in
lowSevenBitsandSSE_GREATER_EQUAL_A.
Register A: [ 0x41 | 0x42 | 0x20 | ... 13 more bytes ]
Register B: [ 0x3F | 0x3F | 0x3F | ... 13 more bytes ]
------------------------------------------
Result: [ 0x80 | 0x81 | 0x5F | ... 13 more bytes ]- Line 16: Adds packed 8-bit integers in
lowSevenBitsandSSE_GREATER_THAN_Z. - Line 17: Calculates
ge_A & ~gt_Z & SSE_HIGHEST_BIT_MASK & ~highestBit. - Line 18: Stores the result to
chunk.
The benchmark journey (Part Two)#
So, we implemented ToLower with SWAR and SIMD. Let’s see the results.
| Benchmark | Time | CPU | Iterations | Throughput |
|---|---|---|---|---|
| BM_ToLower_Dumbest | 36454 ns | 35296 ns | 19,478 | 0.11 GiB/s |
| BM_ToLower_Dumb | 17058 ns | 17090 ns | 44,800 | 0.22 GiB/s |
| BM_ToLower_Normal | 2723 ns | 2727 ns | 263,529 | 1.40 GiB/s |
| BM_ToLower_Clever_partial | 86.1 ns | 87.2 ns | 8,960,000 | 43.75 GiB/s |
| BM_ToLower_Clever | 1566 ns | 1538 ns | 497,778 | 2.48 GiB/s |
| BM_ToLower_SWAR | 402 ns | 399 ns | 1,723,077 | 9.56 GiB/s |
| BM_ToLower_SSE2 | 226 ns | 225 ns | 2,986,667 | 16.96 GiB/s |
| BM_ToUpper_SWAR | 404 ns | 401 ns | 1,792,000 | 9.51 GiB/s |
| BM_ToUpper_SSE2 | 226 ns | 225 ns | 2,986,667 | 16.96 GiB/s |
I also implemented two other functions using the C++ standard library and MSVC compiler-specific __ascii_tolower function.
inline void ToLower_std(string& s)
{
for (char& c : s)
{
c = static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
}
}
inline void ToLower_ASCII_Compiler_Specific(string& s)
{
for (char& c : s)
{
c = static_cast<char>(__ascii_tolower(static_cast<unsigned char>(c)));
}
}| Benchmark | Time | CPU | Iterations | Throughput |
|---|---|---|---|---|
| BM_ToLower_Dumbest | 36454 ns | 35296 ns | 19,478 | 0.11 GiB/s |
| BM_ToLower_Dumb | 17058 ns | 17090 ns | 44,800 | 0.22 GiB/s |
| BM_ToLower_Normal | 2723 ns | 2727 ns | 263,529 | 1.40 GiB/s |
| BM_ToLower_Clever_partial | 86.1 ns | 87.2 ns | 8,960,000 | 43.75 GiB/s |
| BM_ToLower_Clever | 1566 ns | 1538 ns | 497,778 | 2.48 GiB/s |
| BM_ToLower_SWAR_UB | 406 ns | 408 ns | 1,723,077 | 9.35 GiB/s |
| BM_ToLower_SWAR | 402 ns | 399 ns | 1,723,077 | 9.56 GiB/s |
| BM_ToLower_SSE2 | 226 ns | 225 ns | 2,986,667 | 16.96 GiB/s |
| BM_ToUpper_SWAR | 404 ns | 401 ns | 1,792,000 | 9.51 GiB/s |
| BM_ToUpper_SSE2 | 226 ns | 225 ns | 2,986,667 | 16.96 GiB/s |
| BM_ToLower_std | 13819 ns | 13811 ns | 49,778 | 0.28 GiB/s |
| BM_ToLower_ASCII_Compiler_Specific | 1876 ns | 1880 ns | 407,273 | 2.03 GiB/s |
Bonus — Generators#
Writing many if-else branches and switch-cases is very painful, so I wrote a Python script to generate them for me. I also included a random string generator in it.
from math import floor
from random import random
code = ""
for i in range(26):
code += f"if (c == '{chr(ord('A') + i)}') c = '{chr(ord('a') + i)}'; \n"
if i != 25:
code += "else "
print(code)
code = "switch(c) {\n"
for i in range(26):
code += f"\tcase '{chr(ord('A') + i)}': c = '{chr(ord('a') + i)}'; break;\n"
code += '}'
print(code)
code = "const string input = \""
for i in range(4096):
n = floor(random() * 52)
if n < 26:
code += chr(ord('A') + n)
else:
code += chr(ord('a') + n - 26)
code += "\";"
print(code)▸ stay subscribed
Liked this?
Drop your email and you'll get the next post when it's published. No tracking, one-click unsubscribe.