From the dumbest case-conversion to SIMD — a benchmark journey

We start character-case conversion with the dumbest algorithm with 26 if-else branches and gradually improve the performance until we reach a SIMD case-conversion algorithm.

18 min readSwitch Case

Case conversion is a complex topic that looks simple. We’re going to explore its edges, write multiple versions, speak about their benefits and pitfalls, and finally benchmark them.

The code repository#

You can clone the repository from “https://github.com/switchcaselab/code”, which contains all the source code and build scripts you will need.

The repository currently only contains the “from-the-dumbest-case-conversion-to-simd-a-benchmark-journey” project, but I plan to add more projects gradually. I try to share dependencies across projects so it will be easier for you to run them.

C++ projects use CMake as a build system and VCPkg for downloading and compiling dependencies.

Project setup#

First, you should set up CMake. You can download the latest version from their website here or install it through your OS package manager.

Second, you should initialize Vcpkg. If you are using Windows, there is a bootstrap-vcpkg.bat script in the repository; if you are using Linux, there is a bootstrap-vcpkg.sh script. Running the appropriate script will initialize Vcpkg.

Third, you can now generate build scripts using CMake presets. Go to the project directory and run the following commands:

cmake --preset release
cmake --build --preset release

Fourth, there is a case-conversion executable in the cmake-build-release directory. You can run it to see the benchmark results.

The dumbest version#

Let’s define the problem first. Suppose we have a string of ASCII characters. It may contain printable characters, control characters, etc. Sometimes we need to convert the case of characters, but non-alphabet characters should stay as is. For example, the host part of website URLs is case-insensitive; therefore, browsers almost always convert the host part to lowercase before continuing processing. Our goal is to convert the case of each character. For example, ‘A’ should become ‘a’, ‘B’ becomes ‘b’ and so on.

This leads us to the simplest implementation, which uses a long chain of if-else branches.

inline void ToLower_Dumbest(string& s)
{
    for (char& c : s)
    {
        if (c == 'A') c = 'a';
        else if (c == 'B') c = 'b';
        else if (c == 'C') c = 'c';
        else if (c == 'D') c = 'd';
        else if (c == 'E') c = 'e';
        else if (c == 'F') c = 'f';
        else if (c == 'G') c = 'g';
        else if (c == 'H') c = 'h';
        else if (c == 'I') c = 'i';
        else if (c == 'J') c = 'j';
        else if (c == 'K') c = 'k';
        else if (c == 'L') c = 'l';
        else if (c == 'M') c = 'm';
        else if (c == 'N') c = 'n';
        else if (c == 'O') c = 'o';
        else if (c == 'P') c = 'p';
        else if (c == 'Q') c = 'q';
        else if (c == 'R') c = 'r';
        else if (c == 'S') c = 's';
        else if (c == 'T') c = 't';
        else if (c == 'U') c = 'u';
        else if (c == 'V') c = 'v';
        else if (c == 'W') c = 'w';
        else if (c == 'X') c = 'x';
        else if (c == 'Y') c = 'y';
        else if (c == 'Z') c = 'z';
    }
}
Snippet 1.Dumbest implementation of ToLower

In this version, the code loops over all characters and checks each against 26 uppercase characters. If it matches, the code replaces it with its lowercase counterpart.

The dumb version#

Since we previously established that a switch statement is typically faster than a long chain of if-else branches for enumerable variables, the natural question is: why not just use a switch statement?

Let’s implement the previous function with a switch.

inline void ToLower_Dumb(string& s)
{
    for (char& c : s)
    {
        switch (c)
        {
        case 'A': c = 'a';
            break;
        case 'B': c = 'b';
            break;
        case 'C': c = 'c';
            break;
        case 'D': c = 'd';
            break;
        case 'E': c = 'e';
            break;
        case 'F': c = 'f';
            break;
        case 'G': c = 'g';
            break;
        case 'H': c = 'h';
            break;
        case 'I': c = 'i';
            break;
        case 'J': c = 'j';
            break;
        case 'K': c = 'k';
            break;
        case 'L': c = 'l';
            break;
        case 'M': c = 'm';
            break;
        case 'N': c = 'n';
            break;
        case 'O': c = 'o';
            break;
        case 'P': c = 'p';
            break;
        case 'Q': c = 'q';
            break;
        case 'R': c = 'r';
            break;
        case 'S': c = 's';
            break;
        case 'T': c = 't';
            break;
        case 'U': c = 'u';
            break;
        case 'V': c = 'v';
            break;
        case 'W': c = 'w';
            break;
        case 'X': c = 'x';
            break;
        case 'Y': c = 'y';
            break;
        case 'Z': c = 'z';
            break;
        }
    }
}
Snippet 2.Dumb implementation of ToLower

This implementation is easy to understand; it replaces every character that matches an uppercase letter with its lowercase counterpart.

The normal version#

Since characters are enumerable and alphabet characters are in a sequence (see here), we can easily use logical operators to determine if a character is in the uppercase character range. Furthermore, we can use arithmetic operators to offset the character to its lowercase equivalent.

inline void ToLower_Normal(string& s)
{
    for (char& c : s)
    {
        if (c >= 'A' && c <= 'Z')
            c = static_cast<char>(c - 'A' + 'a');
    }
}
Snippet 3.Normal implementation of ToLower

In this version, it checks if the character is in the range of A-Z, and then offsets the character using the difference between ‘a’ and ‘A’.

Note: (‘a’ - ‘A’) is 32(0x20).

The benchmark journey (Part One)#

Let’s set up a benchmarking system to benchmark our implementations.

To benchmark the code, I used the google/benchmark library. It’s very easy to use. It has three main parts: including the header, writing the benchmark function and using the benchmark main macro.

Google benchmark quick look#

  • Header: The main google benchmark header is benchmark/benchmark.h. It contains all the macros that we need.
  • BENCHMARK macro: It is used to register a function as a benchmark function.
  • BENCHMARK_MAIN macro: This macro implements the main function, automatically runs the benchmarks, and prints the outputs to stdout.

Benchmark function#

A benchmark function is a static function with a void return type. It takes a state argument that controls the benchmark loop. To differentiate between normal functions and benchmark functions, we use a BM_ prefix for the function names.

static void BM_ToLower_Dumbest(benchmark::State& state)
{
    for (auto _ : state)
    {
        std::string s = input;
        ToLower_Dumbest(s);
        benchmark::DoNotOptimize(s);
    }
    state.SetBytesProcessed(state.iterations() * static_cast<int64_t>(input.size()));
}
Snippet 4.Google benchmark example

BM_ToLower_Dumbest is a benchmark function.

To register the benchmark function, we use the BENCHMARK macro like in the code snippet below:

BENCHMARK(BM_ToLower_Dumbest);

The benchmark result#

The code for the benchmark is available in the repository, so I didn’t include it here. However, before showing the results, I need to define the input value that you can see in snippet 4.

We need a long enough string of characters to measure the actual time of the algorithm rather than the noises around it. Therefore, I chose a length of 4096(4K) characters for the input string. Now we need to generate the string. The easiest way is to use the string(size_t count, char ch) constructor, so let’s go with that.

const string input(4096, 'A');

The rest is easy, just run the benchmark and get the results.

BenchmarkTimeCPUIterationsThroughput
BM_ToLower_Dumbest1474 ns1413 ns497,7782.70 GiB/s
BM_ToLower_Dumb3543 ns3530 ns203,6361.08 GiB/s
BM_ToLower_Normal1163 ns1074 ns640,0003.55 GiB/s
Figure 1.Google benchmark example with 4096 'A' characters

Wait, what just happened? How is it possible that the switch statement is slower than the if-else branches?

The answer lies in the input. Because we chose A and fill the entire string with it, and our if-else chain starts by checking for A, every character is resolved in the very first if branch. A single, highly predictable if branch is extremely fast.

So, what if we choose Z instead of A?

BenchmarkTimeCPUIterationsThroughput
BM_ToLower_Dumbest46784 ns46039 ns14,9330.08 GiB/s
BM_ToLower_Dumb3087 ns2888 ns248,8891.32 GiB/s
BM_ToLower_Normal1258 ns1221 ns448,0003.13 GiB/s
Figure 2.Google benchmark example with 4096 'Z' characters

What a beautiful result. It shows us the difference between switch and if-else branches. The if-else chain’s speed drops dramatically, but switch statement’s speed stays stable across tests.

But this is also not a fair test. Let’s choose a random string, as shown in Snippet 5.

const string input(
    "mEHogujvUDQWkHfobSdEfZkmWbjFHsKFdBeafwKoydYSDicqhBGkOfPsyXnNjhTYaboTtQhmPFiEtvVHCGtjfNMBZgevjGVHwewekcFMmPmFamRcvvWrFmmHuRROCRGpFZsnCsVvNZbaQjAlYjbSfOIGplnczRUoyHNYcztRStyqSdvdGkjNEEXOReDATiQtSVIiTRVkddPksnNjebIlDaorSBjzJBnSYCiMdxMeIIPTKpSBsQNFouPDfUnRMupzeTudnOQPMqwWDYYqywDwtYnNCAccZhtwCzcxBraXdYlIggQYllBrJpnVjUfjphRpkTwUzIGvEsYYpBKodXsFBIcGXLKVxPVdDSbHJwfNrvxIXarjDvFPBXiqLQXavJGvNYAouhffkwpoCWFOgOtHzRLPZxoZIYbVbBiiKYjOQIrXtFfWLWovijPgVyJSqVkHYLtntNwoIzwFrngcyYOntoRTwaAFJkjgmchIGDVgYWJKmjhJvRSKLIHfiIpEZaELKLlNzoowuMgqYLLrRdzwCsblRzUgpONTZxqxUTTSQEbLFqJhetCaPdCVIrsntqaPvXknABSFGrMNGnKGKdlEmVtPyTxzzTfetBMGDxasFOZqzagrNQeWUhcvBEbGESqZlRFhexIlHLFdDwkTWenZXUEeZmyGxsylsZbPMPuzfEDpTwomlEyMZPHljQygdaDtPrhyxJhXdhzrlrfNVDsNjEyUhJCtcPYGKBXUIMhsDgRptNbLNNmeUJfZCAsytewffQEMPyAzLdYTiUgVaOPQVSkWqfzCmHEwvHUVojENxFpcAjrfnwujnoeEAIKUZNyNffXHIqNzwoEZgsXwHMbTuHwhYdrzkWHaHNpYVBCqIgCdjGRRmMDXpkEHUNImNIxktufKkGEfFuNHWTEbnWtHCEyBHFTJfyoRzYlJJrsuFDGMBDgFKzXMudQxUnNeeMZQOPZkNNxjQbOeFKFFQvUNPuGJcKBGjbHVuTkMtFPfaabAYelIYxZDZbCMjlqxLjGcGZyqoRtJkTQxakMqpKbHlxVBPMqQZsBUEfDoDxhzrkGAvGHXtmHOEbNHedFSUylGXbyJkYAYTlDUKohgKcukrNYdgDIdAxzDqBWOjxxdfiOmmOdSEZEtLfweKQMfDIHLqsFcLSWzSkYFrdVBDHFLqAONaAqynrWPfisHLCGzRkvrukYjsDggLmudRaditLqWNkYkavfeSoFQTYelxBEEnipoAjeBetWJPJzuhqJaCClQuNNtFHVPYqpnNJaBHJhoUMbluOzkICxImuImLtMNllylXUzZFNWjrSjMVvjtmIZpbYhZldsrXRfXVSjHZMhSRnWwqyqKTnIOBnDQdEmjSOZmkRowpOzGCwlZoHbTJGytcMQuZtaLjRbbiKLxjpSqKFeWlNeEmLDUKbFDnxPjaFZKVvFtbhcdrusitVBgiyYUcwIVaMRUKOAzlotKfeCqKkCPFdvXRrdJUxfXAVSalyovhVboNGEIUDntqwrflnzBUxlmzjxCfFMfZBbRpjpziCxvDEHTblyFkXukWrwBKcILesDuQXmpvJbVyuxsuBUflrkqbfdImzpZDQrJiTKmhxrqVLwpDPoeGHhtJhEmmDiWNfTpBJodQTyuwStLEADxxxRDQPbfiaAVcggItirKGGAycpVATpvttlNZTaCGTlbKDaEgQKKjwBJUJBPYlJcrkdpLwLhZskGGJRuUxIJIDWPTnIClfsXNCGtgGIIuwEiBeWcvpvhMLdOIRajtDSwTLKBDKgEbSqYVEpMqLKBnHTpvVYBMaKsmARnGUMEGoUZDlQRcEEGUiVqwFSwqFOzqTQGTgzkBxJUQuIbWpSXRjUfBTBlGwdxopTFEsnsMuSxDWnwxyjrqeRFGbtMzlIEqSBTffhPgcVvrWGqVvLjxEodmQwZLfHkOQaLrOXciVmWIiDYhSVUZbBQZMqVKSvbIvHoSzlrJSSPIrTmujXopGVefIcBitANJTdYQtNyinQQLRxJdyIZsCGANeJcUgdtpOZUQAsaHQPpBJhCvlYfQItCFgHYEOMmkQqMkwZiDxDBTrKqoMLGBSfkpkAZDMjwwbkephdtpPQbhUumbUlLxYrUfhcbVBFbcvBdsRgMvNeUMsnuZDRvbBWTNUuIUSBjtNCwITvicvrVuzKHLNtrEYzhPjgCRumvLfatTCXPhDiyMbzlRWFCqkmfmSgPetFXfnNPfZFlfMrqVPmanxJAyUZZUxvkUknTmDJJaqdcpwMVHttXRxCYiROMKcWmSMjIuHWbMuTJREiDjENknbUGapWhbcNRyhOhSycHOHaKTixGpkDXwbRNzHedVjxZKCifwzThfONXozCrNzyyLUtMSumWOqvMVGQrtzSwskYINnZVMJNeehbvGyaRhKHNfoViNJKfwtBCsdEjbQpdaXdMBlcHpPuBfGEOLzPYZuswhtwmEcjnbvCCGGJJAFVqBePNrWiJJfMzOdajdBXtNRGpCeIIzwUcSxbHgmbIoPeERcyOxZQQSZFWniEOdiKrJAFXsuwhRVGWNAlhMawGvSycsPucHOAvNIWXXYwPpsHoSETTwHquhxjoFraiebWPtxHnasRcuLTKKmLJxKjNWpzuIMSPcATjAxrMENCHkOpPXOovxGFaegUGEiWXZgTCaWexjTYxDEKOyzleiPfTHJkwkQcUQjzQtGrrrcsoeqJxpWgbsdQSjEapfGXHHYNzNJLsZmuODTdyoEMdMOANuxRxNAFbcpTHZvRUsRfTGCCBAEfhzWHBipPjtdhJBdgoMnqwmtOOWTkpuKBpsoQCRhbsFLYzqBoYzLsQpPLtxQtJyqDsXCjUNLTFmdQwTskAvekGXbaAUXIzTKgqBiixYSCBqZmhInHhxpsnJHqCDiubIphjEmpobwVtmwPGTqkzjnVeyTwAPbOgSUhZuiksprjizGoZSNLdHTwKqqFzplShpuSqZdmveQoQBgOmUzplTilrqaTpIlLQchHjRctWzjGeSnvoZnzpDaZWSKWzqLrjDakFtYcKdhuGGXcSLBrivDCpLFtvJbjknoBwUOAUNTJBBMZGspFjnBOjxQOGJtahcvKwWOVuzrFUkOXyVgwcVFiTPsXjrhzopdghQXtYNolYftuVvLgBQrgkCUKDrlANJgNDbLqAbxNFJapcXnUVVAcDTaMeKtMbnDonhfinOEegxrQUCsCsHALxTTSWxrRKuxOgpoSNcNduFNvsOgKsNWGLNIpKSbOtNuaGXGdpTOpsNIpqTUOaNvwmlioPNUTBLRYKyFQfprewKGHYOHhdFnNqgLRHeBgwctuGlAwDbUooHvESxrvDCmWbByNTpBNBJmQjwrsopNhYCHdXFWDTTTHMnmzYwLuGgbtmSMmAzMtLPokjzmNQlLezoIhDQHFyCwCeudWlhCPlIztlGRPXoLUweUeRKIxvwVPlXioZqvigsnFWesFUmqASYBiYiYLEjaWFoFvPRUEGnRsuSVMtzJCXcfjsniSoxUiiGjiXnjaGkUinCdYVylZaYKsUqjNofbhCwUQuhgDUOYXPkwTCZOfBlfUNLIVfCOMhtHAidpMPJdHrfawzHRTiiVEluZNZaZyPSyYEePSIaTAWKdnuoXoLFwtkVJrDPOpNftXTMoCfqUTbSVtPrNNQwhCIAPUNAIXAYsHyTDozKtqlTvUwBhAgeKoxcDbRYSGqrnIzFuiPNJuSQiXTVkDBvFTXaymJNLUWzVHwbKiiXOmJpGsaJzBvhnHWswIpmRjlxarqwRvdFLOXIBmSCQutEKLEfWGnHLVhgEWVRLybMypZCWhDVZTKwKzMDJYnifoxmWTMLTGMmbrHsZIBylWYgPMIFzgrYKXnVDetfrkhxADvTmhLfcJdsNwELACIAnorloPqcOCWhDwddRIfUwctoeBJeDRfRrxKAcREuiQDnkVMgtbOjvVIGvmpzxYfFAPTFRwqnMmRYAUOWwEURDgGowuUbhnipacqzYSyqfWAjnkXVDpNaDBoMHANXYnggTskXNIrhFNkFkQpaZgoGcqGWJdSOjQaUNndjgcHSKwOPnrSVoOhtkhFWKKpwDZlcWOMzoQVqWpld");
Snippet 5.Random input string

The final result.#

BenchmarkTimeCPUIterationsThroughput
BM_ToLower_Dumbest39463 ns38504 ns18,6670.10 GiB/s
BM_ToLower_Dumb18952 ns18485 ns44,8000.21 GiB/s
BM_ToLower_Normal2949 ns2783 ns235,7891.37 GiB/s
Figure 3.Benchmark with random string

Although the switch approach is 2x faster than the if-else chain, the throughput of both drops significantly because the CPU’s branch predictor cannot predict the random string.

Getting clever#

As you saw, branches are very time-consuming, so the question is: Is it possible to solve the problem without comparison?

Let’s take a look at the binary representation of alphabet characters.

Upper case characterBinaryLower case characterBinary
A01000001a01100001
B01000010b01100010
C01000011c01100011
Y01011001y01111001
Z01011010c01111010
Figure 4.Binary representation of alphabet characters

As you can see in Figure 4, the binary representation of uppercase and lowercase for each character is identical except the 5th bit, which is zero for uppercase and one for lowercase characters. So we can just toggle the 5th bit to convert an uppercase letter to lowercase and vice versa.

inline void ToLower_Clever(string& s)
{
    constexpr auto mask = 0x20;
    for (auto &c: s)
    {
        c |= mask;
    }
}
Snippet 6.Branchless ToLower Partial

There are no branches anymore; these types of algorithms are called Branchless or Branch-free since they don’t rely on conditional jumps!

BenchmarkTimeCPUIterationsThroughput
BM_ToLower_Dumbest36454 ns35296 ns19,4780.11 GiB/s
BM_ToLower_Dumb17058 ns17090 ns44,8000.22 GiB/s
BM_ToLower_Normal2723 ns2727 ns263,5291.40 GiB/s
BM_ToLower_Clever86.1 ns87.2 ns8,960,00043.75 GiB/s
Figure 5.Result of ToLower_Clever benchmark

That is unbelievable. Is it real, or is something off? Uh, I found it. There is a bug: because we didn’t limit the bitwise or operator to alphabet characters, it also modifies non-alphabet characters. For example, it converts [ to {. So, this version only works when you are absolutely sure your string contains only alphabet characters.

I’ve renamed the ToLower_Clever to ToLower_Clever_Partial and started digging for a branchless way to limit the formula to uppercase characters only. Finally, I found it.

inline void ToLower_Clever_Partial(string& s)
{
    constexpr auto mask = 0x20;
    for (auto &c: s)
    {
        c |= mask;
    }
}
 
inline void ToLower_Clever(string& s)
{
    constexpr auto GREATER_EQUAL_A = 0x80 - 'A';
    constexpr auto GREATER_THAN_Z = 0x80 - ('Z' + 1);
    constexpr auto HIGHEST_BIT_MASK = 0x80;
    for (auto &c: s)
    {
        const auto lowSevenBits = c & ~HIGHEST_BIT_MASK;
        const auto highestBit = c & HIGHEST_BIT_MASK;
        const auto ge_A = lowSevenBits + GREATER_EQUAL_A;
        const auto gt_Z = lowSevenBits + GREATER_THAN_Z;
        const auto isUpper = ge_A & ~gt_Z & HIGHEST_BIT_MASK & ~highestBit;
        c |= static_cast<char>(isUpper >> 2);
    }
}
Snippet 7.Branchless ToLower

Lines 1 to 8 show the previously renamed function. Let’s explain the algorithm.

Carry into bit 7 detects “is x ≥ K?”

Add (0x80 − K) to a 7-bit value. The top bit of the sum is the answer.

bit 76543210

Case A · x = ‘A’ (0x41), K = ‘A’ → expect x ≥ K

x01000001= 0x41 (‘A’)+T00111111= 0x3F (0x80 − ‘A’)=10000000= 0x80 → bit 7 SET ✓

x ≥ K

Case B · x = ’@’ (0x40), K = ‘A’ → expect x < K

bit 76543210x01000000= 0x40 (’@’)+T00111111= 0x3F (0x80 − ‘A’)=01111111= 0x7F → bit 7 NOT set

x < K

T = “threshold tag” = 0x80 − K. Strip bit 7 of x first; otherwise the add overflows ambiguously.

Figure 6.The 'carry into bit 7' range-detection trick

There is a very beautiful trick that I’ll explain first.

Carry into bit 7: The first thing we need is to differentiate A..Z characters mathematically. Since all standard ASCII printable characters are less than 0x80 (128), we can use bit 7 (the highest bit) as a mathematical flag. We can offset the character to flip bit 7 if the current character is between A and Z. So we can categorize characters into four groups:

  1. Characters greater than or equal to 0x80: These characters’ bit 7 is already one. We can safely filter them out.
  2. Characters less than A: If we offset these characters by (0x80 - 'A'), bit 7 remains zero.
  3. Characters greater than Z: If we offset these characters by (0x80 - ('Z' + 1)), bit 7 flips to one.
  4. Characters between A and Z: If we offset them by (0x80 - 'A'), bit 7 flips to one and if we offset them by (0x80 - ('Z' + 1)), bit 7 remains zero.

That’s it! The characters that meet the conditions of the fourth group are our target. See the ToLower_Clever again.

  • Line 17: This line clears bit 7 to prevent overflow during addition. We handle characters that originally had bit 7 set in the final step.
  • Line 18: This line remembers whether the original character had bit 7 set.
  • Line 19: Offsets the character by (0x80 - 'A').
  • Line 20: Offsets the character by (0x80 - ('Z' + 1)).
  • Line 21: Calculates if the character is in group four and isolates bit 7. It keeps characters that are greater than or equal to A and are not greater than Z and originally had bit 7 as zero. It then ANDs the result with 0x80 to clear the lower 7 bits. So isUpper can have two possible values: for characters between A and Z it is 0x80, otherwise it is zero.
  • Line 22: We need to flip the 5th bit of an uppercase character. By shifting isUpper by two, 0x80 becomes 0x20 and voila!

The uppercase characters are converted to their lowercase counterparts like a piece of cake.

Why the bitwise trick works#

Branches, especially mispredicted branches, are very costly for CPUs. On the other hand, bitwise operations are very cheap for CPUs; some CPUs can even execute multiple bitwise operations in a single CPU clock cycle!

A quick look at the benchmark#

BenchmarkTimeCPUIterationsThroughput
BM_ToLower_Dumbest36454 ns35296 ns19,4780.11 GiB/s
BM_ToLower_Dumb17058 ns17090 ns44,8000.22 GiB/s
BM_ToLower_Normal2723 ns2727 ns263,5291.40 GiB/s
BM_ToLower_Clever_Partial86.1 ns87.2 ns8,960,00043.75 GiB/s
BM_ToLower_Clever1566 ns1538 ns497,7782.48 GiB/s
Figure 7.Result of corrected ToLower_Clever benchmark

As you can see, the clever way is 2x faster than the normal algorithm. Also, the result of ToLower_Clever_Partial is insane; the compiler is doing some vectorization. You can see the compiled assembly code here.

The professional version — SWAR#

The algorithm of converting an uppercase character to lowercase is the same as ToLower_Clever. (Note: This assumes the string’s length is a multiple of eight characters (8 bytes)).

inline void ToLower_SWAR(string& s)
{
    constexpr uint64_t GREATER_EQUAL_A = 0x0101010101010101ULL * (0x80 - 'A');
    constexpr uint64_t GREATER_THAN_Z = 0x0101010101010101ULL * (0x80 - ('Z' + 1));
    constexpr uint64_t HIGHEST_BIT_MASK = 0x8080808080808080ULL;
 
    uint64_t index = 0;
    char* data = s.data();
    while (index < s.size())
    {
        uint64_t chunk;
        memcpy(&chunk, data + index, 8);
 
        const uint64_t lowSevenBits = chunk & ~HIGHEST_BIT_MASK;
        const uint64_t highestBit = chunk & HIGHEST_BIT_MASK;
 
        const uint64_t ge_A = lowSevenBits + GREATER_EQUAL_A;
        const uint64_t gt_Z = lowSevenBits + GREATER_THAN_Z;
        const uint64_t isUpper = ge_A & ~gt_Z & HIGHEST_BIT_MASK & ~highestBit;
 
        chunk |= (isUpper >> 2);
        memcpy(data + index, &chunk, 8);
        index += 8;
    }
}
Snippet 8.SWAR ToLower Implementation
ToLower SWAR implementation diagram
Figure 8.ToLower SWAR implementation diagram

The GOAT#

If the string’s length is a multiple of 16 bytes, we can utilize SSE SIMD instructions. SSE uses 16-byte wide registers.

For more information on SSE functions and also wider versions like AVX, AVX2, etc, see this.

inline void ToLower_SSE2(string& s)
{
    const auto SSE_GREATER_EQUAL_A = _mm_set1_epi8(0x80 - 'A');
    const auto SSE_GREATER_THAN_Z = _mm_set1_epi8(0x80 - ('Z' + 1));
    const auto SSE_HIGHEST_BIT_MASK = _mm_set1_epi8(static_cast<char>(0x80));
 
    uint64_t index = 0;
    char* data = s.data();
    while (index < s.size())
    {
        auto chunk = _mm_loadu_si128(reinterpret_cast<__m128i*>(data + index));
        const auto lowSevenBits = _mm_andnot_si128(SSE_HIGHEST_BIT_MASK, chunk);
        const auto highestBit = _mm_and_si128(chunk, SSE_HIGHEST_BIT_MASK);
 
        const auto ge_A = _mm_add_epi8(lowSevenBits, SSE_GREATER_EQUAL_A);
        const auto gt_Z = _mm_add_epi8(lowSevenBits, SSE_GREATER_THAN_Z);
        const auto isUpper = _mm_and_si128(_mm_andnot_si128(gt_Z, ge_A),
                                           _mm_andnot_si128(highestBit, SSE_HIGHEST_BIT_MASK));
        chunk = _mm_or_si128(chunk, _mm_srli_epi16(isUpper, 2));
        _mm_storeu_si128(reinterpret_cast<__m128i*>(data + index), chunk);
        index += 16;
    }
}
Snippet 9.SSE ToLower Implementation

Okay, let me go through it step by step.

  • Line 11: Loads 16 characters to the chunk variable. The u in _mm_loadu_si128 means the memory is unaligned.
  • Line 12: NOTs SSE_HIGHEST_BIT_MASK and ANDs it with chunk.
  • Line 13: ANDs chunk with SSE_HIGHEST_BIT_MASK.
  • Line 15: Adds packed 8-bit integers in lowSevenBits and SSE_GREATER_EQUAL_A.
Register A: [ 0x41 | 0x42 | 0x20 | ... 13 more bytes ]
Register B: [ 0x3F | 0x3F | 0x3F | ... 13 more bytes ]
            ------------------------------------------
Result:     [ 0x80 | 0x81 | 0x5F | ... 13 more bytes ]
  • Line 16: Adds packed 8-bit integers in lowSevenBits and SSE_GREATER_THAN_Z.
  • Line 17: Calculates ge_A & ~gt_Z & SSE_HIGHEST_BIT_MASK & ~highestBit.
  • Line 18: Stores the result to chunk.

The benchmark journey (Part Two)#

So, we implemented ToLower with SWAR and SIMD. Let’s see the results.

BenchmarkTimeCPUIterationsThroughput
BM_ToLower_Dumbest36454 ns35296 ns19,4780.11 GiB/s
BM_ToLower_Dumb17058 ns17090 ns44,8000.22 GiB/s
BM_ToLower_Normal2723 ns2727 ns263,5291.40 GiB/s
BM_ToLower_Clever_partial86.1 ns87.2 ns8,960,00043.75 GiB/s
BM_ToLower_Clever1566 ns1538 ns497,7782.48 GiB/s
BM_ToLower_SWAR402 ns399 ns1,723,0779.56 GiB/s
BM_ToLower_SSE2226 ns225 ns2,986,66716.96 GiB/s
BM_ToUpper_SWAR404 ns401 ns1,792,0009.51 GiB/s
BM_ToUpper_SSE2226 ns225 ns2,986,66716.96 GiB/s
Figure 9.Benchmark all implementations

I also implemented two other functions using the C++ standard library and MSVC compiler-specific __ascii_tolower function.

inline void ToLower_std(string& s)
{
    for (char& c : s)
    {
        c = static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
    }
}
 
inline void ToLower_ASCII_Compiler_Specific(string& s)
{
    for (char& c : s)
    {
        c = static_cast<char>(__ascii_tolower(static_cast<unsigned char>(c)));
    }
 
}
BenchmarkTimeCPUIterationsThroughput
BM_ToLower_Dumbest36454 ns35296 ns19,4780.11 GiB/s
BM_ToLower_Dumb17058 ns17090 ns44,8000.22 GiB/s
BM_ToLower_Normal2723 ns2727 ns263,5291.40 GiB/s
BM_ToLower_Clever_partial86.1 ns87.2 ns8,960,00043.75 GiB/s
BM_ToLower_Clever1566 ns1538 ns497,7782.48 GiB/s
BM_ToLower_SWAR_UB406 ns408 ns1,723,0779.35 GiB/s
BM_ToLower_SWAR402 ns399 ns1,723,0779.56 GiB/s
BM_ToLower_SSE2226 ns225 ns2,986,66716.96 GiB/s
BM_ToUpper_SWAR404 ns401 ns1,792,0009.51 GiB/s
BM_ToUpper_SSE2226 ns225 ns2,986,66716.96 GiB/s
BM_ToLower_std13819 ns13811 ns49,7780.28 GiB/s
BM_ToLower_ASCII_Compiler_Specific1876 ns1880 ns407,2732.03 GiB/s
Figure 10.Benchmark all implementations + std and specific MSVC compiler __ascii_tolower

Bonus — Generators#

Writing many if-else branches and switch-cases is very painful, so I wrote a Python script to generate them for me. I also included a random string generator in it.

from math import floor
from random import random
 
code = ""
for i in range(26):
    code += f"if (c == '{chr(ord('A') + i)}') c = '{chr(ord('a') + i)}'; \n"
    if i != 25:
        code += "else "
print(code)
 
code = "switch(c) {\n"
for i in range(26):
    code += f"\tcase '{chr(ord('A') + i)}':  c = '{chr(ord('a') + i)}'; break;\n"
code += '}'
print(code)
 
code = "const string input = \""
for i in range(4096):
    n = floor(random() * 52)
    if n < 26:
        code += chr(ord('A') + n)
    else:
        code += chr(ord('a') + n - 26)
code += "\";"
print(code)

▸ stay subscribed

Liked this?

Drop your email and you'll get the next post when it's published. No tracking, one-click unsubscribe.