ascii SIMD benchmark journey lowercase uppercase

From the dumbest case-conversion to SIMD — a benchmark journey

We start character-case conversion with the dumbest algorithm with 26 if-else branches and gradually improve the performance until we reach a SIMD case-conversion algorithm.

May 17, 202619 min readSwitch Case

Case conversion is a complex topic that looks simple. We’re going to explore its edges, write multiple versions, speak about their benefits and pitfalls, and finally benchmark them.

The dumbest version#

Let’s define the problem first. Suppose we have a string of ASCII characters. It may contain printable characters, control characters, etc. Sometimes we need to convert the case of characters, but non-alphabet characters should stay as is. For example, the host part of website URLs is case-insensitive; therefore, browsers almost always convert the host part to lowercase before continuing processing. Our goal is to convert the case of each character. For example, ‘A’ should become ‘a’, ‘B’ becomes ‘b’ and so on.

This leads us to the simplest implementation, which uses a long chain of if-else branches.

inline void ToLower_Dumbest(string& s)
{
    for (char& c : s)
    {
        if (c == 'A') c = 'a';
        else if (c == 'B') c = 'b';
        else if (c == 'C') c = 'c';
        else if (c == 'D') c = 'd';
        else if (c == 'E') c = 'e';
        else if (c == 'F') c = 'f';
        else if (c == 'G') c = 'g';
        else if (c == 'H') c = 'h';
        else if (c == 'I') c = 'i';
        else if (c == 'J') c = 'j';
        else if (c == 'K') c = 'k';
        else if (c == 'L') c = 'l';
        else if (c == 'M') c = 'm';
        else if (c == 'N') c = 'n';
        else if (c == 'O') c = 'o';
        else if (c == 'P') c = 'p';
        else if (c == 'Q') c = 'q';
        else if (c == 'R') c = 'r';
        else if (c == 'S') c = 's';
        else if (c == 'T') c = 't';
        else if (c == 'U') c = 'u';
        else if (c == 'V') c = 'v';
        else if (c == 'W') c = 'w';
        else if (c == 'X') c = 'x';
        else if (c == 'Y') c = 'y';
        else if (c == 'Z') c = 'z';
    }
}

Snippet 1.Dumbest implementation of ToLower

In this version, the code loops over all characters and checks each against 26 uppercase characters. If it matches, the code replaces it with its lowercase counterpart.

The dumb version#

Since we previously established that a switch statement is typically faster than a long chain of if-else branches for integral or enumerable variables, the natural question is: why not just use a switch statement?

Let’s implement the previous function with a switch.

inline void ToLower_Dumb(string& s)
{
    for (char& c : s)
    {
        switch (c)
        {
        case 'A': c = 'a';
            break;
        case 'B': c = 'b';
            break;
        case 'C': c = 'c';
            break;
        case 'D': c = 'd';
            break;
        case 'E': c = 'e';
            break;
        case 'F': c = 'f';
            break;
        case 'G': c = 'g';
            break;
        case 'H': c = 'h';
            break;
        case 'I': c = 'i';
            break;
        case 'J': c = 'j';
            break;
        case 'K': c = 'k';
            break;
        case 'L': c = 'l';
            break;
        case 'M': c = 'm';
            break;
        case 'N': c = 'n';
            break;
        case 'O': c = 'o';
            break;
        case 'P': c = 'p';
            break;
        case 'Q': c = 'q';
            break;
        case 'R': c = 'r';
            break;
        case 'S': c = 's';
            break;
        case 'T': c = 't';
            break;
        case 'U': c = 'u';
            break;
        case 'V': c = 'v';
            break;
        case 'W': c = 'w';
            break;
        case 'X': c = 'x';
            break;
        case 'Y': c = 'y';
            break;
        case 'Z': c = 'z';
            break;
        }
    }
}

Snippet 2.Dumb implementation of ToLower

This implementation is easy to understand; it replaces every character that matches an uppercase letter with its lowercase counterpart.

The normal version#

Since characters are enumerable and alphabet characters are in a sequence (see here), we can easily use logical operators to determine if a character is in the uppercase character range. Furthermore, we can use arithmetic operators to offset the character to its lowercase equivalent.

inline void ToLower_Normal(string& s)
{
    for (char& c : s)
    {
        if (c >= 'A' && c <= 'Z')
            c = static_cast<char>(c - 'A' + 'a');
    }
}

Snippet 3.Normal implementation of ToLower

In this version, it checks if the character is in the range of A-Z, and then offsets the character using the difference between ‘a’ and ‘A’.

Note: (‘a’ - ‘A’) is 32(0x20).

The benchmark journey begins#

Let’s set up a small benchmark harness for the implementations.

To benchmark the code, I used the google/benchmark library. It’s very easy to use. It has three main parts: including the header, writing the benchmark function and using the benchmark main macro.

Google benchmark quick look#

Header: The main google benchmark header is benchmark/benchmark.h. It contains all the macros that we need.
BENCHMARK macro: It is used to register a function as a benchmark function.
BENCHMARK_MAIN macro: This macro implements the main function, automatically runs the benchmarks, and prints the outputs to stdout.

Benchmark function#

A benchmark function is a static function with a void return type. It takes a state argument that controls the benchmark loop. To differentiate between normal functions and benchmark functions, we use a BM_ prefix for the function names.

static void BM_ToLower_Dumbest(benchmark::State& state)
{
    for (auto _ : state)
    {
        std::string s = input;
        ToLower_Dumbest(s);
        benchmark::DoNotOptimize(s);
    }
    state.SetBytesProcessed(state.iterations() * static_cast<int64_t>(input.size()));
}

Snippet 4.Google benchmark example

BM_ToLower_Dumbest is a benchmark function.

To register the benchmark function, we use the BENCHMARK macro like in the code snippet below:

BENCHMARK(BM_ToLower_Dumbest);

The benchmark result#

The code for the benchmark is available in the repository, so I didn’t include it here. However, before showing the results, I need to define the input value that you can see in snippet 4.

We need a string long enough to measure the actual time of the algorithm rather than the surrounding benchmark noise. I chose 4096 characters which is 4 KiB. Now we need to generate the string. The easiest way is to use the string(size_t count, char ch) constructor, so let’s go with that.

const string input(4096, 'A');

The rest is easy, just run the benchmark and get the results.

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	1474 ns	1413 ns	497,778	2.70 GiB/s
BM_ToLower_Dumb	3543 ns	3530 ns	203,636	1.08 GiB/s
BM_ToLower_Normal	1163 ns	1074 ns	640,000	3.55 GiB/s

Figure 1.Google benchmark example with 4096 'A' characters

Wait, what just happened? How is it possible that the switch statement is slower than the if-else branches?

The answer lies in the input. Because we chose A and fill the entire string with it, and our if-else chain starts by checking for A, every character is resolved in the very first if branch. A single, highly predictable if branch is extremely fast.

So, what if we choose Z instead of A?

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	46784 ns	46039 ns	14,933	0.08 GiB/s
BM_ToLower_Dumb	3087 ns	2888 ns	248,889	1.32 GiB/s
BM_ToLower_Normal	1258 ns	1221 ns	448,000	3.13 GiB/s

Figure 2.Google benchmark example with 4096 'Z' characters

What a beautiful result. It shows us the difference between switch and if-else branches. The if-else chain’s speed drops dramatically, but switch statement’s speed stays stable across tests.

But this is also not a fair test. Let’s choose a random string, as shown in Snippet 5.

const string input(
    "mEHogujvUDQWkHfobSdEfZkmWb...(4070 more)"); //the literal is abbreviated.

Snippet 5.Random input string

Random input result.#

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	39463 ns	38504 ns	18,667	0.10 GiB/s
BM_ToLower_Dumb	18952 ns	18485 ns	44,800	0.21 GiB/s
BM_ToLower_Normal	2949 ns	2783 ns	235,789	1.37 GiB/s

Figure 3.Benchmark with random string

Although the switch approach is 2x faster than the if-else chain, the throughput of both drops significantly because the CPU’s branch predictor cannot predict the random string.

The random input hurts every branch dependent version, including ToLower_Normal. That raises the next question: can we solve this without conditional jumps?

Getting clever#

Let’s take a look at the binary representation of alphabet characters.

Upper case character	Binary	Lower case character	Binary
A	01000001	a	01100001
B	01000010	b	01100010
C	01000011	c	01100011
…	…	…	…
Y	01011001	y	01111001
Z	01011010	z	01111010

Figure 4.Binary representation of alphabet characters

As you can see in Figure 4, the binary representation of uppercase and lowercase for each character is identical except the 5th bit, which is zero for uppercase and one for lowercase characters. So we can just toggle the 5th bit to convert an uppercase letter to lowercase and vice versa.

Why branchless can help#

Branches, especially mispredicted branches, are very costly for CPUs. On the other hand, bitwise operations are very cheap for CPUs; some CPUs can even execute multiple bitwise operations in a single CPU clock cycle!

The implementations#

inline void ToLower_Clever(string& s)
{
    constexpr auto mask = 0x20;
    for (auto &c: s)
    {
        c |= mask;
    }
}

Snippet 6.Branchless ToLower Partial

There are no branches anymore; these types of algorithms are called Branchless or Branch-free since they don’t rely on conditional jumps!

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	36454 ns	35296 ns	19,478	0.11 GiB/s
BM_ToLower_Dumb	17058 ns	17090 ns	44,800	0.22 GiB/s
BM_ToLower_Normal	2723 ns	2727 ns	263,529	1.40 GiB/s
BM_ToLower_Clever	86.1 ns	87.2 ns	8,960,000	43.75 GiB/s

Figure 5.Result of ToLower_Clever benchmark

That result looks unbelievable, and it is. The implementation has a bug: because the bitwise OR is not limited to alphabetic characters, it also modifies punctuation. For example, it converts [ to {. So, this version only works when you are absolutely sure your string contains only alphabet characters.

I’ve renamed the ToLower_Clever to ToLower_Clever_Partial and started digging for a branchless way to limit the formula to uppercase characters only. Finally, I found it. I’m going to write a complex looking code below, then I’ll explain it.

inline void ToLower_Clever_Partial(string& s)
{
    constexpr auto mask = 0x20;
    for (auto &c: s)
    {
        c |= mask;
    }
}
 
inline void ToLower_Clever(string& s)
{
    constexpr auto GREATER_EQUAL_A = 0x80 - 'A';
    constexpr auto GREATER_THAN_Z = 0x80 - ('Z' + 1);
    constexpr auto HIGHEST_BIT_MASK = 0x80;
    for (auto &c: s)
    {
        const auto lowSevenBits = c & ~HIGHEST_BIT_MASK;
        const auto highestBit = c & HIGHEST_BIT_MASK;
        const auto ge_A = lowSevenBits + GREATER_EQUAL_A;
        const auto gt_Z = lowSevenBits + GREATER_THAN_Z;
        const auto isUpper = ge_A & ~gt_Z & HIGHEST_BIT_MASK & ~highestBit;
        c |= static_cast<char>(isUpper >> 2);
    }
}

Snippet 7.Branchless ToLower

Play with the visualizer first; the explanation below walks through the same variables.

branchless ascii trace

Bitwise ToLower visualizer

Step through one byte at a time and watch the range mask become the lowercase bit.

C++ execution traceline 1/6

01maskconst auto lowSevenBits = c & ~HIGHEST_BIT_MASK;

02guardconst auto highestBit = c & HIGHEST_BIT_MASK;

03>= Aconst auto ge_A = lowSevenBits + GREATER_EQUAL_A;

04> Zconst auto gt_Z = lowSevenBits + GREATER_THAN_Z;

05rangeconst auto isUpper = ge_A & ~gt_Z & HIGHEST_BIT_MASK & ~highestBit;

06writec |= static_cast<char>(isUpper >> 2);

input

output

lowSevenBits0x48 / 72

01001000

highestBit0x00 / 0

00000000

ge_A0x87 / 135

10000111

gt_Z0x6D / 109

01101101

isUpper0x80 / 128

10000000

final c0x68 / 104

01101000

current step/Clear bit 7 before the range checks.

step 1/72

Lines 1 to 8 show the previously renamed function. Let’s explain the algorithm.

There is a very beautiful trick that I’ll explain first.

Carry into bit 7: The first thing we need is to differentiate A..Z characters mathematically. Since all standard ASCII printable characters are less than 0x80 (128), we can use bit 7 (the highest bit) as a mathematical flag. We can offset the character to flip bit 7 if the current character is between A and Z. So we can categorize characters into four groups:

Characters greater than or equal to 0x80: These characters’ bit 7 is already one. We can safely filter them out.
Characters less than A: If we offset these characters by (0x80 - 'A'), bit 7 remains zero.
Characters greater than Z: If we offset these characters by (0x80 - ('Z' + 1)), bit 7 flips to one.
Characters between A and Z: If we offset them by (0x80 - 'A'), bit 7 flips to one and if we offset them by (0x80 - ('Z' + 1)), bit 7 remains zero.

That’s it! The characters that meet the conditions of the fourth group are our target. See the ToLower_Clever again.

Line 17: This line clears bit 7 to prevent overflow during addition. We handle characters that originally had bit 7 set in the final step.
Line 18: This line remembers whether the original character had bit 7 set.
Line 19: Offsets the character by (0x80 - 'A').
Line 20: Offsets the character by (0x80 - ('Z' + 1)).
Line 21: Calculates if the character is in group four and isolates bit 7. It keeps characters that are greater than or equal to A and are not greater than Z and originally had bit 7 as zero. It then ANDs the result with 0x80 to clear the lower 7 bits. So isUpper can have two possible values: for characters between A and Z it is 0x80, otherwise it is zero.
Line 22: We need to flip the 5th bit of an uppercase character. By shifting isUpper by two, 0x80 becomes 0x20 and voila!

Carry into bit 7 detects “is x ≥ K?”

Add (0x80 − K) to a 7-bit value. The top bit of the sum is the answer.

bit 76543210

Case A · x = ‘A’ (0x41), K = ‘A’ → expect x ≥ K

x01000001= 0x41 (‘A’)+T00111111= 0x3F (0x80 − ‘A’)=10000000= 0x80 → bit 7 SET ✓

x ≥ K

Case B · x = ’@’ (0x40), K = ‘A’ → expect x < K

bit 76543210x01000000= 0x40 (’@’)+T00111111= 0x3F (0x80 − ‘A’)=01111111= 0x7F → bit 7 NOT set

x < K

T = “threshold tag” = 0x80 − K. Strip bit 7 of x first; otherwise the add overflows ambiguously.

Figure 6.The 'carry into bit 7' range-detection trick

The uppercase characters are converted to their lowercase counterparts like a piece of cake.

A quick look at the benchmark#

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	36454 ns	35296 ns	19,478	0.11 GiB/s
BM_ToLower_Dumb	17058 ns	17090 ns	44,800	0.22 GiB/s
BM_ToLower_Normal	2723 ns	2727 ns	263,529	1.40 GiB/s
BM_ToLower_Clever_Partial	86.1 ns	87.2 ns	8,960,000	43.75 GiB/s
BM_ToLower_Clever	1566 ns	1538 ns	497,778	2.48 GiB/s

Figure 7.Result of corrected ToLower_Clever benchmark

As you can see, the clever way is 2x faster than the normal algorithm. Also, the result of ToLower_Clever_Partial is insane; the compiler is doing some vectorization. You can see the compiled assembly code here.

The professional version — SWAR#

The algorithm of converting an uppercase character to lowercase is the same as ToLower_Clever. (Note: This assumes the string’s length is a multiple of eight characters (8 bytes)).

inline void ToLower_SWAR(string& s)
{
    constexpr uint64_t GREATER_EQUAL_A = 0x0101010101010101ULL * (0x80 - 'A');
    constexpr uint64_t GREATER_THAN_Z = 0x0101010101010101ULL * (0x80 - ('Z' + 1));
    constexpr uint64_t HIGHEST_BIT_MASK = 0x8080808080808080ULL;
 
    uint64_t index = 0;
    char* data = s.data();
    while (index < s.size())
    {
        uint64_t chunk;
        memcpy(&chunk, data + index, 8);
 
        const uint64_t lowSevenBits = chunk & ~HIGHEST_BIT_MASK;
        const uint64_t highestBit = chunk & HIGHEST_BIT_MASK;
 
        const uint64_t ge_A = lowSevenBits + GREATER_EQUAL_A;
        const uint64_t gt_Z = lowSevenBits + GREATER_THAN_Z;
        const uint64_t isUpper = ge_A & ~gt_Z & HIGHEST_BIT_MASK & ~highestBit;
 
        chunk |= (isUpper >> 2);
        memcpy(data + index, &chunk, 8);
        index += 8;
    }
}

Snippet 8.SWAR ToLower Implementation

Figure 8.ToLower SWAR implementation diagram

Another quick look at the benchmark results#

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	36454 ns	35296 ns	19,478	0.11 GiB/s
BM_ToLower_Dumb	17058 ns	17090 ns	44,800	0.22 GiB/s
BM_ToLower_Normal	2723 ns	2727 ns	263,529	1.40 GiB/s
BM_ToLower_Clever_partial	86.1 ns	87.2 ns	8,960,000	43.75 GiB/s
BM_ToLower_Clever	1566 ns	1538 ns	497,778	2.48 GiB/s
BM_ToLower_SWAR	402 ns	399 ns	1,723,077	9.56 GiB/s

Figure 9.Another quick look at the benchmark results

By using bitwise operations to process multiple characters at once, SWAR is almost 4x faster than the corrected clever version. Modern CPUs usually have registers wider than 64 bits, from 128-bit SSE to 512-bit AVX-512, so we can apply the same idea to more bytes at once.

The GOAT#

If the string’s length is a multiple of 16 bytes, we can utilize SSE SIMD instructions. SSE uses 16-byte wide registers.

For more information on SSE functions and also wider versions like AVX, AVX2, etc, see this.

inline void ToLower_SSE2(string& s)
{
    const auto SSE_GREATER_EQUAL_A = _mm_set1_epi8(0x80 - 'A');
    const auto SSE_GREATER_THAN_Z = _mm_set1_epi8(0x80 - ('Z' + 1));
    const auto SSE_HIGHEST_BIT_MASK = _mm_set1_epi8(static_cast<char>(0x80));
 
    uint64_t index = 0;
    char* data = s.data();
    while (index < s.size())
    {
        auto chunk = _mm_loadu_si128(reinterpret_cast<__m128i*>(data + index));
        const auto lowSevenBits = _mm_andnot_si128(SSE_HIGHEST_BIT_MASK, chunk);
        const auto highestBit = _mm_and_si128(chunk, SSE_HIGHEST_BIT_MASK);
 
        const auto ge_A = _mm_add_epi8(lowSevenBits, SSE_GREATER_EQUAL_A);
        const auto gt_Z = _mm_add_epi8(lowSevenBits, SSE_GREATER_THAN_Z);
        const auto isUpper = _mm_and_si128(_mm_andnot_si128(gt_Z, ge_A),
                                           _mm_andnot_si128(highestBit, SSE_HIGHEST_BIT_MASK));
        chunk = _mm_or_si128(chunk, _mm_srli_epi16(isUpper, 2));
        _mm_storeu_si128(reinterpret_cast<__m128i*>(data + index), chunk);
        index += 16;
    }
}

Snippet 9.SSE ToLower Implementation

Okay, let me go through it step by step.

Line 11: Loads 16 characters to the chunk variable. The u in _mm_loadu_si128 means the memory is unaligned.
Line 12: NOTs SSE_HIGHEST_BIT_MASK and ANDs it with chunk.
Line 13: ANDs chunk with SSE_HIGHEST_BIT_MASK.
Line 15: Adds packed 8-bit integers in lowSevenBits and SSE_GREATER_EQUAL_A.

Register A: [ 0x41 | 0x42 | 0x20 | ... 13 more bytes ]
Register B: [ 0x3F | 0x3F | 0x3F | ... 13 more bytes ]
            ------------------------------------------
Result:     [ 0x80 | 0x81 | 0x5F | ... 13 more bytes ]

Line 16: Adds packed 8-bit integers in lowSevenBits and SSE_GREATER_THAN_Z.
Line 17: Calculates ge_A & ~gt_Z & SSE_HIGHEST_BIT_MASK & ~highestBit.
Line 18: Stores the result to chunk.

Yet another look at the benchmark results#

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	36454 ns	35296 ns	19,478	0.11 GiB/s
BM_ToLower_Dumb	17058 ns	17090 ns	44,800	0.22 GiB/s
BM_ToLower_Normal	2723 ns	2727 ns	263,529	1.40 GiB/s
BM_ToLower_Clever_partial	86.1 ns	87.2 ns	8,960,000	43.75 GiB/s
BM_ToLower_Clever	1566 ns	1538 ns	497,778	2.48 GiB/s
BM_ToLower_SWAR	402 ns	399 ns	1,723,077	9.56 GiB/s
BM_ToLower_SSE2	226 ns	225 ns	2,986,667	16.96 GiB/s

Figure 10.Benchmark all implementations

So, we implemented ToLower with SWAR and SIMD, and saw the significant improvement in performance.

Library and compiler-specific baselines#

I also implemented two other functions using the C++ standard library and MSVC compiler-specific __ascii_tolower function, and included the BM_ToUpper_SSE2 and BM_ToUpper_SWAR in the code and in the benchmarks table.

inline void ToLower_std(string& s)
{
    for (char& c : s)
    {
        c = static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
    }
}
 
inline void ToLower_ASCII_Compiler_Specific(string& s)
{
    for (char& c : s)
    {
        c = static_cast<char>(__ascii_tolower(static_cast<unsigned char>(c)));
    }
 
}

Benchmark	Time	CPU	Iterations	Throughput
BM_ToLower_Dumbest	36454 ns	35296 ns	19,478	0.11 GiB/s
BM_ToLower_Dumb	17058 ns	17090 ns	44,800	0.22 GiB/s
BM_ToLower_Normal	2723 ns	2727 ns	263,529	1.40 GiB/s
BM_ToLower_Clever_partial	86.1 ns	87.2 ns	8,960,000	43.75 GiB/s
BM_ToLower_Clever	1566 ns	1538 ns	497,778	2.48 GiB/s
BM_ToLower_SWAR_UB	406 ns	408 ns	1,723,077	9.35 GiB/s
BM_ToLower_SWAR	402 ns	399 ns	1,723,077	9.56 GiB/s
BM_ToLower_SSE2	226 ns	225 ns	2,986,667	16.96 GiB/s
BM_ToUpper_SWAR	404 ns	401 ns	1,792,000	9.51 GiB/s
BM_ToUpper_SSE2	226 ns	225 ns	2,986,667	16.96 GiB/s
BM_ToLower_std	13819 ns	13811 ns	49,778	0.28 GiB/s
BM_ToLower_ASCII_Compiler_Specific	1876 ns	1880 ns	407,273	2.03 GiB/s

Figure 11.Benchmark all implementations + std and specific MSVC compiler __ascii_tolower

So, we implemented ToLower with SWAR and SIMD, and saw the significant improvement in performance even in comparison with std library and compiler-specifics. So, let’s conclude the journey.

Conclusion#

Case conversion throughput chart ordered by throughput — Figure 12.Throughput ordered high to low

The main lesson is not that everyone should hand-write SSE2 for case conversion. Most code should use the boring, correct library path. But when the input domain is narrow, the constraints are explicit, and the hot path matters, a small amount of byte-level reasoning can beat both branch-heavy code and some compiler-specific helpers.

Bonus: generators#

Writing many if-else branches and switch-cases is very painful, so I wrote a Python script to generate them for me. I also included a random string generator in it.

from math import floor
from random import random
 
code = ""
for i in range(26):
    code += f"if (c == '{chr(ord('A') + i)}') c = '{chr(ord('a') + i)}'; \n"
    if i != 25:
        code += "else "
print(code)
 
code = "switch(c) {\n"
for i in range(26):
    code += f"\tcase '{chr(ord('A') + i)}':  c = '{chr(ord('a') + i)}'; break;\n"
code += '}'
print(code)
 
code = "const string input = \""
for i in range(4096):
    n = floor(random() * 52)
    if n < 26:
        code += chr(ord('A') + n)
    else:
        code += chr(ord('a') + n - 26)
code += "\";"
print(code)

Appendixes#

The code repository#

You can clone the repository from “https://github.com/switchcaselab/code”, which contains all the source code and build scripts you will need.

The repository currently only contains the “from-the-dumbest-case-conversion-to-simd-a-benchmark-journey” project, but I plan to add more projects gradually. I try to share dependencies across projects so it will be easier for you to run them.

C++ projects use CMake as a build system and VCPkg for downloading and compiling dependencies.

Project setup#

First, you should set up CMake. You can download the latest version from their website here or install it through your OS package manager.

Second, you should initialize Vcpkg. If you are using Windows, there is a bootstrap-vcpkg.bat script in the repository; if you are using Linux, there is a bootstrap-vcpkg.sh script. Running the appropriate script will initialize Vcpkg.

Third, you can now generate build scripts using CMake presets. Go to the project directory and run the following commands:

cmake --preset release
cmake --build --preset release

Fourth, there is a case-conversion executable in the cmake-build-release directory. You can run it to see the benchmark results.

▸ stay subscribed

Liked this?

Drop your email and you'll get the next post when it's published. No tracking, one-click unsubscribe.