Writing a Web Server in assembly from scratch

How hard is it to write a web server in pure assembly, using nothing but raw OS system calls? Sounds cool. In this post, I'm going to go through it and explain each step.

May 26, 202617 min readSwitch Case

Recently, I saw a HN post that someone made a web server entirely in assembly. In the post description, they mentioned that the reason for doing this is to give meaning to my life. That made me wonder: how hard is it, actually? I know writing a full-fledged web server even in high-level programming languages is a pretty hard task. But a toy web server may not be that hard. Let’s give it a try.

Choosing an assembler#

There are many assemblers out there, some more famous than others, such as NASM, FASM, YASM and MASM, etc. Among them, GAS is the one that compilers mostly use to convert their generated assembly into machine code. It belongs to the GNU project, supports multiple syntaxes, features powerful macro preprocessors, and targets many CPU architectures.

I chose to use GAS simply because it already comes bundled with gcc.

Project setup#

A good thing about assembly is that you don’t need much to start a project, a simple text editor like Kate, VS Code, Mousepad, etc is sufficient. However, to keep things organized I will explain my folder structure below.

Bin: This directory will contain object files (.o) and executable files.
Run.sh: A build script to automate compiling and running source files.
Other directories: To make it easier to track each step of the project, I will create a separate directory for each step.
Source files: Each step’s directory contains two files, Server.s and Functions.s.
Functions.s: This file acts as our lightweight standard library, it contains utility functions that are not exclusive to this project and can be shared with other projects.
Server.s: This file contains the main application logic and entry point for the web server.

You can also find the source code on the blog’s Github repository.

The Run.sh script#

As I mentioned, this script compiles the source files using GAS, then passes the generated object files to ld (the linker) to build the final executable, and finally runs it.

#!/bin/bash
 
OBJS=()
 
for arg in "$@"; do
    name=$(basename $arg)
    as -o $PWD/Bin/$name.o $arg    
    OBJS+=("$PWD/Bin/$name.o")
done
 
name=$(basename $1)
ld -o $PWD/Bin/$name.e ${OBJS[@]}
$PWD/Bin/$name.e
 
echo "Exit Code: $?"

Snippet 1.Run.sh script: Compiles, links and executes the code

Line 1: The shebang (#!/bin/bash). This tells the operating system which interpreter to use to execute the script.
Line 3: OBJS is a list that will store generated object file names.
Line 5: Loops over all the command-line arguments passed to the script.
Line 6: Extracts the filename and stores it in the name variable.
Line 7: Compiles the assembly file and stores the generated object file in the Bin directory.
Line 8: Saves the object file path to the OBJS list.
Line 12: Links all the object files and builds the final executable.
Line 13: Executes the executable!
Line 14: Prints the exit code of the executable.

Step 1 - Getting started: Just exit properly#

There are a few ground rules that we should establish before we start coding.

Assembly Syntax: The two most popular assembly syntaxes are AT&T and Intel. The default syntax for GAS is AT&T. However, since the AT&T syntax can be a bit more complicated, I chose Intel syntax for this project.
Entry point: Each assembly application must define a global _start symbol as its entry point. This is the low-level equivalent of the C++ main function.
System Calls: System calls are functions provided by the OS kernel. Each system call has a unique ID number and expects a specific set of parameters. I previously put together a Linux system calls reference here.
Calling convention: Under the Linux x86_64 calling convention (System V AMD64 ABI), function arguments are passed via registers in a specific order: rdi, rsi, rdx, etc. You can read more details about it here (System V AMD64 ABI).

.intel_syntax noprefix
.global _start
_start:
    mov rdi, 0
    call Exit

Snippet 2.Minimum assembly code to run and exit properly.

Line 1: Switches the assembler syntax to Intel. Lines 2, 3: Defines and declares the application’s entry point. Line 4: Moves our exit code (0) into the rdi register (the first argument slot). Line 5: Calls our custom Exit function.

Let’s take a quick look at the Functions.s.

.intel_syntax noprefix
.global Exit
 
Exit:
    mov rax, 60
    syscall

Snippet 3.A sample of Functions.s showing how the exit system call is wrapped with the global Exit function.

Line 2: Tells the assembler that the Exit function is global, making it accessible from other files (like Server.s).
Lines 4 to 6: Defines the implementation of the Exit function.
Line 5: Loads the rax register with 60, which is the Linux system call number for sys_exit.
Line 6: Triggers the syscall instruction to hand control over to the OS kernel.
Missing ret instruction: Because the exit system call immediately terminates the application process, control never returns to this function. It is the absolute last piece of code the application executes, so a ret instruction is unnecessary.

Step 2 - Opening a socket#

Sockets are resources managed and provided by the operating system kernel. To open a new socket, we use the socket system call which takes three parameters:

domain: Specifies the communication protocol (IPv4, IPv6, etc).
type: Defines how data is transmitted (Stream, Datagram, etc).
protocol: Used for advanced networking; For our project, this will always be set to zero.

.intel_syntax noprefix
.global _start
 
_start:
    # Allocate 4 bytes on the stack for the server socket
    sub rsp, 4
    
    mov rdi, 2 # socket domain = AF_INET (IPv4)
    mov rsi, 1 # socket type = SOCK_STREAM (TCP)
    mov rdx, 0 # socket protocol
    call Socket # call the socket syscall
    mov DWORD PTR [rsp], eax # error | socket is stored in eax
    
    cmp eax, 0 # check whether eax < 0
    jl  1f # jump to exit
 
    mov rdi, rax # prepare to close the socket stored in rax.
    call Close # call the close syscall
 
    xor rax, rax # rax = 0 # clear rax to close cleanly.
 
    1: mov rdi, rax # Exit with status code
    call Exit

Snippet 4.This snippet shows how to open a TCP socket using the socket system call.

The code is self-explanatory, except two lines: line 6, line 15.

To allocate a variable in assembly, we need to allocate memory for it on the stack. The rsp register points to the bottom of the stack: therefore, to allocate memory, we can simply subtract the required size from rsp.

There are two types of labels in assembly: local labels, global labels.

Global labels : Use an explicit identifier that can be referenced directly – such as _start or a function name –.
Local labels : Use a number as an identifier, the code references them by combining that number with the direction of the target (forward or backward).

So, Line 6 allocates 4 bytes (32-bit) of memory on the stack, and Line 15 tells the assembler to jump forward to the nearest label 1 if eax is less than zero (negative).

Step 3 - Binding the socket to an address#

The bind system call accepts three parameters:

sockfd: The socket number returned from the socket syscall.
sockaddr: A pointer to the sockaddr_in structure that defines which IP and port the socket should bind to.
addrlen: The size of sockaddr_in structure.

This is a bit tricky, since we need to allocate enough memory space for the sockaddr_in, fill it with IP and port information, and pass its pointer to the bind system call.

.intel_syntax noprefix
.global _start
 
_start:
    # Allocate 4 bytes for the server socket [rsp]
    # Allocate 16 bytes for the sockaddr_in struct [rsp + 4]
    sub rsp, 20
    
    # Create a server socket.
 
    mov rdi, 2 # socket domain = AF_INET (IPv4)
    mov rsi, 1 # socket type = SOCK_STREAM (TCP)
    mov rdx, 0 # socket protocol
    call Socket # call the socket syscall
    mov DWORD PTR [rsp], eax # error | socket is stored in eax
    
    cmp eax, 0 # check whether eax < 0
    jl  1f # jump to exit
 
    # Bind the server socket to 0.0.0.0:1337
    mov WORD PTR [rsp + 4], 2 # sin_family = AF_INET
    mov ax, 1337 # Server port number
    xchg ah, al # little-endian to big-endian
    mov WORD PTR [rsp + 6], ax # sin_port = htons(1337)
    mov DWORD PTR [rsp + 8], 0 # sin_addr = 0
    mov QWORD PTR [rsp + 12], 0 # sin_zero = 0
 
    mov edi, DWORD PTR [rsp] 
    lea rsi, [rsp + 4]
    mov rdx, 16
    call Bind
 
    cmp eax, 0 # check if bind failed.
    jl 1f
 
    # Close the server socket 
    mov edi, [rsp] # prepare to close the socket stored in [rsp].
    call Close # call the close syscall
 
    xor rax, rax # rax = 0 # clear rax to close cleanly.
 
    1: mov rdi, rax # Exit with status code
    call Exit

Snippet 5.Binds the socket to 0.0.0.0:1337 address.

Line 7: Since the size of sockaddr_in is 16 bytes, I allocate 16 bytes more for it.
Line 23: Swaps the byte order of the port number.
Line 29: lea is like mov except that it moves the address of the second parameter to the first parameter. So, it loads rsp + 4 to the rsi register.

Step 4 - Listening on the port and accepting connections#

This step is straightforward. The code will allocate three memory blocks: a 4-byte block for the client’s socketfd, another 4-byte block for the client’s addrlen, and a 16-byte block for the client’s sockaddr_in structure. Then it invokes the listen and accept system calls.

.intel_syntax noprefix
.global _start
 
_start:
    # Allocate 4 bytes for the server socket [rsp]
    # Allocate 16 bytes for the server sockaddr_in struct [rsp + 4]
    # Allocate 16 bytes for the client sockaddr_in struct [rsp + 20]
    # Allocate 4 bytes for the client addrlen [rsp + 36]
    # Allocate 4 bytes for the client socket [rsp + 40]
    sub rsp, 44
    
    # Create a server socket.
 
    mov rdi, 2 # socket domain = AF_INET (IPv4)
    mov rsi, 1 # socket type = SOCK_STREAM (TCP)
    mov rdx, 0 # socket protocol
    call Socket # call the socket syscall
    mov DWORD PTR [rsp], eax # error | socket is stored in eax
    
    cmp eax, 0 # check whether eax < 0
    jl  1f # jump to exit
 
    # Bind the server socket to 0.0.0.0:1337
    mov WORD PTR [rsp + 4], 2 # sin_family = AF_INET
    mov ax, 1337 # Server port number
    xchg ah, al # little-endian to big-endian
    mov WORD PTR [rsp + 6], ax # sin_port = htons(1337)
    mov DWORD PTR [rsp + 8], 0 # sin_addr = 0
    mov QWORD PTR [rsp + 12], 0 # sin_zero = 0
 
    mov edi, DWORD PTR [rsp] # sockfd = server socket
    lea rsi, [rsp + 4] # addr = server sockaddr
    mov rdx, 16 # addrlen = 16
    call Bind
 
    cmp eax, 0 # check if bind failed.
    jl 1f
 
    # Listen on the server socket
    mov edi, DWORD PTR [rsp] # sockfd = server socket
    mov esi, 0 # backlog = 0
    call Listen
 
    cmp eax, 0 # check if listen failed.
    jl 1f
 
    # Accept the first connection
    mov edi, DWORD PTR [rsp]  # sockfd = server socket
    lea rsi, [rsp + 20] # addr = client addr
    lea rdx, [rsp + 36] # addrlen = client addrlen
    call Accept   
    mov DWORD PTR [rsp + 40], eax # error | socket is stored in eax
 
    cmp eax, 0 # check if accept failed.
    jl 1f
 
    # Close the client socket 
    mov edi, [rsp + 40] # prepare to close the socket stored in [rsp].
    call Close # call the close syscall
 
    # Close the server socket 
    mov edi, [rsp] # prepare to close the socket stored in [rsp].
    call Close # call the close syscall
 
    xor rax, rax # rax = 0 # clear rax to close cleanly.
 
    1: mov rdi, rax # Exit with status code
    call Exit

Snippet 6.Listening on the socket and accepting connections.

Since accept returns a new client socket file descriptor, it must be explicitly closed before the process terminates to avoid resource leaks.

Step 5 - Serving the First Request#

Up to this point, we have opened a socket, bound it to the server address, and configured it to listen for and accept incoming connections. Our next objective is to send a response back to the client.

The HTTP response header format is documented on the MDN Web Docs. For this implementation, I’ve defined a minimal static HTTP response within the .data segment, which you can see in the code snippet below at line 79.

.intel_syntax noprefix
.global _start
 
.text
_start:
    # Allocate 4 bytes for the server socket [rsp]
    # Allocate 16 bytes for the server sockaddr_in struct [rsp + 4]
    # Allocate 16 bytes for the client sockaddr_in struct [rsp + 20]
    # Allocate 4 bytes for the client addrlen [rsp + 36]
    # Allocate 4 bytes for the client socket [rsp + 40]
    # Allocate 1024 bytes for the read buffer [rsp + 44]
    sub rsp, 1068
    
    # Create a server socket.
 
    mov rdi, 2 # socket domain = AF_INET (IPv4)
    mov rsi, 1 # socket type = SOCK_STREAM (TCP)
    mov rdx, 0 # socket protocol
    call Socket # call the socket syscall
    mov DWORD PTR [rsp], eax # error | socket is stored in eax
    
    cmp eax, 0 # check whether eax < 0
    jl  1f # jump to exit
 
    # Bind the server socket to 0.0.0.0:1337
    mov WORD PTR [rsp + 4], 2 # sin_family = AF_INET
    mov ax, 1337 # Server port number
    xchg ah, al # little-endian to big-endian
    mov WORD PTR [rsp + 6], ax # sin_port = htons(1337)
    mov DWORD PTR [rsp + 8], 0 # sin_addr = 0
    mov QWORD PTR [rsp + 12], 0 # sin_zero = 0
 
    mov edi, DWORD PTR [rsp] # sockfd = server socket
    lea rsi, [rsp + 4] # addr = server sockaddr
    mov rdx, 16 # addrlen = 16
    call Bind
 
    cmp eax, 0 # check if bind failed.
    jl 1f
 
    # Listen on the server socket
    mov edi, DWORD PTR [rsp] # sockfd = server socket
    mov esi, 0 # backlog = 0
    call Listen
 
    cmp eax, 0 # check if listen failed.
    jl 1f
    
    3:
        # Accept the first connection
        mov edi, DWORD PTR [rsp]  # sockfd = server socket
        lea rsi, [rsp + 20] # addr = client addr
        lea rdx, [rsp + 36] # addrlen = client addrlen
        call Accept   
        mov DWORD PTR [rsp + 40], eax # error | socket is stored in eax
 
        cmp eax, 0 # check if accept failed.
        jl 2f
        
        4: 
            # Read the client request
            mov edi, DWORD PTR [rsp + 40] # fd = client socket
            lea rsi, [rsp + 44]
            mov rdx, 1024        
            call Read
            cmp eax, 1024
            je  4b
        
        lea rdi, A_RESPONSE
        call StrLen # calculate length of A_RESPONSE
 
        mov edi, DWORD PTR [rsp + 40] # fd = client socket
        lea rsi, A_RESPONSE # buf = A_RESPONSE
        mov rdx, rax # count = length of A_RESPONSE
        call Write 
 
        # Close the client socket 
        mov edi, [rsp + 40] # prepare to close the socket stored in [rsp].
        call Close # call the close syscall
        jmp 3b
    2: 
        # Close the server socket 
        mov edi, [rsp] # prepare to close the socket stored in [rsp].
        call Close # call the close syscall     
    1:
        mov rdi, rax # Exit with status code
        call Exit
 
.data
A_RESPONSE: .asciz "HTTP/1.1 200 OK\r\nServer: SwitchCase\r\nConnection: close\r\n\r\nThis is our first response."

Snippet 7.Sending the response back to the client.

Line 62: Moves the address of A_RESPONSE to the rsi register.
Line 79: Defines A_RESPONSE as a string constant. The .asciz macro tells the assembler that the string must be null-terminated.

Steps 6, 7, 8#

To keep this blog post easy to read, I will push the code for the other steps to the repository.

Step 6 - Reading the user’s request#

There are two system calls commonly used to read a socket buffer: recvfrom and read. While the read syscall is easier to implement, it blocks the execution process. Therefore, we should loop over read until it returns fewer bytes than our buffer size. In the rare situation where the request size is an exact multiple of our buffer size, the execution process will block indefinitely.

I benchmarked the web server after this step. The results were surprising.

Server Software:        SwitchCase
Server Hostname:        localhost
Server Port:            1337

Document Path:          /index.html
Document Length:        27 bytes

Concurrency Level:      1
Time taken for tests:   0.548 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      850000 bytes
HTML transferred:       270000 bytes
Requests per second:    18253.50 [#/sec] (mean)
Time per request:       0.055 [ms] (mean)
Time per request:       0.055 [ms] (mean, across all concurrent requests)
Transfer rate:          1515.18 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    0   0.0      0       1
Waiting:        0    0   0.0      0       1
Total:          0    0   0.0      0       1

I was curious how fast a minimal Python alternative to this code would be.

import socket
 
s = socket.socket()
s.bind(("127.0.0.1", 1337))
s.listen(1)
 
while True:
    conn, addr = s.accept()
    conn.recv(1024)  # read the HTTP request, but do not parse it
    conn.sendall(
        b"HTTP/1.1 200 OK\r\n"
        b"Server: SwitchCase\r\n"
        b"Connection: close\r\n"
        b"\r\n"
        b"This is our first response."
    )
    conn.close()

And the benchmark.

Server Software:        SwitchCase
Server Hostname:        localhost
Server Port:            1337

Document Path:          /index.html
Document Length:        27 bytes

Concurrency Level:      1
Time taken for tests:   0.570 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      850000 bytes
HTML transferred:       270000 bytes
Requests per second:    17534.11 [#/sec] (mean)
Time per request:       0.057 [ms] (mean)
Time per request:       0.057 [ms] (mean, across all concurrent requests)
Transfer rate:          1455.47 [Kbytes/sec] received

Step 7 - Serving Files#

This step was very fun. I wrote a minimal request header parser to extract the file path, rewrite it to be relative to the current directory, and finally serve it.

There was nothing special; however, two things are worth mentioning:

Request Method: The code supports GET and POST HTTP methods. To check the buffer for those methods, I implemented a SWAR technique that compares the method string as a single 4-byte integer.
In-place Path Editing: I modify the file path right inside the read buffer. As a result, the remainder of the HTTP header is truncated and discarded.

Step 8 - Concurrency#

I implemented concurrency using the fork system call. It led me to signal handling to prevent zombie processes.

Conclusion#

Building a toy assembly web server is surprisingly straightforward, requiring a mere 332 lines of code. The real challenge lay in implementing concurrency via fork. It demanded a deep dive into the system documentation to properly navigate signal handling and ensure child processes were cleanly reaped, keeping the system free of zombie processes.

▸ stay subscribed

Liked this?

Drop your email and you'll get the next post when it's published. No tracking, one-click unsubscribe.