Writing a Web Server in assembly from scratch
How hard is it to write a web server in pure assembly, using nothing but raw OS system calls? Sounds cool. In this post, I'm going to go through it and explain each step.
Recently, I saw a HN post that someone made a web server entirely in assembly. In the post description, they mentioned that the reason for doing this is to give meaning to my life. That made me wonder: how hard is it, actually? I know writing a full-fledged web server even in high-level programming languages is a pretty hard task. But a toy web server may not be that hard. Let’s give it a try.
Choosing an assembler#
There are many assemblers out there, some more famous than others, such as NASM, FASM, YASM and MASM, etc. Among them, GAS is the one that compilers mostly use to convert their generated assembly into machine code. It belongs to the GNU project, supports multiple syntaxes, features powerful macro preprocessors, and targets many CPU architectures.
I chose to use GAS simply because it already comes bundled with gcc.
Project setup#
A good thing about assembly is that you don’t need much to start a project, a simple text editor like Kate, VS Code, Mousepad, etc is sufficient. However, to keep things organized I will explain my folder structure below.
- Bin: This directory will contain object files (
.o) and executable files. - Run.sh: A build script to automate compiling and running source files.
- Other directories: To make it easier to track each step of the project, I will create a separate directory for each step.
- Source files: Each step’s directory contains two files, Server.s and Functions.s.
- Functions.s: This file acts as our lightweight standard library, it contains utility functions that are not exclusive to this project and can be shared with other projects.
- Server.s: This file contains the main application logic and entry point for the web server.
You can also find the source code on the blog’s Github repository.
The Run.sh script#
As I mentioned, this script compiles the source files using GAS, then passes the generated object files to ld (the linker) to build the final executable, and finally runs it.
#!/bin/bash
OBJS=()
for arg in "$@"; do
name=$(basename $arg)
as -o $PWD/Bin/$name.o $arg
OBJS+=("$PWD/Bin/$name.o")
done
name=$(basename $1)
ld -o $PWD/Bin/$name.e ${OBJS[@]}
$PWD/Bin/$name.e
echo "Exit Code: $?" - Line 1: The shebang (
#!/bin/bash). This tells the operating system which interpreter to use to execute the script. - Line 3: OBJS is a list that will store generated object file names.
- Line 5: Loops over all the command-line arguments passed to the script.
- Line 6: Extracts the filename and stores it in the
namevariable. - Line 7: Compiles the assembly file and stores the generated object file in the Bin directory.
- Line 8: Saves the object file path to the
OBJSlist. - Line 12: Links all the object files and builds the final executable.
- Line 13: Executes the executable!
- Line 14: Prints the exit code of the executable.
Step 1 - Getting started: Just exit properly#
There are a few ground rules that we should establish before we start coding.
- Assembly Syntax: The two most popular assembly syntaxes are AT&T and Intel. The default syntax for GAS is AT&T. However, since the AT&T syntax can be a bit more complicated, I chose Intel syntax for this project.
- Entry point: Each assembly application must define a global
_startsymbol as its entry point. This is the low-level equivalent of the C++mainfunction. - System Calls: System calls are functions provided by the OS kernel. Each system call has a unique ID number and expects a specific set of parameters. I previously put together a Linux system calls reference here.
- Calling convention: Under the Linux x86_64 calling convention (System V AMD64 ABI), function arguments are passed via registers in a specific order:
rdi,rsi,rdx, etc. You can read more details about it here (System V AMD64 ABI).
.intel_syntax noprefix
.global _start
_start:
mov rdi, 0
call ExitLine 1: Switches the assembler syntax to Intel.
Lines 2, 3: Defines and declares the application’s entry point.
Line 4: Moves our exit code (0) into the rdi register (the first argument slot).
Line 5: Calls our custom Exit function.
Let’s take a quick look at the Functions.s.
.intel_syntax noprefix
.global Exit
Exit:
mov rax, 60
syscall- Line 2: Tells the assembler that the Exit function is global, making it accessible from other files (like Server.s).
- Lines 4 to 6: Defines the implementation of the
Exitfunction. - Line 5: Loads the
raxregister with 60, which is the Linux system call number for sys_exit. - Line 6: Triggers the syscall instruction to hand control over to the OS kernel.
- Missing ret instruction: Because the exit system call immediately terminates the application process, control never returns to this function. It is the absolute last piece of code the application executes, so a
retinstruction is unnecessary.
Step 2 - Opening a socket#
Sockets are resources managed and provided by the operating system kernel. To open a new socket, we use the socket system call which takes three parameters:
- domain: Specifies the communication protocol (IPv4, IPv6, etc).
- type: Defines how data is transmitted (Stream, Datagram, etc).
- protocol: Used for advanced networking; For our project, this will always be set to zero.
.intel_syntax noprefix
.global _start
_start:
# Allocate 4 bytes on the stack for the server socket
sub rsp, 4
mov rdi, 2 # socket domain = AF_INET (IPv4)
mov rsi, 1 # socket type = SOCK_STREAM (TCP)
mov rdx, 0 # socket protocol
call Socket # call the socket syscall
mov DWORD PTR [rsp], eax # error | socket is stored in eax
cmp eax, 0 # check whether eax < 0
jl 1f # jump to exit
mov rdi, rax # prepare to close the socket stored in rax.
call Close # call the close syscall
xor rax, rax # rax = 0 # clear rax to close cleanly.
1: mov rdi, rax # Exit with status code
call ExitThe code is self-explanatory, except two lines: line 6, line 15.
To allocate a variable in assembly, we need to allocate memory for it on the stack. The rsp register points to the bottom of the stack: therefore, to allocate memory, we can simply subtract the required size from rsp.
There are two types of labels in assembly: local labels, global labels.
- Global labels : Use an explicit identifier that can be referenced directly – such as
_startor a function name –. - Local labels : Use a number as an identifier, the code references them by combining that number with the direction of the target (forward or backward).
So, Line 6 allocates 4 bytes (32-bit) of memory on the stack, and Line 15 tells the assembler to jump forward to the nearest label 1 if eax is less than zero (negative).
Step 3 - Binding the socket to an address#
The bind system call accepts three parameters:
- sockfd: The socket number returned from the
socketsyscall. - sockaddr: A pointer to the
sockaddr_instructure that defines which IP and port the socket should bind to. - addrlen: The size of
sockaddr_instructure.
This is a bit tricky, since we need to allocate enough memory space for the sockaddr_in, fill it with IP and port information, and pass its pointer to the bind system call.
.intel_syntax noprefix
.global _start
_start:
# Allocate 4 bytes for the server socket [rsp]
# Allocate 16 bytes for the sockaddr_in struct [rsp + 4]
sub rsp, 20
# Create a server socket.
mov rdi, 2 # socket domain = AF_INET (IPv4)
mov rsi, 1 # socket type = SOCK_STREAM (TCP)
mov rdx, 0 # socket protocol
call Socket # call the socket syscall
mov DWORD PTR [rsp], eax # error | socket is stored in eax
cmp eax, 0 # check whether eax < 0
jl 1f # jump to exit
# Bind the server socket to 0.0.0.0:1337
mov WORD PTR [rsp + 4], 2 # sin_family = AF_INET
mov ax, 1337 # Server port number
xchg ah, al # little-endian to big-endian
mov WORD PTR [rsp + 6], ax # sin_port = htons(1337)
mov DWORD PTR [rsp + 8], 0 # sin_addr = 0
mov QWORD PTR [rsp + 12], 0 # sin_zero = 0
mov edi, DWORD PTR [rsp]
lea rsi, [rsp + 4]
mov rdx, 16
call Bind
cmp eax, 0 # check if bind failed.
jl 1f
# Close the server socket
mov edi, [rsp] # prepare to close the socket stored in [rsp].
call Close # call the close syscall
xor rax, rax # rax = 0 # clear rax to close cleanly.
1: mov rdi, rax # Exit with status code
call Exit- Line 7: Since the size of sockaddr_in is 16 bytes, I allocate 16 bytes more for it.
- Line 23: Swaps the byte order of the port number.
- Line 29:
leais likemovexcept that it moves the address of the second parameter to the first parameter. So, it loadsrsp + 4to thersiregister.
Step 4 - Listening on the port and accepting connections#
This step is straightforward. The code will allocate three memory blocks: a 4-byte block for the client’s socketfd, another 4-byte block for the client’s addrlen, and a 16-byte block for the client’s sockaddr_in structure. Then it invokes the listen and accept system calls.
.intel_syntax noprefix
.global _start
_start:
# Allocate 4 bytes for the server socket [rsp]
# Allocate 16 bytes for the server sockaddr_in struct [rsp + 4]
# Allocate 16 bytes for the client sockaddr_in struct [rsp + 20]
# Allocate 4 bytes for the client addrlen [rsp + 36]
# Allocate 4 bytes for the client socket [rsp + 40]
sub rsp, 44
# Create a server socket.
mov rdi, 2 # socket domain = AF_INET (IPv4)
mov rsi, 1 # socket type = SOCK_STREAM (TCP)
mov rdx, 0 # socket protocol
call Socket # call the socket syscall
mov DWORD PTR [rsp], eax # error | socket is stored in eax
cmp eax, 0 # check whether eax < 0
jl 1f # jump to exit
# Bind the server socket to 0.0.0.0:1337
mov WORD PTR [rsp + 4], 2 # sin_family = AF_INET
mov ax, 1337 # Server port number
xchg ah, al # little-endian to big-endian
mov WORD PTR [rsp + 6], ax # sin_port = htons(1337)
mov DWORD PTR [rsp + 8], 0 # sin_addr = 0
mov QWORD PTR [rsp + 12], 0 # sin_zero = 0
mov edi, DWORD PTR [rsp] # sockfd = server socket
lea rsi, [rsp + 4] # addr = server sockaddr
mov rdx, 16 # addrlen = 16
call Bind
cmp eax, 0 # check if bind failed.
jl 1f
# Listen on the server socket
mov edi, DWORD PTR [rsp] # sockfd = server socket
mov esi, 0 # backlog = 0
call Listen
cmp eax, 0 # check if listen failed.
jl 1f
# Accept the first connection
mov edi, DWORD PTR [rsp] # sockfd = server socket
lea rsi, [rsp + 20] # addr = client addr
lea rdx, [rsp + 36] # addrlen = client addrlen
call Accept
mov DWORD PTR [rsp + 40], eax # error | socket is stored in eax
cmp eax, 0 # check if accept failed.
jl 1f
# Close the client socket
mov edi, [rsp + 40] # prepare to close the socket stored in [rsp].
call Close # call the close syscall
# Close the server socket
mov edi, [rsp] # prepare to close the socket stored in [rsp].
call Close # call the close syscall
xor rax, rax # rax = 0 # clear rax to close cleanly.
1: mov rdi, rax # Exit with status code
call ExitSince accept returns a new client socket file descriptor, it must be explicitly closed before the process terminates to avoid resource leaks.
Step 5 - Serving the First Request#
Up to this point, we have opened a socket, bound it to the server address, and configured it to listen for and accept incoming connections. Our next objective is to send a response back to the client.
The HTTP response header format is documented on the MDN Web Docs. For this implementation, I’ve defined a minimal static HTTP response within the .data segment, which you can see in the code snippet below at line 79.
.intel_syntax noprefix
.global _start
.text
_start:
# Allocate 4 bytes for the server socket [rsp]
# Allocate 16 bytes for the server sockaddr_in struct [rsp + 4]
# Allocate 16 bytes for the client sockaddr_in struct [rsp + 20]
# Allocate 4 bytes for the client addrlen [rsp + 36]
# Allocate 4 bytes for the client socket [rsp + 40]
# Allocate 1024 bytes for the read buffer [rsp + 44]
sub rsp, 1068
# Create a server socket.
mov rdi, 2 # socket domain = AF_INET (IPv4)
mov rsi, 1 # socket type = SOCK_STREAM (TCP)
mov rdx, 0 # socket protocol
call Socket # call the socket syscall
mov DWORD PTR [rsp], eax # error | socket is stored in eax
cmp eax, 0 # check whether eax < 0
jl 1f # jump to exit
# Bind the server socket to 0.0.0.0:1337
mov WORD PTR [rsp + 4], 2 # sin_family = AF_INET
mov ax, 1337 # Server port number
xchg ah, al # little-endian to big-endian
mov WORD PTR [rsp + 6], ax # sin_port = htons(1337)
mov DWORD PTR [rsp + 8], 0 # sin_addr = 0
mov QWORD PTR [rsp + 12], 0 # sin_zero = 0
mov edi, DWORD PTR [rsp] # sockfd = server socket
lea rsi, [rsp + 4] # addr = server sockaddr
mov rdx, 16 # addrlen = 16
call Bind
cmp eax, 0 # check if bind failed.
jl 1f
# Listen on the server socket
mov edi, DWORD PTR [rsp] # sockfd = server socket
mov esi, 0 # backlog = 0
call Listen
cmp eax, 0 # check if listen failed.
jl 1f
3:
# Accept the first connection
mov edi, DWORD PTR [rsp] # sockfd = server socket
lea rsi, [rsp + 20] # addr = client addr
lea rdx, [rsp + 36] # addrlen = client addrlen
call Accept
mov DWORD PTR [rsp + 40], eax # error | socket is stored in eax
cmp eax, 0 # check if accept failed.
jl 2f
4:
# Read the client request
mov edi, DWORD PTR [rsp + 40] # fd = client socket
lea rsi, [rsp + 44]
mov rdx, 1024
call Read
cmp eax, 1024
je 4b
lea rdi, A_RESPONSE
call StrLen # calculate length of A_RESPONSE
mov edi, DWORD PTR [rsp + 40] # fd = client socket
lea rsi, A_RESPONSE # buf = A_RESPONSE
mov rdx, rax # count = length of A_RESPONSE
call Write
# Close the client socket
mov edi, [rsp + 40] # prepare to close the socket stored in [rsp].
call Close # call the close syscall
jmp 3b
2:
# Close the server socket
mov edi, [rsp] # prepare to close the socket stored in [rsp].
call Close # call the close syscall
1:
mov rdi, rax # Exit with status code
call Exit
.data
A_RESPONSE: .asciz "HTTP/1.1 200 OK\r\nServer: SwitchCase\r\nConnection: close\r\n\r\nThis is our first response."- Line 62: Moves the address of
A_RESPONSEto thersiregister. - Line 79: Defines
A_RESPONSEas a string constant. The.ascizmacro tells the assembler that the string must be null-terminated.
Steps 6, 7, 8#
To keep this blog post easy to read, I will push the code for the other steps to the repository.
Step 6 - Reading the user’s request#
There are two system calls commonly used to read a socket buffer: recvfrom and read. While the read syscall is easier to implement, it blocks the execution process.
Therefore, we should loop over read until it returns fewer bytes than our buffer size. In the rare situation where the request size is an exact multiple of our buffer size, the execution process will block indefinitely.
I benchmarked the web server after this step. The results were surprising.
Server Software: SwitchCase
Server Hostname: localhost
Server Port: 1337
Document Path: /index.html
Document Length: 27 bytes
Concurrency Level: 1
Time taken for tests: 0.548 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 850000 bytes
HTML transferred: 270000 bytes
Requests per second: 18253.50 [#/sec] (mean)
Time per request: 0.055 [ms] (mean)
Time per request: 0.055 [ms] (mean, across all concurrent requests)
Transfer rate: 1515.18 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 0 0 0.0 0 1
Waiting: 0 0 0.0 0 1
Total: 0 0 0.0 0 1
I was curious how fast a minimal Python alternative to this code would be.
import socket
s = socket.socket()
s.bind(("127.0.0.1", 1337))
s.listen(1)
while True:
conn, addr = s.accept()
conn.recv(1024) # read the HTTP request, but do not parse it
conn.sendall(
b"HTTP/1.1 200 OK\r\n"
b"Server: SwitchCase\r\n"
b"Connection: close\r\n"
b"\r\n"
b"This is our first response."
)
conn.close()And the benchmark.
Server Software: SwitchCase
Server Hostname: localhost
Server Port: 1337
Document Path: /index.html
Document Length: 27 bytes
Concurrency Level: 1
Time taken for tests: 0.570 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 850000 bytes
HTML transferred: 270000 bytes
Requests per second: 17534.11 [#/sec] (mean)
Time per request: 0.057 [ms] (mean)
Time per request: 0.057 [ms] (mean, across all concurrent requests)
Transfer rate: 1455.47 [Kbytes/sec] received
Step 7 - Serving Files#
This step was very fun. I wrote a minimal request header parser to extract the file path, rewrite it to be relative to the current directory, and finally serve it.
There was nothing special; however, two things are worth mentioning:
- Request Method: The code supports
GETandPOSTHTTP methods. To check the buffer for those methods, I implemented a SWAR technique that compares the method string as a single 4-byte integer. - In-place Path Editing: I modify the file path right inside the
read buffer. As a result, the remainder of the HTTP header is truncated and discarded.
Step 8 - Concurrency#
I implemented concurrency using the fork system call. It led me to signal handling to prevent zombie processes.
Conclusion#
Building a toy assembly web server is surprisingly straightforward, requiring a mere 332 lines of code. The real challenge lay in implementing concurrency via fork. It demanded a deep dive into the system documentation to properly navigate signal handling and ensure child processes were cleanly reaped, keeping the system free of zombie processes.
▸ stay subscribed
Liked this?
Drop your email and you'll get the next post when it's published. No tracking, one-click unsubscribe.