Like many people who have backgrounds in higher level languages like JavaScript and Ruby, one thing that really attracted me to Rust was the ability to get “closer to the metal”. While Rust offers plenty of high level abstractions, it certainly makes you think a bit more about lower level concerns like the memory allocation than the JavaScript or Ruby do. But of course, you can always go deeper, and learning more about the abstraction layer underneath Rust can be a really great way to really understand what makes Rust tick.
In this series, we’ll explore the world of assembly language from the perspective of a Rust developer. We’ll treat the compiler as a black box and see what kind of assembly instructions get produced from standard, run-of-the-mill Rust code. Doing this should get us a bit closer to understanding what’s actually happening on our machine (though, of course, the stack does even deeper than the assembly language abstraction layer).
The Setup
Assembly Language Variety
As assembly has a close relationship to the actual machine code for a particular computer architecture and is therefore not a platform agnostic abstraction, we have to choose which variety we’ll be exploring. For us this will be x86-64 assembly which is the architecture that you’re most likely to find on most desktop and server computers today. Hopefully at some point in this series we’ll take what we’ve learned and see if we can apply it to another machine architecture.
There are different syntax flavors for assembly code for a given computer architecture, allowing us to write and read different assembly language syntax which compiles to the same machine instructions to be passed to the CPU. For our purposes, we’ll be looking at the Intel x86-64 assembly syntax which is often compared to the AT&T syntax. While wikipedia says that the Intel syntax is more common in the Windows world while AT&T is more common in Unix circles, in my limited experience I’ve seen the Intel syntax used more often even in Unix contexts.
Godbolt
To explore assembly code, we can use a plethora of tools, but one that I find the most convenient for rapid exploration is Matt Godbolt’s Compiler Explorer. This tool allows us to write Rust code, and have it compile automatically and show us the relevant assembly code complete with color coordinated highlighting indicating which parts of our code produce which parts of the assembly output. The compiler explorer uses Intel syntax by default.
What You Need to Know
I’d really love this to be as accesible as possible, but I do assume some background knowledge. You should have passing high level familiarity with the following concepts:
- The stack: a growable stack data structure that contains stack frames where are each a set of local variables for each function call that get “automatically” cleaned up when the function returns.
- Registers: very small (64 bits on a 64-bit machine) memory storage on the CPU where data can be manipulated.
- Memory: each process gets its own memory space that contains static data, the code being executed, the stack, and some space for dynamically allocated memory known as the heap. Memory can be thought of as a long array of bytes that starts at index (better known as address) 0 and goes all the way to address 264.
- Basic Rust: we’re only writing three lines of Rust but it still helps to have familiarity with Rust, C, or C++.
Ok, now that we’re all on the same page, let’s get started:
INC - debug
In this post, we’re going to explore a very simple Rust library that provides one function inc
which takes in a u8
, adds one to it wrapping around if it goes beyond 255
, and then returns the result:
pub fn inc(n: u8) -> u8 {
n.wrapping_add(1)
}
If you’re not familiar with wrapping_add
, it simply wraps the number around when it overflows unlike +
which panics on overflow in debug mode (+
and wrapping_add
behave the same in release mode).
Go to the compiler explorer, make sure you select Rust from the language drop down menu (as C++ is the default), and copy in the Rust program to the panel on the left. For this post, we’ll be using Rust version 1.40.0. If you use a different version of the compiler it’s possible you may see different results.
On the right hand side of the screen, you should see the following:
core::num::<impl u8>::wrapping_add:
sub rsp, 2
add dil, sil
mov byte ptr [rsp + 1], dil
mov al, byte ptr [rsp + 1]
mov byte ptr [rsp], al
mov al, byte ptr [rsp]
add rsp, 2
ret
example::inc:
push rax
movzx edi, dil
mov esi, 1
call core::num::<impl u8>::wrapping_add
mov byte ptr [rsp + 7], al
mov al, byte ptr [rsp + 7]
pop rcx
ret
This is quite a bit of assembly code just to add 1 to a number! Don’t worry, we’ll see later on that we can easily turn this code into just two instructions. In the meantime, this assembly has lots of interesting bits to it.
Let’s explore this by first looking at the code underneath the example::inc
. The example::inc:
text is what’s known as a label. The label labels a piece of memory - in this case our inc
function. We could use the label in our assembly code as a way to refer to the location in memory where our example::inc
function sits.
The Function Prologue
The first instruction in our example::inc
function is push rax
. which pushes whatever value is in the rax
register on to the stack. In more precise terms it means the value in the rax
register is copied to the location indicated in the rsp
register (the stack pointer register which always contains the location at the top of the stack) and then it subtracts 8 from rsp
.
rsp
and rax
are 64 bit registers known as “general purpose registers” but this is a bit of a misnomer since as we’ve seen rsp
has the special purpose of pointing to the top of the stack. You should take a sec to read about the different registers on an x86-64 machine and how they “contain” smaller versions of themselves inside of them (e.g., rax
“contains” a 32-bit register named eax
, a 16-bit register named ax
, and two 8-bit registers named ah
and al
).
So why does push
subtract 8 from rsp
? For historical reasons the stack grows downward meaning the top of the stack is at a lower memory address than the bottom of the stack. If you want to grow the stack, you need to move the top to an even lower address by subtracting from it. The reason 8 is subtracted is because this is the size in bytes of the rax
(8 bytes is 64 bits) - so we’re moving the stack pointer just beyond the value we just pushed on to the stack.
But what’s the purpose of all this? Well it turns out that we do this to uphold an important part of the function calling convention.
Aside: ABIs and Calling Conventions
An ABI (or application binary interface) is the binary interface between two binary modules. In other words if two pieces of actual machine code need to talk with each other, there’s a whole host of things they need to agree upon in order to do so successfully. One such thing is a calling convention which is an agreed upon way for how functions are called.
x86 assembly only has two instructions dedicated to functions:
call
for calling a function andret
for returning from a function.call
pushes the next instruction’s location on to the stack andret
pops that address off the stack and jumps to that location. But this isn’t enough to handle all function calls. Where do the function arguments go? Where does the return value go? These need to be agreed upon so we can call functions with arguments and return values. We’ll explore these questions in depth in this series.
While Rust may change which calling convention it uses between releases of the compiler, it needs to have a consistent way inside of a binary to call functions. It seems that as of Rust 1.40.0, Rust is using the SystemV ABI at least for its function calling convention. We’ll be exploring what this actually entails in great depth over this series, so don’t worry if this seems fuzzy. We simply need to know what the caller of a function and the called function itself need to do to allow functions to be called successfully.
One thing that the System V calling convention dictates is that the stack be 16 byte aligned - meaning that the stack pointer (i.e., rsp
) should be divisible by 16. Why this is, I’m not entirely sure, but it needs to be this way. If you have the answer, let me know! Since we’re inside the example::inc
function, we know that call
was the last instruction executed. Because call
pushes 8 bytes (i.e., a 64-bit address) on to the stack, the stack must not be 16-byte aligned. To correct for this, we can either subtract 8 from rsp
or we can push something else that’s 8 bytes big on to the stack which will do this for us. Apparently, Rust and LLVM believe doing push
is better choice than subtracting, but I’m not really sure why.
It turns out that there’s usually a bit of ceremony that a function must do when it’s first called to make sure everything is in order and the actual function body can successfully take place. In the case of example::inc
this was just one instruction, but for other functions this may be be more things many of which we’ll see later in this series. This ceremony is referred to as a function’s prologue. As we’ll see later there’s usually also a function epilogue which cleans things up at the end of the function.
Aside: Naked Functions
As a side note: there’s actually an experimental feature in Rust called “naked functions” which allow the programmer to tell the compiler to not include the function’s prologue and epilogue.
Phew… that’s a lot of explanation for one instruction! How long is this post going to be?! Well hopefully things should pick up a bit more from here.
Calling core::num::<impl u8>::wrapping_add
The next three instructions all have to do with calling the function wrapping_add
:
movzx edi, dil ;; "copy" `dil` into `edi` and sign extend
mov esi, 1 ;; copy 1 into `esi`
call core::num::<impl u8>::wrapping_add ;; call `wrapping_add`
In order to call a function, we have to prepare the function arguments. In the System V calling convention, the registers rdi
, rsi
, rdx
, rcx
, r8
, and r9
(and their smaller variants) are used to store integer function arguments (with the stack being used for additional arguments after that).
The first instruction movzx
copies the contents of the 8-bit register dil
into edi
. If you’ve read a bit about x86-64 registers, you may have noticed that dil
is the 8-bit version of edi
which is itself the 32-bit version of the 64-bit register rdi
. As rdi
is the first function argument register, edi
must contain the first (and only) argument to the example::inc
function.
The movzx
will “sign extend” dil
and keep it in the edi
register. “Sign extension” is the process by which the most significant bit will be extended out to fill up the space that can fit in the numbers larger representation. For example, when sign extending 0b1000_0001
to 16 bits , it will become 0b1111_1111_1000_0001
. I assume this is done to give the number more room to overflow.
Next, 1
is copied into esi
. Notice that we’ve now filled edi
with the contents of dil
the argument to example::inc
and esi
with 1. edi
and esi
are the (32-bit versions of the) first two function argument registers. We’ve set up the arguments to wrapping_add
, which we’re now ready to call using call
which we learned above pushes the next instruction on to the stack and jumps to the label provided - in our case, wrapping_add
.
The wrapping_add
Function
Now we enter the wrapping_add
function:
sub rsp, 2 ;; make room on stack
add dil, sil ;; do the addition
mov byte ptr [rsp + 1], dil ;; copy answer to stack - 1
mov al, byte ptr [rsp + 1] ;; copy that value back to `al`
mov byte ptr [rsp], al ;; copy `al` to top of stack
mov al, byte ptr [rsp] ;; copy that back to `al`
add rsp, 2 ;; restore the stack pointer
ret ;; jump back
The first thing we’ll do is the function’s prologue: sub rsp, 2
which will subtract 2 from rsp
(the stack pointer) and store this new value in rsp
. We’re going to use 2 bytes of the stack in this function, so we’re making room.
Next, the reason our function was called in the first place happens: the two function arguments dil
and sil
get added together.
What happens after this is a bit strange, and I’m not sure why this code got generated. With mov byte ptr [rsp + 1], dil
, dil
gets copied to the location rsp + 1
(i.e., one below the top byte of the stack). Remember the stack grows down so adding 1 will get us 1 position below the top position of the stack. Then with mov al, byte ptr [rsp + 1]
, we turn around and copy that byte into al
(one of the 8-bit registers inside of rax
). Then, strangely we do the same dance again this time at the top of the stack. We’ve essentially done 4 instructions to copy the value from dil
into al
. Why this code was generated this way, I’m not sure though I suspect it’s because the compiler/LLVM need additional passes to eliminate the code and in debug mode they skips this.
At any rate, System V calling convention dictates that return values are found in rax
. Since the calculation of our addition is now found in rax
’s 8-bit variant al
, we’re done!
Finally, in the epilogue, we restore rsp
back to what it was before the prologue by adding 2 to it, and then calling ret
which will pop the return address off the stack and jump to it. If this function seemed a bit wasteful, it was, but it’s over!
Finishing Up
We have all the tools in our toolbox to understand the rest of the example::inc
function:
mov byte ptr [rsp + 7], al ;; move return value to 8th byte in stack
mov al, byte ptr [rsp + 7] ;; move that value back to `al`
pop rcx ;; epilogue: pop top of stack
ret ;; return
The call to wrapping_add
ended with the result in al
. For some reason (probably similar to what happened in wrapping_add
) we copy al
to the 8th byte of the stack and the immediately copy it back to al
.
Finally, we must complete our epilogue and undo what we did in the prologue, namely pop off the top of the stack. I believe, we pop into rcx
because it’s not being used. We could have poped off the stack into another unused register and things would still work. Finally we return!
We’re done! 🎉 We just did a lot in order to add one to a number. Surely, we can do better, right? Turns out we can, by increasing the level of optimization.
INC - debug
The Godbolt compiler explorer uses rustc directly and does not turn on any optimizations. If you’re familiar with Rust, then you know that by default Rust doesn’t do a lot of optimization. Usually with Cargo we would add the --release
flag and our code would get optimized (at the expense of longer compilation times), but with rustc we have to pass a different flag -C opt-level=3
which tells rustc to apply the maximum level of optimization. In the compiler explorer, we can pass this flag in the “compiler options” box. Doing so, we should see dramatically different output:
example::inc:
lea eax, [rdi + 1]
ret
Wow! We now only have essentially 1 instruction (plus ret
to return from our function). lea eax, [rdi + 1]
does everything we need. lea
stands for “load effective address”, and here it’s been used in a way that’s not really in line with its name. I believe the “normal” use of lea
is to load an address into a destination register. Clearly rdi + 1
is not an address, but that’s ok, it still gets the job done. It simply takes the contents of rdi
which we know is the argument to our example::inc
function, adds 1 to it, and then stores that into eax
where our return value is expected to be.
We’re done in 1 instruction. 🎉
Conclusion
That was a jam packed first look at the x86-64 assembly that Rust produces in both debug and release mode. Hopefully you’ve learned some neat x86-64 instructions, a bit about the System V calling convention, about the various ways that non-optimized Rust code does funny things. If you enjoyed this, please let me know!