COMS 4995 Advanced Systems Programming

x86-64 Assembly

x86 Data Formats & Registers

x86-64 has 16 64-bit general purpose registers, as presented in CSAPP Figure 3.2:

csapp-fig3.2

Aside from being accessed in its 8-byte entirety, the lower 4, 2, and 1 byte(s) of each register can also be accessed through different names.
Even though these are general-purpose registers, some of these registers tend to have a specific use because of programmming conventions. Furthermore, certain commands automatically use certain registers. We’ll discuss more detail throughout the lecture.

Many x86 assembly instructions can operate on different widths of data. The byte quantity is specified as the suffix of the instruction name. CSAPP Figure 3.1 details these suffixes:

csapp-fig3.1

The following sequence of mov instructions demonstrate how to manipulate data at different widths:

movabsq $0x0011223344556677, %rax  # %rax = 0011223344556677
movb $-1, %al                      # %rax = 00112233445566FF
movw $-1, %ax,                     # %rax = 001122334455FFFF
movl $-1, %eax                     # %rax = 00000000FFFFFFFF
movq $-1, %rax                     # %rax = FFFFFFFFFFFFFFFF

The first mov fills %rax with a quadword (64-bit) quantity, 0x0011223344556677. Regular mov instructions can only take a 32-bit immediate (i.e., literal) value. movabs allows you to specify a 64-bit immediate value.

The next four mov instructions set the lower 1, 2, 4, and 8 bytes of %rax to -1, respectively. Each mov instruction is augmented with the appropriate data-width suffix. Sometimes the suffix is omitted because it can be deduced from the register name.

Note that movb and movw instruction leave the rest of the bits in the register unchanged, but movl has the side-effect of zero-ing out the higher order 4 bytes in the register. x86 dictates that movl with a register destination should zero-out the higher-order 4 bytes.

Memory Access

The mov instruction is used to transfer data between registers and memory. In this section, we’ll cover the x86 syntax used to refer to memory locations. The leaq instruction, often abbreviated as lea since it only operates on 8-byte quadwords, is used to perform pointer arithmetic. It shares the same syntax with mov, but doesn’t dereference memory; it just calculates the address. lea stands for “load effective address”.

CSAPP 3.8.2 gives some examples of using mov and lea instructions with various forms of memory addressing. For the following examples, assume the following:

The starting address of integer array E is stored in %rdx
An integer index i is stored in %rcx
The examples store the result in either %eax for data or %rax for pointers
The notation M[addr] refers to the data stored at addr in memory

csapp-3.8.2

As a side note, this memory access syntax is so convenient that compilers often use lea to perform arithmetic unrelated to memory access.

Stack Operations

The push and pop instructions are examples of instructions that operate on specific registers. The %rsp register holds the address of the top of the stack. The push instruction decreases %rsp (to grow the stack) and then copies a value to the memory location pointed to by %rsp. The pop instruction retrieves the value at the memory location pointed to by %rsp and then increases %rsp (to shrink the stack).

CSAPP Figure 3.9 illustrates these semantics:

csapp-fig3.9

Stack Frame

CSAPP Figure 3.25 depicts what the stack looks like when function P() calls function Q():

csapp-fig3.25

P’s stack frame:

Before calling Q(), P() prepares the arguments. The first six arguments are passed through registers, %rdi, %rsi, %rdx, %rcx, %r8, %r9, in that order. The rest of the arguments are pushed onto the P()’s stack frame, as depicted above.
When P() executes the call instruction, it pushes the return address onto its stack frame before jumping to Q().

Q’s stack frame:

Registers %rbx, %rbp, %r12-%r15 are classified as “callee-saved” registers, meaning that the caller expects them to have the original values when the callee returns. If Q() intends to use them, it must first save them on the stack so they can later be restored.
Q() then allocates space for its local variables in its stack frame.
If Q() invokes another function that takes more than 6 parameters, it will prepare the arguments on its stack frame.
Q() will store its return value in %rax (not depicted)
When Q() returns via the ret instruction, it will pop off the saved return address from P()’s stack frame and jump to it.

%rbp (not depicted) is known as the frame pointer. It points to the start of the current stack frame. You’ll typically see these instructions in a function:

push   %rbp              # Save the old frame pointer on the stack
mov    %rsp,%rbp         # Set the new frame pointer              ```

# <function body>

mov    %rbp,%rsp         # Roll up the stack to %rbp
pop    %rbp              # Restore the old frame pointer
ret

The compiler may optimize away parts of or the entirety of the stack frame. For example, leaf procedures (i.e., functions that don’t call other functions) may not have a stack frame.

Code Walkthrough

Consider the following C program, divided into two source files:

sum.c:

long sum(long a, long b) {
    return a + b;
}

long sum_array(long *p, int n) {
    long s = 0;
    for (int i = 0; i < n; i++) {
        s = sum(s, p[i]);
    }
    return s;
}

main.c:

long sum_array(long *p, int n);

int main() {
    long a[5] = {0, 1, 2, 3, 4};
    long sum = sum_array(a, 5);
    printf("sum=%ld\n", sum);
}

With the x86 assembly essentials we’ve just covered, we can now dive into compiler-generated x86 assembly for this simple C program.

When gcc compiles a C source file into an object file, it first translates the C code into assembly code, and then invokes the assembler to translate the assembly code into the machine code. One can generate the intermediate assembly code using the -S flag (e.g., gcc -S code.c).

The object file is encoded using the ELF binary format, which we will cover later in the class. We can use a disassembler to read the code region of the object file as assembly code (e.g., objdump -d code.o).

`sum.o`

The disassembly of sum.o, heavily annotated by us, is shown below:

sum.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <sum>:
   0:	f3 0f 1e fa          	endbr64 
   4:	55                   	push   %rbp              # Save the old frame pointer on the stack
   5:	48 89 e5             	mov    %rsp,%rbp         # Set the new frame pointer
   8:	48 89 7d f8          	mov    %rdi,-0x8(%rbp)   # Loads arguments a and b onto the stack
   c:	48 89 75 f0          	mov    %rsi,-0x10(%rbp)  #   without adjusting rsp
  10:	48 8b 55 f8          	mov    -0x8(%rbp),%rdx   # Loads stack copies of a and b into %rdx and %rax
  14:	48 8b 45 f0          	mov    -0x10(%rbp),%rax  #   Note that %rax holds function return value
  18:	48 01 d0             	add    %rdx,%rax         # b += a
  1b:	5d                   	pop    %rbp              # Restore the old frame pointer
  1c:	c3                   	ret                      # Return to caller

000000000000001d <sum_array>:
  1d:	f3 0f 1e fa          	endbr64 
  21:	55                   	push   %rbp
  22:	48 89 e5             	mov    %rsp,%rbp
  25:	48 83 ec 20          	sub    $0x20,%rsp           # Allocate 32 bytes on the stack
  29:	48 89 7d e8          	mov    %rdi,-0x18(%rbp)     # Load arguments p and n onto the stack
  2d:	89 75 e4             	mov    %esi,-0x1c(%rbp)
  30:	48 c7 45 f8 00 00 00 	movq   $0x0,-0x8(%rbp)      # s = 0
  37:	00 
  38:	c7 45 f4 00 00 00 00 	movl   $0x0,-0xc(%rbp)      # i = 0
  3f:	eb 2e                	jmp    6f <sum_array+0x52>  # Jump to 0x6f
  41:	8b 45 f4             	mov    -0xc(%rbp),%eax      # %eax = i
  44:	48 98                	cltq                        # %rax = (long) %eax
  46:	48 8d 14 c5 00 00 00 	lea    0x0(,%rax,8),%rdx    # Calculate byte-offset (i * sizeof(long))
  4d:	00 
  4e:	48 8b 45 e8          	mov    -0x18(%rbp),%rax     # %rax = p
  52:	48 01 d0             	add    %rdx,%rax            # p += byte-offset
  55:	48 8b 10             	mov    (%rax),%rdx          # %rdx = *p
  58:	48 8b 45 f8          	mov    -0x8(%rbp),%rax      # %rax = s
  5c:	48 89 d6             	mov    %rdx,%rsi            # %rsi = *p (second arg)
  5f:	48 89 c7             	mov    %rax,%rdi            # %rdi = s (first arg)
  62:	e8 00 00 00 00       	call   67 <sum_array+0x4a>  # Call add() (address not resolved yet)
  67:	48 89 45 f8          	mov    %rax,-0x8(%rbp)      # Store return value into s
  6b:	83 45 f4 01          	addl   $0x1,-0xc(%rbp)      # i++
  6f:	8b 45 f4             	mov    -0xc(%rbp),%eax      # %eax = i
  72:	3b 45 e4             	cmp    -0x1c(%rbp),%eax     # Compare i and n
  75:	7c ca                	jl     41 <sum_array+0x24>  # Jump to 0x41 if i < n
  77:	48 8b 45 f8          	mov    -0x8(%rbp),%rax      # %rax = s as return value
  7b:	c9                   	leave                       # mov %rbp,%rsp then pop %rbp
  7c:	c3                   	ret

A few additional notes:

We compile the code using -O0 -fno-omit-frame-pointer to ensure the compiler does not perform any optimizations and to ensure that %rbp is used as the frame pointer.
sum() is a leaf procedure. Note that it does not build out its stack frame by manipulating %rsp like sum_array() does.
The target addresses for the call instructions are zero at this point. They will be fixed up by the linker when the final executable is built.
sum_array() allocates 32 bytes for local variables (which only take up 24 bytes) probably to ensure proper data alignment.
Conditional jumps (like jl used in sum_array()) usually follow some comparison instructions (like cmp). The result of a comparison instruction is stored as several bits in the %rflags register. Subsequent conditional jumps refer to this register to determine if the jump should be taken.

Optimized `sum.o` (optional)

If we recompile sum.c using -O1 (for optimization level 1), the generated machine code changes; see our annotations in-line:

sum.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <sum>:
   0:	f3 0f 1e fa          	endbr64 
   4:	48 8d 04 37          	lea    (%rdi,%rsi,1),%rax  # %rax = a + (b * 1)
   8:	c3                   	ret    

0000000000000009 <sum_array>:
   9:	f3 0f 1e fa          	endbr64 
   d:	85 f6                	test   %esi,%esi              # Test n
   f:	7e 20                	jle    31 <sum_array+0x28>    # Jump to 0x31 if n <= 0
  11:	48 89 f8             	mov    %rdi,%rax              # %rax = p
  14:	8d 56 ff             	lea    -0x1(%rsi),%edx        # %edx = n - 1
  17:	48 8d 4c d7 08       	lea    0x8(%rdi,%rdx,8),%rcx  # %rcx = (p + 8(n-1)) + 8, addr of one-past last element of array
  1c:	ba 00 00 00 00       	mov    $0x0,%edx              # %edx = 0
  21:	48 03 10             	add    (%rax),%rdx            # %rdx += *p
  24:	48 83 c0 08          	add    $0x8,%rax              # p++
  28:	48 39 c8             	cmp    %rcx,%rax              # Compare p and addr of one-past last element of array
  2b:	75 f4                	jne    21 <sum_array+0x18>    # Jump to 0x21 if not at end yet
  2d:	48 89 d0             	mov    %rdx,%rax              # %rdx -> %rax as return value
  30:	c3                   	ret    
  31:	ba 00 00 00 00       	mov    $0x0,%edx              # %edx = 0
  36:	eb f5                	jmp    2d <sum_array+0x24>    # Unconditional jump to 0x2d

A few things to note:

sum() has been optimized to simply use the lea instruction to perform the addition.
sum_array() no longer has a stack frame because all local variables are stored in registers.

The loop logic has been rearranged to essentially the following:

long *end = p + n;
for (; p != end; p++) { ... }

sum_array() doesn’t even call sum() anymore; it has inlined the addition.

`main.o`

The disassembly of main.o, heavily annotated by us, is shown below:

main.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <main>:
   0:	f3 0f 1e fa          	endbr64 
   4:	55                   	push   %rbp
   5:	48 89 e5             	mov    %rsp,%rbp
   8:	48 83 ec 30          	sub    $0x30,%rsp
   c:	48 c7 45 d0 00 00 00 	movq   $0x0,-0x30(%rbp)    # a[0] = 0
  13:	00 
  14:	48 c7 45 d8 01 00 00 	movq   $0x1,-0x28(%rbp)    # a[1] = 1
  1b:	00 
  1c:	48 c7 45 e0 02 00 00 	movq   $0x2,-0x20(%rbp)    # a[2] = 2
  23:	00 
  24:	48 c7 45 e8 03 00 00 	movq   $0x3,-0x18(%rbp)    # a[3] = 3
  2b:	00 
  2c:	48 c7 45 f0 04 00 00 	movq   $0x4,-0x10(%rbp)    # a[4] = 4
  33:	00 
  34:	48 8d 45 d0          	lea    -0x30(%rbp),%rax    # %rax = a
  38:	be 05 00 00 00       	mov    $0x5,%esi           # %esi = 5 (second arg)
  3d:	48 89 c7             	mov    %rax,%rdi           # %rdi = a (first arg)
  40:	e8 00 00 00 00       	call   45 <main+0x45>      # Call sum_array()
  45:	48 89 45 f8          	mov    %rax,-0x8(%rbp)     # Store return value in sum
  49:	48 8b 45 f8          	mov    -0x8(%rbp),%rax     # %rax = sum
  4d:	48 89 c6             	mov    %rax,%rsi           # %rsi = %rax (second arg)
  50:	48 8d 05 00 00 00 00 	lea    0x0(%rip),%rax      # %rax = address of fmt string (unresolved)
  57:	48 89 c7             	mov    %rax,%rdi           # %rdi = %rax (first arg)
  5a:	b8 00 00 00 00       	mov    $0x0,%eax           # Clear %rax
  5f:	e8 00 00 00 00       	call   64 <main+0x64>      # Call printf()
  64:	b8 00 00 00 00       	mov    $0x0,%eax           # Clear %rax
  69:	c9                   	leave  
  6a:	c3                   	ret

Note that this object file was compiled using -fno-stack-protector. gcc will protect functions that it deems vulnerable to buffer overflow attacks with a special guard value. The guard is initialized when a function is entered and then checked when the function exits. We disabled this feature to keep the main() function short.

`main`

Once the object files are linked together to build the executable, the addresses in the call instructions are resolved. For example, sum_array() call to sum() is now resolved as an address relative to the instruction pointer:

1213:       48 89 c7                mov    %rax,%rdi
1216:       e8 99 ff ff ff          call   11b4 <sum>
121b:       48 89 45 f8             mov    %rax,-0x8(%rbp)

The address of sum() is calculated as 0x121b (the address of the next instruction) + 0xffffff99 (the 4-byte signed integer following the e8 call instruction, in little-endian order), or 0x121b - 0x67, which results in 0x11b4. This jump target encoding relative to the instruction pointer is known as “PC-relative” encoding.

GDB (GNU Debugger)

GDB is a command-line tool that you can use to debug your program. It lets you pause program execution at arbitrary points and inspect memory, register contents, variable values, etc. There exists many online tutorials on how to effectively use GDB. Also see CSAPP Section 3.10.2 and Figure 3.39 for an introduction to GDB and a list of useful commands.

Last Updated: 2024-03-12