User:Schol-r-lea/Brief Tutorial On Data Structures In Assembly Language

From OSDev Wiki
Jump to navigation Jump to search

N.B.: this was originally created as a post on the DevShed forum in December 2006. I am posting it here for now in order to edit it for a more general use later.

This tutorial gives a general overview of how data structures can be implemented in various assembly languages, with examples for a number of commonly used assemblers, with special attention being given to the Intel and AT&T (GAS) syntaces for x86 protected mode. The examples are all for 32-bit systems, with a matching C `int` type; for other system bit widths, the sizes will need to be adjusted as appropriate.


General Concepts

Conceptually at least, a structure in assembly language is like a structure in any other language: a fixed group of heterogeneous variables which have their own names but which are part of a larger composite variable. For example, in C++, a struct might look like

   struct {
      int foo, bar;
      char baz;
   } qux;

The variables inside the braces would all be part of the composite variable `qux`; to access one part of `qux`, say `foo`, you would use `quux.foo` or something similar. These are what are known as the fields or members of the structure.

It's much the same in assembly language, except of course that the types of the variables are implicit rather than explicit. Thus, it is up to the programmers using the data structures to ensure that the field sizes are correct, and the instructions operating on them are appropriate to the data they represent.

Most assemblers have support for defining structures explicitly, via directives, as will be explained below. These directives allow the assembly programmer to define how a structure is laid out in memory, and in most case can also define a structure 'type' as a series of memory offsets from the first field of the structure.

It is possible to define structures even if the assembler has no such directives, however. In that case, the structure definition can be created by a set of programming conventions; the programmer would be responsible for deciding what those conventions are. So, if we agree to use labels which start with the name of the structure as a whole, followed by a period, and then the name of the element in question, you would end up with a 'structure' that consists of a series of explicit labels and/or offsets.

Most of this discussion will focus on the convention-based approach, both to demonstrate the principles more explicitly, and to avoid tying the discussion too tightly to the syntaces of any specific assemblers. This will be followed by brief explanations of the structure directives in some of the more popular assemblers, and how they can be used to the same effect, and how they can improve the clarity of the structure definitions.

while it is preferable to use the directives appropriate to the assembler when possible, in general the result will be the same in terms of how the structure gets laid out in memory.

Structures by Convention

In order to manage a structure in the absence of assembler-supported structure directives, we need to first decide how we want to describe our structures. One simple approach, which most assemblers will permit, is to define the fields - or more generally, the offsets for the fields - by using the name of the structure or structure type, a period, and the name of the field. This is similar to the way fields are accessed in many high-level languages, and most assemblers with directive-based structures use this approach or one very similar to it.


Layout in Memory of a Unique Structure

Defining the fields of a structure becomes simply a matter of defining the location in memory of the structure itself, then defining the locations of the fields.

MASM syntax NASM syntax GAS Syntax (any)
   qux:              
   ; blind label marking 
   ; the start of the
   ; struct, the same actual 
   ; addr. as qux.foo in
   ; this case.
   
   ; the struct's fields:
   qux.foo: DWORD 1 ; int
   qux.bar: DWORD 1 ; int
   qux.baz: BYTE  1 ; char 
   align 4
   qux:              
   ; blind label marking 
   ; the start of the
   ; struct, the same actual 
   ; addr. as qux.foo in
   ; this case.
   
   ; the struct's fields:
   qux.foo: resd 1 # int
   qux.bar: resd 1 # int
   qux.baz: resb 1 # char 
   align 4
   qux:              
   # blind label marking 
   # the start of the
   # struct, the same actual 
   # addr. as qux.foo in
   # this case.
   
   # the struct's fields:
   qux.foo: .space 4 # int
   qux.bar: .space 4 # int
   qux.baz: .space 1 # char 
   .align 4


Note that because `baz` was a one byte value, we need to pad three bytes after it to ensure word alignment. The structure itself is 9 bytes long, but the whole memory use is 12 byte when you count the padding. This will be important later.


Accessing a Uniquely-Defined Structure's Fields

Since in this model, the fields are just ordinary labels which happen to follow a naming convention, they would be used in the same manner as any other globally defined memory variables.


Defining a Structure Type

Now, you might say that this is a lot of pointless work for a single structure; while you can treat the composite variable as a single structure, you really have three separate values, with separate labels. In this case, it is true that they could have been just as easily treated as separate data values; however, this is equally true in the C++ example above. However, declaring a single struct in this manner is useful for showing the logical and conceptual connections between the elements; it shows that they are represent something as a whole.

That having been said, the real power in the idea of data structures is not from declaring a single structure in this manner, but from declaring types of structures. Again, in C++, you could define a structure type:

   struct quux {
      int foo, bar;
      char baz;
   };
   
   quux flarp, *frink;

Now, instead of having a single struct named quux, we define a sort of pattern or template (not a template in the C++ sense, but bear with me) for defining structures of the same type. In assembly language, this sort of structure type would be defined not as a series of labels, but as a series of offsets, like so:

x86, Intel syntax GAS Syntax (any)
   ; size of quux structs, incl. padding
   quux.sizeof equ 12  
   quux.foo    equ 0   ; 1st element size
   quux.bar    equ 4   ; 2nd element size
   quux.baz    equ 8   ; 3rd element size
   # size of quux structs, incl. padding
   quux.sizeof .= 12  
   quux.foo    .= 0   # 1st element size
   quux.bar    .= 4   # 2nd element size
   quux.baz    .= 8   # 3rd element size

You would then define a variable the size of the whole structure type to hold the structure:

MASM syntax NASM syntax GAS Syntax (any)
   flarp  BYTE quux.sizeof dup(?)
   flarp  resb quux.sizeof
   flarp  .space quux.sizeof


Note that a pointer to a struct would still be the regular size of a pointer, which for most 32-bit architectures is 4 bytes:

MASM NASM
   SECTION .DATA
   frink    dd 0
   
   SECTION .CODE
   ; ... do things...
       lea eax, flarp
       mov frink, eax
   section [data]
   frink    dd 0
   
   section [text]
   ; ... do things...
       lea eax, flarp
       mov frink, eax


GAS, x86 GAS, ARM GAS. MIPS
   .section .data
   frink .long 0
   
   .section .text
   # ... do things...
       lea flarp, %eax
       movl %eax, frink
   .section .data
   frink .word 0
   
   .section .text
   @ ...do things...
       ldr %r0, =frink
       ldr %r1, =flarp
       str %r0, [%r1]
   .section .data
   frink .word 0
   
   .section .text
   # ...do things...
       la %t0, flarp
       la %t1, frink
       sw %t0, 0(%t1) 


There may be variations on this, such as using data pools in ARM assembly or different forms of indirection on the x86, but this is the general model.

Using an Instance of a Structure Type

At this point, the structure variables that make up flarp are implicitly defined by the offsets rather than defined by explicit labels. To use the struct you would get a pointer to the start of the struct variable, then use that pointer plus the defined offsets to access the structure elements.


Intel GAS, x86
       mov ebx, flarp
       mov eax, [ebx+quux.baz]
       movl flarp, %ebx
       movl (%ebx, quux.baz), %eax  
ARM, GAS syntax MIPS, GAS syntax
       ldr %r0, 
       lw %t0, flarp
       lb %t1, quux.baz(%t0)


Dynamically Allocated Structures

Now we're ready to see the real advantage of structures: defining values at runtime, either on the stack or in the heap. To dynamically allocate a structure of a given type, you just allocate the space need for it, and use the same offset-based approach to accessing the fields as before.

Stack Allocation

Most languages allocate the local variables for functions on the system stack, allowing them to be created and freed in a simple, consistent manner. To define a structure on the stack, you would do the same as you would a scalar variable, which is to allocate memory for it as part of the function's stack frame.

ARM, GAS syntax MIPS, GAS syntax
       addi $sp, $sp, -20
       sw   $fp, $fp.fp($sp)
       move $fp, $sp
       sw   $ra, fp.ra($fp)
       addi $t0, $fp, -20   # location of the struct on the stack
       la   $t1, frink
       sw   $t0, ($t1)      # frink now points to the structure
   

Note that the 12 bytes has to be in addition to anything else you are saving on the stack, such as the frame pointer, the return address, the s-registers, and so on; the struct (and any other local variables) would come after everything else being saved. This is also why we padded the struct to word alignment rather than halfword alignment - in the MIPS processor, values on the stack are always assigned in word increments (the same applies to heap structures).

Just as in C, you have to be careful with pointers to values on the stack; once the stack is cleared, the variable is gone, and trying to use a pointer to it after that may severely corrupt your stack. This can be a tricky problem, as it will only become apparent if the stack grows up to that point in memory again; it may not show up for several function calls later.

Heap Allocation

As with stack allocation, heap allocation works in much the same way as it does for scalar values; it simple is a matter of allocating a large enough block of memory to store the structure.

Since heap allocation is dependent on the operating system being used, any examples given would reflect the OS used. For this reason, no example here can apply to a new OS. For this reason, the examples will cover the use for four established systems (Windows, MacOS, FreeBSD, and Linux) in order to contrast any differences in how they need to be handled. There is also separate page where OS developers can explain their own system's memory handling methods, to highlight the different methods which new OS developers might consider using (this page will include examples for the established systems as well).

MASM NASM, Windows GAS, x86 Windows
NASM, x86 MacOS
NASM, x86 Linux GAS, x86 Linux


GAS, ARM Linux GAS, MIPS Linux
       li   $a0, quux.sizeof
       li   $v0, malloc      # malloc == syscall 9
       syscall
       sw   $v0, frink       # save the pointer to the allocated space

As always the case for heap allocation, you have to be careful to retain at least one pointer to the structure as a whole, in order to free it later, and you should not try to access a structure pointer after the structure has been freed.

Using Heap Structures for Linked Data Structures

With heap variables, you can also create more elaborate data structures using structs and pointers, such as a linked list or tree. For example, let's consider a binary tree of lookup keys and values, in which each node has a 15-character string (16 bytes total to allow for a zero-delimiter for a maximum size key), a 50-byte data buffer (note the unaligned size!), and two 32-bit pointers for left and right branches:

   struct KeyNode {
       char key[16];
       uint8_t data[50];
       KeyNode *left, *right; 
   };


the general data structure would be:


MASM / NASM GAS (any)
   KeyNode.sizeof equ 132
   KeyNode.key    equ 0
   KeyNode.data   equ 16
   align 4
   KeyNode.left   equ 68
   KeyNode.right  equ 100
   KeyNode.sizeof = 132
   KeyNode.key    = 0
   KeyNode.data   = 16
   .align 4
   KeyNode.left   = 68
   KeyNode.right  = 100

The code to manage them might be:


MASM NASM, Windows GAS, x86 Windows
NASM, x86 MacOS
NASM, x86 Linux GAS, x86 Linux
ARM, GAS syntax MIPS, GAS syntax
   # create a node  
       li   $a0, KeyNode.sizeof
       li   $v0, malloc
       syscall
       move $a0, $v0
       jal  insert
   
   # ...
   
   insert:
   

Structures Using Assembler Directives

Up to now, this discussion has focused on manually defining structures and their fields using the same methods used in defining scalar values, by allotting space based on the size of the structure and using hand-computed offsets for defining and accessing their fields. However, just as it makes sense for the assembler to compute the addresses and offsets for labels and jumps or references to them, there would be a clear advantage for the programmer if the assembler allowed the structures field addresses or offsets be automatically calculated. Most modern assemblers do indeed have some way of doing this, though the specifics often vary greatly.

Let's take the same C++ structure type as before, and apply the different assemblers' techniques for them.


Defining and Using a Structure Type

   struct quux {
      int foo, bar;
      char baz;
   } flarp, *frink;


MASM syntax NASM syntax FASM Syntax
   quux STRUCT ALIGN DWORD
       foo dd  ?
       bar dd  ?
       baz db  ?
   quux ENDS
   
   
   struc quux:              
       .foo: resd 1
       .bar: resd 1
       .baz: resb 1
       alignb 4
   endstruc
   
   flarp: 
     istruc quux
       at .foo, dd 0
       at .bar, dd 0
       at .baz, db '\00'
     iend
   frink: dd 0 ; null-init'ed ptr
   struc quux  foo, bar, baz 
   {              
       .foo dd foo 
       .bar dd bar
       .baz db baz
   }
     
   frink quux 0, 0, '\00'
   flarp dd 0  ; null-init'ed ptr
GAS Syntax (any)
   .struct quux
       foo: .space 4
       bar: .space 4
       baz: .space 1 
   .end_struct
   

Declaring Structures

Now, it is worth recalling that this is just a way of imposing a structured perspective on untyped memory. Most assemblers do not enforce data types, for scalars or structures, and it is up to the programmer to ensure that they are using the data in the intended manner, regardless of whether it is in a structure or not.


Forth Assemblers, High Level Assemblers and Macro-Extensible Assemblers

Most assemblers strive to maintain a close parallel between the assembly mnemonics and data declarations, and the resulting machine code that is actually executed. While most assembly languages do not have a strict one-to-one correspondence between the mnemonics and the opcodes, they do try to keep it as close to them as is feasible; ands while they will automatically calculate things such as label addresses, constant arithmetic in immediate arguments and indexes, and sizes for data, how those computed values are used is still up to the coder. In general, those abstraction mechanisms which assemblers do provide - structs and macros being the main examples - they rarely allow them to change the behavior or syntax of the language.

There are a number of exceptions to this, however. These fall into three main categories:

  • Forth Assemblers - these are assemblers designed to work with Forth interpreters and compilers, and are used for creating executable Forth words which are automatically added to the current Forth dictionary. They essentially are used to extend the language to add support for hardware-dependent low-level operations, and are entirely integrated into the Forth system they are a part of. Generally speaking, these will operate on data structures defined in Forth, rather than having a separate structure mechanism and syntax.
  • High-Level Assemblers - these are an intermediate between a conventional assembler and a high-level language. Most of the examples in this category use standard assembly mnemonics for the majority of the code but providing some higher-level operations such as loops and function returns to simplify the parts which are relatively common and reduce some of the repetitiveness of assembly code. Many of these, though not all, are aimed at making learning assembly language easier for novices. Several of them are actually called "High Level Assembly" or "HLA", including the best know one for x86 machines, Hyde's HLA, which is associated with the book The Art of Assembly by the same author.
    • Alternate-Syntax Assemblers - This is a subclass of HLAs which eschew conventional mnemonics for a high-level style syntax, while still only allowing operators which directly correspond to the machine code. Some of these are also universal assembly languages, which provide operators corresponding to the machine code operators of a specific machine, or a subset of them common to many different architectures, but do so in a general syntax similar to some high-level language which it is modeled after (usually C) and can usually be ported to and reassembled on other systems with only some relatively minor adjustments (something not normally possible with assembly code). One of the languages using the name "C--" is an assembler of this type.
  • Macro-Extensible Assemblers - these are assemblers whose macro facilities are capable of extending and altering the syntax of the language itself, sometimes to the point of becoming, in effect, a high-level language implemented as a set of macros rather than a compiler or interpreter. These were a subject of experimentation in the 1960s, during the period when compiler technology was in its infancy - these powerful macro assemblers seemed to offer an alternative to the problem of implementing a full compiler. The best known example of this is the SNOBOL4 'interpreter', which would convert a line or several lines of SNOBOL into an assembly language program through syntax transformations, assemble it, and execute it before returning to its REPL. These have mostly disappeared since the early 1970s, as compiler design improved to the point where writing a compiler was no longer more difficult than writing a macro preprocessor capable of this together with the macros for doing the transformations.