User:Combuster/Object File Architecture

From OSDev Wiki
Jump to navigation Jump to search

When compiling source code in parts, the compiler will not be able to fill out all details at first, but rather delegates it to the linker, who is in fact aware of all the components, or even the runtime linker. Because the compiler won't know where data ends up, it has to attach extra data to the machine code, so that the missing information can be added later.

Symbols & Relocations

The core mechanic of the intermediate representation depends on having two parts. Symbols represent locations in the code, and relocations represent bits in the code where a reference is made to such a symbol. After you link an application, most of the actual locations of each symbol will become known, and the linker can make sure each of the references to each symbol gets replaced with their correct values. Actually, a linker will throw an error if a symbol is referenced and that symbol can not be found. Once linking is done, the executable has the correct values and both the symbols and relocations are no longer needed to run the application. That doesn't mean they are not useful: Having the symbols will allow you to addresses back to their original names. Having the relocations allows you to move the executable around in memory, which is essential in some implementations of shared libraries where you want to prevent errors caused by libraries liking to stay in the same address. Using position-independent methods is a common alternative and the trade-off is typicially decided between load-time overhead of relocating and runtime overhead of position-independent code. There's typically memory overhead as well, but these effects are mostly dictated by to the implementation details rather than the choice itself.

Relocation styles

The way relocations are provided depend largely on the file format and the architecture of the platform. These architectures typically have a few common grounds, and I will show some assembly examples.

Absolute references

The most simple form of using a symbol is to just point to the exact location where the data is stored. Most data references work in this fashion. For example:

extern int variable_not_defined_here;
extern int other_variable;
(...)
int value = variable_not_defined_here + other_variable;

On i386 the resulting machine code takes the following form

mov eax, [variable_not_defined_here]
add eax, [other_variable]

When we assemble this, it doesn't know the locations yet, so it'll have to provide some substitute values

   0:	a1 00 00 00 00       mov    eax, [0]
   5:	03 05 00 00 00 00    add    eax, [0]

If we try to run that as is, we would be accessing a null pointer. Therefore we have to tell the linker to replace those values later:

contents
   0:	a1 00 00 00 00       mov    eax, [0]
   5:	03 05 00 00 00 00    add    eax, [0]

relocations
   1:   absolute     variable_not_defined_here
   7:   absolute     other_variable

Note that the relocations reference locations 1 and 7 - the linker doesn't want to do anything with assembly and especially not with having to disassemble to find out where the values should be put. Therefore, we simply give it the exact location of memory where the changed value is supposed to go. The linker then determines what the address should be and substitutes it:

 8048080:	a1 90 90 04 08       	mov    eax, [0x8049090]
 8048085:	03 05 94 90 04 08    	add    eax, [0x8049094]

The linker decides the code should be at 0x08048080, and the data at 0x0804090. If the application does indeed get loaded there, these instructions will now work as expected.