ARM Overview

From OSDev Wiki
Jump to navigation Jump to search

ARM is a family of instruction set architectures based on RISC architecture developed by a single company - ARM Holdings.

Because ARM is a family of architectures and not a single architecture, it can be found in large scale of electronic devices (from simple small embedded systems with ARM MCUs, through smartphones, tablets and MP3 players to low-power servers). When you look at a device with ARM processing unit (term processing unit is more accurate because ARM can be found in microcontrollers as well as in microprocessors), there are two things that matter: architecture and core.

So far (Q3 2014) there have been 8 ARM architectures released or announced (where some of them have their extended versions), 7 of them being 32-bit, and the last one being 64-bit, but user-space compatibile with 32-bit instruction set (therefore making it possible to run 32-bit user processes, yet not 32-bit operating systems without virtualization). Very broadly said, with every new architecture version there have been added some new features to all cores (with exceptions) which had been already tried in some cores of a previous architecture. Not all features of previous architecture must be again available in the next one, and not all new versions of technologies added in previous architectures must be compatibile with the old ones. The simplest reason for this one could think of is that designing processors isn't like writing software: when a program decides whether to process old file format or a new one, it makes its decision and then exectues only one code, but backward compatibility of (for example instruction) formats may mean more transistors and more transistors usually (always?) result in more heat.

You must know you can't buy an ARM processor just like you would buy Intel abc or AMD xyz. ARM Holdings is a company that designs architecture, writes an Architecture Reference Manual, then designs a core and writes a Technical Reference Manual. Finally, it sells designed core to a chip-making-company and releases manuals to public. From there, a chip-making-company designs a processing unit - a microcontroller (MCU), a System On a Chip (SoC), an FPGA, whatever - and it may manufacture the silicon itself or order n-thousands of chips from a semiconductor manufacturer. All of processors you'll see in tablets or smartphones are SoCs which act as sort of motherboard with a processor. They contain logic for driving peripherals (ethernet, USB, SD/MMC cards, SPI, I2C, audio), they may contain GPU for graphics coprocessing or FPGA for custom logic.

Many uses means either few multifunctional (and theferefore complex and power-consuming) devices, or many simple devices suited just for that kind of operation. ARM processors as RISC devices chose simplicity over complexity, and therefore they are way too many cores with different instructions used. To have just one assembler to rule them all, ARM defined Unified Assembly Language which can be translated for any of ARM cores.

ARM cores are divided in lastest versions to three main lines:

  • Cortex-M cores, used for really small devices, usually with on-chip memory and simpler operations
  • Cortex-R cores, used for real-time devices
  • Cortex-A cores, used for applications in multifunctional devices like smartphones, TVs or maybe computers.

Apple machines use custom ARM cores, and so do some Nvidia boards.

Overview

ARM processors operate in various modes. User, Fast Interrupt, Interrupt, Supervisor, Abort, and System. Each mode has its own stack and environment that it lives in.

While the ARM manuals distinguish between these modes, the demarcation is made not in their associated privilege levels, but their purposes. There are essentially two privilege levels in ARM: the manuals refer to them as either 'Privileged modes' or 'Non-privileged modes'.

Of the modes listed above, 'User' is the only non-privileged mode, while the others are collectively the privileged modes for the architecture.

Why are there so many modes? And what's the difference between the 'Interrupt' mode and the 'Fast Interrupt' mode? Will I have to write a special sort of abstraction which splits my kernel's interrupt handling away from the rest of its operation? No.

These modes are simply the state the processor is left in when a particular type of exception occurs, or a particular type of IRQ.

As copied from the ARM7TDI datasheet (3.6 Operating Modes):

Mode Description
User (usr) Normal ARM execution state
FIQ (fiq) Designed to support data transfer or channel process
IRQ (irq) Used for general purpose interrupt handling
Supervisor (svc) Protected mode for the operating system
Abort mode (abt) Entered after data or instruction prefetch abort
System (sys) A privileged user mode for the operating system
Undefined (und) Entered when an undefined instruction is executed

Exceptions, IRQs and Software Interrupts on ARMv4 and up

The ARMv4+ architectures define seven vectors for the architecture, contrasted with the x86's grand total of 256 vectors. Also, note well that the kernel does not have the choice of which interrupt vector to store the syscall entry to: ARM statically provides a 'SWI' instruction for software interrupts, and no 'INT N' instruction, so syscalls are statically fixed to a particular vector.

Now we get to go into the idea of the various privileged modes of the processor: When the SWI instruction is executed, the ARM processor vectors into the kernel provided vector table, and jumps to the hardware fixed vector which is set aside for the SWI instruction. The kernel is expected to install the syscall strapping code on this vector, since there is no other logical choice.

However, instead of, like x86, reading which privilege level or privilege settings to apply for a particular vector (I'm referring to the x86's GDT selector which is present in each IDT entry), the idea of modes are employed, such that for an SWI instruction, the ARM CPU will automatically enter the 'Supervisor' mode.

Please remember from now that the 'Supervisor' mode is the standard mode which the kernel is expected to operate from. 'System' mode is not switched to on any public vector; It is like, based on the dodgy way the manual refers to it, and the fact that none of the defined interrupts actually switch to 'System' mode, a #SMI (System Management Interrupt, which switched to System Management Mode) on the PC.

From 'System' mode, the kernel would process the SWI, then return to userspace.

Abort mode switched to by the processor on encountering the equivalent of a Page fault on x86. The processor switches to privileged mode, but the specific mode 'Abort', one of those listed above. Most ARM kernels just take note of the fact that the processor was in Abort mode on kernel enter, then switch to 'Supervisor' mode and service the exception.

Interrupt Mode and Fast Interrupt Mode are almost the same, except that FIQ mode is given its own set of registers (the registers are actually switched from those you refer to, although they still have the same names (r8, for example is still, in your assembly, referred to as r8; however, in FIQ mode, you must be aware that if you had information in r8 in usermode, you should not expect to find that same data under the register named r8 in FIQ mode), so that it can execute IRQ handlers without having to save context too much.

Essentially, like x86, you would have a piece of hardware which could be configured by the kernel. The kernel, knowing that it has only 7 vectors, and two of them are designated for IRQs only, would give out the IRQ number of one of the 2 IRQ vectors (IRQ or FIQ vectors). For a device configured to signal an IRQ on the IRQ vector, the processor, on receiving an interrupt signal on the IRQ vector will enter privileged mode, in the IRQ mode state.

For an interrupt received on the FIQ vector, the processor would enter privileged mode, but in the FIQ mode state.

The 'Undefined' Mode is switched to on the encounter by the CPU of an undefined exception. However, based on the tone used in the ARMv4 manual, and the fact that they blatantly imply this, the exception mode was really meant for the kernel to emulate instructions for the usermode process, and then return.

Note regarding the 'Spectre' exploit

ARM have recently added a new CPU instruction to their specification that mitigates cache speculation side-channel exploits. You can read more about this design patch and recommended action to take regarding Spectre (https://developer.arm.com/-/media/Files/pdf/Cache_Speculation_Side-channels.pdf?revision=8b5a5f33-c686-4b00-8186-187dd2910355 here).

Page Description
http://www.embedded.com/design/prototyping-and-development/4006695/How-to-use-ARM-s-data-abort-exception Good article on the data and instruction ABORT exceptions

Registers

ARM processors have a somewhat large number of registers. The ARM7, for example, has 37 registers, 31 of those being 32-bit general registers, and 6 of those being status registers. Some are only usable by certain modes.

Unlike the x86, important operating registers are clearly visible through general use registers. For example, r15 is 'pc', or the 'program counter', and r13 is the 'stack pointer', or 'sp'.

Along with the general purpose registers, there is also the CPSR register, or, the 'Current Program Status Register'. This registers keeps track of the current operating mode, whether interrupts are enabled or not, etc. The operating system can read and write to this register using the MSR\MRS instructions. (See Here) There are several other system registers which can be used to detect the ARM board for example.

Mode R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 (SP) R14 (LR) R15 (PC)
User32/System R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15
FIQ32 R0 R1 R2 R3 R4 R5 R6 R7 R8FIQ R9FIQ R10FIQ R11FIQ R12FIQ R13FIQ R14FIQ R15FIQ
Supervisor32 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13SVC R14SVC R15SVC
Abort32 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13ABT R14ABT R15ABT
IRQ32 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13IRQ R14IRQ R15IRQ
Undefined32 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13UNDEF R14UNDEF R15UNDEF

Each mode shares some registers with other modes. Normally registers are specified in ARM instructions they use 4 bits. Which can represent 16 registers. As you can see there are exactly 16 registers which you can reference using instructions in each mode. The mnemonic names are specified across the top header as R0 through R15. An alias to R13, R14, and R15 is specified in parenthesis which is SP, LR, and PC. So using the mnemonic R15 is the same as using the mnemonic PC, but keep in mind that this is only relevant to the assembler/compiler. The ARM processor only understands the value given using 4 bits in most instructions. The exactly value represented using these 4 bits for each mnemonic can be gained by simply removing the prefixed letter R. So that R5 is represented by 4 bits as 0101, R6 as 0110, and R7 as 0111. The denotations that postfix IRQ, UNDEF, ABT, SVC, and FIQ are simply for display, and are relevant only to this table, and aren't considered valid by any assembler/compiler. For example R13ABT, R13IRQ, R13UNDEF, R13SVC, and R13FIQ all simply show that internally the processor maps R13 (SP) to a different address in it's register file.

Calling Convention Cheat Sheets

Here is a quick overview of common calling conventions. Note that the calling conventions are usually more complex than represented here (for instance, how is a large struct returned? How about a struct that fits in two registers? How about va_list's?). Look up the specifications if you want to be certain. It may be useful to write a test function and use gcc -S to see how the compiler generates code, which may give a hint of how the calling convention specification should be interpreted.

Return Value Parameter Registers Additional Parameters Stack Alignment Scratch Registers Preserved Registers Return address
R0, R1 R0, R1, R2, R3 stack (R13) 8 byte1 R0, R1, R2, R3, R12 R4, R5, R6, R7, R8, R9, R10, R11, R13 R14

Note 1: Stack is 8 byte aligned at all times outside of prologue/epilogue of non-leaf function. Leaf functions can have 4 byte aligned stacks. 8 byte alignment is required for ldrd/strd to function on the stack, which mostly only comes into play when using varargs with 64-bit integers. Care must be taken in interrupts and exceptions to align the stack to 8 byte before calling C code.

For details look at the EABI specs.

Instructions

Overview
Instruction length are constant fixed size. 32-bits A32 mode. 16-bits Thumb32 mode.
Strays from the RISC design by incorporating some multiple cycle instructions.
Employs inline conditional fields in instructions to reduce performance loss from pipeline flushes.
May produce larger code size due to RISC design compared to CISC designs, but supports Thumb32 mode which uses a 16-bit length instructions (more limited instruction set) to do 32-bit operations which produces smaller code footprint with reduced performance.
Most instruction execute in few clock cycles or single cycles making development of real-time software easier.
A32 and Thumb16 modes can be switched on the fly (worthy to note!)

Loading of large immediate values into registers can be interestingly different from x86/x64. A immediate value being a value that is literally encoded into the instruction. For example the x86/x64 compatible processors support loading a 32-bit immediate (also called a constant) into an arbitrary register. The ARM's A32 and Thumb32 instruction sets do not. Instead one must use a series instructions to gain the same effect or store the value in memory outside of the instruction stream. In essence one may consider the storage of the value outside of the instruction stream the same as storage inside but the does exist a difference in many ways and this is something to keep in mind about the ARM. But, any compiler or assembler can and may take care of this problem for you. If you are compiling C/C++ code then these problems will be transparent to you for example, but as I said this is good information to know and understand especially with system software development.

Example, of machine code produced by GCC to load a register with a 32-bit value. As you can note the immediate value is technically outside of the instruction stream. On the x86/x64 the complete value would have been encoded into the instruction. Also, take note how each instruction is constant in length. The instruction doing the work is LDR using the PC register and a immediate offset in order to place the value 0x12345678 into the register R3.

00000000 <main>:
   0:	e1a0c00d 	mov	ip, sp
   4:	e92dd800 	push	{fp, ip, lr, pc}
   8:	e24cb004 	sub	fp, ip, #4
   c:	e59f300c 	ldr	r3, [pc, #12]	; 20 <main+0x20>
  10:	e1a00003 	mov	r0, r3
  14:	e24bd00c 	sub	sp, fp, #12
  18:	e89d6800 	ldm	sp, {fp, sp, lr}
  1c:	e12fff1e 	bx	lr
  20:	12345678 	.word	0x12345678

An example of loading the register by storing the value in the instruction stream so to speak.

              
mov	r8, #0x78
add	r8, r8, #0x56 << 8
add	r8, r8, #0x34 << 16
add	r8, r8, #0x12 << 24


Almost all instructions support conditional execution which is directly encoded into the instruction. On a x86/x64 architecture when a series of instructions need to be conditionally executed you will likely find a branching instruction which creates a completely separate code path for the processor to follow is the branch happens. The ARM supports branching, but also almost all individual instructions support conditional execution so that instead of causing the processor to potentially flush and refill the pipeline because a branch occurred instead it will load instructions into the pipeline which have a special conditional field. There are a maximum of 16 specified conditional codes with one reserved condition making 15 usable conditions. So instead of making a branch only to execute a few instructions those few instructions can remain in the execution path and will not be executed or will be executed simply using the conditional codes.

Here are the conditional codes for example:

Mnemonic Binary Description
EQ 0000 Z flag set (equal)
NE 0001 Z flag clear (not equal)
HS 0010 C flag set (unsigned higher or same)
LO 0011 C flag clear (unsigned lower)
MI 0100 N flag set (negative)
PL 0101 N flag clear (non-negative)
VS 0110 V flag set (overflow)
VC 0111 V flag clear (no overflow)
HI 1000 C flag set and Z flag clear
LS 1001 C flag clear or Z flag set
GE 1010 N flag set and V set or N clear and V clear
LT 1011 N set and V clear or N clear and V set
GT 1100 Z clear with (either N or V set), or N clear and V set
LE 1101 Z set or (N set and V clear), or N clear and V set
AL 1110 always (no condition really)
NV 1111 reserved condition

Memory

Many ARM processors come equipped with MMUs, full memory protection schemes for their 4GB address space, and TLB.

Paging

The paging scheme used by ARM processors uses a 2 level page table. The first level page table has 16kB size and each entry covers a region of 1MB (sections) and can be mapped directly or point to a second level page table. The second level page tables have 1kB size and each entry covers a region of 4KB (pages). The TLB also supports 16MB sections and 64kB pages. For those the right bits in the first or second level page table entries have to be set and the entry repeated 16 times. A hardware page table walk for any of the repeated entry then adds a TLB entry for the larger size. On ARMv4 and ARMv5 pages might also be divided into what the manual calls 'subpages', of, for a 4KB page, 1KB subpages, and for a 64KB page, 16KB subpages. (I assume you realize the fact that the subpages in both cases are 1/4 the size of the main page size). This is deprecated on ARMv6 and an alternate page table format can be configured that does not have subpages but has a extra access permission bits and provisions for physical address extentions (for more than 4GB physical memory) instead.

The ARM processor supports 2 independent page tables. The first page table can be abreviated from the full 16kB down to 128 byte in powers of 2 and is used for the lower part of the virtual address space. The second page table is always 16KB in size and is used for addresses beyond the end of the first page table. If the first page table already covers all of the virtual addres space, is already full size (16kB), then the second page table is never used and can be left unset.

The Architecture Reference Manuals propose to use the abreviated first page table for processes individual address space and having the second level page table static for the kernel. For small processes (using 2GB or less) this results in a smaller page tables and therefore less overhead per process.

Note: The ARMv6 Architecture Reference Manuals state clearly, multiple times, that subpages are deprecated and should not be used. Subpages have been removed in the ARMv7 Architecture Reference Manuals and are no longer available. So for maximum portability, you may decide to stick to 4KB and 64KB page sizes.

Or you can have two sets of abstractions for your ARM Memory Manager port: One MM for ARMv4 and v5, and one for ARMv6 and up. It's all up to you.

Memory Detection

Memory detection is much different if you are coming from a background in the x86/x64 architecture. The ARM cores are used in many embedded applications and therefore the system board which the core resides on does not need to be overly complicated in order to be compatible with others boards. This is because almost any production board with an ARM core on it was likely custom designed just for that purpose. It is quite possible to use some generic board, but for lots of embedded applications there may not even be an operating system.

Therefore memory detection mechanisms may be non-existent and instead your operating system may opt for a value to be encoded into it at compile for a specific system (board) which is known to have a certain amount of memory, or you may have a specific driver that your operating system loads just for that board (or is compiled in) where the driver can be queried by your kernel for the amount of memory installed. There are lots of ways and these were just two ideas, but the important part is to understand that unlike the x86/x64 compatible systems you will likely find absolutely no mechanisms to properly poll the amount of memory.

It may however be possible to probe memory and recover using processor exceptions. This still may not provide information about if a region of memory is FLASH, memory-mapped I/O, RAM, or ROM depending on how the system board was designed as I do suspect it could be quite possible for some ROM to be external to the core and allow writes to silently fail, and this coupled with the possibility of a region of memory to need a special unlock sequence in order to write to it will render your memory auto-detection code into a potential corner-case.

But, if you think you can do it -- then by all means try because you might figure out a way. This section is just to give you a good general feeling of what you may be getting into and I do not mean to keep any good ideas you may have from becoming something valuable.

Exceptions

Note: For some reason, the ARM people use the terms 'interrupt' and 'exception' as if they were the same.

For exceptions, ARM uses a table similar to the IVT of the real mode x86. The table consists of a number of 32-bit entries. Each entry is an instruction (ARM instructions are 4 bytes in length.) that jumps to the appropriate handler.

Take note of that, and understand the design impact it imposes: On x86, the hardware vector table holds the addresses of handler routines. On ARM, the hardware vector table holds actual instructions. These instructions must fit into 4 bytes. This is actually not a big deal since all ARM instructions (assuming ARM mode and not Thumb, or Jazelle) are actually 4 bytes anyway. The general idea is to emulate the behaviour of something like the x86 and simply place a jump instruction into the actual vector table, so that upon indexing into the table, the ARM processor is made to act as if it jumped to an address contained in the jump instruction. From there, consistency is obtained, and portability is eased.

Also used are various devices to vector interrupts. Two such are the Generic Interrupt Controller and the Vectored Interrupt Controller.

Coding Gotchas

Heap Pointers Needs To Be Aligned

I can not state at this time how many processors support unaligned memory access natively, but from what I know unaligned memory access is more expensive than aligned. And, a good many structures which have fields greater than one byte in size a lot of times are allocated on the heap. So from this you can see how big of a performance hit you could take. Not to mention the bug that would be introduced if running under a processor which does not even handle unaligned memory access gracefully, and by that I meant at the very least raising an exception of some sort so the code can crash and show the problem.

Missing Division Functions (__aeabi_uidivmod, __aeabi_idiv)/No Division Support

This is caused by using GCC and not linking with libgcc. You NEED TO link with libgcc when using GCC. For information about why you should link and information about libgcc, read Libgcc and GCC_Cross-Compiler.

You must also make sure you use the libgcc that has been compiled for your target machine/architecture/platform. See Libgcc for more information. You can produce the correct library for GCC by reading GCC_Cross-Compiler.

This section is just if you want a more in depth explanation and discussion.

This issue is actually complicated because different CPUs may not support hardware division, support it only through the floating point operations, in thumb mode, in native mode, or a combination. Also, in certain situations you may find it faster or more compact to use different methods because you may be interested in code size or performance. The libgcc will handle all of these situations and provide the needed symbols, however here are some various sources of information that can give you a more in-depth understanding:

The following links talk about different methods in handling this problem: 
    http://forum.osdev.org/viewtopic.php?f=1&t=23857&p=212244
    http://stackoverflow.com/questions/8348030/how-does-one-do-integer-signed-or-unsigned-division-on-arm
Also, some extra information that maybe useful:
    http://www.linkedin.com/groups/ARM-cores-hardware-division-85447.S.242517259
A discussion about this section, and also at the end an example of libgcc's version of the divide function:
    http://forum.osdev.org/viewtopic.php?f=8&t=27767
The source for libgcc and source of the division emulation function:
    https://github.com/mirrors/gcc/blob/master/libgcc/udivmodsi4.c
    https://github.com/mirrors/gcc/blob/master/libgcc/
Unaligned Memory Access And Byte Order
Various newer ARM cores supposedly support unaligned memory access, but a specific bit or two have to be set and unset in the program control registers. I have not checked if QEMU's emulation of the ARMv7 supports this. So keep this in mind.

Also, it is recommended to have a look at this article on Endianness which will help clarify and potentially gives a second source of information.

Let us imagine we have a board with 8 bytes of RAM, as depicted below:

Value A B C D E F G H
Address 0 1 2 3 4 5 6 7

A little endian machine reads the least significant byte (LSB) first. So, if we made a 32-bit (word sized) read at address 0 on the memory above, that would yield "DCBA". On a big endian machine, which reads the most significant byte (MSB) first, you would get "ABCD". It all depends on how you visualize memory addresses - whether they flow to the right (big endian), or towards the left (little endian).

Some CPU platforms even support either mode. For example, the ARM7TDMI-S has a pin BIGEND to switch from little to big endian. (Although I could not find the QEMU option to enable this feature.)

ARM cores have two other differences from x86/x64:

  • They do not handle unaligned memory accesses in any defined manner. It is specified on 4-32 ARM7TDMI-S Data Sheet that if bit 0 for the memory address on a half-word access is high then the results are undefined.
  • Accessing half-words in little endian mode will swap the half-words in the word.

An example for the latter:

uint32        *a;
uint16        *b;

a = (uint32*)0;
b = (uint16*)0;

a[0] = 0x12345678;
printf("%#04x --> %#04x\n", b[0], b[1]);

This gives 0x5678 --> 0x1234, while in big endian mode you would get 0x1234 --> 0x5678.

Little Endian Word Size (32-bit)
ED CB A9 87 78 9A BC DE
Offset Register(LittleEndian) Register(BigEndian)
00 0x87A9CBED 0xEDCBA987
01 UNALIGNED - UNDEFINED
10 UNALIGNED - UNDEFINED
11 UNALIGNED - UNDEFINED
Little Endian Half-Word Size (16-bit)
ED CB A9 87 78 9A BC DE
Offset Register(LittleEndian) Register(BigEndian)
00 0xA987 0xEDCB
01 UNALIGNED - UNDEFINED
10 0xEDCB 0xA987
11 UNALIGNED - UNDEFINED
Byte Size (8-bit)
ED CB A9 87 78 9A BC DE
Offset Register(LittleEndian) Register(BigEndian)
00 0xED 0xED
01 0xCB 0xCB
10 0xA9 0xA9
11 0x87 0x87

From reading the data sheet it appears that if operating in big endian mode the half-word access would be reversed to be more natural as you would expect on the x86/x64 architecture. But, since I can not actually test it at the moment I am hoping I got it right.

The word access with offset of 10b (0x2) may be defined, but I am not sure because it does not really state. However, it may employ some of the mechanisms for loading half-words. (Need someone to come through and correct this if it is wrong)

The reason memory access has to be aligned is because unaligned access requires additional access cycles, due to the way memory modules work (see DDR datasheet below), and would increase the complexity of load / store operations (and the processor core).

You could simulate an unaligned memory access on the ARM7TDMI-S, but you would have to make separate loads from two memory locations. A compiler could probably emit code to do this automatically, but the checks whether an access is unaligned or not would slow down all memory accesses. Some code is provided in the ARM7TDMI-S Data Sheet on page 4-35 for such "checked" memory access, when you do not know if the address will be aligned or non-aligned.

Branch Instruction In Vector Table

Keep in mind when working with the ARMv4 and compatible cores that in the vector table each entry is 32-bits -- and that the value in each slot is not the address of where the CPU should jump too.

Instead it is an instruction. This can be a little confusing when you first start out because it seems like it makes sense for it to be an address, but instead it is an actual 32-bit instruction. It can be any instruction, but common usage is a LDR (data load) instruction which loads an immediate into PC from with-in 4K bytes of the instruction, or a B (branch) which uses a 26-bit signed offset.

Also, if you use a B instruction. Remember, its relative to the instruction pointer. So for each index in the table subtract index*4 from it. Then subtract an extra 8 from that so you have index*4+8 subtracted. This 8 is from the prefetch (pipeline having been filled).

Failure to realize this can lead to some service routines that work long enough for you to think they are working then bug out days down the road and leave you looking through a lot of added code.

#define ARM4_XRQ_RESET   0x00
#define ARM4_XRQ_UNDEF   0x01
#define ARM4_XRQ_SWINT   0x02
#define ARM4_XRQ_ABRTP   0x03
#define ARM4_XRQ_ABRTD   0x04
#define ARM4_XRQ_RESV1   0x05
#define ARM4_XRQ_IRQ     0x06
#define ARM4_XRQ_FIQ     0x07

/*
    Will install a branch instruction for the 
    interrupt vector for the ARM platform.
*/
void arm4_xrqinstall(uint32 ndx, void *addr)
{
    uint32      *v;
    
    v = (uint32*)0x0;
    v[ndx] = 0xEA000000 | (((uintptr)addr - 8 - (4 * ndx)) >> 2);
}

It can be a wise idea to populate all entries of the table. If the table is populated with zeros each vector will hold the instruction "andeq r0, r0, r0" which does nothing. Meaning if the CPU jumps to an unpopulated vector it will effectively execute a NOP and move to the next vector which can cause a confusing bug in your code. At least point any unused vectors to a dummy function that will notify you an unhandled exception has occured!

Specifying CPSR using AS (Binutils/GCC) Syntax

You have to use cpsr not %%cpsr or any other form.

uint32 arm4_cpsrget()
{
    uint32      r;
    
    asm("mrs %[ps], cpsr" : [ps]"=r" (r));
    return r;
}

void arm4_cpsrset(uint32 r)
{
    asm("msr cpsr, %[ps]" : : [ps]"r" (r));
}

/* Bit 7 and 6 have to be logic-low/unset/zero/cleared to enable the interrupt type. */
void arm4_xrqenable_fiq()
{
    arm4_cpsrset(arm4_cpsrget() & ~(1 << 6));
}

void arm4_xrqenable_irq()
{
    arm4_cpsrset(arm4_cpsrget() & ~(1 << 7));
}
GCC (Chars Not Signed)

In some cases GCC when targeting ARM architecture may not handle signed values as you may expected.

Consider the case:

void foo(char *a)
{
    *a = -1;
}

void bar()
{
    char      z;
    short     x;

    x = 0;
    foo(&z);

    x = x + z;
}

The expected value for x would be -1 or 0xFFFF in a twos complement system, but instead you may end up surprised and spend quite a many hours when you get 255 or 0xFF as the result. And, here is the explanation.

  4c:	ebfffffe 	bl	0 <foo>
  50:	e55b300f 	ldrb	r3, [fp, #-15]
  54:	e1a02003 	mov	r2, r3
  58:	e15b30be 	ldrh	r3, [fp, #-14]
  5c:	e0823003 	add	r3, r2, r3

The x86/x64 target architecture will cause GCC to treat char as signed char [expected behavior by most], but when targeting the ARM you may be surprised to find that char is treated as unsigned char. You must specify signed before char, and then the compiler will generate code to correct perform the addition.

This link explains better. Also, in case you decide to skip the reading. Essentially, code which assumes char should be signed char could be considered essentially incorrect and buggy which was stated by one poster. So, good idea to keep a check on this for portability.

Mailing List Post

Emulators

Targeting Multiple ARM Based Devices

For information about targeting multiple ARM based devices see, here.

Tutorials And Starting Points

Page Brief Description
BeagleBoard Tutorial on bare-metal [OS] development on the Texas Instruments BeagleBoard. Written specifically for the BeagleBoard-xM Rev C.
Integrator Barebones This is a tutorial. Read Getting Started and Beginner Mistakes. Should have reasonable knowledge of ARM assembly, but may be able to get by on less.
Integrator-CP QEMU PL110 16-Bit Color Example Quick, dirty, and to the point on getting the QEMU PL110 16-Bit Color frame buffer working. Will likely not work on the real hardware, but will be very close. This is just to get someone started. Formed as a tutorial, but very short one.
PL050 PS/2 Controller Information about interfacing a PS/2 device specifically the mouse, and starts with using the PL050. See links at bottom if PL050 is not of interest.
IRQ, Timer, And PIC This demonstrates using exceptions (specifically the IRQ exception), Timer, And PIC. It also provides a decent base for hacking with the ARM and/or QEMU.
ELK Pages (Thin ARM) The experimental learning kernel pages aimed to take someone gradually through the process of building a functional kernel using possibly (in later part of series) experimental designs and implementations that differ from the standard and conventional design in certain areas.
ARM_RaspberryPi Description and details on the commonly used Raspberry Pi boards
QEMU realview-pb-a board This page gives you some information about programming for the realview-pb-a board under QEMU. Actual hardware couldbe different. It also contains a link to the datasheet (which might be hard to find).

Highly Useful External Resources