User:Pancakes/ARM Integrator-CP IRQTimerPICTasksAndMM

From OSDev Wiki
Jump to navigation Jump to search

IRQ, Timer, PIC, Tasks, And MM On The Integrator-CP

Well, hopefully, you are coming from some of the earlier examples. But, if not here are the links.

Page Description
IRQ, Timer, And PIC This demonstration just uses the IRQ, Timer, And PIC.
IRQ, Timer, PIC, And Tasks This shows how to switch between tasks using the timer, and builds from the previous page.

Author

I Pancakes wrote this to help jump start you into developing for the ARM using QEMU or even a real piece of hardware. I have wrote software for both emulators and real hardware, but this has only been tested on QEMU so far. Please make any needed changes if you find problems. Also let me know at [email protected] if you find this useful, have comments, or suggestions.

Note

If you are working with the VMSAv7 such as with the cortex-a9 then there is no bit to disable sub-pages (no backwards compatibility). In the code below I disable sub-pages. So you should omit setting bit 23 of the control register when enabling paging on VMSAv7 systems.

About

I am going to extend the later demonstration of task switching. But, first we are going to need to add in some basic memory management, and then progress to virtual memory afterwards.

Adding A Heap Implementation

I am going to use a fairly simple heap implementation, Bitmap Heap Implementation Enhanced. I would recommend that you read about it because we will be using all of it's features to implement a memory management system, and it will be needed to implement virtual memory. Also, even if you use your own implementation you might find it helpful to have a general idea of what the codes does. It might even be self explanatory for some readers.

To compile and link I would recommend that you place the structures and prototypes in a header called kheap_bm.h and place the implementation into kheap_bm.c. Then when compiling the kernel code just include the header and link with the kheap_bm.o that you should produce and everything should work great.

Adding The Basic Memory Initialization

We are going to be basing everything from one of the previous demonstrations shown in the links at the top of this page. To make this page easier to read I am only going to duplicate the code needed to demonstrate. The first modification we are going to make is to the KSTATE structure.

typedef struct _KSTATE {
	KTHREAD			threads[0x10];
	uint8			threadndx;	
	uint8			iswitch;
	KHEAPBM			hphy;		/* physical page heap */
	KHEAPBM			hchk;		/* data chunk heap */
} KSTATE;

I have added two new fields hphy and hchk. The hphy will serve as our physical memory manager working with 4KB pages while kchk will work with smaller allocations. If you are using the older demonstration you may be missing the KSTATE structure completely.

Lets make some defines:

/* the intial kernel stack and the exception stack */
#define KSTACKSTART 0x2000		/* descending */
#define KSTACKEXC   0x3000		/* descending */
/* somewhere to place the kernel state structure */
#define KSTATEADDR	0x3000
/* 
	RAM is assumed to start at 0x0, but we have to leave room for a little setup code, and
	depending on how much physical memory (KRAMSIZE) we are using we might have to adjust
	KRAMADDR to give a bit more. For example if KRAMSIZE is 4GB then KRAMADDR needs to be
	larger than 1MB, perferrably 2MB at least.
*/
#define KRAMADDR	0x200000
#define KRAMSIZE	(1024 * 1024 * 8)
/* 
	kernel virtual memory size 
	
	KMEMINDEX can be 1 through 7
	1 = 2GB
	2 = 1GB
	3 = 512MB
	4 = 256MB
	5 = 128MB
	6 = 64MB
	7 = 32MB
	
	This is the maximum amount of virtual memory per kernel space.
*/
#define KMEMINDEX   3
#define KMEMSIZE	(0x1000000 << (8 - KMEMINDEX))
/* the size of a physical memory page */
#define KPHYPAGESIZE		4096
/* the block size of the chunk heap (kmalloc, kfree) */
#define KCHKHEAPBSIZE		16
/* minimum block size for chunk heap */
#define KCHKMINBLOCKSZ		(1024 * 1024)

You may already have some defines. Just replace them and add any new ones. One important one is KSTATEADDR which defines where our kernel state resides.

Now, let us add some entries into our main function of our play kernel.

void start() {
	.........
	KSTATE      *ks;
	uint8       *bm;

	ks = (KSTATE*)KSTATEADDR;

	/* create physical page heap */
	k_heapBMInit(&ks->hphy);
	k_heapBMInit(&ks->hchk);
	k_heapBMInit(&ks->husr);
	/* get a bit of memory to start with for small chunk */
	k_heapBMAddBlock(&ks->hchk, 4 * 7, KRAMADDR - (4 * 7), KCHKHEAPBSIZE);
	
	/* state structure */
	k_heapBMSet(&ks->hchk, KSTATEADDR, sizeof(KSTATE), 5);
	/* stacks (can free KSTACKSTART later) */
	k_heapBMSet(&ks->hchk, KSTACKSTART - 0x1000, 0x1000, 6);
	k_heapBMSet(&ks->hchk, KSTACKEXC - 0x1000, 0x1000, 7);
	
	/*
		I like to identity map my page tables in kernel space because it makes them
		easy to walk. To do so I have to try to reserve physical memory in the lower
		half. To do this I have two heaps of physical 4K pages. 
		
		The user heap is any memory past the KMEMSIZE, and anything below is the kernel
		heap. Pages are first allocated out of the user heap for any user space process,
		and if that fails then they are allocated from the kernel heap.
	*/
	if (KRAMSIZE > KMEMSIZE) {
		/* let hphy represent lower memory for kernel so it can be identity mapped */
		bm = (uint8*)k_heapBMAlloc(&ks->hchk, k_heapBMGetBMSize(KMEMSIZE - KRAMADDR, KPHYPAGESIZE));
		k_heapBMAddBlockEx(&ks->hphy, KRAMADDR, KMEMSIZE - KRAMADDR, KPHYPAGESIZE, (KHEAPBLOCKBM*)k_heapBMAlloc(&ks->hchk, sizeof(KHEAPBLOCKBM)), bm, 0);
		/* let husr represent upper memory which does not need to be identity mapped */
		bm = (uint8*)k_heapBMAlloc(&ks->hchk, k_heapBMGetBMSize(KRAMSIZE - KMEMSIZE, KPHYPAGESIZE));
		k_heapBMAddBlockEx(&ks->husr, KMEMSIZE, KRAMSIZE - KMEMSIZE, KPHYPAGESIZE, (KHEAPBLOCKBM*)k_heapBMAlloc(&ks->hchk, sizeof(KHEAPBLOCKBM)), bm, 0);		
	} else {
		/* add block but place header in chunk heap to keep alignment */
		bm = (uint8*)k_heapBMAlloc(&ks->hchk, k_heapBMGetBMSize(KRAMSIZE - KRAMADDR, KPHYPAGESIZE));
		k_heapBMAddBlockEx(&ks->hphy, KRAMADDR, KRAMSIZE - KRAMADDR, KPHYPAGESIZE, (KHEAPBLOCKBM*)k_heapBMAlloc(&ks->hchk, sizeof(KHEAPBLOCKBM)), bm, 0);
	}
		
	/* 
		remove kernel image region 
		
		This ensures it does not reside in either one. Because, KRAMADDR can change we can not
		be sure if it resides in which one or if it spans both somehow so to be safe this works
		quite well.
	*/
	k_heapBMSet(&ks->hphy, (uintptr)&_BOI, (uintptr)&_EOI - (uintptr)&_BOI, 8);
	k_heapBMSet(&ks->hchk, (uintptr)&_BOI, (uintptr)&_EOI - (uintptr)&_BOI, 8);
	k_heapBMSet(&ks->husr, (uintptr)&_BOI, (uintptr)&_EOI - (uintptr)&_BOI, 8);
    ..............

I prefer to have the exception handlers setup before this code. I do not show that above, but you can simply move their initialization code to above this code block. The first objective is to create a small allocation heap, hchk, to handle small allocations then place the remaining RAM as 4KB pages into our second heap known as hphy. The phy part stands for physical, and the chk stands for chunk. The hchk heap uses 16-byte blocks and the hphy uses 4K blocks.

You should notice that I first:

	k_heapBMAddBlock(&ks->hchk, 4 * 7, KRAMADDR - (4 * 7), KCHKHEAPBSIZE);

This adds a region of memory starting after the exception table (4*7) and extends to the beginning of what I define as remaining RAM with KRAMADDR. RAM actually starts at 0x0, so do not let it be misleading. It is just that I had no better name for it.

Next:

	/* state structure */
	k_heapBMSet(&ks->hchk, KSTATEADDR, sizeof(KSTATE), 5);
	/* stacks (can free KSTACKSTART later) */
	k_heapBMSet(&ks->hchk, KSTACKSTART - 0x1000, 0x1000, 6);
	k_heapBMSet(&ks->hchk, KSTACKEXC - 0x1000, 0x1000, 7);

Here we set some regions as used in the bitmap. They contain the kernel's initial stack and the exception stack. My exceptions are non-re-entrant so I can get away with one stack for now. Also notice that our heap grows downward which is why I subtract 0x1000. It just makes the code loading the sp register simpler. Also, I consider each stack to be 4KB, but you could change this to a lower amount if you desired. Also, if you have read the Bitmap Heap Implementation Enhanced page you will know that I am distinguishing each specific region with a different identifier in the bitmap by using 5, 6, and 7. I only remove them from hchk because I think it is fairly safe to assume KRAMADDR will not be lower than KSTACKEXC, but you could add some type of check to be sure someone compiling your kernel does not happen to make KRAMADDR too low.

Finally:

	if (KRAMSIZE > KMEMSIZE) {
		/* let hphy represent lower memory for kernel so it can be identity mapped */
		bm = (uint8*)k_heapBMAlloc(&ks->hchk, k_heapBMGetBMSize(KMEMSIZE - KRAMADDR, KPHYPAGESIZE));
		k_heapBMAddBlockEx(&ks->hphy, KRAMADDR, KMEMSIZE - KRAMADDR, KPHYPAGESIZE, (KHEAPBLOCKBM*)k_heapBMAlloc(&ks->hchk, sizeof(KHEAPBLOCKBM)), bm, 0);
		/* let husr represent upper memory which does not need to be identity mapped */
		bm = (uint8*)k_heapBMAlloc(&ks->hchk, k_heapBMGetBMSize(KRAMSIZE - KMEMSIZE, KPHYPAGESIZE));
		k_heapBMAddBlockEx(&ks->husr, KMEMSIZE, KRAMSIZE - KMEMSIZE, KPHYPAGESIZE, (KHEAPBLOCKBM*)k_heapBMAlloc(&ks->hchk, sizeof(KHEAPBLOCKBM)), bm, 0);		
	} else {
		/* add block but place header in chunk heap to keep alignment */
		bm = (uint8*)k_heapBMAlloc(&ks->hchk, k_heapBMGetBMSize(KRAMSIZE - KRAMADDR, KPHYPAGESIZE));
		k_heapBMAddBlockEx(&ks->hphy, KRAMADDR, KRAMSIZE - KRAMADDR, KPHYPAGESIZE, (KHEAPBLOCKBM*)k_heapBMAlloc(&ks->hchk, sizeof(KHEAPBLOCKBM)), bm, 0);
	}
	
	/* remove kernel image region */
	k_heapBMSet(&ks->hphy, (uintptr)&_BOI, (uintptr)&_EOI - (uintptr)&_BOI, 8);
	k_heapBMSet(&ks->hchk, (uintptr)&_BOI, (uintptr)&_EOI - (uintptr)&_BOI, 8);
	k_heapBMSet(&ks->husr, (uintptr)&_BOI, (uintptr)&_EOI - (uintptr)&_BOI, 8);

One very important difference is I allocate the block header and bitmap outside of that memory region using k_heapBMAddBlockEx. The Ex function adds more control over where the structures are allocated. I also allocate three heaps. The hphy is physical memory that can be used by the kernel for identity mapping, and 'husr is memory that can be used for either user space or kernel space. I make hphy for allocating memory for the page tables because they need to be identity mapped in my kernel just to make it easier to walk the tables. The husr is for any pages not inside kernel space. The husr heap could be empty, and this is okay since we will just use pages from hphy.

Then I remove the region of memory where the kernel image is loaded. You could manually enter values here, but I used a symbol created by the linker script to mark the beginning and end of the kernel. I also remove it from hphy, hchk, and husr because KRAMADDR could change and it is actually possible it could cause the kernel to span any of them. So just to be safe I try to remove it from both heap.

kmalloc and kfree

At this point you can now implement a kmalloc and kfree if you like:

void kfree(void *ptr) {
	KSTATE			*ks;
	
	ks = (KSTATE*)KSTATEADDR;

	k_heapBMFree(&ks->hchk, ptr);
}

void* kmalloc(uint32 size) {
	void			*ptr;
	KSTATE			*ks;
	uint32			_size;	
	
	ks = (KSTATE*)KSTATEADDR;
	
	/* attempt first allocation try (will fail on very first) */
	ptr = k_heapBMAlloc(&ks->hchk, size);
	
	/* try adding more memory if failed */
	if (!ptr) {
		if (size < KCHKMINBLOCKSZ / 2) {
			/* we need to allocate blocks at least this size */
			_size = KCHKMINBLOCKSZ;
		} else {
			/* this is bigger than KCHKMINBLOCKSZ, so lets double it to be safe */
			/* round up allocation to use all the blocks taken */
			_size = size * 2;
			_size = (_size / KPHYPAGESIZE) * KPHYPAGESIZE < _size ? _size / KPHYPAGESIZE + 1 : _size / KPHYPAGESIZE;
			_size = _size * KPHYPAGESIZE;
		}
		ptr = k_heapBMAlloc(&ks->hphy, _size);
		if (!ptr) {
			/* no more physical pages */
			return 0;
		}
		/* try allocation once more, should succeed */
		k_heapBMAddBlock(&ks->hchk, (uintptr)ptr, _size, KCHKHEAPBSIZE);
		ptr = k_heapBMAlloc(&ks->hchk, size);
	}
	
	return ptr;
}

A Quick Primer On Virtual Memory

If your brand new to virtual memory you might have a difficul time grasping what in the word is going on. So let me try to explain and then give you some links.

Basically, for starters the connection between CPU and memory looks kinda like this:

        CPU <-----> MEMORY

Now, that is simple right? The CPU accessess RAM directly. Well, lets add something else in there.

        CPU <-----> MMU/TLB <-----> MEMORY

Basically, the MMU has all addresses lines (at least) pass through it. This gives it the ability to modify the address and change it. The TLB is the actual buffer that holds either the whole table in memory or parts of it. When you set the address of a TLB (or likely more accurately a table) it uses this to translate the addresses. You are basically working with virtual memory addresses when you enable the MMU and it translates them into physical addresses. This process is transparent to the CPU. The CPU has no idea the address is being changed which is why you have to invalidate caches and even invalidate the tables the MMU has cached when you change the mapping through your tables.

Virtual memory allows you to do things such as disk swapping where you pretend you have more memory in the machine than you really do by swapping unused sections to disk, and replacing them with newly accessed section. A section can be any size and depends on your MMU configuration and your architecture. It also allows you to present each application with a different view of memory and can handles fragmentation of your physical memory by arranging the physical pages that may be seperated into a contigious range using virtual memory mapping. Also, on most architectures virtual memory allows you to create security by restricting access to certain areas of memory with different modes such as read only or read/write only for privileged code.

Your best place for further reading is using Google to search for virtual memory and read up on it. I am afraid I can only explain so much here, but I hope I have basically given you the basics to get you on your feet a little instead of feeling completely lost.

TODO: add links to virtual memory in our wiki

Tinkering With Virtual Memory On the ARM complying with VMSA v6

I comform to the VMSA v6 specification. This deals with 16MB super-sections, 1MB sections, 64K entries, and 4K entries. If your coming from an X86/X64 background your going to find it to be almost identical, except instead of having 4MB entries on your first level table you have 1MB. The most flexible configuration is having your first level table map whole 1MB sections or use a coarse table to map 4KB pages in that 1MB region. 64K entries and 16MB entries are essentially similar except one is used on the first level table, and the other on the second level table.

Now that we have some basic memory management we can proceed to the fun stuff. Lets start really small until we get a grasp of what we are working with.

We need a few utility functions to make life easier:

void arm4_tlbset0(uint32 base) {
	asm("mcr p15, 0, %[tlb], c2, c0, 0" : : [tlb]"r" (base));
}

void arm4_tlbset1(uint32 base) {
	asm("mcr p15, 0, %[tlb], c2, c0, 1" : : [tlb]"r" (base));
}

void arm4_tlbsetmode(uint32 val) {
	asm("mcr p15, 0, %[tlb], c2, c0, 2" : : [tlb]"r" (val));
}

void arm4_tlbsetdom(uint32 val) {
	asm("mcr p15, 0, %[val], c3, c0, 0" : : [val]"r" (val));
}

uint32 arm4_tlbgetctrl() {
	uint32			ctrl;
	asm("mrc p15, 0, r0, c1, c0, 0 \n\
	     mov %[ctrl], r0" : [ctrl]"=r" (ctrl));
	return ctrl;
}

void arm4_tlbsetctrl(uint32 ctrl) {
	asm("mcr p15, 0, %[ctrl], c1, c0, 0" : : [ctrl]"r" (ctrl));
}

Defines

	/* first level table */
	#define TLB_FAULT			0x000		/* entry is unmapped (bits 32:2 can be used for anything) but access generates an ABORT */
	#define TLB_SECTION			0x002		/* entry maps 1MB chunk */
	#define TLB_COARSE			0x001		/* sub-table */
	#define TLB_DOM_NOACCESS	0x00		/* generates fault on access */
	#define TLB_DOM_CLIENT		0x01		/* checked against permission bits in TLB entry */
	#define TLB_DOM_RESERVED	0x02		/* reserved */
	#define TLB_DOM_MANAGER		0x03		/* no permissions */
	#define TLB_STDFLAGS		0xc00		/* normal flags */
	/* second level coarse table */
	#define TLB_C_LARGEPAGE		0x1			/* 64KB page */
	#define TLB_C_SMALLPAGE		0x2			/* 4KB page */
	/* AP (access permission) flags for coarse table [see page 731 in ARM_ARM] */
	#define TLB_C_AP_NOACCESS	(0x00<<4)	/* no access */
	#define TLB_C_AP_PRIVACCESS	(0x01<<4)	/* only system access  RWXX */
	#define TLB_C_AP_UREADONLY	(0x02<<4)	/* user read only  RWRX */
	#define TLB_C_AP_FULLACCESS	(0x03<<4)	/* RWRW */	
	/* AP (access permission) flags [see page 709 in ARM_ARM; more listed] */
	#define TLB_AP_NOACCESS		(0x00<<10)	/* no access */
	#define TLB_AP_PRIVACCESS	(0x01<<10)	/* only system access  RWXX */
	#define TLB_AP_UREADONLY	(0x02<<10)	/* user read only  RWRX */
	#define TLB_AP_FULLACCESS	(0x03<<10)	/* RWRW */

Example Enabling Paging

You might notice the word TLB and this stands for transition look-aside buffer. Which, is what we will be working with to implement paging. If you are coming from X86/X64 architecture you might have already read or tinkered with paging and it is basically the same on the ARM. One major difference is ARM can work with two paging tables at the same time. This is intended for one to be used for the kernel and another one to be process specific. I am going to use them backwards (in this example) because the docs state for TTBR0 to be process specific and TTBR1 to be operating system specific, but TTBR1 gets mapped into the upper 2GB of the 32-bit address space and TTBR0 in the lower and since our kernel lives low in this code example I am going to use TTBR0 for it, and it keeps the demonstration simple.

You might want to consider higher half kernel. Meaning, your kernel is linked for in upper memory not lower memory because it provides a number of advantages. My example below is reversed to keep things simple since we are already loaded low in memory.

You can also further divide the space as follows (not evenly):

TTBR0 TTBR1
2GB 2GB
1GB 3GB
512MB 3584MB
... ...

It can actually reach 32MB for the TTBR0 and 4064MB for TTBR1. We are just going to divide it with 2GB for our initial tinkering.

Code
	/*
		This says:
		1. Use the hphy heap.
		2. Allocate 16KB (1024 * 16)
		3. Make sure the 16 least significant bits are zero.
	*/
	ktlb = k_heapBMAllocBound(&ks->hphy, 1024 * 16, 14);
	utlb = k_heapBMAllocBound(&ks->hphy, 1024 * 16, 14);
	
	/* 
		enable the usage of utlb when bit 31 of virtual address is not zero 
	*/
	arm4_tlbsetmode(1);
	
	/* 
           We are going to identity map 2GB of space.

           We only need to set 2048 entries, because the remaining entries
           are ignored in the ktlb. But AFAIK the utlb still needs an entire 16KB which
           appears to me to be proved by the code below...
        */
	for (x = 0; x < 2048; ++x) {
		ktlb[x] = (x << 20) | TLB_ACCESS | TLB_SECTION;
	}
	
	/* 
		Do a little trick to map the same memory twice but in both
		page tables (TTBR0 and TTBR1)
	*/
	utlb[0x800] = (1 << 20) | TLB_ACCESS | TLB_SECTION;
	utlb[0x801] = (1 << 20) | TLB_ACCESS | TLB_SECTION;
	ktlb[1] = (2 << 20) | TLB_ACCESS | TLB_SECTION;
	ktlb[2] = (2 << 20) | TLB_ACCESS | TLB_SECTION;
	
	kserdbg_puts("OK\n");
	
	/* load location of TLB */
	arm4_tlbset1((uintptr)utlb);
	arm4_tlbset0((uintptr)ktlb);
	/* set that all domains are checked against the TLB entry access permissions */
	arm4_tlbsetdom(0x55555555);
	/* enable TLB (0x1) and disable subpages (0x800000) */
	arm4_tlbsetctrl(arm4_tlbgetctrl() | 0x800001);
	
	kserdbg_puts("OK2\n");
	
	/* check trick - as you can see the same 1MB chunk is mapped twice */
	a = (uint32*)0x80000000;
	b = (uint32*)0x80100000;
	a[0] = 0x12345678;
	ksprintf(buf, "utlb:%x\n", b[0]);
	kserdbg_puts(buf);

	a = (uint32*)0x100000;
	b = (uint32*)0x200000;
	a[0] = 0x87654321;
	ksprintf(buf, "ktlb:%x\n", b[0]);
	kserdbg_puts(buf);

	kserdbg_puts("DONE\n");
	for(;;);

We have mapped the one region twice at 1MB and 2MB, and mapped the another region twice at 2048MB and 2049MB. We verify it by writing to one region and reading it from another region. Now, this has been tested in QEMU only. On read hardware we could have missed a thing or two. But, this is the basics and if you have done it correctly it should work! Congratulations, you have enabled paging and not crashed your emulator (or maybe real hardware)!

The CPU actually supports second level tables for a finer grain of mapping, and even some other configurations that provide mapping of 16MB regions. I just used 1MB sections because it is easier to understand at first.

arm4_tlbsetmode(1)

This value can be between 0 and 7. I am using a 1 in the above code.

Basically, 0 disables the usage of the TTBR1 which means we only use a single TLB. In this case your would have to not only swap the TLB on each process change, but also update changes to any kernel space on all TLBs for every process.

Any value greater than 0 enables the usage of TTBR1. But, it also represents the number of highest bits in the address that are used to select between TTBR1 and TTBR0. For example:

VALUE		BITS IF ZERO USE TTBR0      TTBR0 ADDRESSABLE SPACE     TTBR0 SIZE IN ENTRIES (1MB SECTIONS)
0			-----						4GB (TTBR0)					4096
1			31							2GB (TTBR0)					2048
2			31:30						1GB (TTBR0)					1024
3			31:29						512MB (TTBR0)				512
4			31:28						256MB (TTBR0)				256
5			31:27						128MB (TTBR0)				128
6			31:26						64MB (TTBR0)				64
7			31:25						32MB (TTBR0)				32

If you set the appropriate mode and all the bits in the range are zero then it will use TTBR0, and if any are set it will use TTBR1. If you set the value to 0 then it will ONLY use TTBR0.

Each value also effects how much of the TTBR0 table we have to initialize. So if we use the value 2 like below. Then we only have to have 1024 entries (of 1MB sections). However, TTBR1 will always need the full number of entries.

From my research it seems that even though the ARM manual recommends TTBR0 be used for user space and TTBR1 be used for kernel space you can swap them if desired. The major down side is that your will be allocating a full 16KB for each process at a minimum if using 1MB sections with (or with out) second level tables (second level table further divides the 1MB).

On the upside you can give more address space to user space. Of course you could use some of the TTBR0 for user space, but in that case you would either have to share that space with all processes (or groups of processes) or swap out both TTBR0 and TTBR1 which would seem to be a disadvantage because you would lose the ability to keep a portion of memory (kernel space) validated in the TLB for a performance increase.

arm4_tlbsetdom(0x55555555);

Each entry can be assigned a domain value. There are only 16 possible domains. Each domain can be assigned a rough permission, and this can be changed on the fly.

	#define TLB_DOM_NOACCESS	0x00		/* generates fault on access */
	#define TLB_DOM_CLIENT		0x01		/* checked against permission bits in TLB entry */
	#define TLB_DOM_RESERVED	0x02		/* reserved */
	#define TLB_DOM_MANAGER		0x03		/* no permissions */

Each domain is represented by 2 bits. With domain zero starting with the least significant bits. See page 742 in the ARM architecture manual.

	#define TLB_DOM(x)			(x<<5)		/* domain value (16 unique values) */
	#define DOM_TLB(x,y)		(x<<(y*2))	/* helper macro for setting appropriate bits in domain register */

The TLB_DOM macro allows you to set the domain value in a TLB entry, while DOM_TLB lets you set the appropriate bits in the domain register. For example:

	arm4_tlbsetdom(DOM_TLB(TLB_DOM_CLIENT, 0));

The above code would set domain 0 to use the permission bits specified in the TLB entry.

	arm4_tlbsetdom(DOM_TLB(TLB_DOM_CLIENT, 0) | DOM_TLB(TLB_DOM_MANAGER, 1));

The above code would set domain 0 to use the permission bits specified in the TLB entry, and domain 1 to overide the permission bits in the TLB entry and allow any CPU mode to access the memory specified by the entries in the TLB that are marked as domain 1.

Access Bits For TLB Entry

	uint32			*tlb = .....;
	
	tlb[0] = (1 << 20) | TLB_AP_NOACCESS | TLB_SECTION;
	tlb[1] = (1 << 20) | TLB_AP_PRIVACCESS | TLB_SECTION;
	tlb[2] = (1 << 20) | TLB_AP_UREADONLY | TLB_SECTION;
	tlb[3] = (1 << 20) | TLB_AP_FULLACCESS | TLB_SECTION;

Let me give a table of how this would effect memory access:

Virtual Address Range Physical Address Range Access Description
00000000 - 000FFFFF 00100000 - 001FFFFF Any access by privileged CPU mode or user mode will generate a fault.
00100000 - 001FFFFF 00100000 - 001FFFFF Any access by user mode will generate a fault.
00200000 - 002FFFFF 00100000 - 001FFFFF The user mode can read but not write.
00300000 - 003FFFFF 00100000 - 001FFFFF Any mode can read and write.

Now, I mapped each 1MB of virtual address section to the same physical address.

Coarse Table (Sub-Page)

In this example I create a coarse table with two 4KB entries, and I test that they are mapped at the end. I have a few debug printing statements littered throughout and decided to leave them in as an example.

	ktlb = (uint32*)k_heapBMAllocBound(&ks->hphy, 1024 * 16, 14);
	utlb = (uint32*)k_heapBMAllocBound(&ks->hphy, 1024 * 16, 14);

	ksprintf(buf, "ktlb:%x utlb:%x\n", ktlb, utlb);
	kserdbg_puts(buf);
	
	arm4_tlbsetmode(2);	
	/* 
		map 1GB of space - each entry is 1MB (we are setting N to 2 above) 
	*/
	for (x = 0; x < 1024; ++x) { // 0xc02
		ktlb[x] = (x << 20) | TLB_AP_PRIVACCESS | TLB_SECTION;
	}
	
	/* 
		do a little trick to map the same memory twice but in both
		page tables (TTBR0 and TTBR1)
	*/
	utlb[0x800] = (1 << 20) | TLB_AP_PRIVACCESS | TLB_SECTION;
	utlb[0x801] = (1 << 20) | TLB_AP_PRIVACCESS | TLB_SECTION;
	ktlb[1] = (2 << 20) | TLB_AP_PRIVACCESS | TLB_SECTION;
	ktlb[2] = (2 << 20) | TLB_AP_PRIVACCESS | TLB_SECTION;
	
	/* the coarse table must be aligned on a 1KB boundary */
	tlbsub = (uint32*)k_heapBMAllocBound(&ks->hphy, 1024, 10);
	ksprintf(buf, "tlbsub:%x\n", tlbsub);
	kserdbg_puts(buf);
	
	/* link the coarse table to our main table */
	ktlb[3] = (uint32)tlbsub | TLB_COARSE;
	
	/* clear coarse table (sub-table) */
	for (x = 0; x < 256; ++x) {
		tlbsub[x] = 0;
	}
	
	tlbsub[0] = TLB_C_SMALLPAGE | ((1024 * 1024) << 12) | TLB_C_AP_PRIVACCESS;
	tlbsub[1] = TLB_C_SMALLPAGE | ((1024 * 1024) << 12) | TLB_C_AP_PRIVACCESS;
	
	kserdbg_puts("OK\n");
	
	/* load location of TLB */
	arm4_tlbset1((uintptr)utlb);
	arm4_tlbset0((uintptr)ktlb);
	/* set that all domains are checked against the TLB entry access permissions */
	arm4_tlbsetdom(0x55555555);
	/* enable TLB 0x1 and disable subpages 0x800000 */
	arm4_tlbsetctrl(arm4_tlbgetctrl() | 0x800001);
	
	kserdbg_puts("OK2\n");
	
	/* check sub-pages */
	a = (uint32*)0x300000;
	b = (uint32*)0x301000;
	a[0] = 0x22334411;
	ksprintf(buf, "small:%x\n", b[0]);
	kserdbg_puts(buf);
	
	/* check trick - as you can see the same 1MB chunk is mapped twice */
	a = (uint32*)0x80000000;
	b = (uint32*)0x80100000;
	a[0] = 0x12345678;
	ksprintf(buf, "utlb:%x\n", b[0]);
	kserdbg_puts(buf);

	a = (uint32*)0x100000;
	b = (uint32*)0x200000;
	a[0] = 0x87654321;
	ksprintf(buf, "ktlb:%x\n", b[0]);
	kserdbg_puts(buf);

	kserdbg_puts("DONE\n");
	for(;;);

Using 64K Entries In Coarse Table

I have done some experimenting with 64K page entries in the coarse table. Essentially, you have to create 16-64K entries to map 64K.

Take the normal way to map 64K of memory in a coarse table with 4K entries:

	uint32			*ctbl = ....;
	uint32			offset;
	uint32			x;
	
	for (x = 0; x < 16; ++x) {
		ctbl[x] = offset + (x * 4096) | FLAGS | SMALLPAGE4K;
	}

Let us see the mapping of 64K using 64K entries.

	uint32			*ctbl = ....;
	uint32			offset;
	uint32			x;

	for (x = 0; x < 16; ++x) {
		ctbl[x] = offset | FLAGS | LARGEPAGE64K;
	}

Now, QEMU will not enforce the entry to be replicated 16 times.

It appears that real hardware may enforce the 16 count replication in order to reduce memory access and space used by the TLB cache/memory. QEMU, does not enforce this and will happily work just fine with blank entries in the 64K mapping or even 4K mappings mixed in. Also, it states in the manual that the entry must be replicated 16 times. So the safe way is to always follow this rule and it should work with any MMU no matter if it is software emulated or hardware.

Globally Tagged Entries

...shared

Protection And Cache Behavior

...

Working With Tables Per Process

Let us now see about working with page tables per process. We need some utility functions. Then we will take a look at actually initializing the page tables per process, and finally the code in our scheduler that will switch out tables for each process.

Utility Functions

To work with pages tables I have come up with a few utility functions to make life easier. These just make working with a TLB table much easier. These functions only work with 4K coarse table mapping. One of the functions can understand 1MB sections, but thats it. They are just basic utility functions to allow you to have an idea of how you could work with page tables.

#define KVMM_SUCCESS		1
#define KVMM_FAILURE		0
#define KVMM_UNMAP			0x80000000
#define KVMM_USER			0x40000000
#define KVMM_KERNEL			0x20000000

/*
	kernel kernel page alloc
	
	- only alloc from hphy
*/
void *kkpalloc() {
	KSTATE			*ks;
	int				x;
	
	ks = (KSTATE*)KSTATEADDR;
	return k_heapBMAlloc(&ks->hphy, KPHYPAGESIZE);
}

/*
	kernel user page alloc
	
	- first try to alloc from husr then hphy
*/
void *kupalloc() {
	KSTATE			*ks;
	int				x;
	void			*p;
	char			buf[128];
	
	ks = (KSTATE*)KSTATEADDR;
	p = k_heapBMAlloc(&ks->husr, KPHYPAGESIZE); 
	if (!p) {
		p = k_heapBMAlloc(&ks->hphy, KPHYPAGESIZE);
		ksprintf(buf, "p:%x\n", p);
		kserdbg_puts(buf);
	}
	
	return p;
}

/* Initialize the paging table structure. */
int kvmm_init(KVMMTABLE *table) {
	uint32		x;
	KSTATE		*ks;
	char		buf[128];
	
	ks = (KSTATE*)KSTATEADDR;	
	table->table = (uint32*)k_heapBMAllocBound(&ks->hphy, 1024 * 16, 14);
	
	kserdbg_putc('&');
	
	ksprintf(buf, "table.table:%x\n", table->table);
	kserdbg_puts(buf);
	
	for (x = 0; x < 1024 * 4; ++x) {
		table->table[x] = 0;
	}
}

/* Map contigious range starting at a virtual address and physical of count */
int kvmm_map(KVMMTABLE table, uintptr virtual, uintptr physical, uintptr count, uintptr flags) {
	uintptr		x, v, p, y;
	uint32		*t, *st;
	KSTATE		*ks;
	uint8		unmap;
	
	ks = (KSTATE*)KSTATEADDR;
	
	unmap = 0;
	if (flags & KVMM_UNMAP) {
		/* remove special flags */
		flags = flags & (~0 >> 1);
		unmap = 1;
	}
	
	/* this entire loop could be optimized, but it is simple and straight forward */
	v = virtual >> 12;
	p = physical >> 12;
	t = table.table;
	for (x = 0; x < count; ++x) {
		if ((t[v >> 8] & 3) == 0) {
			/* create table (alloc on 1KB boundary) */
			t[v >> 8] = (uint32)k_heapBMAllocBound(&ks->hphy, 1024 * 16, 10);
			if (!t[v >> 8]) {
				return 0;
			}
			/* get sub-table */
			st = (uint32*)t[v >> 8];
			t[v >> 8] |= TLB_COARSE;
			/* clear table (all entries throw a fault) */
			for (y = 0; y < 256; ++y) {
				st[y] = 0;
			}
		} else {
			/* get sub-table */
			st = (uint32*)(t[v >> 8] & ~0x3ff);
		}
		
		if ((st[v & 0xff] & 3) == 0) {
			/* map 4K page */
			if (!unmap) {
				/* map page */
				if (p + 1 == 0) {			
					st[v & 0xff] = (p << 12) | flags | TLB_C_SMALLPAGE;
				} else {
					st[v & 0xff] = (p << 12) | flags | TLB_C_SMALLPAGE;
				}
			} else {
				/* unmap page */
				st[v & 0xff] = 0;
			}
		}
		
		++v;
		++p;
	}
	
	return 1;
}

/* Find a region of size within specified boundries. */
int kvmm_getunusedregion(KVMMTABLE table, uintptr *virtual, uintptr count, uintptr lowerLimit, uintptr upperLimit) {
	uintptr		x, v, c, s;
	uint32		*t, *st;
	KSTATE		*ks;
	
	ks = (KSTATE*)KSTATEADDR;
	
	/* this entire loop could be optimized, but it is simple and straight forward */
	s = 0;	/* start */
	c = 0;  /* count */
	v = lowerLimit >> 12;
	t = table.table;
	for (x = 0; x < count; ++x) {
		if (t[v >> 8] & 3 == 0) {
			c++;
		} else {
			/* get sub-table */
			st = (uint32*)(t[v >> 8] & ~0x3ff);
			
			if (st[v & 0xff] & 3 == 0) {
				c++;
			} else {
				s = v << 12;
				c = 0;
			}
		}
		
		/* if we found a region of size */
		if (c >= count) {
			*virtual = s;
			return 1;
		}
		
		++v;
	}	
	
	*virtual = 0;
	return 0;
}

/*
	This does not support 64K or 16MB mappings, only 1MB and 4K.
*/
int kvmm_getphy(KVMMTABLE table, uintptr virtual, uintptr *out) {
	uint32			*t;
	char			buf[128];
	
	/* not mapped */
	if ((table.table[virtual >> 20] & 3) == 0) {
		*out = 0;
		return 0;
	}
	
	ksprintf(buf, "table.table[virtual >> 20]:%x\n", table.table[virtual >> 20]);
	kserdbg_puts(buf);
	
	/* get 1MB section address */
	if ((table.table[virtual >> 20] & 3) == TLB_SECTION) {
		*out = table.table[virtual >> 20] & ~0xFFFFF;
		return 1;
	}
	
	/* get level 2 table */
	t = (uint32*)(table.table[virtual >> 20] & ~0x3ff);
	
	virtual = (virtual >> 12) & 0xFF;

	ksprintf(buf, "t[%x]:%x\n", virtual, t[virtual]);
	kserdbg_puts(buf);
	
	/* not mapped on level 2 */
	if ((t[virtual] & 3) == 0) {
		*out = 0;
		return 0;
	}
	
	/* get 4K mapping */
	*out = t[virtual] & ~0xfff;
	return 1;
}

/* Map fresh pages from physical page accounting system.*/
int kvmm_allocregion(KVMMTABLE table, uintptr virtual, uintptr count, uintptr flags) {
	uintptr		x, v;
	uint32		*t, *st;
	KSTATE		*ks;
	uint8		kspace;
	uint32		y;

	/*
		If we are mapping in user space then lets try to use pages from the
		kusr map, and if for kernel then pull from kphy. 
	*/
	kspace = 1;
	if (flags & KVMM_USER) {
		kspace = 0;
	} else {
		if (!(flags & KVMM_KERNEL)) {
			/* force specification of either kernel space or user space mapping */
			return 0;
		}
	}
	/* remove flags */
	flags = flags & ~KVMM_KERNEL;
	flags = flags & ~KVMM_USER;
	
	ks = (KSTATE*)KSTATEADDR;
	/* this entire loop could be optimized, but it is simple and straight forward */
	v = virtual >> 12;
	t = table.table;
	for (x = 0; x < count; ++x) {
		if ((t[v >> 8] & 3) == 0) {
			/* create table (alloc on 1KB boundary) */
			t[v >> 8] = (uint32)k_heapBMAllocBound(&ks->hphy, 1024 * 16, 10);
			if (!t[v >> 8]) {
				/* memory failure */
				return 0;
			}
			st = (uint32*)t[v >> 8];
			
			t[v >> 8] |= TLB_COARSE;
			
			/* clear table (all entries throw a fault) */
			for (y = 0; y < 256; ++y) {
				st[y] = 0;
			}
		} else {
			/* get sub-table */
			st = (uint32*)(t[v >> 8] & ~0x3ff);
		}
		
		if ((st[v & 0xff] & 3) != 0) {
			/* return failure because we are mapping over already mapped memory */
			return 0;
		}
		
		/* map a page */
		if (kspace) {
			kserdbg_putc('1');
			st[v & 0xff] = (uintptr)kkpalloc() | flags | TLB_C_SMALLPAGE;
		} else {
			kserdbg_putc('2');
			st[v & 0xff] = (uintptr)kupalloc() | flags | TLB_C_SMALLPAGE;
		}
		/* increment to next virtual address */
		++v;
	}	
	return 1;	
}
/* Unmap pages and hand them back to the physical page accounting system. */
int kvmm_freeregion(KVMMTABLE table, uintptr virtual, uintptr count) {
	uintptr		x, v, p;
	uint32		*t, *st;
	KSTATE		*ks;
	
	ks = (KSTATE*)KSTATEADDR;
	
	/* this entire loop could be optimized, but it is simple and straight forward */
	v = virtual >> 12;
	t = table.table;
	for (x = 0; x < count; ++x, ++v) {
		if ((t[v >> 8] & 3) != 0) {
			/* get sub-table */
			st = (uint32*)(t[v >> 8] & ~0x3ff);
			
			/* unmap a page */
			if ((st[v & 0xff] & 3) != 0) {
				/* determine if it is a kernel or user page */
				p = (st[v & 0xff] >> 12 << 12);
				if (p > KMEMSIZE) {
					/* user */
					if (!k_heapBMFree(&ks->husr, (void*)p)) {
						return 0;
					}
				} else {
					/* kernel */
					if (!k_heapBMFree(&ks->hphy, (void*)p)) {
						return 0;
					}
				}
			}
		}
	}	
	
	return 1;
}

Initializing Page Tables Per Process

Our code, if you have been following from IRQ, Timer, PIC, And Tasks, included a simple scheduler that swapped state for each thread. What we want to do now is include a virtual address space for each thread. And, eventually upgrade the system to support the design of processes and threads. For now just consider each thread a process, and later we will implement actual threads and processes. This just helps keep the code simple at first which makes reading and understanding easier.

Our first step is to expand our thread structure with a field for the level one page table.

typedef struct _KVMMTABLE {
	uint32			*table;
} KVMMTABLE;
	
typedef struct _KTHREAD {
	uint8			valid;
	uint32			r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, sp, lr, cpsr, pc;
	KVMMTABLE		vmm;
} KTHREAD;

Next, lets add a field for the kernel level one page table.

typedef struct _KSTATE {
	KTHREAD			threads[0x10];
	uint8			threadndx;	
	uint8			iswitch;
	KHEAPBM			hphy;		/* kernel physical page heap */
	KHEAPBM			hchk;		/* data chunk heap */
	KHEAPBM			husr;		/* user physical page heap */
	KVMMTABLE		vmmk;		/* kernel virtual memory map */
} KSTATE;

Now, we need to initialize the kernel page table.

	/* initialize the kernel memory map */
	kvmm_init(&ks->vmmk);
	
	/* identity map kernel memory (may use all of RAM or not may) */
	for (x = 0; x < 4096; ++x) {
		// TLB_AP_FULLACCESS
		ks->vmmk.table[x] = (x << 20) | TLB_AP_PRIVACCESS | TLB_SECTION;
	}

This just identity maps all of RAM. That keeps everything simple at the moment. We can go back later and actually make the kernel page in what it needs. But, that would complicate our existing code and I want to either do that later or leave it as an excercise for you. Also, worth nothing, our utility functions will be unable to work with the kernel table because it uses 1MB sections and they do not handle those properly.

Also, we used KMEMINDEX to calculate KMEMSIZE which defines the maximum size of the kernel space. So, make sure that we use it to divide the address space.

arm4_tlbsetmode(KMEMINDEX);

Here we initialize the address space of a thread, and we copy some executable code to it.

	ks->threads[0].pc = 0x80000000;         /* set program counter */
	ks->threads[0].valid = 1;
	ks->threads[0].cpsr = 0x60000000 | ARM4_MODE_USER;
	ks->threads[0].sp = 0x90001000;         /* set stack pointer */
	ks->threads[0].r0 = 0xa0000000;		/* argument to function we copied */
	
        /* initialize the level one table */
	kvmm_init(&ks->threads[0].vmm);
        /* map in the serial hardware MMIO */
	kvmm_map(ks->threads[0].vmm, 0xa0000000, 0x16000000, 1, TLB_C_AP_FULLACCESS);
        /* alloc and map a 4K page for the code */
	kvmm_allocregion(ks->threads[0].vmm, 0x80000000, 1, KVMM_USER | TLB_C_AP_FULLACCESS);
        /* alloc and map a 4K page for the stack */
	kvmm_allocregion(ks->threads[0].vmm, 0x90000000, 1, KVMM_USER | TLB_C_AP_FULLACCESS);
        /* get the physical page (going to be mapped in kernel space) */
	kvmm_getphy(ks->threads[0].vmm, 0x80000000, &page);
	/* copy some code there */
	for (x = 0; x < 1024; ++x) {
		((uint8*)page)[x] = ((uint8*)&thread1)[x];
	}

ARM passes the first few arguments by register so this enables us to pass an argument in R0. We also map in the serial hardware MMIO so the thread can send some letters to the output to demonstrate that it is running. The argument is the address of the serial device's memory mapped output register for sending a character. We are supposed to check if it is ready to recieve another character, but QEMU does not seem to be effected so this should work fine on it. It is mainly just to demonstrate it working and not for production usage.

One VERY important part to understand is that we get the physical page (which we assume is mapped into kernel space), since I am assuming your running with less than 32MB of RAM. This is just to keep it simple. We can go back later and improve on this, but for now it will work enough for you to move forward. If you have more memory than KMEMSIZE then this would not work because it would allocate a page from the husr heap and that would be above KMEMSIZE meaning the kernel will not have it identity mapped. I will cover additional ways to access the memory of a thread/process later. For now just make QEMU have less than KMEMSIZE of RAM.

Here is the thread.

void thread1(uintptr serdbgout) {
	int			x;
	
	for (;;) {
		for (x = 0; x < 0xfffff; ++x);
		((uint32*)serdbgout)[0] = 72;
	}
}

Next, we need some code in our scheduler to do this before it returns (after loading the new thread's state).

            ......
			/* set TLB table for user space (it can be zero for kernel) */
			arm4_tlbset1((uintptr)kt->vmm.table);
			/* 
				Invalidate all unlocked entries...
				
				..according to the manual there may be a better way to invalidate,
				only some entries per process. But, for now this should work.
				
				If you do not do this then the TLB does not flush and old entries
				from the previous process will still be in the TLB cache.
			*/
			asm("mcr p15, #0, r0, c8, c7, #0");
			/* go back through normal interrupt return process */
			return;

The above happens after we have restored the registers on the exception stack and the two hidden registers, and is right before we return from the IRQ generated by the timer hardware. We also invalidate the entire TLB. I assume this invalidates both TTBR0 and TTBR1 tables.

Lets Page Out The Kernel Space

Ok, you now should have a minimaly functional kernel with user application isolation, memory management, and tasking. Now, you can continue on with your current memory management system and go quite far, or you could tear it down and build something better but more complicated. The previous design has some limitations mainly if a user applications exhaust the husr memory and start grabbing memory from hphy. A large app could essentially end up hogging all the memory the kernel can use and render the system unusable even if you terminated other tasks. One method to handle this case is to take each page of memory used from hphy that has been allocated to user space and copy it (while the system is interrupted) into a page from husr then change the mapping to point to the husr page. Essentially, swapping pages from kphy to husr.

Also having a unneeded memory mapped into kernel space poses a hazard from bugs. Mainly, your kernel could have a bug that makes stray writes and corrupts other processes causing really hard to find bugs, and by not mapping unneeded memory you make it more likely that you will catch bugs writing by them writing to invalid memory locations.

If you want to explore different solution to managing virtual memory with out identity mapping the kernel space then checkout the page VMM2 which builds from this page.