User:Dbstream/Paging on x86

From OSDev Wiki
Jump to navigation Jump to search

Paging is a mechanism by which a processor translates virtual addresses to physical addresses. The address space is divided into fixed-size blocks of memory, where one such block is a page. Usually, translations from virtual to physical addresses are implemented with a kind of recursive data structure, the page tables. The processor keeps a register that points to a root page table. A page table can be conceptually thought of as an array of pointers to pages or next-level page tables. When a virtual address is to be translated, different parts of it function as indices into the different page table levels.

Paging modes

There are many different paging modes on x86 CPUs that depend on the state of the processor. These are listed below:

CR0.PG CR4.PAE CR4.LA57 EFER.LME Paging mode
0 (ignored) (ignored) (ignored) paging disabled
1 0 (ignored) (ignored) 32-bit paging
1 1 (ignored) 0 PAE paging
1 1 0 1 4-level paging
1 1 1 1 5-level paging

This article only deals with long-mode paging (4-level and 5-level paging).

Paging (long-mode)

A page table is a page which contains an array of 512 64-bit values. These values encode permissions and flags, such as whether a page is writable, if it can be cached, but most importantly, a pointer to a page which can be the next level of page tables. In 4-level paging, the cr3 register of the CPU points to a PML4. The PML4 entries point to PML3s, whose entries point to PML2s, whose entries point to PML1s, whose entries point to physical pages of memory. In 5-level paging, there is another level of indirection, in that the cr3 register points to a PML5, whose entries point to PML4s. This enables a bigger effective address space.

WARNING: the Intel and AMD manuals refer to the PML3, PML2 and PML1 as page directory pointer (PDP), page directory (PD) and page table (PT). For simplicity, this article refers to them by different names in a more consistent naming scheme.

The format of a page table entry is as follows:

+63----------+62-----------------+MAXPHYADDR---------------------+11-------+8-------+7------------+6------+5---------+4----+3----+2-----+1---------+0--------+
| no-execute |      ignored      | pointer to page or page table | ignored | global | pat or size | dirty | accessed | pcd | pwt | user | writable | present |
+----------63+-------MAXPHYADDR+1+-----------------------------12+--------9+-------8+------------7+------6+---------5+----4+----3+-----2+---------1+--------0+

(MAXPHYADDR is the number of bits of physical address space supported. It is a number between 36 and 52.)

Look at the Intel(r) 64 and IA-32 Architecture Software Developer's Manual for information about what each of these bits do.

In long mode, an address looks like this:

+63---------+57-----+47-----+38-----+29-----+20-----+11----------------+
| canonical | PML5e | PML4e | PML3e | PML2e | PML1e | offset into page |
+---------58+-----48+-----39+-----30+-----21+-----12+-----------------0+

If 4-level paging is used, the PML5e doesn't exist, and is instead canonical. The canonical part of an address is effectively a sign-extension of the rest of the address.

Detecting support for 5-level paging

Support for 5-level paging (LA57, 57-bit linear addresses) can be detected using CPUID leaf 7 subleaf 0. If bit 16 is set in ecx, LA57 is supported.

bool is_la57_supported(void)
{
        if(cpuid_a(0) < 7)
                        /* CPUID leaf 7 is not supported. */
                return false;

                /*
                 * NOTE: when implementing 'cpuid_c()', make sure ecx is
                 * zero before the CPUID instruction is called. Otherwise,
                 * information from the wrong subleaf will be returned.
                 */
        return (cpuid_c(7) & (1 << 16)) ? true : false;
}

Enabling LA57 cannot be done from within long-mode, so there are three options: 1) Have a protected-mode trampoline that enables it. 2) Enable it before your kernel enters long mode. 3) Depend on the bootloader to enable it.

With a modern bootloader like Limine, LA57 will probably be enabled if it exists. For less modern bootloaders and protocols, the second approach may be needed. It is described below:

        /* Assumption: we are somewhere in a protected-mode boot stub. */

        /* Detect if we support LA57. */
        xorl %eax, %eax
        cpuid
        cmpl $7, %eax
        jb .Lno_la57

        movl $7, %eax
        xorl %ecx, %ecx
        cpuid
        testl $(1 << 16), %ecx
        jz .Lno_la57

                /* We have LA57. */
        movl %cr4, %eax
        orl $(__CR4_PAE | __CR4_LA57), %eax
        movl $boot_pml5, %edx
        jmp 1f

.Lno_la57:      /* We do not have LA57. */
        movl %cr4, %eax
        orl $__CR4_PAE, %eax
        movl $boot_pml4, %edx

1:      /* %eax holds the %cr4 we want, and %edx holds the %cr3 we want */
        movl %eax, %cr4
        movl %edx, %cr3

Translation lookaside buffer (TLB)

TODO: describe the TLB and invalidation

Page attribute table (PAT)

TODO: describe the page attribute table and cache modes.

Example: Mapping a 4KiB page with 4-level and 5-level paging

The code below is an example of how to map a 4KiB page of memory. It maps the page read-write, supervisor-only, executable, and non-global. It assumes the following about your kernel:

- alloc_page returns the physical address of a page full of zeroed memory.

- phys_to_virt converts a physical address to a virtual address in some kind of higher-half direct map.

- addr_t is a typedef that matches uint64_t.

- root_page_table is the page table in use (pointed to by cr3) and is a volatile uint64_t *.

- use_5level_paging is non-zero if 5-level paging is being used, otherwise zero.

enum pte_bits {
        PG_PRESENT      = 1 << 0,
        PG_WRITE        = 1 << 1
};

void map_page(addr_t virt, addr_t phys)
{
        /* how many page tables to traverse before we get to the PML1 */
        int l = use_5level_paging ? 4 : 3;

        /* points to the PML5e or PML4e that manages 'virt' */
        volatile uint64_t *table = root_page_table + ((virt >> (12 + 9 * l)) & 0x1ff);

        /*
         * In a loop, reduce the scope of 'table' by traversing downward and
         * allocating new page tables as neccessary.
         */
        while(l) {
                uint64_t value = *table;
                if(value & PG_PRESENT) {
                        /* page table already exists */
                        value &= 0x7ffffffffffff000;
                } else {
                        /* allocate a new page table */
                        value = alloc_page();
                        *table = value | PG_PRESENT | PG_WRITE;
                }

                l--;
                table = (volatile uint64_t *) phys_to_virt(value) + ((virt >> (12 + 9 * l)) & 0x1ff);
        }

        /* 'table' now points to the PTE we're interested in modifying */
        *table = phys | PG_PRESENT | PG_WRITE;
}