Kernel Multitasking

From OSDev Wiki
Jump to navigation Jump to search

When a task (process, thread, application, program, ...) is using the CPU, some of the task's state is stored in the CPU - things like the contents of registers, the address of the task's current stack, the current instruction pointer, which virtual address space the task is using, etc.

A task switch involves saving any state that the previous task had in the CPU (that the task may have modified) and then loading any state that the next task expects to be in the CPU. Typically a kernel stores at some of this state on the task's (kernel) stack and some of the state in data structure/s in kernel-space (e.g. in a "task control block" structure).

There are two very different ways of implementing multi-tasking, depending on how the kernel stacks are done.

Kernel Stack Per Task

In most kernels, each task has its own kernel stack used when a task moves between user-space (and user stack) and kernel (and kernel stack). In this case, task switches only ever happen after something else (IRQ, exception, system call) has already caused the task to switch to kernel code and kernel stack; which means that task switches can only ever occur between tasks that are running kernel code.

For example (NASM syntax, 32-bit 80x86, "cdecl" calling convention, single-CPU only/globals used for "per CPU" variables, no support for FPU/MMX/SSE/AVX, untested):

;C declaration:
;   void switch_tasks(thread_control_block *next_thread);
;
;WARNING: Caller is expected to disable IRQs before calling, and enable IRQs again after function returns

switch_tasks:

    ;Save previous task's state

    ;Notes:
    ;  For cdecl; EAX, ECX, and EDX are already saved by the caller and don't need to be saved again
    ;  EIP is already saved on the stack by the caller's "CALL" instruction
    ;  The task isn't able to change CR3 so it doesn't need to be saved
    ;  Segment registers are constants (while running kernel code) so they don't need to be saved

    push ebx
    push esi
    push edi
    push ebp

    mov edi,[current_task_TCB]    ;edi = address of the previous task's "thread control block"
    mov [edi+TCB.ESP],esp         ;Save ESP for previous task's kernel stack in the thread's TCB

    ;Load next task's state

    mov esi,[esp+(4+1)*4]         ;esi = address of the next task's "thread control block" (parameter passed on stack)
    mov [current_task_TCB],esi    ;Current task's TCB is the next task TCB

    mov esp,[esi+TCB.ESP]         ;Load ESP for next task's kernel stack from the thread's TCB
    mov eax,[esi+TCB.CR3]         ;eax = address of page directory for next task
    mov ebx,[esi+TCB.ESP0]        ;ebx = address for the top of the next task's kernel stack
    mov [TSS.ESP0],ebx            ;Adjust the ESP0 field in the TSS (used by CPU for for CPL=3 -> CPL=0 privilege level changes)
    mov ecx,cr3                   ;ecx = previous task's virtual address space

    cmp eax,ecx                   ;Does the virtual address space need to being changed?
    je .doneVAS                   ; no, virtual address space is the same, so don't reload it and cause TLB flushes
    mov cr3,eax                   ; yes, load the next task's virtual address space
.doneVAS:

    pop ebp
    pop edi
    pop esi
    pop ebx

    ret                           ;Load next task's EIP from its kernel stack

Note that the low-level task switch code should be written in pure assembly. If it's written in inline assembly in the middle of a C function then the compiler may add its own "function prologue" and "function epilogue" code that changes the layout of the stack. This is important because when a new task is being created the kernel needs to put values on the new task's kernel stack to match the values that the "switch_tasks" expects to pop off of the new task's stack.

Also note that (for kernels primarily written in higher level languages) the low-level task switch code may be the only part of the scheduler that needs to be rewritten when porting the kernel to a different architecture.

This low-level task switch code may be called directly from other places in the kernel (e.g. when a task is being pre-empted and the caller knows exactly which task is pre-empting the current task); but may also be called by higher-level code that selects a task to switch to (e.g. when the currently running task is being terminated and the kernel doesn't know which task to switch to).

Kernel Stack Per CPU

For some rare kernels (mostly only a few micro-kernels), there is only one kernel stack per CPU (and each task doesn't have its own kernel stack). In this case task switching becomes radically different because the kernel's stack can't be used to store any task's state. Instead, the user-space state has to be saved in some kind of data structure ("thread control block") immediately after any switch from user-space to kernel-space; and the user-space state has to be loaded from that structure immediately before any switch from kernel-space to user-space.

In this case; typically the kernel has a "thread to return to" variable (for each CPU); and any kernel code can modify this variable when it decides a task switch should happen.

Note that "kernel stack per CPU" is not recommended for beginners; partly because it's only beneficial in rare cases (e.g. it'd be silly for a monolithic kernel), and partly because it becomes a lot more complex as soon as you try to overcome the latency problems caused by having a non-preemptable kernel (e.g. expensive operations, like creating a new process, may need to be broken up into multiple smaller/less expensive pieces that are separated by explicit "preemption points").


See Also

Articles

Threads

External Links