Windows NT

From OSDev Wiki
Jump to: navigation, search

This page is under construction! This page or section is a work in progress and may thus be incomplete. Its content may be changed in the near future.

TODO: Introduction

Contents

Brief History

TODO

System Architecture

The Windows NT architecture consists of vertical layers of modules that together form the monolithic NT kernel.

Windows Architecture.

As depicted in the diagram, the kernel itself comprises two layers: the core kernel (a kernel within the kernel) responsible for CPU-related aspects like scheduling and interrupt handling, and the executive layer, implementing higher-level services.

Beneath the NT kernel lies the Hardware Abstraction Layer (HAL), providing an abstraction for low-level hardware details such as access to device registers, interrupt controllers, and Direct Memory Access (DMA) operations.

Starting with Windows 10, the system by default includes the lowest layer, the Hyper-V hypervisor, which adds orthogonal separation into Normal and Secure worlds through virtualization and the concept of Virtual Trust Levels (VTLs), collectively forming Virtualization-Based Security (VBS). VBS provides support for a number of OS security features, such as VBS-based memory enclaves, Secure Kernel Control Flow Guard, Device Guard, Credential Guard, and others.

Let us examine each layer individually:

The Hypervisor and Secure Kernel

The Windows Hypervisor is the lowest software layer and is part of Microsoft’s virtualization solution, Hyper-V. Hyper-V consists of two components: the Hyper-V Type 1 hypervisor (also called the Windows Hypervisor) and the virtualization stack. The Hyper-V hypervisor serves as a software layer between the hardware and one or more operating systems, providing them with an isolated execution environment known as a partition.

Partitions do not have direct access to the physical processor; instead, they execute threads and handle interrupts on abstractions of physical processors called virtual processors.

The Windows Hypervisor always has at least one root partition with a Windows system, which is well aware of the hypervisor environment. Optimizations made to the root system running under the hypervisor are referred to as enlightenments. For example, in cases of long busy-wait spin-lock loops, the hypervisor can be notified to make scheduling decisions for another virtual processor on the same physical processor until the wait condition is satisfied.

Virtual Secure Mode architecture.

As previously mentioned, the hypervisor also provides support for a secure execution environment called Virtual Secure Mode (VSM), managed by the Secure Kernel (proxy kernel) and inaccessible to software in the Normal World. The Secure Kernel provides services for the Isolated User Mode (IUM), which, in turn, offers an isolated environment for trusted processes (trustlets) and secure memory enclaves.

The virtualization stack runs in the root operating system and manages virtual machines, provides device support services to child partitions, and handles communication between root and child partitions (which do not have direct access to hardware resources). Among other functions, the virtualization stack provides support for fully isolated containers that require fast startup, minimal memory footprint, and often share memory among multiple containers. Such containers leverage the memory manager of the root OS and serve as the foundation for features like Windows Sandbox and WSL 2.

The Hardware Abstraction Layer

Initially, the HAL was a loadable kernel module hal.dll. However, starting with Windows 11, the HAL layer has been merged into ntoskrnl.exe, and hal.dll no longer contains actual code but remains for driver compatibility.

The Hardware Abstraction Layer is a thin software layer designed to hide hardware details from higher-level kernel layers, ensuring portability. This includes differences in the implementation of low-level synchronization primitives, access to device registers, DMA operations, timer management, ACPI (Advanced Configuration and Power Interface) support, and more.

Instead of directly accessing hardware, the NT kernel and device drivers maintain portability by invoking services from the HAL when they require platform-dependent information. For instance, the memory manager and the kernel obtain information about hardware details related to cache and memory locality in this manner.

The Kernel

As mentioned earlier, the Windows kernel (ntoskrnl.exe) consists of two layers, and the lower layer of the kernel is somewhat confusingly referred to as the kernel. It can also be abbreviated as “Ke” (functions in this layer have the “Ke” prefix), and it should not be confused with the user-mode Windows subsystem libraries called kernel32.dll and kernelbase.dll.

The kernel layer separates mechanisms from executive policies and serves as a layer between the processor and the rest of the kernel. The primary task of the kernel layer is to provide a set of CPU management abstractions. This includes threads and their scheduling, low-level processor synchronization, interrupt and exception handling. These topics will be discussed in detail in the individual chapters.

In addition to abstracting the processor, the executive layer provides a set of functions to the NT executive. The executive treats threads (and other shareable resources) as executive objects. These objects have additional requirements and policy overhead, including object handles, security descriptors checks, resource quotas, and more. These overheads are minimized in the kernel, where a set of low-level primitive objects is implemented, providing a foundation for more complex and specialized executive layer objects.

Kernel Objects

Kernel objects can be divided into two categories: control objects and dispatcher objects. Kernel objects have a prefix of “K” indicating they are part of the core kernel, for example, KTHREAD, KEVENT, KDPC.

Control objects provide an abstraction over processor execution. Dispatcher objects serve as the basic synchronization mechanism provided for the executive layer. This set includes objects like events, gates, mutants (OS/2 mutexes), queues, priority queues, semaphores, threads, processes, and timers. They all share a common synchronization structure called DISPATCHER_HEADER.

The kernel provides a programming interface for synchronization with dispatcher objects, allowing waiting on them using functions like KeWaitForSingleObject/KeWaitForMultipleObjects. The executive subsystem provides a similar class of APIs but interacts with dispatcher objects indirectly through object handles using functions like NtWaitForXxx and ObWaitForXxx. User-mode synchronization objects accessible through the Windows API also derive their synchronization capabilities from dispatcher objects. We will discuss waiting on dispatcher objects in more detail in the Synchronization chapter along with other synchronization primitives.

Typical examples of control objects include software interrupts: DPCs (Deferred Procedure Calls) and APCs (Asynchronous Procedure Calls).

IRQL

IRQLs (Interrupt Request Levels) is an important software mechanism for prioritizing software and hardware interrupts using a level-based priority scheme (ranging from 0 to 15 levels on AMD64 and ARM64 systems).

IRQLs.

Unlike thread priorities, interrupt priority is an attribute of the interrupt source. Each processor has its own IRQL, which dynamically changes during kernel code execution and determines which interrupts can be serviced by that processor. IRQLs are also used for synchronization within the kernel. The kernel provides functions like KeLowerIrql and KeRaiseIrql to lower and raise the IRQL, respectively (on AMD64, IRQL is stored in the CR8 register).

When a processor is running at a specific IRQL, interrupts at that IRQL and lower are preempted (blocked). Preempted interrupts are then serviced when the IRQL is lowered to a sufficiently low level or moved to another processor.

All the user mode and a sizable portion of the kernel mode code run at passive IRQL (0) and strive to maintain it at this level to minimize delays. The kernel dispatcher (another name for the thread scheduler in NT) runs at the DPC/DISPATCH level (2). Consequently, at this level and above, the processor operates in a single-threaded cooperative mode, and context switching is not possible. Another rule is that at the DPC/DISPATCH level, you cannot reference memory that could be paged out, as disk I/O operations are not possible.

IRQLs from 3 to 11 are used by devices based on the Plug and Play interrupt arbiter's and bus driver decisions. The SYNCH level (12) is used for internal synchronization of the dispatcher. The CLOCK (13) and IPI (14) levels are used for clock interrupts and inter-processor interrupts delivery (as well as for recovery from Machine Check Exceptions), respectively. The highest IRQL, PROFILE/HIGH (15), blocks all maskable interrupts and is used by the system during halting after a critical error.

Software Interrupts

DPCs

DPCs are used to reduce the time spent executing ISR (Interrupt Service Routines) in response to an interrupt. Only minimal critical operations are performed in the ISR, and the remaining work can be deferred by queuing DPCs. DPC objects represent the further work to be done once the interrupt request level decreases and any remaining hardware interrupts are processed. Other common examples of DPC usage include responding to timer interrupts and interrupting the execution of the current thread at the end of its quantum.

The software interrupt associated with DPC is executed at the DPC IRQL, which is lower than hardware IRQLs. When the IRQL drops to PASSIVE or APC, the DPC is executed immediately and blocks all non-hardware-related processing (which is why they are often used for the immediate execution of high-priority system code). Only when the entire DPC queue is emptied will the IRQL drop, allowing thread execution to resume.

It is important to avoid excessively long DPCs and ISRs because threads cannot run due to preemption, leading to undesirable latencies in the system (especially in cases like real-time audio). Windows has several mechanisms to combat potential thread starvation caused by DPCs. One of them is called DPC Watchdog, which is tasked with monitoring execution at elevated IRQL (DPC/DISPATCH and higher). DPC Watchdog enforces time limits (soft and hard) for both individual DPC execution and high IRQL execution in general. If the soft limits are exceeded, the system collects additional profiling information for further analysis, and if the hard limits are exceeded, the system crashes with a DPC_WATCHDOG_VIOLATION stop code.

Additionally, in Windows 11 the kernel stores a short history of DPC runtime as a hash table to identify long-running DPCs. It also maintains a special DPC delegate thread for each processor in addition to idle threads. In the case of a long DPC, the kernel can reschedule it to the DPC delegate thread, switching the DPC to its own dedicated thread. This way, preempted threads are much less likely to experience starvation.

APCs

APCs are like DPCs (and somewhat akin to signals in Unix) in that they defer the processing of a procedure. However, unlike DPCs, which operate in the context of the processor, APCs run in the context of a specific thread. APCs operate at a lower IRQL than DPCs, so they can be preempted by high-priority threads, can perform waits, and have access to pagable memory. APCs are used to notify about asynchronous I/O completion, suspend, resume, and terminate threads.

APC objects, like DPCs, are organized into queues, with two queues per thread in the case of APCs (one queue for kernel-mode APCs and another for user-mode ones). Normal user-mode APCs invoke a procedure only when the thread is in a waiting state and specifically in an alertable state. On the other hand, the execution of kernel-mode APCs occurs instantly in the context of the running thread and is delivered as a software interrupt at the APC IRQL, preempting the PASSIVE level execution.

In Windows 11, there are also special user-mode APCs, which are entirely asynchronous and execute even if the thread is not in an alertable state.

The Executive

The second and upper layer of the NT kernel is the executive layer. It is the largest part of the kernel, mostly architecture-independent (except for the memory manager) and written mostly in the C programming language (with some components using modern C++). The executive is composed of numerous components or modules that are built upon the foundation provided by the core kernel. The executive components are often referred to as managers because they manage specific aspects of the kernel, such as memory, objects, I/O, power, and more. Almost all components have an exported from the ntoskrnl.exe interface for use by driver developers, making it appear as if the executive is a kernel-mode library.

Windows implements an object-based model to provide unified and secure access to executive services. An object is an instance of a specific statically defined object type, such as threads, files, and processes. A component of the executive layer of the kernel called the object manager manages executive objects and is a pivotal part of the system (hence, we will delve into it in detail shortly).

The Executive Components

The process manager manages the creation and termination of threads and processes, along with the policies that service them. These policies are implemented on top of kernel thread and process objects. Processes act as containers for a set of threads and private address space. The process manager also provides support for jobs, objects that allow control over multiple processes as a single group.

The memory manager implements a demand-paged virtual memory management scheme. It manages the mapping of virtual address spaces to physical memory, controls physical memory allocation, and handles paging memory in and out. The memory manager is the largest and probably most complex component of the kernel and works in conjunction with the prefetcher and store manager.

The store manager serves as an additional lower layer of the memory manager, responsible for optimizing backing compressed stores. It implements memory compression and provides a simple key-value interface for operations on compressed store pages.

A separate component implements technologies tightly integrated with the memory manager: the prefetcher and SuperFetch. The logical prefetcher is a prefetching engine that tracks the OS boot process and application startup to accelerate consistent scenarios. SuperFetch, on the other hand, implements proactive memory management using background prefetching with low-priority I/O operations when there is available RAM, effectively caching potentially useful data in memory.

The I/O manager provides an abstraction for devices and manages I/O operations related to them including completion and cancellation of I/O operations. Additionally, it defines and implements an asynchronous communication model based on I/O packets for drivers. The I/O manager is also responsible for kernel-mode telemetry and crash dumps.

The cache manager optimizes file-based I/O performance by maintaining cached pages of the file system in main memory. Caching itself is implemented by the memory manager using mapped files. In some other operating systems caching works at the physical/block level, as in Unix, where the system caches physically addressed blocks of the raw disk volume. In NT, there's centralized virtual-address caching at the level of logical/virtual files organizing cached pages in terms of their location in the files.

The security reference monitor is a centralized component responsible for implementing access controls according to the Trusted Computer System Evaluation Criteria (TCSEC) standard. It oversees determining access tokens to represent security context, performing access checks on objects, managing privileges, and generating security audit events.

The configuration manager implements the registry, which serves as a centralized repository for information required for OS booting, configuration, system settings, and more. Instead of using numerous configuration files, NT employs a unified method for system configuration with a stable API, support for multiple users, filters, callbacks, and a centralized location making backup and recovery easier.

The power manager handles various power states of devices and the system, coordinates transitions between these power states, and implements special shutdown modes like hibernation and standby. Additionally, the power manager includes several subcomponents: the processor power manager, responsible for managing CPU power states and core parking, the Power Framework (PoFx), which coordinates power states at the device level, and the terminal topology manager that manages terminals on devices.

Advanced Local Procedure Calls (ALPC) is a high-speed, scalable, and secure inter-process communication (IPC) mechanism. It operates by sending messages in various modes, either synchronously or asynchronously, and implements a combination of different message passing techniques depending on the message size (message copying or temporary allocation of shared memory sections).

In modern Windows, ALPC is used extensively throughout the system. Even the most basic Windows program maintains at least one active ALPC connection. For instance, ALPC is used when a process starts up to establish communication with the CSRSS subsystem. ALPC is also employed by the user-mode power manager to communicate with the kernel-mode power manager, as well as by user-mode drivers in general for communication with the kernel. The core messaging mechanisms in the new WinUI frameworks also utilize ALPC. Although ALPC is an internal and undocumented mechanism, its APIs are exported.

In addition to ALPC, Windows also offers a publisher/subscriber registration mechanism called the Windows Notification Facility (WNF). It works by notifying interested parties about the existence of certain events or states.

A server publishes the name of a WNF state, which is protected by a standard security descriptor, and updates some data associated with the state. Clients with sufficient access subscribe to updates for the WNF state. State updates are associated with customizable payload data (up to 4 KB). WNF state names are internally represented by a 64-bit identifier that encodes the version number, lifetime, and scope. Many WNF states are well-known, which means that they are pre-provisioned for use by OS components.

Since WNF state instances contain a fixed amount of preallocated data, there is no queuing of data, which avoids resource management problems associated with message-based IPC. Subscribers are only guaranteed to see the latest version of a state instance. This state-based approach gives WNF an advantage: publishers and subscribers are decoupled and can be started or stopped independently of each other.

WNF is widely used throughout the system, for example, the power manager uses WNF to notify about events like opening/closing a lid or turning the screen on/off. WNF is not available in the Windows subsystem, but this is probably subject to change.

The kernel runtime library provides a wide range of helper functions in kernel mode, including common data structures like hash tables and RB trees, string processing, compression algorithms, and the kernel heap management library.

The executive support is a component of the executive layer that provides various executive support routines. This includes support for allocating system memory from paged/non-paged pools, various synchronization primitives like push locks and fast mutexes, interlocked memory access, WNF, and worker threads.

Partial user-mode debugging support is implemented in the kernel by a component called the debugging framework. It provides capabilities for registering and listening to debug events and managing a special debug object.

The kernel shim engine allows dynamic modifications to be applied to old drivers. It maintains a special database of driver compatibility shims that can be applied during driver loading. The engine can hook drivers, specifically the Import Address Table (IAT), driver callback functions, and I/O operations.

The hypervisor library deals with optimizations when the system is running under a hypervisor and the VSL library provides support for a secure virtual environment, isolated user mode, and Hypervisor-Protected Code Integrity (HVCI).

Kernel mode driver development is a complex task where even a minor error can lead to system corruption and crashes. Therefore, an automated method of directed testing and verification is needed. For this purpose, the Driver Verifier was developed, a mechanism that helps find and isolate many common driver issues. It offers a range of testing options, including I/O verification and logging, forced IRQL checks, pool allocation tracking, memory pressure simulation, and more.

Event Tracing for Windows (ETW) is the primary tracing technology in Windows, enabling the provisioning, utilization, and management of event logging and tracing. ETW encompasses three types of components: providers that initiate and control event tracing sessions, controllers that store the tracing instrumentation, and real-time consumers that receive the tracing information. In later builds of Windows 10, static ETW tracing was supplemented with DTrace, a dynamic tracing tool originally developed for the Solaris operating system, but unlike ETW, implemented in a separate driver (dtrace.sys) rather than in the kernel itself.

Windows Diagnostic Infrastructure (WDI) assists in diagnosing and resolving common problem scenarios using automated system monitoring built on top of the well-known ETW. An example of a WDI diagnostic scenario could be tracking memory leaks in the system and preventing resource exhaustion.

Windows Hardware Error Architecture (WHEA) is a unified, portable platform for reporting and recovering from hardware errors.

All NT components follow a unified naming scheme, which includes adding a prefix indicating ownership by a specific component or its part. The table below lists most of the prefixes. It is worth noting that for internal functions, either the letter “p” (“private”) or “I” (“internal”) is added to the prefix. For example, Mm and Mi, Ke and Ki, Ps and Psp, Hal and Halp.

Different prefixes.
Prefix Component
Accel Accelerator Library
Alpc Advanced LPC
AnFw Animation Framework
ApiSet API Sets
Arb Arbiter
Bapd Boot Application Persistent Data
Bcd Boot Configuration Data
Bg Boot Graphics
Cc Cache Manager
Cm Configuration Manager
Dbgk User-mode Debugging Support
Dif Driver Instrumentation Framework
Em Errata Manager
Etw Event Tracing for Windows
Ex Executive
FsRtl File System Run-Time Library
Hal Hardware Abstraction Layer
Hv Hive Manager
Hvi Hypervisor Interface
Hvl Hypervisor Library
Iaa In-memory Analytics Accelerator
Inbv Initialization Boot Video
Init System Initialization
Io I/O Manager
Iov I/O Verifier
Ium Isolated User Mode
Kasan Kernel Address Sanitizer
Ke Kernel
Kse Kernel Shim Engine
Lsa Local Security Authority
Mm Memory Manager
Nt System Services
Ob Object Manager
Pdc Power Dependency Coordinator
Pf Prefetcher
Pnp Plug and Play
Po Power Manager
PoFx Power Framework
Ppm Processor Power Manager
Ps Process Manager
Rtl Run-Time Library
Sdb Shim Database
Se Security Reference Monitor
Sk Secure Kernel
Sm Store Manager
Verifier Driver Verifier
Vm Virtual Machine
Vsl VSM Library
Wdi Windows Diagnostic Infrastructure
Wer Windows Error Reporting
Whea Windows Hardware Error Architecture
Wmi Windows Management Instrumentation
Wnf Windows Notification Facility

The Object Manager

As previously mentioned, NT implements a unified and consistent interface for managing system resources and data structures through objects.

The object manager is a centralized kernel component responsible for managing objects. It handles their naming, creation, deletion, tracking, and protection in a unified manner. The list of executive objects includes files, processes, threads, jobs, sections, mutexes, semaphores, registry keys, ALPC ports, and many others. Some of these objects were previously mentioned as kernel objects. Indeed, executive versions of these objects encapsulate the dispatcher object of the corresponding type, adding executive policies (with names prefixed by "E" instead of “K,” indicating executive, for example, EPROCESS, ETHREAD). Additionally, the object manager manages more specialized objects like sessions, kernel transactions, and power requests. In total, there are around seventy different types of executive objects (some of which are internal to the executive and not accessible from outside).

Object Structure

An object itself is just a data structure in virtual memory accessible in kernel mode. Typically, these data structures are used to build more complex data structures. Each object always consists of an object header and a body specific to the object type. The object manager manages fields of the object header, while the format and content of the object body are managed by the corresponding executive component.

The structure of an executive object that contains a kernel object.

The structure of the object header includes:

  • The object's reference count in order to properly track the lifetime of an object, objects are reference counted.
  • The number of open object handles.
  • A push lock for per-object locking. Thus, manipulation of one object does not prevent simultaneous manipulation of any other object of the same type.
  • The object type index, which contains attributes for objects of that type.
  • Debugging and tracing flags, as well as object attribute flags.
  • Quota charges associated with the object. Quotas are used by the memory manager to control the size of the working set, and by the process manager to control constraints if any are set.
  • A security descriptor.

Additionally, an object may contain certain subheaders. Each subheader is optional and can be present only under certain conditions. In this case, the object may have a footer in addition to the header and body.

Handles

User-mode programs can access kernel-mode objects using handles. Handles are opaque values returned by many system APIs. The object manager converts handles into references to specific kernel-mode data structures representing objects, using a handle table for this purpose. To be more precise, each handle is an index into this table, multiplied by four to make room for special per–handle bits (the first index corresponds to 1–4, the second index to 5–8, and so on).

Every process has its own handle table, including the system process that hosts kernel threads and whose handle table is protected from user mode. The kernel handle table contains kernel handles not accessible from user mode.

The handle tables are represented as a tree structure, which can dynamically expand as needed, up to three levels, holding a maximum of 16M handles.

The Object Manager Namespace

The object manager allows objects to be named, providing a means to distinguish objects from each other, query specific objects, and share objects among processes. The object manager implements a hierarchical namespace for objects. If a process wants to share an object with other processes, the object can be added to the global namespace. Unlike Unix, the namespace in NT is abstract, internal, and hidden from users but remains an important part of the system.

The hierarchy of the namespace is formed by object directories and symbolic links. The object namespace is modeled after OS/2 file naming convention, where directory names in a path are separated by a backslash. Case insensitivity is optional whenever a name lookup is performed.

The namespace is extensible, allowing anyone to define namespace extensions using the parse procedure. For instance, in this way the I/O manager extends the namespace to remote files, enabling access to named objects from other computers. Parse is one of the procedures that can be defined for each object type upon creation, along with open, close, delete, query, security, and delete.

Win32k

For performance improvement purposes, during the time of NT 4.0, a massive portion of the Win32 graphical subsystem was moved into kernel mode as the win32k.sys module. This indeed enhanced performance on the hardware available at that time. However, it also introduced architectural flaws and security issues in the form of numerous vulnerabilities. Fortunately, this is subject to change, and most likely, an upcoming Windows release will move win32k back into user mode.

Device Drivers

Device drivers in NT are dynamic link libraries loadable by the executive. While they are primarily used for specific hardware devices, drivers serve as a general mechanism for kernel-mode extensions and may not necessarily be associated with a hardware device (software drivers). For example, as mentioned earlier, a part of the Win32 subsystem is loaded as a kernel-mode driver.

In the NT I/O system, a device driver can either perform all the necessary operations independently or several drivers can be “stacked” together to organize an I/O data flow path for each device instance. This data flow path is known as a device stack, and in such cases, I/O requests may pass through a sequence of drivers, with each driver handling its part of the work.

Additionally, filter drivers can be inserted into the device stack, which handle preprocessing or postprocessing of I/O operations. Filters can address issues such as modifying driver behavior without altering the driver itself or implementing entirely new functionality for a specific device.

Another category of device drivers in NT is filesystem drivers. Each volume for a filesystem is treated as a device object created as part of the device stack for that volume. Special filesystem filter drivers can be directed at filesystem drivers, which can be used for purposes such as encryption or antivirus scanning.

To facilitate driver development, the Windows Driver Model (WDM) was introduced, allowing for the creation of device drivers following a unified model. However, WDM is closely tied to the operating system, with drivers directly interacting with system functions and structures. On top of this model, a set of frameworks known as Windows Driver Frameworks (WDF) has been developed. These frameworks simplify many concepts by handling interactions with the kernel, allowing developers to focus more on driver requirements and functionality.

Boot and Initialization

TODO

Processes and Threads

TODO

Synchronization

TODO

Thread Scheduling

TODO

Memory Management

TODO

Caching

TODO

I/O Management

TODO

Power Management

TODO

Virtualization

TODO

Personal tools
Namespaces
Variants
Actions
Navigation
About
Toolbox