There is no fixed way to implement video handling, and many hobby operating systems develop their UI as a unit, at least initially. However, for a number of practical reasons, most systems end up separating the details of communicating with the hardware from the process of rendering the image, and that in turn from the management of higher-level concerns such as window layout. In most systems, this leads to a layered approach that could be call the graphics stack, analogous to the networking stack.
(NB: this should not be confused with the graphics pipeline, which is the sequence of operations which a GPU applies to an image being rendered.)
A Generalized Graphics Stack
There is no fixed model for how to layer the graphics stack, though most are fairly similar. A general model for a graphics stack, from highest to lowest levels, might look like this:
- Application Layer
- Interoperation Layers
- Desktop Management Layer
- Window Management Layer
- Presentation Layers
- Compositing Layer
- Widget Toolkit Layer
- Rendering Layer
- Display Layers
- Device Driver Layer
- Hardware Layer
This is close to most of the well-known graphics stacks, such as Microsoft Windows, MacOS, X Window System, and Wayland, though not identical to any of them. The order of layers may differ, some layers may be missing or merged together, and some may have additional layers or even multiple stacks.
Furthermore, some layers may be more in parallel than sequential. Presentation, in particular, can involve some fairly complex relationships.
A note on X Window System
X Windows System differs from most of the others regarding its graphics stacks, as it was originally designed as a protocol for networked video, rather than as a specific display manager. X uses separate Client and Server Stacks each with their own Display Layers (even when used for rendering locally, as is the more common use case today), and splits some aspects of the remaining stacks between the Client (the remote program requesting the display being rendered) and the Server (the local system rendering the image - while this might seem to be a reversal of the usual client/server relationship, it makes sense if you view the client as the program requesting a service from a remote system). Further, the Client-Server relationship in X is potentially many-to-many, meaning that their may be several Server stacks rendering for a single client stack, while the Server in turn may be connected to other Clients and have to compose the graphics from each of them into a desktop. Finally, different applications may need different degrees of control, meaning that several details cannot be specified to belong to either the Server or the Client, but need to be negotiated between the two at the start of the operation.
Device Driver Layer
At the lowest software level, we have the device drivers, which communicate with the actual hardware. These need to be able to work with either the specific display devices - the video memory, the GPU if any, the video signal generators, and even the monitor - or some common subset of it which it shares with disparate adapters. However, this does not mean that the driver must do all the work alone. The VESA VBE/Core defines a standard minimal interface to the hardware as an extension BIOS, which a complaint video adapter should provide as a way of interfacing with the hardware without needing any proprietary details of the adapter.
Somewhere here you would find things like the Mesa driver framework and the Xlib Direct Rendering Manager. This level doesn't have a formal name in most systems, at least not as far as I know of, which is a first abstraction layer which software system (not necessarily the operating system itself) provides to give a uniform model for drawing pixels on the screen, while still exposing the underlying hardware. The split between 2-D and 3-D often starts around here, as a 3-D renderer generally needs a lot more direct hardware access than a 2-D one.
Then next level is the renderer, which is where individual images or sections of images get created. This is where you really see 3-D becoming a separate thing, as most systems prior to, say, 2007 would have used a strictly 2-D rendering for everything that didn't specifically require 3-D rendering, due to the need for hardware acceleration for practical real-time 3-D rendering at the time. There has been a Cycle of Reincarnation for graphics rendering going between CPU-driven rendering dedicated GPU rendering and back, with current systems often using both for different purposes. Note, however, that the graphics rendering Wheel of Incarnation has been rolling since the very first days of computer graphics in the early 1960s, so it is a good guess that this won't be the last word on the subject.
For example, the Mesa library used in many Linux systems started out in the 1990s as a software 3-D renderer, but currently is used to abstract the rendering process, allowing it to use hardware acceleration while providing software rendering a fallback mode.
This is where you need to decide how you are going to handle the differences between rendering 2-D images such as basic windows and widgets, and the more impressive but also more processing-intensive 3-D rendering. While the fact that you can treat 2-D as a special case of 3-D means it is tempting to use 3-D for everything, that approach has some significant down sides, especially on older hardware. You may need to consider where you can use less general 2-D rendering to avoid a lot of hardware crunching where possible.
You also need to look at how you separate different renderable elements such as glyphs (letters, digits, text symbols, etc.), widgets (window borders, menus, icons, the mouse pointer), 2-D images such as drawings and pictures, 3-D manipulable objects, etc. This relates, and raises the issue in, the next layer of the stack, the compositor. However, before that I need to mention another part of this layer, the widget toolkit.
Widget Toolkit Layer
The widget toolkit is the set of primitive widgets - window frames, menus, drawing spaces, textboxes, text areas, radio buttons, checkboxes, etc. - that a window manager uses. In most designs, this is not a separate layer from the renderer, but side-by-side with it, and the widgets have to work together with the compositor as well.
The compositor is the part that combines the individual elements being rendered into the instantaneous display state, that is, the screen as it is at a given moment. In a 2-D design, this is usually done by the renderer directly, but 3-D UIs almost always have a separate compositor.
OK, quick history lesson. Early 2-D windowing systems generally composited in situ, that is, directly into the display. However, while this was feasible with the stroke-vector displays of the 1960s, or on raster displays that used fixed cells drawn from tables of glyphs such as PLATO and the majority of text-oriented terminals, this was problematic for bitmapped video systems even from the outset, as it meant that a large block of memory - often as much as 30% of system memory in the days of the Alto and 128K Macintosh - had to be set aside for the video, and the timing of drawing had to be synced with the vertical refresh in order to avoid flicker.
While double buffering was part of the answer, it ran into issues with time - copying that much data would take longer than the vrefresh, so a workable double buffer needed to be done by hardware. You would have to dedicate two buffer's worth of memory in hardware (one to drive the video, and the other to draw to), and the display would need even more hardware to let it switch which of the video buffers was driving it in order to make it work. Pretty much every video system today supports this as a matter of course. However, this did nothing for when you have to copy a bitmapped image from general memory - something loaded from a file, say - into the drawing video buffer.
In order to cut the time further, developers at Xerox PARC developed a technique called 'Bit BLT', in which a part of the image is prepared as a mask and only the mask is drawn to the video buffer. Other techniques, such as hardware sprites (which were drawn directly to the screen, bypassing the video buffer entirely) were also developed, but were mostly used in dedicated gaming and video editing systems.
I mention all this to get to compositing. Up until 2006 or so, the act of compositing for a window manager was done mainly as a 2-D action, and generally was focused on a) determining what parts of the display have changed, b) determining which parts of the screen were observable, and c) blitting the observable sections of a window that were getting changed to the draw buffer. This was generally easier for a tiling window manager, as there was no z-scaling - no windows overlapped, so everything could be drawn, and you could divide the windows into those which had changed and those which hadn't. Layering windows managers were a little more complicated because some windows might obscure parts of others, but generally it wasn't too difficult. Even so, 2-D hardware acceleration was still very useful for this, even if it wasn't absolutely necessary.
With the introduction of 3-D layered UIs such as Aqua and Aero, the issue of combining things became much more complex, leading to the need for a separate compositor layer. Most major window managers today have a 3-D compositor, and for a time it was almost impossible to get good performance from one without a dedicated GPU, meaning software rendering was out of the question even for the basic GUI, leading to issues that previously were mostly seen in gaming.
However, because this still has a performance cost, there are several Linux desktop managers (see below) which only use 2D compositing, a number of which (e.g., Notion, Awesome) are (or initially were) specifically tiling systems designed to avoid applying significant amounts of compositing. Many of these are also designed to be used keyboard-only, for those who prefer to avoid using the mouse, which is a separate issue but often seen as going hand in hand with the performance matters.
Window Management Layer
Getting back on track, we now get to the window manager itself, which is the part that actually decided where to put each rendered component, sets things related to the way widgets interact, and just generally, well, manages the windows. This is what X Window System was from the outset, and it acts as the glue between the lower level aspects of the GUI and the more abstract parts such as the desktop manager.
Desktop manager Layer
The next layer is the desktop manager, and this is what most people are actually thinking when they talk of a GUI, and of the differences between Windows, Mac, and the various Linux desktops such as KDE, Gnome, Unity, XFCE, Cinnamon, MATE, and so forth.
The Application Layer is the part which is internal to the program itself.