There is no fixed way to implement video handling, and many hobby operating systems develop their UI as a unit, at least initially. However, for a number of practical reasons, most systems end up separating the details of communicating with the hardware from the process of rendering the image, and that in turn from the management of higher-level concerns such as window layout. In most systems, this leads to a layered approach that could be call the graphics stack, analogous to the networking stack.
A Generalized Graphics Stack
There is not fixed model for how to layer the graphics stack, though most are fairly similar. A general model for a graphics stack might look like this:
- Application Layer
- Interoperation Layers
- Desktop Management Layer
- Window Management Layer
- Presentation Layers
- Compositing Layer
- Widget Toolkit Layer
- Rendering Layer
- Display Layers
- Device Driver Layer
- Hardware Layer
This is close to most of the well-known graphics stacks, such as Microsoft Windows, MacOS, X Window System, and Wayland, though not identical to any of them. The order of layers may differ, some layers may be missing or merged together, and some may have additional layers or even multiple stacks (e.g., X Windows System has separate Client and Server Stacks, with their own Display Layers, and splits some aspects of the remaining stacks between the Server, which is the program requesting the display, and the Client actually doing the display - since the Client-Server relationship in X is potentially many-to-many, and even when it isn't different applications may need different degrees of control, several details cannot be specified to belong to either the Server or the Client.)
Furthermore, some layers may be more in parallel than sequential. Presentation, in particular, can involve some fairly complex relationships.
A more detailed review of these proposed layers is as follows:
Device Driver Layer
At the lowest software level, we have the device drivers, which communicate with the actual hardware. These need to be able to work with either the specific display devices - the video memory, the GPU if any, the video signal generators, and even the monitor - or some common subset of it which it shares with disparate adapters. However, this does not mean that the driver must do all the work alone. The VESA VBE/Core defines a standard minimal interface to the hardware as an extension BIOS, which a complaint video adapter should provide as a way of interfacing with the hardware without needing any proprietary details of the adapter.
Somewhere here you would find things like the Mesa driver framework and the Xlib Direct Rendering Manager. This level doesn't have a formal name in most systems, at least not as far as I know of, which is a first abstraction layer which software system (not necessarily the operating system itself) provides to give a uniform model for drawing pixels on the screen, while still exposing the underlying hardware. The split between 2-D and 3-D often starts around here, as a 3-D renderer generally needs a lot more direct hardware access than a 2-D one.
Then next level is the renderer, which is where . This is where you really see 3-D becoming a separate thing, as most systems prior to, say, 2007 would have used a strictly 2-D rendering for everything that didn't specifically require 3-D rendering, due to the need for hardware acceleration for practical real-time 3-D rendering at the time. As Brendan has pointed out before, right now the Cycle of Reincarnation for graphics rendering has been swinging towards CPU-driven rendering since the 2012 or so, though dedicated rendering hardware is still dominant at the moment. Note, however, that the graphics rendering Wheel of Incarnation has been rolling since the very first days of computer graphics in the early 1960s, so it is a good guess that this won't be the last word on the subject.
Anyway, Mesa proper started out in the 1990s as a software 3-D renderer, but currently is used to sort of abstract the rendering in a way that the software rendering is more of a fallback mode.
This is where you need to decide how you are going to handle the differences between rendering 2-D images such as basic windows and widgets, and the more impressive but also more processing-intensive 3-D rendering. While the fact that you can treat 2-D as a special case of 3-D, it is tempting to use 3-D for everything, but that approach has some significant down sides, especially on older hardware; you may need to consider where you can use less general 2-D rendering to avoid a lot of hardware crunching where possible.
You also need to look at how you separate different renderable elements such as glyphs (letters, digits, text symbols, etc.), widgets (window borders, menus, icons, the mouse pointer), 2-D images such as drawings and pictures, 3-D manipulatable objects, etc. This relates, and raises the issue in, the next layer of the stack, the compositor. However, before that I need to mention another part of this layer, the widget toolkit.
Widget Toolkit Layer
The widget toolkit is the set of primitive widgets - window frames, menus, drawing spaces, textboxes, text areas, radio buttons, checkboxes, etc. - that a window manager uses. This is not a separate layer from the renderer, but side-by-side with it, and the widgets have to work together with the compositor.
The compositor is the part that combines the individual elements being rendered into the instantaneous display state, that is, the screen as it is at a given moment. In a 2-D design, this is usually done by the renderer directly, but 3-D UIs almost always have a separate compositor.
OK, quick history lesson. Early 2-D windowing systems generally composited in situ, that is, directly into the display. However, while this was feasible with the stroke-vector displays of the 1960s, or on raster displays that used fixed cells drawn from tables of glyphs such as PLATO and the majority of text-oriented terminals, this was problematic for bitmapped video systems even from the outset, as it meant that a large block of memory - often as much as 30% of system memory in the days of the Alto and 128K Macintosh had to be set aside for the video, and the timing of drawing had to be synced with the vertical refresh in order to avoid flicker.
While double buffering was part of the answer, it ran into issues with time - copying that much data would take longer than the vrefresh, so a workable double buffer needed to be done by hardware. You would have to dedicate two buffer's worth of memory in hardware (one to drive the video, and the other to draw to), and the display would need even more hardware to let it switch which of the video buffers was driving it in order to make it work. Pretty much every video system today supports this as a matter of course. However, this did nothing for when you have to copy a bitmapped image from general memory - something loaded from a file, say - into the drawing video buffer.
In order to cut the time further, they developed Bit BLT, which is a method in which a part of the image is prepared as a mask and only the mask is drawn to the video buffer. Other techniques, such as hardware sprites (which were drawn directly to the screen, bypassing the video buffer entirely) were also developed, but were mostly used in dedicated gaming and video editing systems.
I mention all this to get to compositing. Up until 2006 or so, the act of compositing for a window manager was done mainly as a 2-D action, and generally was focused a) determining what parts of the display have changed, b) determining which parts of the screen were observable, on blitting the observable sections of a window that were getting changed to the draw buffer. This was generally easier for a tiling window manager, as there was no z-scaling - no windows overlapped, so everything could be drawn, and you could divide the windows into those which had changed and those which hadn't. Layering windows managers were a little more complicated because some windows might obscure parts of others, but generally it wasn't too difficult. Even so, 2-D hardware acceleration was still very useful for this, even if it wasn't absolutely necessary.
With the introduction of 3-D layered UIs such as Aqua and Aero, the issue of combining things became much more complex, leading to the need for a separate compositor layer. Most major window managers today have a 3-D compositor, and for a time it was almost impossible to get good performance from one without a dedicated GPU, meaning software rendering was out of the question even for the basic GUI, leading to issues that previously were mostly seen in gaming.
Window Management Layer
Getting back on track, we now get to the window manager itself, which is the part that actually decided where to put each rendered component, sets things related to the way widgets interact, and just generally, well, manages the windows. This is what X Window System was from the outset, and it acts as the glue between the lower level aspects of the GUI and the more abstract parts such as the desktop manager.
Desktop manager Layer
The next layer is the desktop manager, and this is what most people are actually thinking when they talk of a GUI, and of the differences between Windows, Mac, and the various Linux desktops such as KDE, Gnome, Unity, XFCE, Cinnamon, MATE, and so forth.
The Application Layer is the part which is internal to the program itself.