3D Renderer Basics
3D graphics is something most operating systems want to support one day. Unfortunately, video drivers that support GPUs are extremely complex to implement, and it's hard to find information that covers the basics of software rendering.
The goal here is to describe the simplest 3D rendering pipeline possible, and then show ideas to improve on it, to give the reader (you) a basic understanding. Please note we do not intend this to be "state-of-the art" and will cover nothing advanced and is also not intended to be maths tutorial and won't include formulas (these can be found with a simple web search).
Before a 3D renderer can do anything it needs data describing a scene to render. A scene shall comprise a camera, a world, and objects within that world.
To describe a camera you need the camera's location (x, y, z), which direction it is facing (a "front vector") and which direction is the top of the camera ("up vector"). You can think of a vector as a displacement in 3D space (amount to add to x, amount to add to y, amount to add to z). For the mathematics you'll want later it's best to use unit vectors for the camera's "front vector" and "up vector". A unit vector is a vector where the distance is 1. To convert a vector into a unit vector divide each component by the distance (e.g. "distance = sqrt(x*x + y*y + z*z); x /= distance; y /= distance; z /= distance;").
A camera might also have other data associated with it (e.g. field of view).
The world is a special kind of object that may be literally nothing (a coordinate system only), or may describe things that can not move (e.g. terrain, sky). The world needs no direction vectors (it is the primary coordinate system that everything else is relative to), but otherwise can be the same as an object.
An object comprises another coordinate system (the object's "local coordinate system"); several values to describe where the object is in the world's coordinate system, one or meshes, plus the data that each mesh needs. For simplicity (initially) you can assume that an object only ever has one mesh.
The values to describe where the object is in the world's coordinate system are a coordinate describing where the object's local coordinate system's origin would end up in the world's coordinate system; plus a "front vector" and "up vector" to describe the object's orientation within the world's coordinate system.
A mesh comprises a list of coordinates, then a list of polygons.
Coordinates are just a location within a coordinate system, typically described as 3 values "(x, y, z)" with no additional information. For performance it is recommended that coordinates are stored in "structure of arrays" format (and not "array of structures" format). What this means is that you do something like "float x[COORDS]; float y[COORDS]; float z[COORDS];" and not "struct coord myCoords[COORDS];". The reason for this is that it makes it much easier to use SIMD (SSE, AVX) to process many coords.
Note: Sometime people get "coordinate" and "vertex" confused or use them interchangeably. In my opinion this is wrong. For this wiki page, a vertex is a coordinate with additional information, and is not the same as a coordinate that has no addition information.
For simplicity, we will assume there are only 2 kinds of polygons - textured polygons and gradient polygons.
Every polygon has a set of flags, where one/some of the flags are used to determine what type of polygon it is (and the other flags are used for other things described later), and every polygon has a "number of vertices". Everything else depends on the type of polygon.
For textured polygons the polygon also has some kind of "texture reference". Each vertex used by a textured polygon has a reference to the coordinate (in the meshes' list of coordinates) plus a location in the texture (typically a "(u, v)" coordinate within the texture the polygon uses).
Note that for repeating patterns the texture coordinate may be larger than the texture's dimensions. When drawing polygons on the screen later on, the renderer will do something like "u = u % texture_width; v = v % texture_height" for each pixel, so that if the"(u, v)" coordinate within the polygon's vertex is larger than the size of the texture then the texture's ends up being repeated across the polygon. This is very beneficial for things like (e.g.) brick walls, where the same "brick texture" might be repeated over a large area, because that large area can be described with one polygon rather than many polygons.
For gradient polygons; each vertex has a reference to the coordinate (in the meshes' list of coordinates) plus a colour. When drawing polygons on the screen later on, the renderer will use interpolation to determine a pixel's colour. For example, if the polygon is a triangle and has a top vertex that is blue, a left vertex that is red and a right vertex that is green; then we'd find the points on the left edge and right edge where a horizontal line through the pixel would intersect those edges and use the vertices' colours and the distances between those vertices along the edge to determine a "left colour" and a "right colour"; then use distances that the pixel is between the left and right (and the "left colour" and a "right colour") to determine the pixel's colour.
Coordinate Systems and Transformations
There are multiple different coordinate systems (each object's local coordinate system, the world's coordinate system, the camera's coordinate system, then the screen's coordinate system). One of the first things the 3D renderer will need to do is transform all coordinates (from objects and the world) into screen coordinates. This involves is multiple steps (because there multiple coordinate systems).
Most steps involve building a rotation matrix (from the "front vector" and "up vector" of one coordinate system) and a translation matrix (from the "location of origin in the other coordinate system"); and then multiplying each coordinate (in each object's meshes' coordinate list) by each matrix in order. However, this can be optimised by multiplying the matrices together, so you only end up multiplying a coordinate by a single "all steps combined" transformation matrix.
The first step is building a transformation matrix to convert coordinates in the camera's coordinate system into screen coordinates. This is special and unlike the other steps. For this you need a projection matrix (which essentially does "x = x/z; y = y/z"), a scaling matrix (to scale coordinates to suit the screen's resolution, possibly derived from the camera's field of view), and a translation matrix (which subtracts half the screen's horizontal resolution from x values and subtracts half the screen's vertical resolution from y values). These 3 matrices should be combined (multiplied) to form the final "camera to screen transformation matrix".
Note: This is a simplification used to keep things in a "nicer to understand" order. See the warning in "Viewing Volume Warning" section below about problems involving "z = 0" and division by zero.
The second step is building a transformation matrix to convert coordinates in the world's coordinate system into the camera's coordinate system. This involves creating a rotation matrix from the camera's "front vector" and "up vector", creating a translation matrix from the camera's location in world space, then combining these 2 matrices with the "camera to screen transformation matrix" from the previous step to get a final "world to screen transformation matrix".
Once you have the "world to screen transformation matrix" it can be applied to world coordinates to convert them to screen coordinates, and this is what makes the world special (because you need one less step to create the transformation matrix for the world's mesh).
The third step is, for each object in the world, use the object's "front vector" and "up vector" to create a rotation matrix, and the object's position to create a translation matrix; then combine the object's 2 matrices with the "world to screen transformation matrix" to get a "object to screen transformation matrix" for the object. Once you have the "object to screen transformation matrix" for an object it can be applied to the object's coordinates to convert them to screen coordinates.
When you've finished all of this (applied the transformation matrices to the world and all objects in the world), all coordinates will be screen coordinates.
Back Face Culling
To improve performance we want to discard ("cull") polygons that can't be seen as early as possible (after doing the least unnecessary work for them). For opaque objects that have thickness, some polygons will be on the object's front side and some will be on the object's back side, and the polygons on the object's back side can't be seen (and can be culled). It turns out it's easy to test for this.
If the vertices of a polygon are in a specific order (e.g. clockwise when looking at the polygon), after coordinates are transformed into screen coordinates you can test if the polygon's screen coordinates are in the same specific order (e.g. clockwise) by using any 3 coordinates in sequence (e.g. just the first 3 coordinates) to find 2 vectors ("first to middle" and "middle to last"), and then calculate the vector dot product of those 2 vectors. The sign of the vector dot product will determine if the polygon is a front face or a back face. If it is a back face, then the polygon can be discarded.
Note: You may want to use one of the polygon's flags for "if this flag is set, discard polygon if it's facing backwards". This allows you to disable back face culling for some polygons (e.g. if you want a "zero thickness" fence that looks the same from both sides, then you can use one polygon for both sides instead of having 2 polygons). Also; if you know that the polygon will be a front face for some other reason (e.g. because it's part of the sky and you don't allow the camera to get higher than the sky) you can clear the flag to bypass the overhead of the ("known unnecessary") back face culling test.
Viewing Volume Culling
If all of a polygon's z coordinates are negative, then the entire polygon must be behind the camera and impossible to see, and that polygon can be culled. This is (should be) very easy to understand.
Now imagine a "viewing volume" described by a left plane, a top plane, a right plane, a bottom plane, a near plane and a far plane; where everything inside that volume is visible and everything outside that volume is not visible. With screen coordinates, the "near plane" is described by "z = 0" so if all of a polygon's z coordinates are negative the polygon must be outside the viewing volume and the polygon can be culled.
The other planes are similar. The far plane can be described by "z = some far away constant" and if all of a polygon's z coordinates are larger than "some far away constant" then the polygon must be outside the viewing volume and the polygon can be culled.
The left plane can be described by "x = 0" (and if all of a polygon's z coordinates are negative then the polygon must be outside the viewing volume and the polygon can be culled). The right plane can be described by "x = horizontal_resolution". The top plane can be described by "y = 0". The bottom plane can be described by "y = vertical_resolution".
Viewing Volume Clipping
If some of a polygon's z coordinates are negative but some aren't; then you can find the points on the polygons edges that intersect with "z = 0" and clip the polygon at those new points. The same applies to all other planes that describe the viewing volume.
To do this, start by finding the intersection points and then calculating the other data that a vertex needs - for a gradient polygon you need to calculate the colour for the vertex at the intersection point, and for a textured polygon you need to calculate the texture coordinate for the new vertex at the intersection point. The new vertices can be inserted into the polygon's list of vertices. Then it starts to get slightly tricky.
There are 2 types of polygon shapes - convex and concave. A convex polygon is a polygon where any line that passes through the polygon can only intersect with 2 points on the polygon's circumference. A concave polygon is a polygon where a line that passes through the polygon might intersect with more than 2 points on the polygon's circumference. Convex polygons complicate things throughout the renderer, and for this reason it might be a good idea to say "all polygons must be convex" (and not bother supporting conc(left, right, top, bottom)ave polygons).
If a polygon is convex, then you will have found (at most) 2 intersection points and created 2 new vertices at those intersection points; and all of the original polygons vertices that fail your clipping test (e.g. have a negative z if you're clipping to "z = 0") can be discarded.
If a polygon is concave, then you may have found 2 or more intersection points and created 2 or more vertices at those intersection points. If you found 2 intersection points then it becomes the same as convex polygons (simply discard any vertices that fail your clipping test). Otherwise you have a choice: you can leave the polygon alone and deal with "incomplete clipping" later (during rasterisation); or you can deal with it during viewing volume clipping. If you do deal with this case during viewing volume clipping then it gets complicated, partly because you can end up with multiple polygons.
Note 1: It is entirely possible to not do viewing volume clipping at all, and always leave it until later (during rasterisation). However, it's better for performance to clip a smaller number of whole polygons and avoid the need to do clipping on a much larger number of line fragments.
Note 2: Depending on what you felt like doing (e.g. if you support convex polygons, etc); it may be a good idea to use a polygon flag to indicate if the polygon was fully clipped or partially clipped, so that (during rasterisation) you don't do a bunch of unnecessary clipping tests for polygons that were fully clipped.
Viewing Volume Warning
Above I've described the near plane (used for viewing volume culling and viewing volume clipping) as "z = 0". In practice this is a bad idea because you end up having to deal with the possibility of division by zero. One of the places where you will have to worry about division by zero (caused by "z = 0") is in the projection matrix mentioned above (in the "Coordinate Systems and Transformations" section), where that projection matrix does something like "x = x/z; y = y/z".
For this reason it's much better to use "z = some small positive value" (instead of "z = 0") for the near plane (used for viewing volume culling and viewing volume clipping); and then apply the projection matrix after viewing volume culling and viewing volume clipping. However this complicates things again, because it makes volume culling and viewing volume clipping complicated for the other (left, right, top, bottom) planes.
The practical solution is to do viewing volume culling and viewing volume clipping for the near plane (and maybe the far plane too); then apply the projection matrix (after you know that z must be a non-zero positive number for all remaining coordinates); then do viewing volume culling and viewing volume clipping for the remaining (left, right, top, bottom, and maybe far) planes after that.
The idea of rasterisation is to convert each polygon into horizontal line fragments. In the most simplest way, this can be done by finding the coordinates with the highest and lowest y value, then doing a "for each y from lowest to highest" loop where you find the left and right intersections for each horizontal line fragment.
For convex polygons the simplest way isn't very good - it can be improved a lot by keeping track of "left edge" and "right edge" and calculating the intersection from that, switching to the next edge when you y goes beyond an edge's vertex.
For concave polygons it gets complicated again - for each y you can multiple intersections (where the number of intersections is always even). For example, for a shape like the letter "W", near the bottom of the shape you might get 4 pairs of intersections, resulting in 4 horizontal line fragments.
Because we're using 2 different types of polygons (textured polygons and gradient polygons) we will end up with 2 different types of line fragments (textured fragments and gradient fragments).
For textured polygons you want to know the "screen y" for the fragment, which texture it is, the left intersection's x and its texture coordinates (u1, v1), and the right intersection's x and its texture coordinates (u2, v2).
For gradient polygons you want to know the "screen y" for the fragment, the left intersection's x and its colour, and the right intersection's x and its colour.
Note: To avoid distortion caused by "non-linear depth" you might also want to know the left and right intersection's z for each fragment (if you take that into account later when determining what colour each pixel would be). Because correcting this distortion can be complicated I'd recommend not bothering initially (and worrying about it later, after you've got the renderer working).
Once you have fragments, you can draw them on the screen (sort of). The basic idea is to have a loop that goes from the fragment's left x to the fragment right x; where you find the pixel at the current x (and the fragment's y), then determine the colour for that pixel from the rest of the fragment's data.
For textured fragments you want to use the current x to find the texture coordinate between "left intersection (u1, v1)" and "right intersection (u2, v2)", and then do texture coordinate reduction mentioned earlier for repetition ("u = u % texture_width; v = v % texture_height"), then find the colour of the textel at the resulting texture coordinate and use that for the pixel.
For gradient fragments you want to use the current x, "left colour" and "right colour" to determine the pixel's colour.
If you simply draw fragments, then you're going to get an ugly mess - polygons that are behind other polygons may be drawn as if they were in front instead. Occlusion is all about preventing that problem. There are several different ways to do occlusion, each with their own advantages/disadvantages.
One way to do occlusion sounds simple - just draw everything in "back to front" order. In practice, determining the correct order to draw polygons can be expensive when there's many polygons, and can be very complicated (up to "impossible without splitting polygons up" - e.g. if polygon A is in front of polygon B, polygon B is in front of polygon C, and polygon C is in front of polygon A).
Another way to do occlusion is to use a buffer to store the "current Z for each pixel" that is initially (at the start of rendering a frame) set to "all z coordinates are far away". When you're drawing a fragment you determine the z coordinate for the pixel (for that fragment) and compare it with the z value in the buffer. If the z value in the buffer is smaller then this pixel is behind something and not drawn, and if the z value in the buffer is larger then this pixel is in front of whatever was previously drawn and you set the pixel's colour and also set the z value in the buffer.
There are a few problem with this. The first problem is when you the z value for a fragment's pixel is equal to the z value in the buffer (e.g. due to precision loss/rounding), you can't know if the pixel should be kept or not. This leads to a problem called "z fighting". The second problem is performance - for every pixel of every fragment, you determine if the pixel should be "kept for now" and if the pixel is kept you do work to determine the pixel's colour; but there is no guarantee the pixel won't be overwritten later and the work you did to determine the pixel's colour will be a waste of time.
Imagine if, for each horizontal screen line, you have a list of fragments for that y. Initially these lists will all be empty; and during rasterisation you insert fragments into the list for that y.
When you insert a fragment in the list for that y, you can determine if the fragment is in front of or behind any existing fragments for that y. If the new fragment is (completely or partially) behind another fragment then the new fragment can be (completely or partially) discarded, and if other fragments are (completely or partially) behind the new fragment you can (completely or partially) discard the other fragments.
After you've done this for all fragments; you end up with a list of fragments that are not occluded, and then you'd render those fragments (and never determine the colour of a pixel that can't be seen).
This can be a complicated for memory management (e.g. using "malloc()" and "free()" for every fragment would be unusable slow), and takes some cleverness to do the fragment insertion quickly (possibly starting with storing fragments "starting x order" and using binary search).
It is possible to do occlusion before rasterisation, and this can (in theory) be much better for performance because it avoids doing all unnecessary work (in rasteristation and fragment rendering) for everything that's occluded.
This is extremely complex. For each pair of polygons, determine which polygon is closer and which is further away, then create a "masking polygon" by mapping the closer polygon onto the plane of the further polygon, then subtract the masking polygon from the further polygon. Note that polygon subtraction can lead to concave polygons (which most people would've decided to avoid) and "donuts polygons" (where a polygon has an outer circumference and an inner circumference and needs 2 vertex lists to describe).
Object Meshes Revisited
Earlier I mentioned that each object might have one or more meshes; and you might be wondering why an object might want multiple meshes. There are a few reasons for this.
Meshes For Different Distances
One reason for multiple meshes is that maybe you want to have one detailed mesh (with lots of polygons) for when the object is close to the camera, and then another simpler mesh (with fewer polygons) for when the object is far away from the camera. This improves performance when objects are too far away for the details to matter. You can extend this idea to many meshes at different levels of detail.
Spherical Mesh Culling
Imagine an object that has an exterior and an interior (like a space ship, or car, or submarine, or house, or...). When the camera is outside the object there will be polygons in the interior that are impossible to see (and if there's no windows all interior polygons will be impossible to see). When the camera is inside the object there will be polygons on the exterior that are impossible to see (and if there's no windows all exterior polygons will be impossible to see).
Now assume you have 2 spheres centered at the object's origin, where each sphere is stored as its radius. The radius of the first sphere is determined by distance from the object's origin to the furthermost interior polygon coordinate; and if the camera is further from the object's origin than this sphere's radius then the camera must be outside the object.
The radius of the second sphere is determined by the distance from the object's origin to the closest exterior polygon coordinate. If the camera is closer to the object's origin than this sphere's radius then the camera must be inside the object.
Now imagine you have 3 meshes. The first mesh is for polygons that can only be visible when the camera is outside the object; the second mesh is for polygons that can be visible regardless of where the camera is (e.g. exterior polygons that can be seen from the interior through a window); and the third mesh is for polygons that can only be visible when the camera is inside the object.
This gives us 3 cases (camera definitely inside object, camera definitely outside object, and camera may be inside or outside) where you can determine which of these cases it is extremely quickly (by using "distance from camera to object's origin" and comparing it to the radius of each sphere); where for 2 of those cases you can cull entire groups of polygons very early in the renderer (before you do anything else).
Partial Frame Rendering
There are a few cases where you might be able to avoid redrawing everything for a frame from scratch.
Note that it may be possible to support all of these, and may be beneficial to have a preliminary "assessment" step that determines how much needs to be rendered when multiple things have changed since last frame.
If a texture changed (but camera and objects remain in the same position) you can (at a minimum) start rendering the new frame at the "rasterisation" step. For this to work you need to keep the data used for rasterisation (the "screen coords" and lists of polygons that weren't culled) until after you know if it will/won't be needed for the next frame. This can potentially be improved by determining which area/s of the screen may have been effected by the texture that was changed, and only rendering that area of the screen.
If you used "polygon occlusion"; then you could rasterise and redraw the polygons that used the modified texture and nothing else.
If you used "fragment buffer"; then you can skip redraw the fragments that used the modified texture and nothing else.
If an object moves (but camera and textures remain the same); then you can determine the area of the screen that was effected (e.g. use the object's bounding sphere, it's previous location and its new location to determine a bounding rectangle on the screen) and then only render that area of the screen.==
(Small) Camera Rotation
For 2 of the 3 possible rotations (excluding "roll"), if the camera rotates (but objects and textures remain the same) then you can scroll the previous frame's pixels in the corresponding direction. This creates "not rendered before" area/s on the left or right of the screen and/or on the top or bottom of the screen; where you only need to render the "not rendered before" area/s.
Rendering Distance Cube
You can have a set of 6 textures (north, east, south, west, up, down) that form a cube, and use these textures to represent everything that is past the rendering distance. Because these textures represent "past the rendering distance" they can either be drawn first (so that all polygons occlude them), or they can be drawn last (after rendering everything else, if there's no data for a pixel then get the pixel's colour from the rendering distance cube). The benefit of this is that it can be a relatively cheap way to give the illusion of huge open areas (without the expense of using meshes/polygons to draw things like distant mountains, etc).
These textures can be replaced. For example, if the camera moves a certain distance then maybe you change to a different set of 6 textures.
These textures can be changed. For example, maybe the "top" texture is dynamically generated clouds that move slowly.
These textures can be morphed. For example, maybe you have "midday textures", "dusk textures", "midnight textures" and "dawn textures"; and depending on the time of day you blend 2 textures together (e.g. if the time is 20% of the way between midday and dusk, then you take 80% of a pixel from "midday" and the other 20% from "dusk").
These textures can be dynamically generated. For example, maybe (for the "north" texture) you have mountain in the distance, but there's a large hot air balloon flying over that mountain, so you regenerate the "north texture" every 250 ms by taking the original and drawing a "hot air balloon sprite" into it at a different positions.
Static Distant Impostors
For objects in the distance (especially things like trees where it doesn't matter so much which angle you look at them from); you can draw a single static textured polygon (an "impostor") instead of drawing the object itself (many polygons) to improve performance (reduce the number of polygons you have to deal with). When the camera gets close enough, then you switch to the object and stop using the impostor.
Dynamic Distant Impostors
For objects in the distance, you can render the object all by itself to dynamically generate a suitable distant impostor (and then use that dynamically generated distant impostor in the same way you would've used a static distant impostor). This takes a little care to determine when to update the distant impostor from the object.