Optimizing
Optimization is the practise of taking functioning code and modifying it in order to make it execute faster. Optimization is already discussed very thoroughly in other places for various userland problem domains, and below are a few areas that are important for operating system developers to consider.
Note that to fully understand the ideas on this page, it is important to have fairly good grasp of assembly (x86 examples will be used), and a thorough knowledge of the x86 architecture is a must. But you've already read the Intel CPU manuals, haven't you?
When and What to Optimize
Please take the following list with some common sense: your kernel might be different.
Things to Optimize
- Deeply-nested loops. If possible, try to eliminate some of the nesting of the loop. Any instruction taken out of a loop will have a much more significant speed impact than an instruction taken out of a general routine. If you run though a loop one million times, and there is just one "unnecessary line of code", then there is a significant portion of a second wasted, especially on slower computers.
- Code that is run often.
- Graphics code.
- API code.
Things not to bother Optimizing
- Boot code. Code that only runs once (and doesn't take very long, such as loading an IDT), is not worth optimizing.
Things not to Optimize
- Untested or nonworking code. If you try to prematurely optimize, you will almost certainly make any problems you might have worse. 'First make it run, then make it run fast.'
- Code that needs to retain clarity. It is generally fairly hard to keep optimized code looking as neat and understandable as code that is designed to be easy to understand. Comments can help (and should be used anyway, especially to mark why optimizations were made).
- Crash recovery code. This will depend on your personal stance. It may be a good idea to err on the side of caution and concentrate on stability of the recovery code, not speed.
Compiler Optimization
Modern compilers (such as GCC) provide inbuilt optimization logic, which is available to you without having to write (or remove) a single line of code. Enabling these compiler optimizations might seem to break previously functioning code; it is important to remember that this is because of bugs in your code that have been there all along, that are just exposed by the optimizations. One such example is when the volatile keyword is not used where necessary.
Another side-effect of optimization is that source-level debugging might not always produce the expected results, as variables or whole lines of code are optimized away.
GCC
To use compiler optimizations in GCC, simply add the -O(x) switch to the command line when compiling, where (x) is one of the following:
Switch | Use |
---|---|
0 | Disable optimization. (This is the default.) |
1 | Try to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time. |
2 | Perform nearly all supported optimizations that do not involve a space-speed tradeoff. |
3 | Perform nearly all supported optimizations including those involving a space-speed tradeoff. |
fast | Optimize for speed. This includes all optimizations from -O3, as well as aggressive optimizations that may violate strict compliance with language standards. (For example, use a fused multiply-add instruction for floating point numbers instead of multiplying and adding separately.) |
s | Optimize for size. This includes all optimizations from -O2 that do not typically increase code size, plus additional size-reducing optimizations. |
g | Turn on optimizations that do not interfere with debugging (see the GCC docs). Available since 4.8. |
Numerous sources suggest the existence of further optimization levels, like "-O4", "-O6" or "-O99"; but only -O0 through -O3 and -Os (and -Og since 4.8) are documented and can be relied on.
For more information, read the GCC manual; it lists the exact optimizations done at each level, plus some additional options not enabled at any of the above optimization levels.
Clang
Clang supports all of the same optimization flags as GCC, in addition to -Oz
which is similar to -Os
but reduces code size even more.
ToDo: Describe Link-time optimizations.
Inline Functions
The inline keyword is a way of requesting that the compiler insert the code instead of calling it. This means that if you create an inline function and then call it from another function, the compiler will try to replace your call with the function body (similar to a preprocessor define, but with none of its disadvantages).
Inline functions save the clock cycles involved in actually calling a function (argument and return code pushes, call and return jump). Compared to preprocessor defines, they add type safety, and avoid side effects that make macros a hazard. Inlining can also be triggered on and off using compiler options so you can measure the size / speed tradeoff. On the downside, a normal function called from five different locations will only reside in memory once. If you inline that function, it will reside in memory five times. Your code becomes faster, but it also becomes bigger. It is important to note that the inline keyword is merely a request to the compiler, which the compiler can choose to ignore if it deems the size / speed tradeoff to be detrimental.
It is best to only use the inline keyword on small functions, where it is known to work best and to provide the largest speed boost. An example of small function is an accessor or a mutator inside a C++ class, or a function that simply sets the value. The inline keyword is simply specified before the function type, just like the static keyword:
inline int max_of_three(int x, int y, int z) { if ( x > y && x > z ) return x; return ( y > z ) ? y : z; }
The Cache
- Main article: CPU Caches
The CPU's cache contains a high-bandwidth store of currently used data. On a modern CPU, it usually takes 2 cycles to do a read or write from level 1 (L1) cache (compared with 1 cycle for a register), 5 cycles for level 2 (L2) cache, and perhaps around 30 cycle for main memory. As you can see, reading from memory is very expensive compared to reading from a register or cache. Unfortunately, this trade-off for speed comes at a size cost: L1 cache is only around 8KiB per processor core, with L2 cache sitting around 4MiB to 12MiB per Core 2 chip. For more information, see this Wikipedia entry.
Most processors try to keep the most recently used data in the two caches (though the method used can differ, check the manuals). For this reason, it pays to keep in mind what data is likely to be in the caches at any given time. If you, for example (and this is a very simplistic example), set your multitasking code to switch processes too fast, you'll waste a higher percentage of time going through cache 'flushes', refilling the cache with new data.
CPU Pipelines
This page or section is a stub. You can help the wiki by accurately contributing to it. |
See Also
Articles
Threads
- 01000101 asks about optimizing memcpy() and memset()
- A discussion on the best way to do multi-processor scheduling