Floating Point Number

Real numbers, coprocessors and vector units
Technical

Floating Point Number

X86 implementations

Introduction This page or section is a stub. You can help the wiki by accurately contributing to it.

Floating point numbers are a way to represent real numbers inside computer memory which (usually) operates with binary digits. As opposed to fixed point numbers which have a fixed number of digits before and after the decimal mark, floating point numbers can be considered to have a certain amount of significant digits or 'accurate leading numbers' that we consider to carry an accurate approximation to some value. Floating point numbers are inherently not a precise storage medium precisely because they store an approximation to a value and not the precise value itself This requires care when writing code both in low or high-level languages to avoid running into precision errors which may break your program.

Because floating point numbers have a different representation than integers, they can not be treated as ordinary integers with ordinary instructions. Luckily for you, modern processors come with a dedicated on-chip FPU (Floating Point Unit) co-processor that quickly performs floating point calculations in hardware. On x86 the FPU and it's instructions are named 'x87' because of historical reasons. Embedded devices may or may not come with an FPU, depending on the complexity and price range of these devices. Check your spec! (Before FPUs became an affordable upgrade for desktop computers, floating point emulation libraries would be either present or installable on computers. Sometimes these libraries even required manual installation by the end user!)

This article will explain the IEEE 754 standard for floating point number representation. This standard is extremely common, not x86-specific and has multiple 'precisions' or representation sizes which are all similar in how they work except for the number of bits used.

Representation

All floating point numbers are a variation on the theme of x = m * b^e, that is, they all save some kind of mantissa that is value limited to be between 1 and the base, and an exponent, and the whole number is formed by raising a constant base b to the exponent and then multiplying in the mantissa.

In most implementations, the base will be 2 or 10. But the latter is called "decimal float", and is a bit of an outlier. There are a lot of software implementations for it, but on hardware it was only implemented on the IBM POWER6 and newer, the IBM System z9 and newer, as well as the Fujitsu Sparc 64. However, most of these also maintain compatibility to "binary float" (that is what base 2 floating point is called), and almost all hardware with FP support will support binary float, including PCs.

Therefore the rest of this page will focus on IEEE 754 binary floating point.

IEE 754

This standard standardized the format of binary floating point numbers: All FP numbers are bit structures, with one sign bit in the highest place, then a collection exponent bits, and then the rest of the number being the mantissa. Since bits can only form unsigned integers, interpretation is necessary.

If all exponent bits are set, then this value is special. If the mantissa is zero, then this means the value is infinite, else the value is not-a-number (NaN). On platforms where this is important, the highest mantissa bit selects between quiet NaN and signaling NaN. Use of the latter results in an exception, but the PC hardware does not support this behavior.

Otherwise the exponent is saved as a biased integer. That is, we define a number, called the bias, which is all bits set, and one bit shorter than the exponent, and the exponent saved in the FP number is the actual exponent plus this bias. This allows saving negative values without two's complement or anything.

Since in binary float, the mantissa has to be between 1 and 2, the "1." leading the mantissa is implicit. The rest of the mantissa is the first bits behind the decimal point. Therefore, if the number is interpreted as integer, it has to be divided by 2 raised to the bit width of the mantissa. Then add 1. Only special case is, if the exponent is all 0 then the implicit integer part is 0 instead of 1. Such numbers are not in normal form, and are therefore called "subnormal".

Original (1985)

level width range precision exponent width mantissa width
single precison 32 bits ±1.18x10^-38 to ±3.4x10^38 approx 7 digits 8 bits 23 bits
double precison 64 bits ±2.23x10^-308 to ±1.8x10^308 approx 15 digits 11 bits 52 bits
double extended precision 79 bits ±3.26x10^-4932 to ±1.19x10^4932 approx 19 digits 15 bits 63 bits

Note that the values for double extended precision are minimums. x87 implements this type as 80 bit data type, and adds an explicit integer bit as highest bit of the mantissa.

2008

level width range precision
half precison 16 bits ±4.8x10^-4 to ±3.24x10^4 approx 7 digits