# Floating Point Number

Real numbers, coprocessors and vector units
Technical
X86 implementations

## Introduction

 This page or section is a stub. You can help the wiki by accurately contributing to it.

Floating point numbers are a way to represent real numbers inside computer memory which (usually) operates with binary digits. As opposed to fixed point numbers which have a fixed number of digits before and after the decimal mark, floating point numbers can be considered to have a certain amount of significant digits or 'accurate leading numbers' that we consider to carry an accurate approximation to some value. Floating point numbers are inherently not a precise storage medium precisely because they store an approximation to a value and not the precise value itself This requires care when writing code both in low or high-level languages to avoid running into precision errors which may break your program.

Because floating point numbers have a different representation than integers, they can not be treated as ordinary integers with ordinary instructions. Luckily for you, modern processors come with a dedicated on-chip FPU (Floating Point Unit) co-processor that quickly performs floating point calculations in hardware. On x86 the FPU and it's instructions are named 'x87' because of historical reasons. Embedded devices may or may not come with an FPU, depending on the complexity and price range of these devices. Check your spec! (Before FPUs became an affordable upgrade for desktop computers, floating point emulation libraries would be either present or installable on computers. Sometimes these libraries even required manual installation by the end user!)

This article will explain the IEEE 754 standard for floating point number representation. This standard is extremely common, not x86-specific and has multiple 'precisions' or representation sizes which are all similar in how they work except for the number of bits used.

## Representation

All floating point numbers are a variation on the theme of x = m * b^e, that is, they all save some kind of mantissa that is value limited to be between 1 and the base, and an exponent, and the whole number is formed by raising a constant base b to the exponent and then multiplying in the mantissa.

In most implementations, the base will be 2 or 10. But the latter is called "decimal float", and is a bit of an outlier. There are a lot of software implementations for it, but on hardware it was only implemented on the IBM POWER6 and newer, the IBM System z9 and newer, as well as the Fujitsu Sparc 64. However, most of these also maintain compatibility to "binary float" (that is what base 2 floating point is called), and almost all hardware with FP support will support binary float, including PCs.

Therefore the rest of this page will focus on IEEE 754 binary floating point.

### IEE 754

This standard standardized the format of binary floating point numbers: All FP numbers are bit structures, with one sign bit in the highest place, then a collection exponent bits, and then the rest of the number being the mantissa. Since bits can only form unsigned integers, interpretation is necessary.

If all exponent bits are set, then this value is special. If the mantissa is zero, then this means the value is infinite, else the value is not-a-number (NaN). On platforms where this is important, the highest mantissa bit selects between quiet NaN and signaling NaN. Use of the latter results in an exception, but the PC hardware does not support this behavior.

Otherwise the exponent is saved as a biased integer. That is, we define a number, called the bias, which is all bits set, and one bit shorter than the exponent, and the exponent saved in the FP number is the actual exponent plus this bias. This allows saving negative values without two's complement or anything.

Since in binary float, the mantissa has to be between 1 and 2, the "1." leading the mantissa is implicit. The rest of the mantissa is the first bits behind the decimal point. Therefore, if the number is interpreted as integer, it has to be divided by 2 raised to the bit width of the mantissa. Then add 1. Only special case is, if the exponent is all 0 then the implicit integer part is 0 instead of 1. Such numbers are not in normal form, and are therefore called "subnormal".

#### Original (1985)

level width range precision exponent width mantissa width
single precison 32 bits ±1.18x10^-38 to ±3.4x10^38 approx 7 digits 8 bits 23 bits
double precison 64 bits ±2.23x10^-308 to ±1.8x10^308 approx 15 digits 11 bits 52 bits
double extended precision 79 bits ±3.26x10^-4932 to ±1.19x10^4932 approx 19 digits 15 bits 63 bits

Note that the values for double extended precision are minimums. x87 implements this type as 80 bit data type, and adds an explicit integer bit as highest bit of the mantissa.

#### 2008

level width range precision
half precison 16 bits ±4.8x10^-4 to ±3.24x10^4 approx 7 digits
quad precision 128 bits ±3.36x10^-4932 to ±1.19x10^4932 approx 34 digits

## Conversions

The code examples in the following sections will be valid C99, but not necessarily valid C++. C++ might require the use of memcpy(). It also assumes that FPU endianess equals CPU endianess. The examples use hexadecimal floating-point constants, which are features of C99 and C++17.

### Integer conversion

It is occasionally necessary to convert integers to floating point numbers and back again. Some architectures have no hardware support for that. While soft-float methods could be used to directly create the floating-point representation of an integer, it is simpler, shorter, and faster to create the floating-point representation of some number that is connected to the input and calculate the correct output from there.

In the case of the conversion of an unsigned 32-bit integer into a double-precision number, we can see that the output format has a 52-bit significand. If x is the input, then the number 2^52 + x has as its representation x in the low word, and the high word of all bits zero and a fixed exponent part of 0x433 (which represents \$2^52\$). And from that we can calculate the correct output just be subtracting the offset. So in C, the idea can be expressed as:

double utof(uint32_t x) {
union {uint64_t i; double d;} u;
u.i = x | ((uint64_t)0x433 << 52);
return u.d - 0x1p52;
}

For a signed 32-bit number, the input may be negative, but if the input is offset by 2^31, it will be nonnegative. Then after the conversion, that same offset has to be removed from the output number as well. The adding can in this case be expressed as an XOR instruction, since it only affects a single bit.

double itof(int32_t x) {
union {uint64_t i; double d;} u;
u.i = (((uint32_t)x) ^ 0x80000000) | ((uint64_t)0x433 << 52);
return u.d - 0x1.000008p52;
}

In an unsigned 64-bit number, the same conversion can be done on the upper and lower half separately, the upper half getting an exponent of 84, which is 32 more than 52. Then the numbers can be added and their combined offsets removed. If the order of operations is correct, then no digits of the final number will be shifted out.

double ulltof(uint64_t x) {
union {uint64_t i; double d;} hi, lo;
hi.i = (x >> 32) | ((uint64_t)0x453 << 52);
lo.i = (x & 0xFFFFFFFF) | ((uint64_t)0x433 << 52);
return hi.d - 0x1.00000001p84 + lo.d;
}

Finally, for a signed 64-bit number, both ideas presented before combine. A signed 64-bit number can be sign as a multi-word number, comprised of two 32-bit numbers of which the upper word is signed, and the lower word unsigned. So it can be converted into a double-precision number like this:

double lltof(int64_t x) {
union {uint64_t i; double d;} hi, lo;
hi.i = ((((uint64_t)x) >> 32) ^ 0x80000000) | ((uint64_t)0x453 << 52);
lo.i = (x & 0xFFFFFFFF) | ((uint64_t)0x433 << 52);
return hi.d - 0x1.00000801p84 + lo.d;
}