Floating Point Number

Real numbers, coprocessors and vector units
Technical

Floating Point Number

X86 implementations

Introduction

 This page or section is a stub. You can help the wiki by accurately contributing to it.

Floating point numbers are a way to represent real numbers inside computer memory which (usually) operates with binary digits. As opposed to fixed point numbers which have a fixed number of digits before and after the decimal mark, floating point numbers can be considered to have a certain amount of significant digits or 'accurate leading numbers' that we consider to carry an accurate approximation to some value. Floating point numbers are inherently not a precise storage medium precisely because they store an approximation to a value and not the precise value itself This requires care when writing code both in low or high-level languages to avoid running into precision errors which may break your program.

Because floating point numbers have a different representation than integers, they can not be treated as ordinary integers with ordinary instructions. Luckily for you, modern processors come with a dedicated on-chip FPU (Floating Point Unit) co-processor that quickly performs floating point calculations in hardware. On x86 the FPU and it's instructions are named 'x87' because of historical reasons. Embedded devices may or may not come with an FPU, depending on the complexity and price range of these devices. Check your spec! (Before FPUs became an affordable upgrade for desktop computers, floating point emulation libraries would be either present or installable on computers. Sometimes these libraries even required manual installation by the end user!)

This article will explain the IEEE 754 standard for floating point number representation. This standard is extremely common, not x86-specific and has multiple 'precisions' or representation sizes which are all similar in how they work except for the number of bits used.

Representation

IEE 754

Original (1985)

level width range precision
single precison 32 bits ±1.18x10^-38 to ±3.4x10^38 approx 7 digits
double precison 64 bits ±2.23x10^-308 to ±1.8x10^308 approx 15 digits

2008

level width range precision
half precison 16 bits ±4.8x10^-4 to ±3.24x10^4 approx 7 digits