IEEE-754
IEEE-754 is the IEEE Standard for Floating-Point Arithmetic.
Floating point numbers can be used to efficiently store very large or very small decimal numbers, similar to standard form/scientific notation but for binary. In Java, these are the float and double types which use 32 and 64 bits respectively.
Floating point numbers consist of a sign bit, exponent and fraction.
- The sign bit is the most significant bit and denotes whether the number is positive (0) or negative (1).
- The exponent describes how much to shift the mantissa. In a 32-bit number it is the most significant 8 bits after the sign bit.
- The mantissa is a binary fraction. The most significant bit represents $2^{-1}$, the second $2^{-2}$ and so on. In a 32-bit number it is the remaining 23 bits.
The value of a floating point number is given by $$ (-1)^{\text{sign bit}} \times 1.(\text{mantissa}) \times 2^{\text{exponent} - \text{bias}} $$ For a 32-bit number the bias is 127, for 64-bit it is 1023.
Floating point to base 10
An example of a 32-bit floating point number is 00111110001000000000000000000000.
The sign bit is 0, the exponent is 01111100 and the mantissa is 01000000000000000000000.
The value of the mantissa is $2^{-2} = 0.25$. The exponent is 124.
The value of the number is therefore given by $(-1)^0 \times 1.25 \times 2^{124-127} = 1 \times 1.25 \times 2^{-3} = 0.15625$
Base 10 to floating point
Consider 38.125.
38.125 is 100110.001 in fixed point binary. Shifting right until the MSB is 1 (normalising) gives 1.00110001. The exponent is 5 as the number has been shifted 5 places to the right. We now have $38.125 = 1.00110001 \times 2^5$, which is of the form $1.(\text{mantissa}) \times 2^{\text{exponent} - \text{bias}}$.
Using 32-bit precision, the exponent is given by 127+5 = 132 = 1000 0100 and the mantissa is 00110001. The sign is 0 so the 32-bit precision floating point representation of 38.125 is 01000010000110001000000000000000.
Special values
These are the special values for a 32-bit precision floating point number.
| Exponent | Mantissa | Value |
|---|---|---|
| 0 | 0 | 0 |
| 255 | 0 | Infinity |
| 0 | not 0 | Denormalised |
| 255 | not 0 | Not a number |