Module 1010 min

Fixed and Floating Point Representation

How hardware encodes fractions and real numbers

Twos-complement integers (covered in the foundations path) handle whole numbers, but DSP, graphics and machine-learning hardware must also represent fractions and a huge range of real values. Two encodings dominate: fixed point, which is just an integer with an agreed scale, and floating point, which carries its own exponent. Choosing between them sets the area, power and accuracy of a datapath.

Fixed point: fractions without a float unit

Fixed point places an imaginary binary point at a chosen position and stores the value as a plain twos-complement integer. A Q-format label such as Q4.12 means 4 integer bits and 12 fraction bits in a 16-bit word, so the stored integer represents its value divided by 2 to the power 12. The smallest step (the resolution) is 2 to the power minus 12. Addition and subtraction are ordinary integer ops as long as both operands share the same Q format and you watch for overflow. Multiplication adds the fraction bit counts, so a Q4.12 times Q4.12 product is Q8.24 and must be shifted right by 12 to return to Q4.12.

verilog
// Q4.12 fixed-point multiply: realign the scale after multiplying.
// Each input is 16-bit signed (4 integer + 12 fraction bits).
wire signed [15:0] a, b;
wire signed [31:0] full   = a * b;        // 32-bit product, format Q8.24
wire signed [15:0] result = full >>> 12;  // drop 12 frac bits -> Q4.12
                                          // >>> is the arithmetic (signed) shift

Floating point: range over uniform spacing

Floating point trades exact, uniform spacing for enormous dynamic range. An IEEE 754 normalized number equals sign times 1.fraction times 2 to the power (exponent minus bias). The single-precision float32 format packs 1 sign bit, 8 exponent bits with a bias of 127, and 23 fraction bits, with an implied leading 1 for normalized values, so the effective precision is 24 bits. That gives about 7 decimal digits and a magnitude range from roughly 10 to the minus 38 up to 10 to the plus 38. Reserved exponents carry special meaning: an all-zero exponent encodes zero or subnormals, and an all-one exponent encodes infinity or NaN.

FormatSignExponentFractionBias
float16 (half)151015
float32 (single)1823127
float64 (double)111521023
Pro tip

for fixed point, remember that adds need matching Q formats and a multiply adds the fraction bit counts, so a Q4.12 product is Q8.24 and you must shift right by 12 to get back to Q4.12. For floats, know the float32 layout 1-8-23 with bias 127 cold; it comes up constantly.

Watch out

never test floating-point results with exact equality, because values like 0.1 are not representable and rounding errors accumulate, so compare against a tolerance instead. With fixed point, the opposite trap bites: spacing is uniform but the range is small, so a multiply or a long sum overflows easily unless you carry guard bits or saturate.