# Calculators on desktops (was Re: Octal)

der Mouse mouse at Rodents.Montreal.QC.CA
Sat Sep 2 22:48:18 CDT 2006

```> I don't no what the IEEE format is, but I'd guess it would be
> Integer, Exponent, Exponent, Exponent?

As I think someone else mentioned, IEEE there stands for the (US)
Institute of Electrical and Electronics Engineers, the standards body
responsible for the format.

An IEEE floating-point number consists of three parts: a sign bit, an
exponent field, and a mantissa field, normally stored in that order.  I
know of two common formats, 32-bit format and 64-bit format; I don't
know whether the standard defines any others, but it's a fairly easy
format to extend to other sizes.

The sign bit is always one bit, and is 0 for positive and 1 for
negative.  (It's a sign-magnitude format; unlike the 2's-complement
that's now nigh universal for integers, negating a number here involves
nothing but flipping the sign bit.)

The exponent field is variable size.  For 32-bit it's 8 bits; for
64-bit, 11.  This exponent field is what's called `biased'.  A number
N>0 can always be written as N = M * (2^E) for M in [1,2) and some
integer E.  IEEE format stores not E but E+K, where K depends on the
size of the exponent field, such that E=0 corresponds to the exponent
field having high bit 0 and its other bits all 1.  For 32-bit, for
example, this means that E=0 corresponds to an exponent field of
01111111, or 127 decimal; this is sometimes called "excess-127".  An
exponent field of all 0 bits or all 1 bits is special; I'll mention
this more below.  If a number calls for an E value that's out of range,
that number is not representible (except for certain very small
numbers; see below).

The mantissa field is also variable size, and occupies the balance of
the bits.  Because M is >=1 and <2, it is of the form 1.xxxx... when
written in binary.  The high bit thus carries no information; they get
one more bit of precision by not actually storing it.  For example, the
number 3 corresponds to E=1 and M=1.1 (that M value being in binary);
the M value stored in the number is 100000...000, the 1 bit before the
binary point being the non-stored bit (often called the hidden bit).

Because the hidden bit is hidden, it is not available to distinguish
between a nonzero number that's a power of two and a number that's
zero.  I mentioned above that an exponent field of all 0 bits was
special.  One of the things it's used for is representing zero: zero is
represented with all its bits - sign, exponent, and mantissa - 0.

This leaves a number of possible conditions unspecified: any bit
pattern with all exponent bits zero but some other bits set has no
defined meaning according to the above.  I'm a little hazy on exactly
what's what for them.  Part of this is because I also know another
floating-point format - VAX format - well, and it is very similar to
the above, differing only in its treatment of these "exceptional"
cases (and its lack of the all-1-bit-exponent special case).  It's been
too long since I looked at either; the details are blurring together in
my memory.

In VAX format, I think any bit pattern with sign and exponent bits zero
represents zero; any bit pattern with sign bit 1 and exponent bits 0 is
a "reserved operand", which the floating point unit raises an exception
when asked to do anything with.

In IEEE format, there are denormalized numbers, which are numbers for
which the non-stored ("hidden") bit is 0, not 1; I *think* these are
all bit patterns with exponent bits zero, and they correspond to the
above for values of E too small to represent with nonzero bits in the
exponent field.  These account for all the other bit patterns with
exponent bits all 0.

This leaves exponent fields of all 1s.  IEEE format has other values,
notably infinities and NaNs; all-1 exponent fields represent them.
Positive infinity is exponent field all 1, sign bit 0, mantissa all 0;
negative infinity is the same but with sign bit 1.  I infer that NaNs
are any pattern with exponent bits 1 and mantissa bits not all 0.
(There is a distinction between "quiet NaNs" and "signaling NaNs";
while I don't really know, I suspect this is done with the sign bit.  A
signaling NaN is like a VAX reserved operand; a quiet NaN is similar,
except that it simply results in another NaN when you do something for
which its exact value would matter.  It disappears when you do things
like multiply it by zero, though.)

Infinity is used to represent values that are actually infinite, such
as dividing any nonzero number by zero, and values that, while not
theoretically deserving to be called infinite, overflow the available
number range, such as 1e200*1e200.  There is a negative infinity as
well; this is a conventional number line, not a projective line with
only one point at infinity.

"NaN" stands for "Not a Number" and represents something that does not
exist on the real line at all; they normally indicate uninitialized
data that happens to contain such a bit pattern, but can be generated
by some operations, such as 0/0, sqrt(-1), or adding infinities of
opposite signs.

This concludes today's class on floating point. :-)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
X  Against HTML	       mouse at rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

```