As we saw in the discussion of Boolean logic, we can perform arithmetic
on binary numbers using combinations of logical operations. Now we'll take
a somewhat closer look at the specifics of arithmetic operations.
Recall from assembly language that binary numbers have a power of two associated
with each of their digits. If we use this to number the digits as we write
them down, we get a number scheme in which a 32- bit number has bit 31 on
the left and bit 0 on the right.
In decimal notation we represent a negative number by writing a minus in
front of a positive number. With binary numbers we do not have the convenience
of such a notation, although we could declare one bit (say bit 31) to be
a sign bit and the rest of the number to be the quantity. Such a notation
is called signed-magnitude representation. Suppose we were using a four-bit
word. Then 6 would be represented as
0110
and -6 as
1110
This is easy to read, but it has the disadvantage that if we add the numbers
in this form, we do not get 0.
0110
1110
------
0100 with a 1 overflow.
so the result of 6 + (-6) appears to be 4 with overflow.
We don't particularly care about readability of binary numbers in the computer
because we rarely look at them anyway (we write software to translate them
into decimal notation). So we should really be more concerned with what
representation is most useful for the computer. A system that allows us
to use an adder to sum positive an negative numbers will simplify the design
of the ALU. There are two such systems of representation and they both depend
on the complement of a number.
The complement of a number in agiven base can be defined as the difference
between each digit of the number and the maximum digit value for the base.
Thus, in base 10 the complement of 26 would be 73, because the maximum digit
value is 9. In base two, the complement is easy to compute -- it is just
the bitwise inverse of a number. Thus the complement of 11010 would be 00101.
This scheme is known as ones-complement and was used on machines designed
by Seymour Cray at Control Data Corporation. They used ones complement because
it made negating a number as fast and simple as passing it through a word-wide
bank of inverters. It is also a symmetric number system in that, for any
given wird size, there are an equal number of psoitive and negative numbers.
The problem is that one of those numbers is -0. So a ones complement machine
has to recognize the presence of all 1's in a word (with a giant AND gate)
and invert the bits again whenever -0 appears. In ones complement, the above
addition would be represented by:
0110
1001
-----
1111 (-0) => 0000
Similarly 26 + 73 = 99 (-0) => 0
We can avoid the problem of -0 by using two's complement, in which we add
1 to the complement of a number to negate it.
0110
1010
-----
10000 => 0000 with overflow
The 10's complement of 26 would be 74.
So 26 + 74 = 100 => 0 with overflow
Thus we never have negative 0 to worry about, but we do have a minor asymmetry
-- there is one more negative number than positive number in our system.
The unique 0 is essentailly taken from the positive side of the number line
in a two's complement system.
As we saw before, we can perform the +1 part of a two's complement by using
the carry-in to the low- order bit of an adder, so there is almost no cost
associated with this operation.
In two's complement, a high-order bit of 1 always signifies a negative number
(in ones complement, we have to check for -0 when the high order bit is
1). Thus, the high order bit is usually called the sign bit.
If we need to convert a two's complement number to a larger word size, we
simply extend the sign bit into the high order digits to produce a number
of the correct form. That is, if the number is positive, all of the higher
order bits in the wider word will be 0. If it is negative, they are all
1.
Shifting a word right or left (which is equivalent to multiplying or
dividing by a power of 2) is used in multiplication and division and also
to align data on byte or word boundaries. It can be implemented in several
ways. The least expensive is with a shift register.
Basically a chain of D flip-flops is arranged with gates that connect their
inputs to an external source or to each other.
[drawing of D flip-flops with AND-OR circuit selecting one of two sources]
We can make this bidirectional by adding a third input that takes its data
from the flip-flop on the other side.
[drawing showing the third input]
The drawback of this approach is that it requires each shift on n places
to take n clock cycles. We can instead use a logic circuit that selects
any of the input 32 bits to go to a given output bit, with all of the other
output bit receiving a parallel selection.
[drawing of two rows of bits with lines connecting them]
[drawing of n AND gates feeding an n input OR gate -- one of the component
circuits]
This arrangement is clearly more expensive but it can compute any shift
in just one cycle (or even less, since it is not a clocked circuit). It
is called a barrel shifter.
The book does a nice job of showing how an adder/subtractor/AND/OR/less
than can be built so it won't be repeated here. Of particular interest is
to note how zero detection and overflow are detected in Figure 4.17.
We'll just briefly go over the operation of the carry lookahead since we've
discussed the concept before. The basic idea behind its implementation is
that some clever algebra has been used to simplify the logic functions by
factoring out some common terms in each of the carry expressions.
These terms depend only on the bits being input at a given position -- not
on any of the other positions. They are called
generate = A AND B (because if A and B are both 1, then a carry is generated
regardless of the carry in to that position.
propagate = A OR B (because if either is a 1, then a carry in will be propagated
to the next stage (and if they are both 1, the carry in doesn't matter --
it is an inclusive propagate)
So we have
c1 = g0 + (p0 c0) that is, we get a carry if we have either a generate,
or a propagate with a carry in
c2 = g1 + (p1 c1) = g1 + (p1 g0) + (p1 p0 c0)
c3 = g2 + (p2 c2) = g2 + (p2 g1) + p2 p1 g0) + (p2 p1 p0 c0)
c4 = g3 + (p3 c3) = g3 + (p3 g2) + (p3 p2 g1) + (p3 p2 p1 g0) + (p3 p2 p1
p0 c0)
As you can see, we can build each of the carry out computations from a circuit
with just two stages of gates, but the n-th bit contains a circuit with
an n+1 input OR gate and n AND gates who's input counts range from 2 to
n+1.
For an 8-bit adder this works OK, but for 32 bits the fan-in is too high.
So a hierarchical scheme is used instead -- it will usually consist of 4
or 8-bit carry lookahead adders and these feed another level of lookahead
circuitry that produces the carry in to each of the blocks at the same time.
Again, the book does a good job of presenting very simple multipliers.
The basic idea is that we shift and add the multiplicand to a result register
repeatedly. At each shift position we look at a corresponding digit of the
multiplier and if it is 0 we skip the addition and if it is 1 we carry out
the addition.
More clever schemes recognize that as the data shifts around, space is left
in the double-size result register that we can take advantage of to avoid
having to use extra register space. So some of the bits recirculate to fill
the space being vacated by the departing bits of the multiplicand and multiplier
as they shift across the registers.
The book also discusses Booth's algorithm, which notes that if the multiplier
contains a string of ones (for example)
0000111111110000
that the partial sum that results from this string is equal to the multiplicand
times a 1 in the next higher bit minus itself times the 1 in the lowest
order bit of the string. So if a number contains long strings of 1 bits,
we can avoid all of the intermediate shifts and adds and replace them by
one add and one subtract. The problem is that this algorithm is actually
slower for patterns like
0101010101010101
where an add and subtract would be done for each 1 in the multiplier.
We can build a really fast multiplier by creating a chain of adders that
each take the multiplicand as one input (controlled by the appropriate bit
of the multiplier) and take the output of the preceding stage as the other
input.
This scheme produces a result in the time that it takes the data to propagate
through all of the adder stages. It has has many adders as there are bits
in the multiplier, and the adders are as wide as the number of bits in the
multiplicand. Thus, for N-bit words, N-squared gates are required for a
carry propagate adder based circuit, and a CLA based circuit is even more
expensive.
One point to note about this approach is that if we put a register between
each pair of adders, and clock the data through the register, then we can
start a new multiply operation into this circuit with each clock cycle.
After N clock ticks the first result emerges and subsequent multiplies emerge
on each additional tick. Such an approach is called a pipelined multiplier
and is useful in machines that are doing vector operations.
Wallace proposed an alternative scheme that uses fewer gates. He recognized
that we don't have to complete all of the carry operations at each stage
before passing a value on to the the next stage. As long as the carries
get added in before the final result is produced, we will have a valid sum
of all of the partial products.
Thus, in Wallace's scheme, all of the partial products are generated in
parallel by gating N copies of the N- bit the multiplicand by the bits of
the multiplier into the inputs of a tree of "pseudo-adders". (Much
as in the diagram above, except that the pipelined stages are replaced by
the adder tree).
Each pseudo-adder is a bank of full adder circuits. But instead of propagating
the carry output to an adjacent bit within the pseudo-adder, the carry is
simply another output number:
A pseudo-adder block can be treated as a 3-input 2- output combinational
circuit that produces two sums from three inputs. If we add the two sums
together, then we get the actual binary sum of the three inputs.
Wallace showed how to construct a tree of these adders. The following diagram
shows how 20 partial products could be summed together with pseudo- adders.
At the final stage, a carry lookahead adder is used to sum the last carry
and sum output pair.
Wallace tree adders can be constructed from larger adder blocks that take
more inputs and produce more outputs -- all that is required of the outputs
of a pseudo adder is that if you sum them up you get the sum of the inputs.
Thus, for example, one might find a convenient Boolean function that takes
15 inputs and generates 4 outputs who's sum is the sum of the 15 inputs.
The book describes several variations on the same basic approach to division.
Start with the dividend and divisor aligned. Shift the dividend left and
subtract the divisor. If the result is positive, then the quotient gets
a 1 and the remainder replaces the dividend. If the result is negative,
then the quotient gets a 0 and the dividend is restored except that it is
still shifted by one.
Givena fast multipler, however, there are faster methods that can be applied
based on a Newton- Raphson iterative solver. Flynn showed that a first-
order Newton-Raphson formula can start with a 4-bit approximation and in
just 3 iterations achieve 32-bit precision. Each of the iterations takes
2 multiplies and some shifts and complements.
The initial value can be obtained with a table lookup that uses the high
order bits of the divisor as the address and produces the 4-bit approximation
to the quotient. Thus, division of this form takes only about 6 times as
long as multiplication. If the multiplier is like the Wallace Tree, it may
be several times faster than the shift-and-subtract approach.
Although the book doesn't address the issue of computing sines, cosines,
etc. in hardware, they are actually included in many floating point coprocessors.
One popular method that involves only simple hardware (and is used, for
example, in most calculators) is the CORDIC (COordinate Rotation DIgital
Computer) approach developed by Volder.
He observed that if one rotates a vector by a decreasing series of postitive
or negative angles that the vector eventually converges on a desired angle.
Further, if the angles are selected appropriately and at each stage either
a positive or negative version of the rotation is applied (whichever gives
the closer result) that the length of the resultant vector is a constant.
Thus, if the final X and Y values are multiplied by an appropriate set of
constants to change the length of the vector into a unit, then as we know
from trigonometry, the X and Y values of the unit vector at that angle are
the values of the sine and cosine.
The CORDIC method uses a table of constants. For each iteration it compares
the current angle to the desired angle, and either adds or subtracts the
constants for that interation to the angle, X, and Y values. The precision
of the result increases one digit per interation. At the end of this series
of add/subtract operations, one multiply is required to produce the desired
function.
As you probably recall from assembly language programming, the basic
concept of floating point representation is that it provides a porition
of a word that represents a mantissa, part that represents an exponent,
and one bit for a sign. The sign of the exponent is usually subsumed by
a bias value that must be subtracted from the stored value to obtain the
correct exponent value. The advantage of the bias is that it makes the exponents
all positive which saves an explicit bit for the sign and also means that
the arithmetic (such as comparisons) does not have to treat positive and
negative exponents differently. However, after adding two exponents (as
in multiplication), the extra bias amount must be subtracted. Biased exponent
representations also have the advantage that zero is represented by a word
of all zeros.
The mantissa for a floating point number is usually normalized -- that is,
it is shifted left until the first one bit is in the high order position.
The exponent is adjusted by a corresponding amount to compensate for the
shifting. Because a normalized representation always has a one in the leading
position, we need not represent it explicitly. Thus, one more bit can be
used to extend the precision of the mantissa.
Addition involves aligning the mantissas of the two numbers (determining
the difference between their exponents and then shifting the smaller of
the two numbers right by that many positions, with the implicit one made
explicit), then adding the mantissas and if there is an overflow renormalizing
by shifting the result one to the right.
Subtraction is the same as addition except that one mantissa is subtracted
from the other after alignment, and the renormalization may involve an arbitrary
left shift of the result.
Multiplication does not require the mantissas to be aligned. They are multiplied,
the exponents are added, and the result is renormalized if necessary (a
shift of one position). Of course, the result is double the size of the
original values, so it must be shortened wither through truncation or rounding.
(The other operations require this as well whenever a value is renormalized
and bits of precision are lost.)
Division similarly involves no prealignment of the mantissas. When the division
is performed, one exponent is subtracted from the other. Unlike multiplication,
the mantissa result of the division can require arbitrary shifting to be
renormalized.
The majority of computer systems today use the IEEE 754 standard that
was first adopted in 1985. The adoption of this standard was a rare show
of apparent cooperation among computer manufacturers. However, it was partly
a matter of pragmattics (handling floating point properly is complicated,
and adopting the standard saves both design cost and potentially embarassing
errors), partly competition (it became a positive marketing point to be
able to claim IEEE compatibility), partly the effect that the microprocessor
generation of manufacturers had a clean slate to start from (most of them
did not have existing mainframe-based standards -- those that did were the
ones like DEC and IBM who entered the microprocessor business late), and
partly that the manufacturers uniformly underestimated the difficulty of
implementing the IEEE standard (so they publicly jumped onto the bandwagon
before they knew how much it would really cost).
Because the standard is well designed, its widespread adoption has been
a very positive experience for the software community. However, even ten
years after it was introduced, there is very little language support to
take advantage of the more sophisticated features of the standard, and some
of the features are either unimplemented or very costly to employ on some
machines.
The standard defines four formats for floating point numbers, of which two
are commonly used (single and double). In single-precision numbers there
is a one-bit sign, 23 bits in the mantissa (an implicity 24th bit is the
leading 1 in all mantissas, which isn't stored), and the 8-bit exponent
ranges from -126 to 127 with a bias of 127. In double-precision numbers
there is the one-bit sign, 53 bits of precision in the mantissa and the
exponent ranges from -1022 to 1023 with a bias of 1023.
The bias is a number that must be subtracted from the stored value of the
exponent to get its value for purposes of computation. Thus, the minimum
value that can appear in an exponent field of a proper single-precision
floating point number is 1 and the maximum value is 254. This leaves 0 and
255 for representing special cases. In double-precision, the exponent can
be 1 through 2046, with 0 and 2047 reserved for special cases.
The other two formats are "extended" versions of the two common
representations. They assume that a machine will devote an entire word (32
or 64 bits) to the mantissa and at least 10 (single) or 14 (double) bits
to the exponent field. The extended formats do not have defined biases.
When the exponent of an IEEE FP number is all ones (255 or 1023) and
the mantissa is all zeroes the number is explicitly equal to infinity (+
or -, depending on the sign bit). The standard requires arithmetic such
as division by 0 to produce an infinity as a result, and inversely, division
by infinity to produce 0. Overflow can also produce an infinite result.
If the mantissa is nonzero, then the value is called "Not a Number"
or NaN. The NaN values can be used to represent results such as square root
of a negative. Arithmetic on NaN values should also produce NaN values as
output so that if an improper floating point value enters a string of calculations,
it cannot result in a valid floating point result.
When the exponent is zero, then the mantissa represents an unnormalized
or "subnormal" number. These are numbers that are smaller than
the number 1 x 2^-126. That is, the implicit leading 1 disappears and the
number of significant digits in the mantissa can decrease to fewer than
24 bits. These numbers are useful when we consider that any two numbers
with exponents of -126 will differ by an amount that is smaller than we
can otherwise represent. So if we try to compare the numbers by subtracting
them, we get an unnormalized number. In some systems, unnormalized numbers
are not allowed and the difference would simply be zero. Thus, it would
be possible to have two very small but different numbers that appear to
be equal when we compare them.
While everyone agrees that having subnormal numbers enhances the accuracy
of calculations and reduces programming effort, hardware designers find
them to be a very difficult part of the standard to implement. Subnormals
are a special case that occur relatively rarely, but they also require that
all computations make the checks for the special combination of a zero exponent
and a nonzero mantissa and then turns off the usual prealignment and renormalization
hardware appropriately. Thus, implementing subnormals can both increase
the cost of a floating point unit and result in slower computation overall.
In some processors, for example the PowerPC, the programmer can disable
the use of subnormal numbers. In the PowerPC, when subnormals are disabled,
a subnormal result generates a normalized value and an exception is raised
so that the software can handle the underflow. Of course, the exception
can be ignored if speed is more desirable than accuracy. When subnormals
are enabled, no underflow exception is raised and the FPU takes care of
all of the special handling.
The IEEE standard has four different rounding modes. The usual mode is
to round to the nearest value, with a number that falls midway between two
others being rounded to the nearest value with an even (zero) low-order
digit. The other modes are round toward zero, round toward plus infinity,
and round toward minus infinity.
Rounding is based on two additional bits of precision that are kept in intermediate
stages of a floating point calculation. These are the guard and round bits.
To see why two bits are needed, consider that adding two numbers can require
an extra digit in order to know how to round them (as when one value is
shifted by one position with respect to the other) and that the result of
such an addition can also produce a carry out of the high order place. Thus,
we need to be able to store a result that is two bits longer than normal.
One problem with the "round to even" mode is that it complicates
the rounding process. For example, the result of a multiply might end with
a long string of zeros and then a one. Without the one, the result would
be rounded down to the nearest even value. But with the one, the result
does not fall half way between the two rounded possibilities -- it actually
falls closer to the odd number just above. Thus, the FPU has to keep track
of whether there are any one bits beyond the guard and round -- it does
this with a bit called the "sticky bit" because it is often implemented
by watching the low order bits as they are shifted out to the right during
normalization, and sticking to a one if any one bits are seen.