Comparative Study of Parallel Multipliers Based on Recoding Techniques

A 5-bit recoding scheme reduces the number of partial products by a factor of four in an array multiplier, but at the same time increases the complexity of recoding and partial products generation process. There is no concrete evidence yet , about efficiency or deficiency for applying 5-bit recoding. In this study, we compare the area-time performance of the VLSI implementation of 5-bit reoding array multipliers to its 3-bit recoding counterparts. Conditions in which a 5-bit multipliers is more efficient in terms of area and propagation time than that of a 3-bit recoding one are derived for given word lengths. We then present an example 5-bit recoding circuit design and its VLSI layout that yields the area-time efficiency.


List of References
Vlll the main memory, specifies the operation to be performed by the arithmet ic units and the address of the operands in the memory to be operated upon. After t he numbers have been obtained from the memory, the arithmetic unit perform s t he operation specified by the instruction.
The extensive development of large scale digital computers in the past few years has naturally been accompanied by the corresponding development of the mathematical techniques required for the most efficient use of these tools of research. In the 70's and 80's, the multiplication in the low-end computer was usually performed on adder units . However, due to the advent of VLSI, the bulky multiplier can be realized on the processor chip. The multiplication speed is increased by this direct implementation.
This also fits to the current trend of intensive scientific computations. For instance, important applications such as Digital Signal Processing, Image Processing, circuit simulation etc. depend on the speed of multiplication.
In many applications, the system performance depends upon the floating-point multiplication time. Speed of the integer multiplier has a direct effect on the speed of t he floating-point multiplier, since the latter is built around the former. Designing a fast multiplier has been of great theoretical and practical interest to computer scientists and engineers. Various multiplication algorithms have been proposed and practically implemented.
The multiplication process can be divided into three different steps [2]: • Adding each partial product to the accumulated sum.
• Shifting the multiplicand to the right.
• Adding the next partial product.
The above procedure repeats until all partial products have been added. Earlier, the number of ICs needed for multiplying two numbers varied from three to four.
The 64x64 multiplier required four different ICs to implement the three steps of multiplication [3]. As the technology advanced, the implementation of a multiplier was possible on a single IC.
Most of the early computers used fixed point arithmetic, but it presented difficulty in handling large scientific and engineering computations. Thus floating-point computation was proposed to overcome above difficulties, even though the operations which deal with data in scientific notations are more complex than fixed point or integer operations. Since, single precision multiplication is less accurate than the integer operation, therefore many floating-point operations are performed by double precision [4]. Most computers are equip with both fixed-point and floating-point arithmetic processor. Hence, there is a need for fixed-point and floating-point multipliers.
Conventionally, the Floating Point Unit (FPU) were external to the Central Processing Unit (CPU): a coprocessor [5]. The FPU was a coprocessor to the CPU. It The current trend is toward putting FPU on the processor chip . For example, Intel 80486 integrates the numeric coprocessor and CPU on a single chip. The advent of RISC processors also brought CPU core, cache memory and FPU on one ch ip. Since, the RISC concept greatly reduces the complexity of the processor, and at the same time, the advent of VLSI increases the number of devices per VLS I chip, the demand for high performance floating point coprocessors has created a need for high-speed small area-multipliers.
The multiplication of unsigned numbers using one's complement format and two's complement format are treated differently. Until 1951, the result of the multiplication of two numbers had to be shifted, if either or both of the operands were negative. Booth [7] derived an algorithm which produced the result without any need for the correction. The use of two's complement numbers significantly simplified the addition process. Ghest [8] described a multiplier chip, which implements Booth's algorithm in a parallel mode. This algorithm produces a 2's-complement product when multiplier and multiplicand are represented in 2's-complement form. This was the first practical high speed multiplier.
The processing revolution in digital computers has been augmented by even more dramatic advances in microelectronic circuitry in silicon. It has been a challenging task to develop a multiplier, which can perform multiplication in time, proportional to the logarithm of the word length of the operands, and has a regular cellular array structure suitable for VLSI implementation. However , comparing the performance of different multiplier designs is extremely difficult. For instance, when a multiplier design claims a speed of lOns, there is no indication that this is a better design than the one with a speed of 15ns. Of course, we must first make sure that they are both of the same word length. Still there are too many variables in the domain of multiplier design. In this example, the lOns multiplier may be a GaAs array multiplier, while 15ns one may be a CMOS Booth multiplier.
The new submicron technology helps to achieve few of the above requirements [9].
This work essentially points out the constraints in using different techniques for multiplication. The technology used here is not of great importance, since technology Chapter 2

Background Theory
Integer multiplication is inherently slower than integer addition or subtraction. However, by design technique, the multiplication in hardware implementation can be improved (10]. Multiplication is governed by two different processes. The crux of this multiplication process is the generation of the partial products and the summation of partial products (11]. The "add and shift" approach is the straightforward technique to perform a multiplication when only adding and shifting resources are available.
For example, multiplying two unsigned binary numbers is performed as follows: When this operation is simulated in a software routine, each bit of the multiplier results in one add and one shift operation. Since most computers can add and shift in the same instruction cycle, at least n operations are required for n bits of multiplication. This sequential method is adequate for indirect implementation of multiplication. For high speed multiplication a combinatorial approach is needed.

6
In this approach partial products are formed simultaneously and then added concurrently. Each bit in the partial products is formed by ANDing the multiplicand bit with the multiplier bit (12]. While the partial product generation is fast the summation of partial products is slow.
There are many methods to speed up the multiplication process. At the first stage, the speed up can be accomplished by employing modified Booth recoding or k-bit recoding [11] techniques to reduce the number of partial products. This ultimately reduces the partial product summation time. The speed up for the summation of partial products may be achieved by using carry-save adder, Wallace tree [13], counters [14] and 4:2 compressor [15]. At the final stage of adding the final two partial product can be further reduced [3]. At this end conditional-sum, carry-select [16] and carry-lookahead adders are used to reduce the multiplication time. For a certain size of word length carry-lookahead has been proved to be the fastest adder [12].
Thus carry-lookahead is employed in the last stage for addition. Instead of waiting for the carry to ripple, carry is added at a later stage. For a n x n array multiplier, n partial products are generated. This means an array of O(n 2 ) devices. Different circuits, structures and physical implementation of partial product reduction array portion of the multiplier were analyzed in [17].

Generation of Partial Product
In multiplication, if one addition is performed for each one in the multiplier, the average multiplication would require half as many additions as there are bits in the multiplier. This can be improved considerably by the use of both addition and subtraction of the multiplicand. The rules of determining when to add or subtract Were developed, and the method of determining the number of operations to expect from the bit groupings was explained [11]. This resulted in a variable number of add cycles for fixed length multipliers.
Substituting shift cycles for add cycles when the multiplier bit is zero can reduce (2.1) The actual number as written would consist of characteristics only and would be written as AnAn-l · · · A 1 A 0 . If such a number contained the coefficients ... 01111111110 .. , this part of t he number would have the value 2n-l + 2n-2 + · · · + 2n-x , where n is the position number of the highest order one in the group for which the lowest order position in the number is designated zero, and x is the number of successive ones in a group. The numerical value of this last expression may also be obtained from the expression 2n -2n-x. Thus for any string of ones in a multiplier the necessity of one addition for each bit can be replaced by one addition and one subtraction for each group. The only additional equipment that is required is a means of complementing the multiplicand to permit subtracting. A shift counter or some equivalent device was provided to keep track of the number of shifts and to recognize the completion of multiplication.
The above method of multiplication is dependent on the number of on es in the multiplier. Multiplication using uniform shifts is considered more practical. The number of shifts over here is dependent on the size of the multiplier rather than the number of ones. The number of cycles can be predicted using uniform shifts.
Rules were developed for handling two-bit and three-bit multiplier group ing. The discussion of multiplier is incomplete without mentioning the name of Booth. Booth [7] described a technique which uses recoding of binary numbers for multiplication.
The purpose of a Booth's algorithm is to skip over a string of bits rather than to form a partial product for each bit. This process was independent of any foreknowledge of the signs of these numbers. Since the two's complement representation of numbers is used almost universally, Booth's algorithm treated positive and negative numbers alike. Booth's algorithm scans two bits, at a time, of the multiplier number X and generate partial products in the form of -Y, 0, or + Y.
• If x;, x;_ 1 = 01 , the multiplicand Y is added to the existing sum of partial product and result is shifted one bit to the right.
• If x;, x;_ 1 = 10, the multiplicand Y is subtracted from the existing sum of partial product and result is shifted one bit to the right.
The procedure requires the attachment of the dummy digit( x_ 1 =0 ) to the right of L.S.B. and then evaluating the binary pair. In two-bit recoding each adjacent pair share one bit, so that in every iteration only one bit of the multiplier retires. In this way an n-bit multiplier generates n partial product.
Another speed up technique, a modification of Booth described by MacSorley [11] increases speed by reducing the number of partial products by a factor of two; this reduces the number of CSA stages and the gate count. The modified Booth's algorithm reduces the number of partial products by half, because the modified Booth scans three bits at a time. Each multiplier bit is divided into the substrings of 3 bits each, with adjacent bit sharing a common bit. In every iteration, two bits of the multiplier will retire. Apparently, the summation time for all partial products is significantly reduced. The array multiplier implementation has a complexity of 0(~2 ).
A proof of Modified Booth algorithm was later given by Rubinfield [18]. However the partial product generation is not trivial. Modified Booth recoding was later extended to recode 5-bits and higher [19]. Table   2.4 shows the signed digit value for the 5-bit recoding. The multiplier is recoded in 5-bit groups. This corresponds to radix-16 representation (k=4) for X using signed digits 0, ±1 , · · ·, ±8. Each recoder cell is larger than before but only n/4 of these are required compared to n/2 for modified Booth algori thm. The increase in the number of recoding bits will lead to significant reduction of the number of partial products, but, at the same time adds to the complexity of partial product generation. The generation of odd multiples of Y (3Y, 5Y, 7Y) is difficult in this form of recoding.
All the other multiples (Y, 2Y, 4Y, SY) are generated by shifting Y (i.e. hardware displacement) to the left by O, 1, 2, or 3 bits respectively. 6Y is obtained by displacing 3Y one position to the left. In [19] the handling of the odd multiples was done by using carry-select adders (i.e. power of two multiples to be added) to obtain the required multiples. There is a need of three such adders to obtain the odd The 5-bit recoding was claimed to be an attractive choice for large multiplier sizes (48 x 48), because of the high speed performance and at the same time good area cost [19]. However , the results shown in [19] may be misleading, since there is no actual comparative study that exists.
Typically for a k-bit recoding k:l partial products are generated. However , these partial products are in the form of -2k-2 Y, -(2k-2 -l)Y, · · ·, 0, . · ·, +(2k-2l)Y, +2k-2 Y. The main advantage of going to a higher bit recoding lies in its reducing the CSA array delay. By doubling the size of k, the delay of CSA array is halved.
The most formidable obstacle in going to a higher bit recoding is generation of odd multiples of the multiplicand e.g. 3Y, 5Y, 7Y, etc. Every time the recoding size is increased by one bit, the number of digits required to represent the binary number is doubled. Thus increase in one bit requires selectors, which should select from twice as many digits as before. Moreover, the number of odd multiples of the multiplicand is doubled requiring more carry-select adders to implement those multiples. There are some multiples of Y which can be obtained by addition of more than two powerof-2 multiples. The odd multiple llY is obtained by the addition of SY, 2Y and Y.
Such multiples can be formed by using carry-save adders to reduce the summands to two numbers and then by a carry-select.adders to obtain the required multiple. Thus there is a need of additional carry-save adder apart from the carry select adder for generating the odd multiples for a higher bit recoding. The hardware size increases exponentially as the value of k increases. The use of 9-bit recoding would offer only 25% improvement in the array delay over the 5-bit case, while requiring 1 of 128 selectors and the generation of 63 odd multiples of Y [20]. Then the complexity determined by the number of devices will no longer estimate the actual implementation complexity accurately. For VLSI implementation , the distribution of all the partial products formation to the partial product summation array will take a fairly large area proportional to k. Thus, the operations required, and the number, and the complexity of circuits needed, to implement them, grows very quickly, and soon prohibits a practical multiplier design using a higher bit recoding.

Recoding Algorithm
The mathematical representation of partial products has been derived to consummate the discussion of recoding. Consider the case of unsigned numbers; let X represent the multiplicand, and Y=Yn-i, Yn-2, ···,Yi, Yo an integer multiplier -the binary point following Yo. The lowest action is derived from the multiplier bits Yi, Y 0 , 0. The next higher action is found from the multiplier bits Y 3 , Y 2 , Yi, but the resulting action is shifted by 2. For the highest order action with an unsigned multiplier the action must be derived with a leading or padded zero. For an odd number of multiplier the last action will be defined by O,Yn_ 1 , Yn_ 2 . Multipliers in the two's complement form may be used directly in the algorithm. In this case the highest order action is determined by Yn-1, Yn-2, Yn-3 for an even number of multiplier bits and Yn-i, Yn-i, Yn-2 a sign extended group for an odd sized multiplier.
Given an (n+l) digit binary vector B = B n, Bn-i, ···,Bi, B 0 , to obtain (n+l)-digit canonical SD vector D = Dn, Dn-l , · · ·, D 1 , Do with D; ={I, 0, l}. This process can be extended to generating two or even more signed digits at a time [12]. The radix-4 SD vector will have the following digit set.
{2,I, 0,1,2} An SD representation of X in radix 2k will have n/k signed digits. The value of X can be writ ten as

4)
j=l The proof of generalized multibit recoding was given by H. Sam et.al. [19]. After substituting the values of D; in equation (2.2) and rearranging The above equation can be represented in n-bit two's complement binary integer format as

Partial Product Reduction
The partial products that are generated using different schemes are to be added to form the required product of the two n bit numbers. Different hardware units can be employed to add these partial products. Each partial product forms a row in the multiplication array. Each bit in the row is represented by a weight. These weights have the values ranging from 2°, 2 1 , · · · 2n, depending on the pos.ition of the bit. When a row has to be added to the other row, bits representing the same weight have to be added. The number of rows that can be added depends on the complexity of the adding unit. This adding unit is a compressor which receives inputs and generates outputs. The number of compressors required is a function of the number of columns and rows in an array. This reduction of partial products is done to accelerate the addition of summands . The techniques described for the reduction of partial products are discussed in the following subsections.

Array Multiplier
This is the most basic form of multiplier. The attractive feature of this multiplier is its regular array structure. Such arrays lead to a design, that is easy to lay out efficiently, and have a high throughput. Figure 2.1 shows an array multiplier of size n X n proposed by Braun [21]. It needs n(n -1) full adders and n 2 AND gates to implement it. Consider two unsigned binary integers A = am-l · · · a 1 ao and B = bn-l ... , b 1 b 0 with values Av and Bv, respectively. In a binary multiplication , the (m+n)-bit product P = Pm+n-1 · · ·, P1Po is formed by multiplying the multiplicand A by the multiplier B. The product P has a value m-1 n-1 Each of the partial products ajbj is called a summand. The arrangement of these summands form a matrix layout. For the operation involving other than unsigned multiplication, complementers have to be included. These operations are typically sign-magnitude, one's complement, and two's complement multiplication. The precomplementer will convert the two operands to the positive integers before they can be multiplied by the core of unsigned multiplication array. The post complementer will convert the result back to the signed number representation if the two input operands do not agree in their signs.
This form of multiplier employs half adders for the top most row and full adders for the remaining part of the array to form the final product. Assuming that the numbers are in two's complement form, the time delay of this n-bit multiplier can be calculated as follows: Although it has a very regular structure for implementation on a VLSI chip, the speed of this multiplier is not competitive. There had been better designs proposed , that improved the speed of adding summands effectively [22] [23].

Wallace Tree Reduction
The basic addition process employed in computers add two numbers together. The possibility exists of adding together more than two numbers in a single adder to produce a single sum. However , the logical complexity of the adder increases compared to the increase in resulting speed.
This method introduced addition of three numbers. This resulted in producing two numbers rather than a single number. These two numbers have different weights, and are represented by a sum and a carry. The advantage of this adder is, that it can operate without carry propagation along the digital stages. This form of adder is a simple full adder , where the carry inputs are used for the third input number, and the carry outputs for the second output number.
Wallace (13] introduced a tree structure which forms an interconnection of carry-save adders that reduces n partial products to two operands. The principle of the carry save adder is to use a single carry-save adder to reduce three bits of equal weight to two bits, one a sum and other a carry. Several adders have to be provided to reduce the partial products to two.
The maximum height of a column is n bits and it can be divided into three bit groups.
Each of these groups can be reduced simultaneously to two bits, resulting in a new column 3n in height. The new column is again divided into three-bit groups, and the process repeats until the final column height is two bits. The time required varies linearly with the number of summands. The delay to reduce the height of a Wallace tree is log3;2(N/2). The base is derived from the fact that in full adder arrays, three partial products can be compressed to two at a time. 2" 2" '2" 2" 2" 2" 2" This new unit adder was developed to enhance parallelism [24] and speed of the array part during circuit design [15]. It can compress four partial products of the same weight into two new partial products. Since in a 4:2 compressor the four partial products are concurrently reduced to two, the rate at which the partial products are reduced is log 2 ( N /2). Although 4:2 compressor is more complex, but is faster than Wallace tree and yields a more regular tree structure. The equations that represent the outputs of the 4:2 compressor are   [25] l.5x0.4 --Carry Save Array mmxmm [26] 0.61 x0.58 9.5ns Array Carry save adder mm x mm [27] -7.5ns Modified Booth Modified array mm x mm [20] 5.8 x 6.3 120 ns Modified Both Redundant binary mmxmm addition [15] 3.8x6.5 120 ns Booth's 4:2 mm x mm [28] l.55x 1. 44   Chapter 3

Design and Area Evaluation
The area t ime performance is the most important governing factor in VLSI devices as well as any multiplier designs [32] . The variety of existing designs are due to the effort to compromise between area and speed. These designs vary in the method the partial products are generated and the method these partial products are reduced to only two part ial products. Confusion still exists when comparing two multiplier designs for their area-time performance. The set of design rules and the device characteristi cs of the chosen technology determine the area and speed of the circuit.
Obviously, a fair comparison must be based on the microelectronic architecture level to provide an unbias basis.

R ecoder Design
Recoder is one of the most important unit in a fast multiplier. It performs one of the basic function of reducing the summands of the multiplier. Since the area occupied by the recoder has to be minimum, the design selected should not defeat the purpose of chosing t he method of recoding to generate the partial product. The gate delay governs the speed of the recoder. The design should have a minimum delay for a fast multiplier. Designs of 3-bit and 5-bit recoder are discussed in this section. 24

3-bit recoder
The design of a 3-bit recoder was suggested by Ghest [8]. This is easily implementable. Different logic gates are used to make it area and time efficient. Figure 3.1 shows the design of a multiplier using a 3 bit recoder. The multiplier and the multiplicands are assumed to be of n-bits in length. Each encoder has three inputs coming from the mult iplier Y. The output of the encoder generates three control signals to select one of the multiples of X: (0, ±X, ±2X). These three control signals are: Yn-2Yn-1Yn + Yn-2Yn-1Yn Booth encoder required are n/2 for a n-bit multiplier.

5-bit recoder
Since 5-bi t recoder technique is not accepted universally, the 5-bit recoder was designed to meet the requirements. Although [19], a multiplier using 5-bit recoding was designed but the design itself was not discussed. Figure 3.2 shows the design of a multiplier using a 5-bit recoder. There are n/4 5-bit encoders each having five inputs of the multiplier X and generating the four control signals for the multiplexers.
These control signals are derived in Table 3.1 and are represented by S 4 , S3 , S 2 and S1.
The following equations, represents the four signals and the sign which is controlled by k, are derived from Karnaugh map.
There are two 4:1 multiplexers and one full adder that is required to generate each bit. There are 2(n + 3) multiplexers in each row. These multiplexers select different multiples of the multiplicand Y. The sum output of the adder generates one of the possible multiples of Y (0,±lY,· · · ,±8Y). The k input to the adder controls the sign of the output. Thus the output from each adder forms an array, that is reduced by different partial product reduction techniques.

Area Evaluation
The area factor is an important factor when the economy of the design is a concern.
The area could affect the size of a chip by a considerable amount. Here, we consider two partial product generation approaches: the 3 bit recoding (or modified Booth recoding) (11] and the 5 bit recoding [19]. The reduction of the partial products is done b · Y usmg array of full adders, Wallace tree [13], or 4:2 compressors [24].

Notations
In the area evaluation, the area of a full adder is used as a unit of measurement.
The area of a full adder: a J, is approximately double the size of a half adder: ah, and the area of a 4:2 compressor ac is 8 units. We further assume the area of a 3 bit recoder: ar 3 , equals a units. The area of a 5 bit recoder: ars, is assumed to be equal to a(J units. Here (J is the ratio of the areas of the 5 bit recoder and that of the 3 bit recoder. The above assumptions can be represented in the following mathematical form. ah 0.5_ af (3.12) ac 8a1 (3.13) ar3 aa1 (3.14) ars The multipliers for 8, 12, 16, 24 and 32 bits are examined respectively. The partial products are distribu tively generated using 3-bit and 5-bit recoding, respectively.
The recoding can be distributive, or by shifting the multiplier bits in a centralized device. Although more area is needed when using the former, the time delay is substantially less. Hence there is a tradeoff between area and time. Distributive method of recoding is used for the analysis purpose as speed is the prime factor in high-speed computer arithmetic.

Multiplier Recoding
For a 3-bit recoded multiplier, % recoders are required to generate all the partial products simultaneously. The generated partial products are n + 1-bit in length. For each row, a 3-bit recoder and n multiplexers are needed. Let the area of a multiplexer and two additional gates be 1 3 ar 3 , then the area used for recoding for each row is (3.16) For a 5-bit recoded multiplier, ~ recoders are required to generate all the partial products simultaneously. Since the partial products are ranged from 8Y to -SY, the recoded partial products are n + 3-bit in length. In this case, n + 2 multiplexers are needed. Note, that there are two four-to-one multiplexers used in 5-bit recoding. Let the area of these multiplexers and the full adder be /sars · Then the area occupied by the recoding circuits at each row is arows = (1 + (n + 2)/s)ars (3 .17)

Array of Full/Half Adders
Once the partial products are obtained, they are added using a known partial products reduction technique. An array of full adders such as in the Braun's multiplier [21] is a simple and straightforward approach. In this type of approach partial products are generated without recoding. We assume recoding for generation of partial products. The array of bits is assumed to be uniform. In the array multiplier half The total area required for this modified array multiplier with a 3-bit recoding is calculated from the following equation (3.20) Substituting the above derived relations the area is computed to be The size of the array is ( n + 3) The total area needed for the adder is calculated from the equation given below: (3.24) After substituting the values for H 5 (n) and F 5 (n) the area is computed to be Using the relationships from eq. 3.14, the above equation can be rewritten as follows . (3.27) If 5-bi t recoding is used , F 5 w(n) full adders and H 5 w(n) half adders are present in the Wallace tree. The area of the Wallace tree in an n-bit multiplier using 5-bit recoding is derived as follows.

4:2 Compressors
(3.28) (3 .29) The 4-2 compressors are used to speed the array part during circuit design. For t he sake of simplicity, the full adders are replaced by 4-2 compressors to reduce four partial products to two. The fourth input is fed with a zero. In the case of 3-bit recoding, C 3 (n) 4-2 compressors are required. The area occupied by 4-2 compressors in an n-bit multiplier can be calculated by the following equation. (3.30) Using equation (3.14), the above equation can be rewritten as follows (3.31) If 5-bit recoding is used, there are C 5 (n) 4-2 compressors. The area occupied by 4-2 compressors in an n-bit multiplier us ing 5-bit recoding is derived as follows. n . Acs = ars X 2(1 + n-ys) + acCs(n) + 0.5Hsc(n)a1 (3.32) Using equation (3.13) , (3.14) and (3.15) n Acs = o:/3 x 2(1 + Wfs) + 8Cs(n) + 0.5Hsc(n) (3 .33 ) Again C3(n), H3c(n), Cs(n) and Hsc(n) cannot be formulated systematically. The values used in the next section for the purpose of comparison are based on actual implementations.

Area Comparisons
The areas of the multipliers with 3-bit recoding and 5-bit recoding are compared in this section , for different techniques of partial product reduction. These areas give the cost estimation of the multipliers.

Array of Full/Half Adders
The reduction of partial products m the multiplier is a deterministic function of n. Therefore the areas cannot be solely represented by the full adder units . The evaluation of array multipliers is done in order to give a complete treatment to the multipliers .

Wallace Tree
The areas required for n bit multiplication using 3-bit and 5-bit recoding are equated to be equal. Then the value of /3 can be calculated for a given value of o:. The value of f3 obtained gives the ratio of a 5-bit to 3-bit recoder, that is required, in order to have the same size of the multiplier.
Equating Awp3 and Awps, we obtain H3w(n) -Hsw(n))] a x ( 1 + ( n + 2 hs H We assume a 2µm p-well CMOS proces~. The value of a is varied from 3 to 8. shows the values of Fsw(n) and Hsw(n).
In other words, with the given parameters, if the area of a 5-bit recoder is less than four times that of a 3-bit recoder, the multiplier with 5-bit recoding is more efficient in area. Figure 3.5 shows several plots based on the above equations. ote that the plots represent the upper bounds of f3 for the given values of a and n. For example, when n=16 and a=5, we derive that {3=4.4. This means that when a 3-bit recoder is five times the size of a full adder, for n=l6, a 5-bit recoding multiplier using Wallace tree is more area efficient if the size of a 5-bit recoder is less than 4.4 times the size of a 3-bi t recoder.
We emphasize that this bound is based on two fundamental variables: the size of the 3-bit recoder and the size of the 5-bit recoder. The value of a must br based on the most compact layout of the 3-bit recoder and the full adder.

4:2 Compressors
The areas required for n-bit multiplication using 3-bit and 5-bit recoding are equated to be equal. The value of f3 is calculated for different values of a, as it was done for

Time Evaluations
Since fast arithmetic circuits are the key elements of high performance computers, the speed of the multiplier is a major concern in modern day applications. Here, we shall examine the delay times of various constructions of multiplier designs.
The timing analysis is carried out in different stages of multiplication. These stages are subdivided into three stages: • Recoding stage • Partial products reduction stage • Final adder stage.
The delay time of recoding stage is the time taken for the encoder to do the recoding and the generation of partial product. The reduction of the partial products is dependent on the recoding method and the array organization. The final stage is a parallel adder with a delay of b:.. cp · This number is independent from the recoding method and the organization of partial products reduction. 43

Recoding Stage
The time taken for recoding can be represented by 6.3r for a 3-bit recoder and 6. 5 r for a 5-bit recoder. Since, a distributive approach is assumed for the generation of partial products the delay of the recoder is not a function of n. In [19] the recoding is assumed to be centralized . Then , the generation of odd products (3Y, 5Y, etc) needs n -bit binary adders and the delay time is a function of n. This is is obviously not sui table for high-speed multiplication. This type of discussion will not be considered in the following discussions .

Partial Products Reduction Stage
The timing of an array multiplier is obtained by tracing the worst case critical path from which the partial products are generated to the final reduction of these products into a single row of final product. The delay in an array multiplier with a 3-bit recoder can be represented by the following equation.
( 4.1) Here 6 is the time delay through one inverter. The term 86 represents the delay in the last row assuming that there is CLA adder, instead of allowing the carry to ripple through. The delay in the multiplier using a 5-bit recoder is: 1 6sam = 6.sr + [-(7n -28)(n -8)]6.FA + 86. + 6. cp 4 (4.2) From (4.1) and (4.2) it is observed that simple array structure with 5-bit recoding is far better. However , it is trivial to see that this simple array structure is not a candidate for high-speed multiplication. The importance of 5-bit recoding must be just ified by the state-of-the-art multiplier designs.
The number of carry save adder stages required to reduce the Wallace tree height from n to 2 is ( 4. 3) The number of carry save adder stages required to reduce the partial products using 4:2 compressor is ( 4.4) The total delay in a 3-bit recoder and the partial product generation ~3r is the same for each row of partial product generation. Thus the total multiplication time needed for an n bit number using modified Booth recoding and Wallace tree is ( 4.5) The total multiplication time needed for an n bit number using modified Booth recoding and 4:2 compressor is (4.6) The total delay in a 5-bit recoder and the partial product generation ~sr is the same for each row of partial product generation. The number of partial products generated in this case is ~. Thus, the total multiplication time needed for an n bit number using 5-bit recoding and Wallace tree is (4.7) The difference in speed of the multiplier using a Wallace tree, with 3-bit and 5-bit recoding can be represented as (4.8) ( 4.9) The total multiplication time needed for an n bit number using modified Booth recoding and 4:2 compressor is ( 4.10) The 4:2 compressor can be realized using complex gates and this compressor is 1.5 times the speed of a full adder. Therefore the equations ( 4.6) and ( 4.10) can be rearranged to form 6.384 n 6.3r + l.5 log2( 4 )6.F A+ 6.cp 6.s84 n 6.sr + l.5 log2( S )6.F A+ 6.cp ( 4.11) ( 4.12) The difference in speed of the multiplier using a 4:2 compressor with 3-bit and 5-bit recoding can be represented as The equations can be represented in natural logarithm as For a multiplier of bit n, the time taken for a 3-bit and 5-bit recoder is dependent on the recoding technique used, and the way the partial products are generated.
Since, recoding is done distributively the ~ partial products and % partial products for 3-bit and 5-bit respectively are generated simultaneously. The delay of partial product reduction with the change in the number of bits was studied for 3-bit and 5-bit recoders. The plots were obtained for three different methods of reduction

Comparison
The speed comparisons of multipliers using different recoders for the Wallace tree and the 4:2 compressor is compared. The difference in speed stays constant for the two types of recoders , and is not a function of the number of bits. The different technologies for reducing partial products, also plays an important role in determining the speed of the multiplier. The 4:2 compressor that uses complex gate proved to be the fastest method for reducing the partial products. The total speed of the multiplier is determined by adding the recoder delay to the reduction technique. The 3-bit recoder is always faster than the 5-bit recoder. But the total delay is a function of the number of bits n. t wb and t 4 b are the upper bounds for the time efficiency of 5-bit recoding multipliers. For a positive iwb, the 5-bit recoding multiplier in question has a shorter propagation time than its 3-bit recoding counterpart. If the derived twb is a negative number then the 5-bit recoding multiplier is not as time efficient. Similar principle applies to t 4 b. Again we caution here that the delay time of the 3-bit recoder that is used to derive t wb or t 4 b must be based on the best possible design for a true comparison.

Empirical Results
The VLSI layout of the recoder is done using CAD tool MAGIC. The 3-bit recoder is designed according to [12]. The design of a 5-bit recoder is derived using truth tables.
These designs are implemented using the standard gates. These gates are available in the cell library on the computers of the Electrical Engineering Department. In the trial layouts , the area of a 5-bit recoder is four times the area of a 3-bit recoder: ....

Chapter 5 Conclusion
In recent studies, trade-off between speed and area are considered to be bottleneck in the multiplier design. A 10 ns multiplication time has been achieved by optimizing both the propagation time of 4:2 compressor and propagation time of the final adder [30]. It has also been known that for VLSI applications, regularity of the signal flow and the layout is very important [28]; it is even so for the submicrometer technology where the devices are fast. In [33] high speed binary integer multiplication algorithm suitable for VLSI implementation was proposed in which all intermediate results are represented in the redundant binary representations; and all additions are performed in the redundant binary number system, where addition of two numbers can be performed in a constant time independent of the word length of operands without carry-propagation. Multibit overlapped scanning technique can be applied to the multiplication function and possible actions to be taken for the design of the multipliers has been discussed [34].
In this study, we examine the microelectronic architectures of array multipliers ( or hardware algorithms). The study is important in pointing out the unbiased approach for multiplier designs comparison. The speed-up contributed by technological innovation (reduction of feature size) and by fine-tuning at the switch or FET level is limited by its fundamental microelectronic architecture. In other words it is much more important to understand the merit of an array multiplier before committing to 53 it. Design of the 3-bit and the 5-bit recoders is implemented in VLSI for concrete numbers to be used in the study. The area of the recoders is calculated from the layouts drawn in MAGIC. For the comparisons of areas , appropriate ratios were assigned to full adders and the recoders. The number of full adders were calculated using the Wallace tree and the areas for 8, 12, 16, 24 and 32 bit multipliers were evaluated.
As far as the area is concerned, 5-bit recoding is not efficient for the multipliers of size less than 16-bits given a = 7. In other words area efficient 5-bit recoder can be achieved for higher word lengths and an efficient 5-bit recoder design or layout.
We caution that this conclusion was drawn without the consideration of the areas required for the routing channels. Nonetheless, the comparison given here represents the best approximation so far.
For the timing analysis , a 4:2 compressor using a complex gate gives the best performance for both 3-bit and 5-bit recoders. The recoder delays were not considered along with the delays for partial product reduction . When these delays are added to the partial product reduction delays, the total delay for the two stages can be more accurately modeled. Thus, the total timings also depends upon the recoding techniques being used. In general, the total time required for using 5-bit recoder is more than that required using a 3-bit recoder.
When using 4:2 compressors, a 5-bit recoding multiplier may not be efficient in area for more than 32 input bits. Since, the evaluations for higher than 32-bits were not performed here , the above statement is based on the trend shown in the plots. In this configuration 5-bit recoding multipliers are always faster than their 3-bit recoding counterparts. The second set of evaluation is done on a 4:2 compressor. The size of the multiplier should be more than 32 bits to have an area efficient use of a 5-bit recoder. ote that, this may be true only if the given design for recoders is used.
In the empirical result , we have demonstrated , the approximate speed of the 3-bit recoder and the 5-bit recoder. The 3-bit recoder design is apparently faster than the 5-bit recoder, but the ratio between them may vary depending upon the circuit