# Modified Multiply and Accumulate Unit with Hybrid Encoded Reduced Transition Activity Technique Equipped Multiplier and Low Power 0.13 $\mu \mathrm{m}$ Adder for Image Processing Applications 

S.Saravanan<br>Assistant Professor<br>Department of ECE, K.S.R. College of Technology<br>Tiruchengode-637215, India

M.Madheswaran<br>Professor<br>Department of ECE, Muthayammal Engineering<br>College, Rasipuram-647408, India


#### Abstract

This paper explores the design approach of a low power high performance Multiply and Accumulate (MAC) unit with Hybrid Encoded Reduced Transition Activity Technique (HERTAT) equipped multiplier and low power $0.13 \mu \mathrm{~m}$ adder. The developed low power MAC unit is verified for image processing systems exploiting insignificant bits in pixels values and the similarity of neighboring pixels in video streams. The proposed technique reduces dynamic power consumption by analyzing the bit patterns in the input data to reduce switching activities. If the number of 1 's less than or equal to three the proposed encoding technique used otherwise go for Booth technique. The proposed adder cell used in the MAC block consumes less power than the other previous adder techniques. This high performance low power MAC can be used in image processing. It is observed from the device level simulation using TANNER 12.6 EDA, that the proposed scheme helps to reduce the switching activities in the MAC unit up to $19 \%$ and saves power up to $46 \%$.


## Categories and Subject Descriptors

B.8.2 [Performance Analysis and Design Aids]

General Terms
Algorithms, Performance, Design

## Keywords

Low power, Booth Multiplier, MAC, RTAT.

## 1. INTRODUCTION

The growing popularity of portable and multimedia devices such as video phones, note books, etc., have motivated the research in the recent years to design low power VLSI circuits. The real time implementation of image processing system is expected to consume high computational power and high data throughput
rate which limits the use of general purpose processors [1]. Application specific integrated circuits rely on efficient implementation of various arithmetic circuits for executing the specified algorithms. It is well known that if the density of transistor increases, the complexity of arithmetic circuits also increases and consumes more energy. This has further motivated the new concepts of designing low power VLSI circuits. It is also clear that the reduction in power consumption and enhancement in the circuit design are expected to pose challenges in implementing wireless multimedia and digital image processing system, in which multiplication and multiplication-accumulation are the key computations. In the recent past, the researchers proposed various design methodologies on dynamic power reduction using minimizing the switching activities [2].
Choi et al [3] proposed Partially Guarded Computation (PGC) which divides the arithmetic units e.g., adders and multipliers into two parts, and turns off the unused part to minimize the power consumption. The reported results show that the PGC can reduce power consumption by $10 \%$ to $44 \%$ in an array multiplier with $30 \%$ to $36 \%$ area overhead in speech related applications. A 32-bit 2's complement adder equipping a Dynamic Range Determination (DRD) unit and a sign-extension unit was reported by Chen et al [4]. This design tends to reduce the power dissipation of conventional adders for multimedia applications.
Later, Chen et al [5] presented a multiplier using the DRD unit to select the input operand with a smaller effective dynamic range that yield the Booth codes and it reduces $30 \%$ power dissipation than conventional method. Benini et al [6] reported the technique for glitching power minimization by replacing few existing gates with functionally equivalent ones that can be "frozen" by asserting a control signal. This method operates in the layout level environment which is tightly restricted and hence it reduces $6.3 \%$ of total power dissipation. The double-switch circuit-block switch scheme was proposed by Henzler et al [7] is capable of reducing power dissipation by shortening the settling time during down time. Huang and Ercegovac [8] presented the arithmetic details about the signal gating schemes and showed $10 \%$ to $45 \%$ power reduction for adders. The combination of the Signal Flow Optimization (SFO), left-to-right leapfrog structure, and upper/lower split structure was incorporated in the design to optimize the array multipliers by Huang and Ercegovac [9] and it is reported that the new approach can save about $20 \%$ power dissipation. Wen et al [10] reported that for the known output, some columns in the multiplier can be turned off and it reduces
$10 \%$ power consumption for random inputs. Chen and Chu [11] reported that the spurious power suppression technique has been applied on both compression tree and modified Booth decoder to enlarge the power reduction. Ko et al [12] and Song and Micheli [13] investigated full adder as the core element of complex arithmetic units like adder, multiplier, division, exponentiation and MAC units. Several combinations of static CMOS logic styles have been used to implement low-power one bit adder cells. In general, the logic styles were broadly divided into two major categories such as the complementary CMOS and the passtransistor logic circuits. The complementary CMOS logic style uses the power lines as input where the pass transistor logic uses separate input signals. But one pass transistor network is sufficient to implement the logic function.
The complementary CMOS full adder is based on the regular CMOS structure with pMOS pull-up and nMOS pull-down transistors [14]. The authors reported that the series transistors in the output stage form a weak driver and additional buffers at the last stage is required for providing the necessary driving power to the cascaded cells. Chandrakasan and Brodersen [15] reported that the Complementary Pass Transistor (CPL) logic full adder with swing restoration structure utilizes 32 transistors. A Transmission Function full Adder (TFA) based on the transmission function theory was presented by Zhuang and Hu [16]. Later, Weste and Eshraghian [17] presented that a Transmission Gate Adder (TGA) using CMOS transmission gates circuit which is a special kind of pass-transistor logic circuit. A pMOS transistor and nMOS transistor connected in parallel, which are controlled by complementary signals was used to built TGA. The transmission gate logic requires double the number of transistors of the standard pass-transistor logic or more to implement the same circuit. Hence the research has been focused by various researchers on smaller transistor count adder circuits, most of which exploit the non full swing pass transistors with swing restored transmission gate technique. This is exemplified by the state-of-the-art design of 14 T and 10 T which was reported by Vesterbacka [18] and Bui et al [19]. These adders differ in their transistor counts and the way their intermediate nodes are generated. Chang et al [20] proposed a hybrid style full adder circuit in which the sum and carry generation circuits are designed using hybrid logic styles. Full adders are used in a tree structure for high performance arithmetic circuits and a cascaded simulation structure is introduced to evaluate the full adders in real time applications.
Keeping the above facts, it is proposed to improve the performance of the MAC unit using hybrid encoded technique. In this research paper a novel design method has been proposed to reduce the number of switching activity and power consumption for multiplier and adder used in MAC utilizing a hybrid encoding schemes.

## 2. POWER CONSUMPTION REDUCTION IN MAC

The architecture of MAC with power consumption reduction technique is shown in Fig 1. The major units of MAC are the detection logic unit which generates control signals according to the special conditions, registers, low power multiplier and adder. MAC unit is mainly essential for kernel based process which requires a large number of repetitive computational operations on a fixed window. The repetitive operations can be performed using
parallel processing concept which is expected to reduce the complexity and improve the performance.


Figure 1. Architecture of low power MAC unit
Images in the video sequences are generally processed in raster scan method hence neighboring pixels usually have the same values or very small deviations.


Figure 2. Illustration for pixels with its neighborhood relationship

This condition may be applicable for low light environment video sequences. The example of pixel values in a neighborhood of an image is shown in Fig 2. It can be seen from the values of the pixels that most of the pixels are having the same value and the difference is only in least significant part. This characteristic can be exploited to reduce switching activity in the design of arithmetic units.
In this research work, the power consumption of the MAC unit is reduced by incorporating partial bit representation of data technique in multiplier and adder unit. If the MSB or LSB is zero, the design can be done by bypassing some operations in MAC unit to reduce the switching activities. If the condition is detected, appropriate control sequence is developed to disable the parts or all data paths in the architecture to reduce switching activity. Here the approach for low-power design by reducing switching activities relies on this characteristic to reuse results or to bypass parts of the pixel values in the computations in MAC operations. The consecutive MAC operations for two consecutive pixels will reduce switching activities in MAC units by performing the following design conditions. If the pixel values of two consecutive MAC operations are same, disable the multiplier and reuse previous result at the output of the proposed low power multiplier.

If the current pixel value is 0 , avoid the operation of both proposed low power multiplier and proposed low power adder and reuse the previous result of the accumulator. If the part of the input is zero, freeze those multiplier paths to reduce switching power. If the current pixel value is one, avoid the operation of the multiplier and increment the current accumulated value by 1 .

## 3. PROPOSED HYBRID ENCODED LOW POWER MULTIPLIER

In general, multiplication process consists of two parts as multiplicand and multiplier. According to the conventional shift and add multiplication, the number of Partial Products (PP) are equal to the number of bits in the multiplier. The number of partial products can be reduced by half using Booth recoding. In the proposed encoding technique, the partial products can still be reduced which in turn reduces the switching activity and power consumption. The operation can be defined according to the number of 1 's and its position in the multiplier. The proposed hybrid encoding rule is stated in Table 1 with details of operation. If the number of 1 's in the multiplier is less than or equal to 3 , the control goes to proposed multiplication technique, otherwise the control split the multiplier in to two parts. Again the number of 1 's in the part of the multiplier is verified. If the number of 1 's is more than three, the control goes to Booth multiplication. Otherwise the control goes to proposed multiplication technique. If the number of 1 's in the multiplier is one and depends upon its position, the control goes to execute the operation in category A or B. If the number of 1 's in the multiplier is two and depends upon its position, the control goes to execute the operation in category C or D .

## Table 1.Hybrid encoding rule

Otherwise the number of 1 's in the multiplier is three and
depends upon its position, the control goes to execute the

| Number of 1's in the multiplier | Position of the 1 | Category | Operation |
| :---: | :---: | :---: | :---: |
| 1 | $1^{\text {st }}$ bit | A | Add 0 to multiplicand (M) |
| 1 | $i^{\text {th }}$ bit | B | Shift M left by i-1 and add 0 |
| 2 | $\begin{gathered} 1^{\text {st }} \text { and } \mathrm{i}^{\text {th }} \\ \text { bit } \end{gathered}$ | C | Shift M left by i-1 and add M |
| 2 | $\begin{gathered} \mathrm{i}^{\text {th }} \text { and } \mathrm{i}+\mathrm{j} \\ \text { th } \mathrm{bit} \end{gathered}$ | D | Shift M left by j , add M and shift the result left by i1 |
| 3 | $\begin{aligned} & \mathrm{i}^{\mathrm{th}}=1^{\text {st t }} \mathrm{j}^{\mathrm{th}} \\ & \text { and } \mathrm{k}^{\text {h }}{ }_{\mathrm{bit}} \end{aligned}$ | E | Shift M by k-j , add $M$ and shift the result left by $j$ i, add M and shift the result left by i1 |
| 3 | $\begin{gathered} \mathrm{i}^{\text {th }}, \mathrm{j}^{\text {th }} \text { and } \\ \mathrm{k}^{\text {th }} \text { bit } \end{gathered}$ | F | Shift M by k-j, add M and shift the result left by j $i$, add M and shift the result left by i |

operation in category E or F .

### 3.1. Block Diagram of Proposed Hybrid Encoded Low Power Multiplier

The block diagram of the proposed hybrid encoded low power multiplier is shown in Fig 3.


Figure 3. Block diagram of proposed hybrid encoded low power multiplier

The process of the proposed multiplier can be divided into hybrid encoding, multiplication and controlling. The proposed encoder
works as per the method explained in Table 1. In the partial product compression, the partial products are added without carry
propagation and row bypassing can be used when the entire row of the PP is zero. This is done by freezing the adder at that time of the above condition occurs. This is expected to reduce the switching activity and power consumption. In the final adder unit a column bypassing provision is available to avoid the unwanted addition operation. The detection logic circuit is used to detect the effective data range. If the part of the input data does not make any impact in the final computing results then the data controlling circuit freezes that portion to avoid unnecessary switching transitions. A glue circuit can be used to control the carry and sign extension unit which will manage the sign.

## 4. RESULTS AND DISCUSSIONS

The low power multiplier circuit which is a part of MAC unit was simulated using TANNER 12.6 EDA schematic editor. The output of the EDA editor is shown in Fig 4.The simulated adder is shown in enlarged version for understanding.


Figure 4. Architecture of proposed adder
The power analysis of the proposed multiplier-adder circuit has been estimated with example. For multiplying $65(41 \mathrm{H})$ which is the pixel value in MAC with another pixel value $34(22 \mathrm{H})$, the proposed procedure shown in Fig 5 may be adopted.


Figure 5. Example for Hybrid encoded multiplication scheme
For the above multiplication, it needs 8 partial products for normal multiplication and 4 partial products for Booth recoding but only one partial product is enough for the proposed hybrid encoding method. Moreover, the proposed technique doesn't need the 2 's complement process and virtual 0 which is to be placed as a first bit of Booth recoding. Table 2 shows the power and delay analysis of different multipliers for the above multiplication. The simulation results have been taken for different voltage ranges from 0.8 V to 2.4 V .

Table 2. Power and delay analysis of different multipliers

| Multiplier <br> type | Para- <br> meter | VDD (volts) |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $\mathbf{0 . 8}$ | $\mathbf{1 . 2}$ | $\mathbf{1 . 6}$ | $\mathbf{2 . 0}$ | $\mathbf{2 . 4}$ |  |
| Con- <br> ventional <br> multiplier <br> (8PP) | Power <br> (mw) | 0.03 | 0.12 | 0.25 | 0.53 | 0.66 |
| Delay <br> (ns) | 11.2 | 4.17 | 2.77 | 2.29 | 1.93 |  |
| Booth <br> multiplier <br> (4PP) | Power <br> (mw) | 0.01 | 0.05 | 0.11 | 0.26 | 0.28 |
|  | Delay <br> (ns) | 4.80 | 1.79 | 1.19 | 0.98 | 0.83 |
| Proposed <br> multiplier <br> (1PP) | Power <br> (mw) | 0.01 | 0.02 | 0.04 | 0.08 | 0.09 |
|  | Delay <br> (ns) | 1.60 | 0.59 | 0.39 | 0.33 | 0.28 |

The power consumption of the proposed multiplier has been reduced by $87 \%$ and $26 \%$ compared with conventional and Booth multiplier. The full adder cell which is the important sub module of the proposed multiplier architecture is designed according to the following equations.

$$
\begin{align*}
& C=A B+B C+C A  \tag{1}\\
& S=A^{\prime} B^{\prime} C+A^{\prime} B C^{\prime}+A B^{\prime} C^{\prime}+A B C \tag{2}
\end{align*}
$$

In full adder, four inverters can be used to provide inverted inputs, the sum and carry circuits are joined together. A pull down nMOS transistor is connected near the carry output to provide the undistorted output. The output wave form of the full adder without and with the pull down transistor is shown in Fig 6 and Fig 7 respectively. Here $0.13 \mu \mathrm{~m}$ TSMC technology files were used for simulating in TSPICE TANNER 12.6 EDA tool.


Figure 6. Out put wave form of the full adder without pull down mechanism


Figure 7. Out put wave form of the full adder with pull down mechanism.
The various adder circuits have been simulated using the TSPICE TANNER 12.6 EDA tool for supply voltages range from 0.8 V to 2.4 V . The operating frequency is set at 100 MHz . The power consumption variation with various voltages is shown in Fig 8.


Figure 8. Power consumption comparison of different adders
It is seen from the figure that the 14 T design consumes more power beyond the supply voltage range 0.8 V . All other designs C CMOS, TGA, TFA, CPL, Hybrid and the proposed method are working better at the input supply voltage ranges from 0.8 V to 2.4 V . Even the number of transistors required to design TGA and TFA is less, they require additional buffers at the output. This additional buffer increases the short circuit power and also switching power because of less driving capacity. CPL adder design consumes more power than hybrid and C-CMOS due to its dual-rail structure and the large number of internal nodes. It is found that hybrid adder may exhibit smaller power-delay-product than C-CMOS except at 0.8 V because of faster in nature. Even though the transistor count of the proposed adder design is more
than the 10 T and 14 T , the proposed adder cell consumes less power than other design which is shown in the comparison.

The MAC operation of the window without and with repeated pixel values consideration is shown in Fig 9 and Fig 10 respectively.

Figure 9. 3*3

| 65 | 66 | 70 |
| :---: | :---: | :---: |
| 66 | 34 | 68 |
| 00 | 64 | 64 | repeated pixel

consideration

| 65 | 66 | 70 |
| :--- | :--- | :--- |
| 66 | 34 | 68 |
| $\mathbf{0 0}$ | 64 | 64 |

Figure10. 3*3 window with repeated pixel value consideration

The MAC operation of the $3 * 3$ window without repeated pixel values consideration needs eight multiplications and eight addition operation. But in the above window the pixel values 66, 64 are repeated and one pixel value is zero. The MAC operation of the above window with repeated pixel values consideration needs six multiplications and seven addition operations only. This saves two multiplication and one addition operation. The number of switching activity of MAC with repeated pixel values consideration are reduced by $19 \%$ compared to MAC with out repeated pixel values consideration. Table 3 and 4 shows the power consumption and delay of the MAC unit without and with considering repeated pixel values.

Table 3. Power and delay analysis of MAC with different types of multiplier (without repeated pixel values consideration)

| Multiplier type | Parameter | VDD (volts) |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | 0.8 | 1.2 | 1.6 | 2.0 | 2.4 |
| Conventional multiplier (8PP) | Power (mw) | 0.47 | 1.87 | 3.05 | 5.79 | 7.28 |
|  | Delay (ns) | 110 | 39.5 | 27.2 | 22.4 | 18.9 |
| Booth multiplier (4PP) | Power (mw) | 0.32 | 1.31 | 1.92 | 3.39 | 4.25 |
|  | Delay (ns) | 58.9 | 20.4 | 14.6 | 11.9 | 10.1 |
| Proposed multiplier <br> (1PP) | $\begin{aligned} & \hline \begin{array}{l} \text { Power } \\ \text { (mw) } \end{array} \end{aligned}$ | 0.25 | 1.03 | 1.35 | 2.19 | 2.74 |
|  | Delay (ns) | 33.3 | 10.9 | 8.26 | 6.70 | 5.67 |

Table 4. Power and delay analysis of MAC with different types of multiplier (with repeated pixel values consideration)

| Multiplier <br> type | Para- <br> meter | VDD (volts) |  |  |  |  |
| :---: | :--- | :---: | :---: | :---: | :---: | :---: |
|  | Power <br> (mw) | 0.35 | 1.39 | $\mathbf{1 . 2}$ | $\mathbf{1 . 6}$ | $\mathbf{2 . 0}$ |
|  | Delay <br> (ns) | 73.9 | 26.2 | 18.3 | 15.02 | 5.05 |
| Booth <br> multiplier <br> (4PP) | Power <br> (mw) | 0.26 | 1.04 | 1.47 | 2.52 | 3.15 |
|  | Delay <br> (ns) | 41.9 | 14.3 | 10.4 | 8.49 | 7.16 |
|  | Power <br> (mw) | 0.21 | 0.87 | 1.11 | 1.39 | 2.74 |
|  | Delay <br> (ns) | 18.0 | 8.33 | 6.45 | 5.21 | 4.40 |

The simulation results in Table 3 shows power consumption and delay of the MAC unit with different multipliers for the MAC operation without considering the repeated pixel values. It can be observed that the MAC with proposed multiplier consumes less power than the other multipliers. In table 4 the power consumption and delay of the MAC operations considering the repeated pixel values is shown.

## 5. CONCLUSION

The performance of the hybrid encoded low power multiplier has been estimated and compared with existing multipliers. The developed unit has been tested for image processing systems exploiting insignificant bits in pixel values and the similarity of neighbouring pixels in video streams. The number of switching activity and power consumption of adder, multiplier units have been reduced. The number of switching activity and power consumption of MAC with repeated pixel values consideration are reduced by $19 \%$ and $46 \%$ compared to MAC with out repeated pixel values consideration respectively.

## 6. REFERENCES

[1] Unsal,O., and Koren,I.2003. System-level power-aware design techniques in real-time systems. In Proceedings of the IEEE, vol.91, no.7, pp.1-15.
[2] Gandhi, K., and Mahaptra, N. 2005. Dynamically exploiting frequent operand values for energy efficiency in integer functional units. In Proceedings of 18th International conference on VLSI Design, pp. 570-575.
[3] Choi, J., Jeon, J., and Choi, K. 2000. Power minimization of functional units by partially guarded computation. In Proceeding of IEEE International Symposium Low Power Electron Devices, pp. 131-136.
[4] Chen, O., Sheen, R., and Wang, S. 2002. A low power adder operating on effective dynamic data ranges. IEEE Transaction on Very Large Scale Integration (VLSI) System, vol. 10, no.4, pp.435-453.
[5] Chen, O., Wang, S., and Wu, Y.W. 2003. Minimization of switching activities of partial products for designing low-
power multipliers. IEEE Transaction on Very Large Scale Integration (VLSI) System, vol. 11, no.3, pp. 418-433.
[6] Benini, L., Micheli, G.D., Macii, A., Macii, E., Poncino, M., and Scarsi, R. 2000. Glitching power minimization by selective gate freezing. IEEE Transaction on Very Large Scale Integration. (VLSI) System, vol. 8, no. 3, pp. 287297.
[7] Henzler, S., Georgakos, G., Berthold, J., and SchmittLandsiedel, D. 2004. Fast power-efficient circuit-block switch off scheme. Electronics Letter. vol. 40, no. 2, pp. 103-104.
[8] Huang, Z., and Ercegovac, M.D. 2001. On signal-gating schemes for low power adders. Proceeding of $35^{\text {th }}$ Asilomar Conference on Signal, Systems \& Computer.pp.867-871.
[9] Huang, Z., and Ercegovac, M.D. 2005. High performance low power left-to-right array multiplier design. IEEE Transaction on Computer. vol. 54, no. 3, pp. 272-283.
[10] Wen, M.C., Wang, S.J., and Lin, Y.N. 2005. Low-power parallel multiplier with column bypassing. Electronic Letter. vol. 41, no. 12, pp. 581-583.
[11] Chen, K.H., and Chu, Y.S. 2007. A low power multiplier with spurious power suppression technique. IEEE Transaction on Very Large Scale Integration (VLSI) System.vol. 15, no.7, pp. 846-850.
[12] Ko, U., Balsara, P., and Lee, W. 1995. Low-power design techniques for high-performance CMOS adders. IEEE Transaction on Very Large Scale Integration (VLSI) System. Volume. 3, no.2, pp.327-333.
[13] Song, P.J., and De Micheli, G. 1991. Circuit and architecture trade-offs for high-speed multiplication. IEEE Journal on Solid-State Circuits, Vol.26, no.9, pp.1184-1198.
[14] Shams, A., Darwish, T., and Bayoumi, M. 2002. Performance analysis of low power 1-bit CMOS full adder cells. IEEE Transaction on Very Large Scale Integration (VLSI) System, vol. 10, no. 1, pp. 20-29.
[15] Chandrakasan, A. P., and Brodersen, R.W. 1995. Low Power Digital CMOS Design. Norwell, MA: Kluwer.
[16] Zhuang, N., and Hu, H. 1992. A new design of the CMOS full adder. IEEE Journal on Solid-State Circuits, vol. 27, no. 5, pp. 840-844.
[17] Weste, N., and Eshraghian, K. 1993. Principles of CMOS VLSI Design, A System Perspective. Reading, MA: Addison-Wesley
[18] Vesterbacka, M. 1999. A 14-transistor CMOS full adder with full voltage swing nodes. Proceeding of IEEE workshop on Signal Processing Systems, pp. 713-722.
[19] Bui, H.T., Wang, Y., and Jiang,Y. 2002. Design and analysis of low-power 10 -transistor full adders using novel XOR-XNOR gates. IEEE Transaction on Circuits \& System II, Analog Digital Signal Processing vol. 49, no. 1, pp. 2530,
[20] Chang, C.H., Gu, J., Zhang, M. 2005. A review of $0.18-\mu \mathrm{m}$ full adder performances for tree structured arithmetic circuits. IEEE Transaction on Very Large Scale Integration (VLSI) System vol. 13, no.6, pp. 686-695.

