Design and Synthesis of High Performance Vedic DSP Processor

Anuradha Savadi
Pooja Doddappa Appa College of Engineering, Kalaburgi, Karnataka, India

Raju Yanamshetti
Pooja Doddappa Appa College of Engineering, Kalaburgi, Karnataka, India

Jyoti Godihal
Appa Institute of Engineering and Technology, Kalaburgi, Karnataka, India

ABSTRACT
To satisfy the prerequisite of rapid speed signal processing design of high performance DSP processor is renowned. This paper represents a novel design and FPGA based pursuit of 64 bit DSP processor. The proposed design implicates multistage pipeline architecture and vedic algorithms to improve the speed. The DSP processor is rich with multiple application specific instructions (ASIP). The verilog HDL is used and the validated through extensive simulation. Synthesis results and attainment scrutiny of each systems components confirmed significant performance meliorism in the proffered DSP processor over the extant one.

General Terms
Urdhava Tiryagbhyam multiplier Nikhilam sutra and paravarty sutra kogge stone adder et. al.

Keywords
DSP Processor, Pipelining, Vedic mathematics, ASIP.

1. INTRODUCTION
With the appearance of PC, advanced mobile phones, gaming and other interactive media gadgets, the interest for the DSP processor is steadily expanding. The intelligent world is transmigrating from analog to DSP based systems to bolster the fast handling. This paper depicted a design, synthesis and implementation of 64-bit fixed/floating point DSP processor and Harvard architecture with reduced instruction sets as shown in figure1. Presenting multistage pipelining will escalate the speed by reducing engendering delay with better CPI (cycle per instructions). Vedic algorithms are invoked in major blocks of the processor like, ALU, MAC and filters etc. to embroider the processing speed. The working of Urdhava Tiryagbhyam multiplier is as shown in figure 2 & 3. There have been several implementations on DSP Processor design has been discussed here.

A 16-bit fixed point DSP processor has been designed with RISC and Harvard architecture, which includes 3 stage pipelined program counter [1]. CSA multiplier is used to boost speed with penalty of increased area. In a unity clock cycle all instruction are executed. This paper [4] describes DSP processor, consisting of parallel multiplier for single cycle MAC with maximum frequency 106MHz. Some application algorithms like ADPCM and MP3 decode are simulated. Advancement has been achieved as 32-bit RISC core DSP processor is designed with single cycle MAC [2]. Radix-4 multiplier algorithm is used for signed unsigned numbers and for the selected encoding and decoding section booth’s algorithm is used to increase the performance. A 32-bit RISC DSP processor with two stage pipeline and dedicated single cycle MAC is designed to intensify the performance with a Hazard free data processing [5]. To upturn the speed parallel adder and carry look ahead adder is utilized. The FSM has been designed for Hazard handling operations.
2. PROPOSED WORK

The propounded DSP has several features to carry out many signal processing algorithms in the real time. The processor is designed with Harvard and RISC architecture with the multiple parallel I/O buses. The highest priority of the proposed design is towards the speed without affecting the area and the power consumption. This architecture supports both the fixed point and floating point arithmetic IEEE 754standard.

2.1 ALU

ALU is an essential component of the DSP processor which will perform arithmetic and logical operation, in addition it will indicate the output flags like overflow, parity, zero, and sign flag etc. Parallel adder subtract is design to improve the performance as shown in the fig, for addition kogge stone adder is used, which has very small propagation delay. Urdhava Tiryagbhyam is used for multiplication which is compatible even with floating point numbers. For division the vedic algorithms called nikhilam sutra and paravarty sutra are used which will improve the processing speed drastically as shown in figure 4. The DSP operations like convolution and de-convolution can be designed using this vedic divider.paravarty sutra theorem states that if polynomial

\[(ax^n + bx^{n-1} + cx^{n-2} + dx^{n-3})\] is divide by \((x-y)\) then the remainder is: \[R = (ay^n + by^{n-1} + cy^{n-2} + dy^{n-3})\]  

(1)

![Figure 4: flow diagram of binary division operation using nikhilam sutra](image)

2.2 Shifter

The recommened processor consist of barrel shifter which takes N bit input value and provides as output the N bit value shifted left or right by P bits as shown in fig which enhances the processing speed. As it is used in pre and/or post scaling, normalization and de-normalization and block floating point operation

2.3 MAC

MAC (multiply add and accumulate) is One of the important block in the DSP processor which consist of multiplier, adder and accumulator as shown in figure 5. To upsurge the processing speed carry skip adder and Urdhava Tiryagbhyam is used to implement the MAC. MAC is useful in filters and Fourier analyzer and in different signal processing task.

![Figure 5 Vedic MAC](image)

2.4 Memory

In the implied processor Harvard architecture is used which has separate data and instruction memories. In the output data memory tri-state buffer is used to stop data interruption. 64-bit data and address lines are defined to handle four stage pipelines architecture. High speed memory requirement is fulfilled with pipelining of the data memory which are dual ported resulting independent and concurrent access by DSP and I/O controller. Even high speed program memory is also designed because its grades system clock speed which is the summation of sequenced deal and memory access.

2.5 Filters

Urdhava Tiryagbhyam is used to design FIR and IIR filter, namely band pass FIR filter, Chebyshev I & II, butterworth filters and elliptical filter. FIR filter output sequence is calculated by linear convolution of input sequence \(x(n)\) and filter co-efficient \(h(n)\) i.e

\[y(n) = \sum_{k=0}^{m-1} x(n)h(n)\]  

(2)

Where \(m\) is filter length, \(k\) is number of sequence.

If length of \(x(n)\) and \(h(n)\) does not match then they made equal with zero padding method. Thus FIR filter based on vedic method consumes less area and computation time as compared to the conventional one[18].

The same vedic method is used for implementation of IIR filters with parallel computation to increase the speed as shown.

\[y(n) = \sum_{k=0}^{N} b[k]x[n-k] - \sum_{k=1}^{N} a[k]y[n-k]\]  

(3)

where in equation output \(y(n)\) is weighted sum of \(x(n)\) at present time \(n\), past inputs \(x(n-k)\) for \(k=1,2,\ldots,N\) and past
outputs \( y(n-k) \) for \( k=1,2,\ldots,N \). The delay blocks represent a form of storage and delay\[16,17\]. The numbers of delay blocks are easily seen to be \( N+N \) for this particular.

2.6 Pipelining and parallelism
Pipelining enables proper utilization of on chip resources and allows multiple operations in a faster clock times, which is useful for the DSPs where data inputted in DSP is continues signal. Hence proposed processor is implemented by four staged pipelined architecture. To fulfill the requirement of high speed processing parallelism is introduced which removes slow start ups and terminations.

3. PERFORMANCE ANALYSIS AND DISCUSSIONS
Counselled DSP processor performance achievement is shown by comparing with existing design \[5\] & \[14\]. The throughput of implied design is 229.8MB/s which are better than the existing design. By increasing the throughput performance of processor will be burgeoning. Four stage pipeline designs will reduced the number of cycles to complete execution.

Applying the low power techniques like clock gating and glitch reduction, power consumption has been reduced as shown in figure 6 & 7. Area minimization has been gained with tree structure in multipliers and adders, as they the basic building blocks of processor. Miniaturization is done in every stage of architecture. Acceptable results found as shown in figure 7.

Timing specification of each building block is obtained by synthesis report as shown in table 1. From which it is found that worst case delay of each sub-block. Introducing four stage pipelines and vedic mathematics for multipliers, ALU and MAC in the propounded processor aggrandize the performance which is appreciable compared to other processor and internal pipeline does not need in filters. Worst case delay of the processor is gained as 212.76MHz, better compared with the extant, as shown in table 2. Simulations and synthesis results of few sub-modules are shown in figure 8 to 15.
Figure 10: RTL schematic of FIR Filter

Figure 11: Simulation output of Chebyshev I filter

Figure 12: IIR filter RTL schematic

Figure 13: RTL schematic of Vedic MAC
4. CONCLUSION
In the reminded design a DSP processor with RISC and four stage pipeline architecture is implemented for performance optimization. The highest priority of design is to gain high speed processing and better throughput with reduced area and power consumption. This design is defined in verilog, simulated and synthesized in Xilinks14.7 and then implemented on 5VF70tf1136-1 which proved to work properly. The ameliorated performance of this processor scrutiny is done by correlating throughput with [5]. The comparison shows better throughput 229.8MB/s can be achieved with proposed design. The maximum delay is 0.47ns is achieved which is better compared to the existing one. In future the recommended processor can be used for different application where speed is the highest priority. The designed DSP processor can be used as application specific DSP processor for the different applications like audio and video codec.

### Table 1: comparison of propagation delay of different components

<table>
<thead>
<tr>
<th>Component</th>
<th>Proposed processor(64-bit)</th>
<th>[5], 32-bit</th>
<th>[14], 32-bit</th>
</tr>
</thead>
<tbody>
<tr>
<td>Program counter</td>
<td>1.1</td>
<td>1.125</td>
<td>1.35</td>
</tr>
</tbody>
</table>

### Table 2: comparison of maximum frequency of the proposed processor with existing designs

<table>
<thead>
<tr>
<th>Processor</th>
<th>Maximum frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>[5]</td>
<td>13MHz</td>
</tr>
<tr>
<td>[4]</td>
<td>106 MHz</td>
</tr>
<tr>
<td>[1]</td>
<td>100 MHz</td>
</tr>
<tr>
<td>Proposed design</td>
<td>212 MHz</td>
</tr>
</tbody>
</table>

5. ACKNOWLEDGMENTS
A sincere gratitude and deep regard to Poojya Dr. Sharnbaswappa Appaji, President, Sharanabasveshwar Vidyadharma Sangha, Kalaburagi, for his immense support and encouragement. And special thanks to all experts of PDA college of engineering, also like to give our sincere gratitude to Dr. V’D Mytri, Principal, APPA IET, Dr. Anilkumar Bidve, Dean of Administration, APPA IET for their invaluable suggestions and support.

6. REFERENCES


