## **Design of Nova Decoder for H.265/HEVC**

Anuradha Savadi Dept of ECE PDA College of engineering Kalaburagi, Karnataka, India Raju Yanamshetti Dept of ECE PDA College of engineering Kalaburagi, Karnataka, India

Sherqua Asma Dept. of ECE Appa Institute of engineering and technology Kalaburagi, Karnataka, India

#### ABSTRACT

This paper represents the Nova decoder design for latest standard of video coding H.265/HEVC. Power optimization is the main priority of the propound decoder at various levels of the system like wise physical, circuit, algorithm and architecture levels. The proposed design is able to decode QCIF 30fps at maximum frequency with optimized power supply and 80% in area reduction with UMC 180nm technology. Video quality and less power dissipation is the highest priority for the portable devices in the present era in which our propound design will meet all requirements.

#### **General Terms**

Low power video codec, bit stream parser, deblocking filter engine, interpretation, intra prediction et. al.

#### **Keywords**

H.265/HEVC, Baseline Decoder, CAVLC / CABAC.

#### **1. INTRODUCTION**

H.264 is widespread and broadly followed in the packages as transmission of video on terminals of mobiles. Those videosuccessfully present on powered battery gadgets for the low power. Modern VLSI layout accomplish at discrete steps of abstraction which are from the uppermost to lowermost, set of rules, structure, circuit, and physical ranges. As per Theoretical results, power optimization is less for higher design levels so, it achieves higher power saving .For reducing power lower design levels are used but it is not applicable by higher design levels.

The video codec H.265/HEVC coding standard performance is better than beforehand basis as MPEG-2(H.263), MPEG4 .H.265 provides exactly same video quality with halves the encoded bit rate. This enormous upgrading is at the rate of large computational complexity. It defines the latest coding strategies, together with various block sizes of deblocking filter and CABAC/CAVLC (context adaptive binary arithmetic coding/context adaptive variable length coding).

An organized lowest energy layout technique for video decoding has been proposed and carried out on the H.265/HEVC baseline decoder. On the level of algorithm, the complexities by the computations are analyzed. And at the level of architecture, pipeline and the parallel are much extensively used to reduce operating frequency. Maximum of the data from large memory to smaller one is moved in hierarchical memory organization.

#### 2. SYSTEM OVERVIEW

H.265/HEVC have the similar basic ideas as previous basis as H.264 and MPEG-2. But, HEVC consists of many improvements which are considered as incremental improvement including,

- Additional flexibility in partitioning, since highest to smallest partition sizes.
- Higher flexibility in types of predictions as well as in sizes of transform block.
- Additional refined interpolation and deblocking filter.
- Capabilities to assist well-organized parallel processing.
- Meet the conditions of the compressed video format and HEVC should be effectively decodecapable the use of the approach defined in trendy. HEVC video sequences may be saved in media documents, streamed to the internet, and transferred by broadcast, etc.

CODING TREE UNIT (CTU) AND CODING TREE BLOCK (CTB): Partition of the new standard differs from H.264, it follows the tree structure called coding tree unit (CTU) which includes luma CTB and consequent chroma CTBs and syntax elements.

PREDICTION UNIT AND PREDICTION BLOCK: At CU level, image can be coded or cannot is decided through interpicture or intra-picture prediction. At the coding unit level partitioning of prediction unit consists of roots, depending on fundamental prediction-type, decision has made that luma as well as chroma coding blocks might similarly fragmented to different sizes, expected from block of luma and chroma prediction blocks. High efficiency video coding adjusts prediction blocks varying of  $64 \times 64$  right through 4x4 models.

TRANSFORM UNIT (TU) AND TRANSFORM BLOCKS(TB): The residual of prediction block are the inputs for transform block, in which luma CB are same as luma TB or again divided into small TBs. DCT (discrete cosine transform) is used for the computation, where in intra prediction residual transform will be done by DST (discrete sine transform).

MOTION VECTOR SIGNALING: According to the data of adjacent PBs, as well as from data of reference pictures, advanced motion vector prediction (AMVP) can be utilized, which also includes the derivation of numerous possible candidates.



Fig [2.1] Emblematic HEVC/ H.265 video codec (decoder elements shaded in light gray)

A mode of motion vector coding can be utilized by letting patrimony of motion vectors spatially or temporally adjacent PBs.

MOTION COMPENSATION: For MVs and seven-tap or eight-tap filters, quarter-model precision is utilized to every prediction block, more than one motion vectors are forwarded, resultant may be an uni-predictive or bi-predictive coding, correspondingly. Offset operation of H.264 can enforce for prediction signals recognized as weighted prediction.

INTRA AND INTER PICTURE PREDICTION: Inter-picture prediction was not well performed in few regions; adjacent blocks of decoded boundary samples were used for spatial prediction for reference data. 33 directional modes are supported as intra picture prediction in which eight modes of H.264 are common. A Planar and prediction mode, an intrapicture prediction mode is programmed through maximum feasible mode based on the earlier decoded neighboring prediction blocks. Inter picture prediction will be same as H.264 video coding. HEVC supports more partition block compared to intra prediction. To reconstruct the original signal, nonlinear amplitude mapping is incorporated within inter prediction loop.

### 3. RELATED WORK

As in [1] the Baseline decoder is designed using five familiar techniques like clock gating, parallel luma and chroma processing and hybrid pipeline architecture for 4x4 block and 16x16 block levels and self adaptive pipeline for prediction unit ,hierarchical memory organization for memory access reduction and optimization of coded block pattern based on statistical distribution.

As in [2] a priority based, data driven heading one detector has designed to reduce the power consumption by using CAVLC entropy coding.

As in [3]. To increase the throughput of the encoding blocks at high bit rate parallel algorithm of CAVLC is used. Dual block pipeline architecture is incorporated to enhance the speed and dual buffer architecture for memory efficient usage has been proposed.

As in [4]. CABAC decoder algorithm is used for new standard of video coding i.e. H.265/HEVC. Flexible pipeline decoder architecture has been propounded to soak up the fluctuating processing time. External DRAM access is reduced by invoking the cache architecture (UCLS technique). For pixel processing single cycle time found sufficient.

# 4. PROPOUND ARCHITECTURE FOR NOVA DECODER

#### 4.1 Function Blocks

The architecture of Nova decoder is mainly divided into two partitions: 1.Bit stream parser, 2. Data reconstruction path

#### 4.1.1 BIT Stream Parser

Data stored in external memory is passed to circular bitstream buffer encoded bitstream is fetched by bit stream parser.there are two decoders are used to handle input code word named as CAVLC decoder or variable length decoder. By using the priority based heading one detector power can be optimized. A hierarchical decomposition and state merging FSM with small sub-FSM is designed to reduce the number of states which leads unwanted switching activities. Many LUTs are used in bit stream parser, but these are separelty divied in inter and intra prediction units.

International Journal of Computer Applications (0975 – 8887) Volume 179 – No.11, January 2018

#### 4.1.2 Data Reconstruction Path

INTRA PRED ENGINE: Intra Prediction Utilizes Many Low Power Techniques Such As Self Adaptive Pipeline, Memory Hierarchy,MFPF Seed.,Two-Level.

INTER PRED ENGINE: Inter Prediction Engine Is A Device Used Specially For Power And As Well As Throughput. Pipelining and parallel processing are used to achieve the goal of real time decoding requirement and of low power .Also the memory access reduction are broadly accepted in entire design of inter pred engine. Mainly the self-adaptive inter prediction is divided in 2 stages, Reference data fetch and Interpolation.Just like intra prediction and motion vectors need distinctive variety of memory access and computation of pixel. Below figure of self adaptive pipeline is able to identify all possible combinations .

Deblocking Filters which were used for prior video compressors as MPEG-2/H.263, H.264/AVC . As compaired to the Operation of H.265/HEVC DF is more complicated because of advanced compliance filtering procedureS, also due to minor size partitions. Design of 5 stages pipeline filter of deblocking is as shown below To filter on 16x16 MB only

204 cycles required to filter without using costly two port or dual port SRAM.When this task distributs among five nearly stable stages,then time required for a single operation is reduced .And to reduce the clock power,pipeline stages are independently clock-gated .In above pipelined architecture, usually time consumes for one filtering operation is decreases when task is divided among nearly five balanced stages.And also each pipeline stages is indivisually clock gated and noticed the reduced clock power.

#### 5. PERFORMANCE ANALYSIS

To get the simulation result of bit stream buffer, reset\_n should be always high and at every positive clock pulse bit stream\_ram\_en should be high. It selects any one adress such as 0003B as shown in above figure, it sends a data of 16 bit such as c721toparticular asigned а adress.Ext frame ram cs n and Ext frame ram wr should be always vice versa as shown in above figure. Circular bit stream buffer contains total 128 bit, when above mentioned inputs given then it selects particular data and further continue for the reconstruction path .Like this procedure will be continued for all 128 bits.



Fig [5.1] : Simulation result of circular bit stream buffer.

This is the analog simulation result of overall HEVC decoder. To observe the analog output reset\_n and clock signal should be always high and pin\_disable should be always low,freq\_ctrl0 and freq\_ctrl1should be vice versa of each other.First calculating when bit stream buffer is state 1 st1 and next for st0.When all the parameters selected as mentioned above then we will get exact simulation output of HEVC decoder.

International Journal of Computer Applications (0975 – 8887) Volume 179 – No.11, January 2018



Fig [5.2]: Analog simulation result of HEVC decoder.

Table[5.1] Power consumption reduction was also one of the goal for introducing the H.265(HEVC), power consumed is, dynamic power is 0.089w and quiescent power is 0.449. Hence total power consumed is 0.538w.

Table[5.1] :power consumption of proposed design

| Device              |                  |    | On-Chip | Power (W)  | Used          | Available   | Utilization (%) | Supply | Summary   | Total       | Dynamic     | Quiescent   |
|---------------------|------------------|----|---------|------------|---------------|-------------|-----------------|--------|-----------|-------------|-------------|-------------|
| Family              | Virtex5          |    | Clocks  | 0.089      | 38            | -           | -               | Source | Voltage   | Current (A) | Current (A) | Current (A) |
| Part                | xc5vtx50t        |    | Logic   | 0.000      | 25342         | 28800       | 88              | Vccint | 1.000     | 0.378       | 0.089       | 0.289       |
| <sup>p</sup> ackage | ff1136           |    | Signals | 0.000      | 28581         | -           |                 | Vccaux | 2.500     | 0.062       | 0.000       | 0.062       |
| Temp Grade          | Commercial       | •  | DSPs    | 0.000      | 4             | 48          | 8               | Vcco25 | 2.500     | 0.002       | 0.000       | 0.002       |
| Process             | Typical          | •  | 10s     | 0.000      | 174           | 480         | 36              |        |           |             |             |             |
| Speed Grade         | 4                | ٦. | Leakage | 0.449      |               |             |                 |        |           | Total       | Dynamic     | Quiescent   |
|                     |                  | _  | Total   | 0.538      |               |             |                 | Supply | Power (W) | 0.538       | 0.089       | 0.449       |
| Environment         |                  |    |         |            |               |             |                 |        |           |             |             |             |
| Ambient Temp (C)    | 50.0             |    |         |            | Effective TJA | Max Ambient | Junction Temp   |        |           |             |             |             |
| Use custom TJA?     | No               | •  | Thermal | Properties | (C/W)         | (C)         | (C)             |        |           |             |             |             |
| Custom TJA (C/W)    | NA               | ٦. |         |            | 1.5           | 84.2        | 50.8            |        |           |             |             |             |
| Airflow (LFM)       | 250              | •  |         |            |               |             |                 |        |           |             |             |             |
| Heat Sink           | Medium Profile   | v  |         |            |               |             |                 |        |           |             |             |             |
| Custom TSA (C/W)    | NA               |    |         |            |               |             |                 |        |           |             |             |             |
| Board Selection     | Medium (10"x10") | •  |         |            |               |             |                 |        |           |             |             |             |

Area reduction was one of the goal for introducing the H.265(HEVC), the above table compare how the total area is reduced by reducing the different parameters such as Bit stream parser fsm was 23,712 in previous design and in proposed design it is reduced to 101. Intra pred node decoding was 2,744 but in proposed design it is only 84.

Table [5.2] comparision of gate count of previous design with proposed design

| Function blocks.            | Proposed design. | Previous desidn. |
|-----------------------------|------------------|------------------|
| Bit parser fsm              | 101              | 23,721           |
| Intra pred node decoding    | 84               | 2744             |
| Inter motion vector decoing | 371              | 15600            |
| IQIT                        | 815              | 14,310           |
| Intra pred                  | 685              | 17,743           |
| Interpred                   | 1580             | 53,462           |
| Deblocking filter           | 1316             | 33,182           |
| Number of slices            | 2389             | 169k             |
| Frequency                   | 1.5M             | 53.55M           |

Inter motion vector decoding was 15,600in previous design and in proposed design it is 371. IQIT was 14,310 but in proposed design it is 815.INTRA PRED was 17,743 but in proposed design 685. INTER PRED was 53,462 but in previous design it is 1580.Deblocking filter was 33,182 in previous design but in proposed design it is reduced to 1316.Number of slices were 169k but in proposed design reduced to 2389. Frequency was 53.553Mhz but in proposed design it is 1.5Mhz. The below mentioned table is for overall video codec utilization parameters. Here number of slice registers Available are 28,800 and utilized only 1% which is only 84 slice registers are used. Numbers of slice LUTs available are 28,800 and only 1% utilized which will be only 157 slices are used. Number of unused flip flop available of 201 and 58% utilized will be 117 flip flops are used. Number of fully used LUT-FF pairs are available 201 but utilized 19% which is only 40 LUT-FF pairs are used. Overall system design has been simulated, synthesized using Xilinx 14.2 implemented on FPGA. Few RTL schematic of decoder as shown below.

International Journal of Computer Applications (0975 – 8887) Volume 179 – No.11, January 2018

| Device Utilization Summary [-]                                     |      |           |             |         |  |  |  |  |
|--------------------------------------------------------------------|------|-----------|-------------|---------|--|--|--|--|
| Slice Logic Utilization                                            | Used | Available | Utilization | Note(s) |  |  |  |  |
| Number of Slice Registers                                          | 84   | 28,800    | 1%          |         |  |  |  |  |
| Number used as Flip Flops                                          | 84   |           |             |         |  |  |  |  |
| Number of Slice LUTs                                               | 157  | 28,800    | 1%          |         |  |  |  |  |
| Number used as logic                                               | 157  | 28,800    | 1%          |         |  |  |  |  |
| Number using O6 output only                                        | 135  |           |             |         |  |  |  |  |
| Number using O5 and O6                                             | 22   |           |             |         |  |  |  |  |
| Number of occupied Slices                                          | 85   | 7,200     | 1%          |         |  |  |  |  |
| Number of LUT Flip Flop pairs used                                 | 201  |           |             |         |  |  |  |  |
| Number with an unused Flip Flop                                    | 117  | 201       | 58%         |         |  |  |  |  |
| Number with an unused LUT                                          | 44   | 201       | 21%         |         |  |  |  |  |
| Number of fully used LUT-FF pairs                                  | 40   | 201       | 19%         |         |  |  |  |  |
| Number of unique control sets                                      | 21   |           |             |         |  |  |  |  |
| Number of slice register sites lost<br>to control set restrictions | 0    | 28,800    | 0%          |         |  |  |  |  |
| Number of bonded IOBs                                              | 152  | 480       | 31%         |         |  |  |  |  |
| Number of BUFG/BUFGCTRLs                                           | 1    | 32        | 3%          |         |  |  |  |  |
| Number used as BUFGs                                               | 1    |           |             |         |  |  |  |  |
| Average Fanout of Non-Clock Nets                                   | 3.74 |           |             |         |  |  |  |  |

#### Table [5.3] Overall video codac utilization parameters



Fig [5.3] :RTL schematic of DECODER



Fig [5.4]RTL schematic of ENCODER



Fig [5.5]RTL schematic of H.265/HEVCvideo codec

### 6. CONCLUSION

Proposed plan describes "low power consumption" by H.265/HEVC baseline decoder. A particular less power consumption design practice for decoding of video is planned and as well as activated. This design is accomplished in XILINX 13.4ISE version and executed on vertex-5 FPGA. H.265 reduces to the half value of bit rates and gives nearly the similar quality of video as of H.264. Number of gate count of proposed design and previous design has been compared, By comparing the table we can conclude that number of gate count in this design is very less than previous design gate count. Operation of deblocking filter in H.265 is more complicated due to advanced adaptability of process of filtering and due to smallest size partitions, so in future new techniques will be introduced to make operation of deblocking filter to much more simple.

#### 7. ACKNOWLEDGMENTS

Our special gratitude to all who have supported towards enlargement of the nova decoder.

#### 8. REFERENCES

- ke xu chiu sing choy "low power h.264/avc baseline decoder for portable applications", ISLPED 07, august 27-29,2007, Portland, Oregon, USA copyright 2007 ACM 978-1-59593-709-4/07/008.
- [2] ke xu chiu sing choy "Priority-Based heading one detector in H.264/AVC decoding", EURASIP journal on Embedded System, vol. 2007, Article ID 60834. 2007
- [3] Waleed Ahmed EI- ghobashy, Mohammed Ebian, et. al "An efficient implementation method of H.264 CAVLC

video coding using FPGA", DOI 978-1-5090-0275-7/15, pp.212-216, IEEE conference 2015.

- [4] Maleen abeydeera, Manupa Karunaratne et.al "4K Real time HEVC Decoder on FPGA", IEEE transactions on Circuits and systems for video technology, TCSVT 2015.
- [5] Gary j.sullivan "overview of high efficiency video coding (HEVC)standard.1051-8215 2012 IEEE.
- [6] Seongmo Park, Hanjin Cho, Heebum Jung , and Dukdong Lee, "An Implemented of H.264 Video Decoder using Hardware and Software" analyzed implementation.
- [7] D.wu,w.Gao,M.Z.Hu,Z.Z.Ji,"A VLSI architecture design of CAVLC decoder", The 5<sup>th</sup> international conference on ASIC,oct.2003,pp.962-965.
- [8] k.xu,C.S.choy, C.F. chan,K.P.pun,"A low power bitstream controller for H.264/AVC baseline decoding",32<sup>ND</sup> European solid stste circuits conference,pp.162-165, sep 2006.
- [9] Tsu-Ming Liu, Wen-Ping Lee, Ting-An Lin and Chen-Yi Lee "A Memory-efficient Deblocking filter for h.264/avc video coding" by the National Science Council of Taiwan, R.O.C., under Grant NSC 93-2220-E-009 -010.
- [10] Hae-Yong Kang, Kyung-Ah Jeong, Jung-Yang Bae, Young-Su Lee, Seung-Ho Lee "MPEG4 avc/h.264 decoder with scalable bus architecture and dual memory controller".0-7803-8251-X/04/\$17.00 2004 IEEE.
- [11] Tung-Chien Chen, Chung-Jr Lian, and Liang-Gee Chen "Hardware Architecture Design of an H.264/AVC Video Codec". 0-7803-9451-8/06/\$20.00 2006 IEEE