A Novel Way to Design and Implement Statistical Operations based on FPGA

Sarmad F. Ismael
Computer Science Department
Cihan University
Iraq – Duhok

Basil Shukr Mahmood, PhD
Computer Engineering Department
University of Mosul
Mosul – Iraq

ABSTRACT
The architecture design for statistical operations to compute the Mean, Variance, Standard Deviation, RMS (Root Mean Square), Covariance, and MSE (Mean Square Error) values has been implemented on hardware concerning Xilinx Spartan 3E XC3S500E FPGA and worked properly up to maximum frequency of 73.252 MHz. The practical outcomes have been compared with the theoretical values calculated by Matlab with maximum error of 1.425%. New methods of design were concerned for the architecture of each function to reduce the number of slices.

General Terms
Computer Architecture, computational technology, Statistical theory

Keywords
FPGA, VHDL, Statistical Operations, Accumulators, fixed point.

1. INTRODUCTION
The most commonly statistical used operations were mean, variance, standard deviation, RMS, Covariance, and MSE. The statistical operations were used in a large number of applications such as digital signal processing [1], image proceeding [2] and many other applications.

In this paper, a novel way is presented to implement the architecture design to do the statistical operations. There are many researches deal with statistical operations, but most of them don’t refer to their architectures using FPGA.

In 2008, three methods of accumulation have been examined to find the mean and variance to adapt them to be used in FPGA applications by David B. Thomas and Wayne Luk[3]. In 2009, a floating-point accumulator for FPGA-based high performance computing applications is proposed, evaluated and implemented in FPGA-based by Song Sun and Joseph Zambreno [4]. In 2010, a floating-point accumulation was performed on a modern FPGA in Single and Double Precision by Tarek Ould Bachir and Jean-Pierre David [5].

The paper is organized as follows: In section (II), the background and theory of the architectural design is presented. In section (III), the hardware Implementation details are shown. In section (IV), the experimental result is given. Accordingly, the section (V) concludes this paper.

2. BACKGROUND AND THEORY
The statistical operations [6] presented in this paper are as follow:

A. Mean value (or average):

\[ X = \frac{1}{n} \sum_{i=1}^{n} X_i \]  

B. Variance value:

\[ s^2 = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2 \]  

C. Standard Deviation value:

\[ s = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2} \]  

D. Root Mean Square value:

\[ RMS = \sqrt{\frac{1}{n} \sum_{i=1}^{n} X_i^2} \]  

E. Mean Square Error value:

\[ MSE(x, y) = \frac{1}{n} \sum_{i=1}^{n} (X_i - Y_i)^2 \]  

F. Covariance value:

\[ c = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) \]  

To find the variance, standard deviation and covariance values firstly find the mean and then use it in the above equations. These operations need two iterative steps, therefore their expressions have been analyzed as follows:

\[ s^2 = \frac{1}{n} \left[ \sum_{i=1}^{n} X_i^2 - \frac{1}{n} \left( \sum_{i=1}^{n} X_i \right)^2 \right] \]  

\[ MSE = \frac{1}{n} \left[ \sum_{i=1}^{n} X_i^2 - 2 \sum_{i=1}^{n} X_i Y_i + \sum_{i=1}^{n} Y_i^2 \right] \]  

\[ c = \frac{1}{n} \left[ \sum_{i=1}^{n} X_i Y_i - \frac{1}{n} \left( \sum_{i=1}^{n} X_i \right) \left( \sum_{i=1}^{n} Y_i \right) \right] \]

3. HARDWARE IMPLEMENTATION DETAILS
The general design of the proposed architecture to compute the statistical operations is described in Fig. 1. Using the above manipulated equations, all operations are share with the same accumulators, therefore the components to compute the accumulator operations has been implemented as shown in Fig.1.
Fig 1: General Component of Statistical operations Architecture

After computing the values of all accumulators for the all expressions, the arithmetic operations such as Add, Subtract, Shift, and Square Root are needed to produce the final results. The arithmetic operations used in the proposed architecture are:

3.1 Accumulators:
The Accumulators used in the design represent the main operations in the proposed architecture are of two types: the general accumulator (GAC) and the multiplier accumulator (MAC) as shown in Figure 2 [4][5]:

3.2 Addition and Subtraction:
As these operations are ready in VHDL, their codes were copied to the proposed design.

3.3 Multiplier:
Characterized by low size and very high speed, the 18X18 Embedded Multipliers in Spartan 3E XC3S500E FPGA have been used.

3.4 Division:
The division operation cannot be implemented directly by using VHDL, therefore many algorithms are introduced for the implementation. The best one of them is the non-restoring algorithm [7] which is suitable for the large number of accumulator bits. Since the large number of bits in the accumulator usually needs large number of slices in design, the number of input data fixed to 16 inputs and shift to right is used to represent division operation instead of using division algorithm. Shift to right 4 bits to represent division by N and the conjugate multiplication method have been suggested to simplify the denominator N-1 as show in Eq. 10.

\[ \frac{1}{N-1} \frac{N}{N+1} + \frac{N^2}{N+1} + \frac{N^3}{N+1} + \ldots + \frac{1}{N^m} + \frac{1}{N^{m-1}} + \ldots + \frac{1}{N^{n-1}} \]  

The accumulator result is shifted to right four times (4 bits, 8 bits, 12 bits, 16 bits), and then add them to produce the result.

3.5 Square Root:
The modified non-restoring algorithm has been used to implement square root operation which is depending on subtract operation and append 01 and need one clock cycle to produce the result [8].

Fig. 3 shows the design of the proposed architecture.

The proposed architecture consists of two main parts, the first part is to calculate the five accumulator operations and to input data X and Y which work in parallel, therefore they need just 16 clocks to complete all values and then they can be used many times to produce all statistical operations. The second one contains many 2x1 multiplexers (MUX) to select which input X or Y is used as well as many arithmetic operations that have been discussed before.

The reset bit has been used to synchronously reset all accumulator operations when new next input data acquired. Two types of MUX have been used, the first one is 2X1 MUX to select whether the input data is X or Y, depending on the first bit of the selector signal. The second multiplexer 8X1 MUX is used to select which statistical operation has to appear on the output depending on the last three bits of the selector. Table 1 shows which statistical operation is selected according to the input selector signal. The remaining operations are add, sub, shift right and shift left have been used to accomplish the remaining statistical operations.

The fixed point package [9] is used to represent variables in the VHDL code. Table 2 shows the length and the format of each variable used in the design. The output of the statistical operations assumed to have 24 bit lengths, therefore they are padded to be 24 bit length.
Table 1. Statistical operations selector according to input selector signal

<table>
<thead>
<tr>
<th>Selector (3)</th>
<th>Selector (2)</th>
<th>Selector (1)</th>
<th>Selector (0)</th>
<th>Statistical operations</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>Mean (X)</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>Mean (Y)</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>Variance (X)</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>Variance (Y)</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>Standard Deviation (X)</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>Standard Deviation (Y)</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>RMS (X)</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>RMS (Y)</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>-</td>
<td>Covariance (X,Y)</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>-</td>
<td>MSE (X,Y)</td>
</tr>
</tbody>
</table>

Table 2. The numbers of bits representation for each element

<table>
<thead>
<tr>
<th>Operations</th>
<th>NO. of real part + sign bit</th>
<th>NO. of fraction part</th>
</tr>
</thead>
<tbody>
<tr>
<td>mean</td>
<td>5</td>
<td>8</td>
</tr>
<tr>
<td>variance</td>
<td>12</td>
<td>12</td>
</tr>
<tr>
<td>Standard Deviation</td>
<td>6</td>
<td>10</td>
</tr>
<tr>
<td>RMS value</td>
<td>5</td>
<td>11</td>
</tr>
<tr>
<td>Covariance</td>
<td>12</td>
<td>12</td>
</tr>
<tr>
<td>MSE value</td>
<td>13</td>
<td>11</td>
</tr>
</tbody>
</table>

Table 3. X and Y values used as inputs

<table>
<thead>
<tr>
<th>X Values</th>
<th>Y Values</th>
<th>HEX for X</th>
<th>HEX for Y</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1</td>
<td>2.3</td>
<td>032</td>
<td>01E</td>
</tr>
<tr>
<td>-1.6</td>
<td>1.9</td>
<td>1E6</td>
<td>01E</td>
</tr>
<tr>
<td>1.12</td>
<td>-1.1</td>
<td>012</td>
<td>1EE</td>
</tr>
<tr>
<td>-0.91</td>
<td>-0.81</td>
<td>1F1</td>
<td>1F3</td>
</tr>
<tr>
<td>-10.111</td>
<td>0.78</td>
<td>15E</td>
<td>00C</td>
</tr>
<tr>
<td>0.4</td>
<td>0.515</td>
<td>006</td>
<td>008</td>
</tr>
<tr>
<td>2.2</td>
<td>0.629</td>
<td>023</td>
<td>00A</td>
</tr>
<tr>
<td>-5.6</td>
<td>-3.23</td>
<td>1A6</td>
<td>1CC</td>
</tr>
<tr>
<td>-12.7</td>
<td>11.6</td>
<td>135</td>
<td>0BA</td>
</tr>
<tr>
<td>3.8</td>
<td>-15.313</td>
<td>03D</td>
<td>10B</td>
</tr>
<tr>
<td>4.66</td>
<td>-2.9</td>
<td>04B</td>
<td>1D2</td>
</tr>
<tr>
<td>-1.222</td>
<td>0.1</td>
<td>1EC</td>
<td>002</td>
</tr>
<tr>
<td>10.61</td>
<td>-0.09</td>
<td>0AA</td>
<td>1FF</td>
</tr>
<tr>
<td>0.99</td>
<td>0.13</td>
<td>010</td>
<td>002</td>
</tr>
<tr>
<td>-1.31</td>
<td>8.9</td>
<td>1EB</td>
<td>08E</td>
</tr>
<tr>
<td>9.87</td>
<td>4.2</td>
<td>09E</td>
<td>043</td>
</tr>
</tbody>
</table>

4. EXPERIMENTAL RESULTS

The input data for X and Y first saved in the distributed memory array of Spartan 3E as 16 elements each 9 bits (sign bit, 4 bits for real part and 4 bits for fraction part).

Signed fixed numbers have been used to represent X and Y values. Table 3 shows examples of X and Y values with their hex_decimal representations.

After feeding X and Y values inputs, the values of all the statistical operations will appear on the output after 16 clock cycles, the simulation results of all the statistical operations coming out from the ISE 14.7 simulator are shown in Figure 4.

The theoretical values calculated by Matlab for the same input values have been tested and compared with the experimental practical values taken, and the error values between them are shown in table 4.

Error values calculated based on Eq. 11 from the table 4 shows that the maximum error produced is not greater than 1.425%.

Error \[= \frac{\text{Theoretical value} - \text{Practical value}}{\text{Theoretical value}} \times 100\] (11)

The complete architecture design was implemented on Xilinx Spartan 3E XC3S500E FPGA by using ISE14.7 simulator. After synthesizing the design by using ISE14.7 simulator, it is found that the maximum operating frequency becomes 73.252 MHz and the summary of resources utilization from the FPGA chip is shown in table 5.
The maximum operating frequency becomes 73.252MHz. This enhancement in performance is due to the usage of novel way to deal with the statistical equations and the use of shift operations instead of using division operation which needs more time, therefore 16 clock cycles are needed to produce all statistical operations, this means that results of all statistical operations are calculated by less than 220 nsec. The proposed architecture is useful in real time applications as well.

In future works, the proposed architecture can be used to design a statistical processor in order to solve statistical problems in many science fields that require these operations.

5. CONCLUSIONS AND FUTURE WORK

The proposed architecture produces six statistical operations for two inputs X and Y with better performance by using a small area and high frequency of Spartan 3E chip and few clock cycles. The used area of Spartan 3E is 23% and the maximum operating frequency becomes 73.252MHz. This enhancement in performance is due to the usage of novel way to deal with the statistical equations and the use of shift operations instead of using division operation which needs more time, therefore 16 clock cycles are needed to produce all statistical operations, this means that results of all statistical operations are calculated by less than 220 nsec. The proposed architecture is useful in real time applications as well.

In future works, the proposed architecture can be used to design a statistical processor in order to solve statistical problems in many science fields that require these operations.

6. REFERENCES

[9] D. Bishop, 1076-2008 Fixed point package user’s guide. Packages and bodies for the IEEE.