Optimising the New Chinese Remainder Theorem 1 for the Moduli Set

\[ \{2^{2n+2} + 3, 2^{2n+1} + 1, 2^{2n} + 1, 2\} \]

John Bosco Aristotle K. Ansuura  
ICT Directorate  
University for Development Studies-Ghana

Ismail Rashid Fadulilahi  
Department of computer science  
University for development Studies-Ghana

ABSTRACT
This paper seeks to improve the performance of the New Chinese Remainder Theorem (CRT) using the new moduli set \( \{2^{2n+2} + 3, 2^{2n+1} + 1, 2^{2n} + 1, 2\} \). This optimization is very important in order to minimize the cost of hardware implementation and to improve the reverse conversion speed. The major factor responsible for this high hardware cost and high reverse conversion time is the presence of multipliers in the hardware implementation of the reverse converters. This paper proposes the moduli set \( \{2^{2n+2} + 3, 2^{2n+1} + 1, 2^{2n} + 1, 2\} \), which is applicable for applications requiring larger dynamic range. The moduli set must be relatively prime integers. The computation of multiplicative inverses can be eliminated. We employ the proposed moduli set to optimize the New CRT-I. This scheme can result in less memory and adder based reverse converters, which is shown to be better than known existing similar state of the art scheme.

Keywords
Reverse Conversion, Optimization, Algorithm, Co-prime.

1. INTRODUCTION
Recent times have seen vigorous and continuous research into the improvement of computer performance. Researchers are making progress in improving the efficiency of computers with new ideas and technologies. Computing is the major task of a computer that is dealing with numbers all the time hence the number system. Some examples of number systems are binary number systems, decimal number systems, Residue Number System (RNS) and many more. Research has revealed that binary and decimal number systems intrinsically limit the performance of arithmetic units and processors built based on them. This is a limitation of the Weighted Number System (WNS) therefore making RNS more preferred in computing larger numbers in computers. A number in RNS is represented by the residues of all moduli, and arithmetic operations can be performed independently on each modulus. Thus, RNS offers the properties of parallelism, carry-free addition, borrow free subtraction, which are the major challenges of binary and decimal systems [10]. According to [9], the third-century Chinese scholar Sun Tzu invented the Residue Number System (RNS). Sun Tzu posed a mathematical riddle:

\begin{align*}
\text{We have things of which we do not know the number} \\
\text{If we count them by threes, we have two left over} \\
\text{If we count them by fives, we have three left over} \\
\text{If we count them by sevens, we have two left over. Tzu [9] gave a rule, how many things are there?}
\end{align*}

This riddle was later generalized by another Chinese and known as the Chinese remainder theorem which is the bases of RNS [10].

In the 1950s, RNS was rediscovered by some computer scientists who sought to put them to use in the implementation of fast arithmetic and fault-tolerant computing [10]. Their system also offered useful properties for error detection, error correction and fault tolerance in digital systems. These properties increase the efficient in carrying out arithmetic operations. The speed of arithmetic operations relies largely on the size of the numbers involved, smaller numbers result in faster operations. Smaller numbers were considered in their research and therefore known for faster arithmetic operations. This system is applied in the fields of Digital Signal Processing (DSP), Speech Processing, Image Processing, Computer Engineering and Computer Security.

Sun Tzu [9] also proposed a general structure of a typical RNS processor as shown in Figure 1. Data that is represented in RNS is processed in parallel with no dependence or carry propagation between the processing units. The process of encoding the input data into RNS representation is called Forward Conversion. This data when processed is converted back to the conventional representation. Reverse conversion is the process of converting back the output data from RNS to conventional representation.
2. SIGNIFICANCE OF RNS REPRESENTATION

Residue Number System is an integer system which is capable of supporting parallel, carry-free and high speed arithmetic operations [12], [3] and [5]. In addition, RNS offers some useful properties for error detection, error correction and fault tolerance in digital systems. It is very efficient in carrying out arithmetic operations like additions, subtractions and multiplications. The speed of the arithmetic operations relies on the size of the numbers involved.

Omondi A and Premkumar B [10], proposed an algorithm using parallel distributed arithmetic with no dependence between the arithmetic blocks which simplifies the overall design and reduces the complexity of the individual building blocks. [10] Summarized the advantages of RNS representation as follows:

High Speed: In conventional digital processors, the critical path is associated with the propagation of the carry signal to the last bit of the arithmetic unit. RNS encodes large words into small words minimizing the critical path. Also, the carry-free property of RNS between the arithmetic blocks results in high speed processing.

Reduced Power: RNS processors reduce the switching activities in each channel by using small arithmetic units resulting in a reduction of the dynamic power. This is because the dynamic power is directly proportional to switching activities.

Reduced Complexity: The property of RNS that makes it encode large numbers into smaller residues reduce the complexity of the arithmetic units in each modulo channel. This facilitates and simplifies the overall design.

3. DRAWBACKS OF RNS REPRESENTATION

[11] highlights the advantages of RNS architectures especially in areas of speed and power. These make it suitable to implement RNS in different applications. However, RNS processors are not widely use but remain as an interesting theoretical topic. There are two main reasons behind the limited use of RNS in applications;

1. Although RNS representation simplifies and expedites addition and multiplication compared to the conventional binary system, other operations such as division, square-root, sign detection, and comparison are difficult and costly operations in the residue domain. Thus, this makes it difficult to build an RNS based ALU capable of performing the basic arithmetic operations.

2. Conversion circuitry can be complex and can introduce latency that offsets the speed gained by the RNS processor. Hence, the design of efficient conversion circuits is considered the bottleneck of a successful RNS.

4. APPLICATIONS

RNS is suitable for applications in which addition and multiplication are the predominant arithmetic operations because of its carry-free property. RNS has good potential in applications where speed and/or power consumption is very critical. In addition, RNS facilitates error detection and correction due to the isolation between the modulo channels. Examples of these applications are digital signal processing (DSP), digital image processing, RSA algorithms, communication receivers, and fault tolerance. Intensive multiply-and-accumulate (MAC) operations are required in most of these applications.

Sun Tzu [9] proposed the design of digital filters which is an RNS application in DSP. These digital filters have different uses such as interpolation, decimation, equalization, noise reduction, and band splitting.

Later, [5] identified two basic types of digital filters: Finite Impulse Response (FIR) filters and Infinite Impulse Response (IIR) filters. These filters mostly use multiplication and addition operations. These two arithmetic operations in the residue domain increases system speed and lowers the power consumption.

RNS could also be applied in the field of cryptography to secure information [14]. The research implements an efficient algorithm of RNS for RSA cryptography which enhances security and also decrease delay time complexity for encoding and decoding. The area of hardware requirement for implementation was reduced.

Another possible application of RNS in DSP is the Discrete Fourier Transform (DFT), an engineering based application. In this application, multiplication and addition are equally the main operations of the DFT. Hence, faster operations due to the parallelism in the processing. In addition, the carry-free property of the RNS makes it potentially very useful in fault tolerant applications. Recent integrated circuits are very dense, and therefore full testing will no longer be possible. RNS has no weight information this implies an error in one of the residues does not affect the other modulo channels. Consequently, ordering is not important in RNS representation, therefore, faulty residues can be discarded and corrected separately.

In summary, RNS seems to be good for many applications that are important in modern computing algorithms.

5. CHOICE OF MODULI

The choice of the moduli set is the major consideration in the design of RNS systems this is because the efficient choice of the moduli guarantees the most efficient outcome possible from an RNS system in terms of speed, hardware etc. Consequently, for RNS the moduli \( \{m_1, m_2, ..., m_n\} \) should satisfy the following properties:

1. The moduli should be relatively prime i.e. no two moduli should have a greatest common divisor greater than 1.
2. The moduli should be as small as possible, so that the modulo operations require less computational time.
3. The moduli should be in such a form so as to offer simple forward (weighted to RNS) and reverse (RNS to weighted) conversion with simple residue arithmetic.
4. The product of the moduli should be large enough so as to offer the required dynamic range for the particular system.
5. The moduli should create a balanced decomposition of the dynamic range which means the difference between the number of bits of different moduli should be as small as possible for achieving optimal parallel performance.
6. CONVERSIONS AND THEORIES

6.1 Data Conversion
Data conversion is one of the greatest challenges of RNS because the input operands are provided in either standard binary or decimal format and must be converted to RNS before the computation can be performed. Similarly, the final results must be represented in the same way as the input operands, thus RNS to binary/decimal conversion is very essential to a successful RNS design. This implies that RNS based processors make heavy use of data conversions, which are slow processes. For an RNS processor to compete favorably with a conventional processor efficient data converters must be developed so that the RNS speedup will not be nullified by the conversion overhead. Data conversion can be divided into two Categories, namely, forward and reverse conversion. Relatively, the reverse conversion is more complex but the forward conversion is not simple either.

6.2 Forward Conversion
The input operands to the RNS processor are either in the decimal or binary format, and therefore need to be converted into their respective residues before they are used for the computation. This work of converting from decimal to binary to residue is done by the forward conversion.

6.3 Reverse conversion from RNS to binary representation
Reverse conversion algorithms in the literature are all based on either Chinese Remainder Theorem (CRT) or Mixed-Radix Conversion (MRC). The MRC is an inherently sequential approach. On the other hand, the CRT can be implemented in parallel. The main drawback of the CRT based Residue to Binary reverse converter, is the need of a large modulo adder in the last stage. The reverse conversion is one of the most difficult RNS operations and has been a major, if not the major, limiting factor to a wider use of RNS.

6.4 Mixed Radix Conversion (MRC)
According to [9], given a set of pair-wise relatively prime moduli \( \{m_1, m_2, ..., m_n\} \) and a residue representation \( \{x_1, x_2, ..., x_n\} \) in that system of some number \( X \), where \( x_i = [X]_{m_i} \). The number \( X \) can be represented uniquely in mixed-radix form as \( X = (x_1, x_2, ..., x_n) \) where \( X = x_1 + x_2m_1 + x_3m_2m_1 + ... + x_nm_2m_3m_2m_1 \) and \( 0 \leq x_i \leq m_i \)

The Mixed-Radix Conversion (MRC) establishes an association between the unweighted, non- positional RNS and a weighted, positional mixed-radix system. In order to perform a reverse conversion the \( z_i \) values must be obtained. The \( z_i \) values are obtained as follows:

\[
z_1 = x_1, \quad z_2 = \left[m_2^{-1}m_1(x_2 - z_1)\right]_{m_2}, \quad z_3 = \left[m_3^{-1}m_2\left(m_1^{-1}m_2(x_3 - z_2) - z_1\right)\right]_{m_3}, \quad \vdots
\]

\[
z_N = \left[m_N^{-1}m_{N-1}\left(...m_2^{-1}m_1(x_N - z_1) - z_2\right) - z_3\right]_{m_N}
\]

Example 1: Suppose we wish to find the number, \( X \), whose residue representation is {1, 0, 4, 0} relative to the moduli set \( \{2, 3, 5, 7\} \)

From the equations above,

\[
z_1 = 1, \quad z_2 = [2^{-1}][2(0 - 1)] = 2 \times [-1] = 2 \\
z_3 = [3^{-1}][2^{-1}(4 - 1)] = 2 \times [3 \times 3 \times -1] = 5 \\
z_4 = [5^{-1}][3^{-1}][2^{-1}(0 - 1)] = 2[27] = 6
\]

Therefore,

\[
X = (1116) \text{ and for the conventional form, we translate this as } X = 6 \times 2 \times 3 \times 5 \times 1 \times 2 \times 3 \times 1 = 189
\]

6.5 Chinese Remainder Theorem (CRT)
The Chinese Remainder Theorem (CRT) may rightly be viewed as one of the most important fundamental results in the theory of residue number systems. CRT assures us that if the moduli of RNS are chosen appropriately then each number in the dynamic range will have a unique representation in RNS and that from such a representation we can determine the number represented. According to [9], [3] and [6], CRT is useful in reverse conversion as well as several other operations. Given a set of pair-wise relatively prime moduli, \( m_1, m_2, m_3, ..., m_n \), and a residue representation \( (x_1, x_2, ..., x_n) \) in that system of some number \( X \).

That is \( x_i = [X]_{m_i} \), that number and its residues are related by the equation \( X = \sum_{i=1}^{n} x_{i} \left[M^{-1}i\right]_{m_{i}}M \)

The expression above is the Chinese Remainder Theorem

Where \( M = \prod m_i \)

Example 2: Consider the moduli set \( \{3, 5, 7\} \), and suppose we wish to find the \( X \) whose residue representation is \( \{1, 2, 3\} \).

Solution

\[
M = 3 \times 5 \times 7 \quad M_1 = M/m_1 = (3 \times 5 \times 7)/3 \\
M_2 = M/m_2 = (3 \times 5 \times 7)/5 \\
M_3 = (3 \times 5 \times 7)/7
\]

Where

\[
\left| M_1 M_1^{-1} \right| = 1 \quad \left| 35 M_2^{-1} \right| = 1 \quad M_1^{-1} = 2 \\
\left| M_2 M_2^{-1} \right| = 5 \quad \left| 21 M_3^{-1} \right| = 5 \quad M_2^{-1} = 1 \\
\left| M_3 M_3^{-1} \right| = 1 \quad \left| 15 M_3^{-1} \right| = 1 \quad M_3^{-1} = 1
\]

Then by the CRT, we have \( (M = 3 \times 5 \times 7) = 105 \)

\[
X = \sum_{i=1}^{3} x_i |X_i|_{105}
\]

\[
X = 1 \times 35 \times 2 + 2 \times 21 \times 1 + 3 \times 15 \times 1|_{105} = 52
\]

6.6 New Chinese Remainder Theorem I (CRT I)
According to [8], the speed of the arithmetic operations is based on the numbers involved in the operations. The size of the numbers is directly proportional to the delay of the
operations, and therefore smaller numbers imply faster operations. However, the Chinese Remainder Theorem requires a slow large modulo operation while the Mixed Radix Conversion requires finding the mixed radix digits which is a slow process. New Chinese Remainder Theorems were designed to make the computations faster and efficient without any overheads. New Chinese Remainder theorem I (CRT I) is a modified version of the traditional Chinese Remainder Theorem. In this conversion process, the weighted number can be retrieved faster because the operations are done in parallel, without depending on other results.

Some propositions proposed are necessary for this new conversion and are as follows,

**Proposition 1:**
\[ a = 1 \mod (m_1, m_2) \text{ implies } a = 1 \mod m_1 \text{ and } a = 1 \mod m_2 \]

The above proposition is obtained from the corollary,

\[ a = 1 \mod (m_1, m_2 \ldots m_k) \text{ implies } a = 1 \mod m_1, a = 1 \mod m_2, \ldots, a = 1 \mod m_k \]

**Proposition 2:**
\[ [am_1] \mod (m_1, m_2) = [a] \mod m_2 \times m_1 \]

**Proposition 3**
For any \( y \) belonging to \([0, M - 1]\), where \( M = m_1 \times m_2 \times \ldots m_{n-1} \times m_n \), there is unique mixed radix representation as follows, where \( y_i \) satisfies the condition
\[ 0 \leq y_i \leq m_{i+1} \]
\[ y = y_0 + y_1 m_1 + y_2 m_1 m_2 + \ldots + y_{n-1} m_1 m_2 \ldots m_{n-1} \]

Given the residue numbers \((x_1, x_2, x_3, \ldots, x_n)\), the corresponding weighted number \( X \) can be computed using the following equation.
\[ X = [x_1 + k_1 m_1 (x_2 - x_1) + k_2 m_1 m_2 (x_3 - x_2) + \ldots + k_{n-1} m_1 m_2 \ldots m_{n-1} (x_n - x_{n-1})] \mod m_1 m_2 \ldots m_{n-1} m_n \]

Where \( k_1 = (m_1)^{-1} \mod m_2 m_3 \ldots m_n \), \( k_2 = (m_1 m_2)^{-1} \mod m_3 m_4 \ldots m_n \), and similarly for multiplications, thereby reducing area and delay. Further area improvements are possible by exploiting the period of terms to be added. An algorithmic approach is used to obtain full adder-based architectures that are optimized for area and delay.

**7. ADDER**
An adder or summer is a digital circuit that performs addition of numbers. In many computers and other kinds of processors, adders are used not only in the arithmetic logic unit(s), but also in other parts of the processor, where they are used to calculate addresses, table indices, and similar operations.

The figure 3 below is the hardware assembly of \( X \) without optimization.

**8. OPTIMIZED HARDWARE IMPLEMENTATION OF NEW CHINESE REMAINDER THEOREM I**
The new Chinese Remainder theorem one (CRTI) was optimized by removing inverse modulo operations which reduced the overhead of using big size summation terms in the equations. But, the equation carried multiplication terms which requires additional full and half adders, and this is a barrier for hardware implementation. This section further attempt to remove those multiplication terms which means the usage of additional full and half adders can be eliminated. The optimization is carried out by using carry save adders, which reduces the time delay in carrying out the multiplication operations.

Optimized Hardware Implementation of the Base Theorem for New Chinese Remainder Theorem one (CRT1)
Using the moduli set proposed, \( M = \{2^{2n+2} + 3, 2^{2n+1} + 1, 2^{2n} + 1, 2^{2n} + 1\} \) where \( n = 1, 2, 3, \ldots \) and the optimized equation from the previous section, we get
\[ X = x_1 + m_1^2 (x_2 - x_1) + m_1 m_2 (x_3 - x_2) + m_1 m_2 m_3 (x_4 - x_3) \mod (m_1 m_2 m_3 m_4) \]

Putting the above moduli set values in the equation, we get.
\[ X = x_1 + [Q_1 + Q_2 + Q_3] \mod (m_1 m_2 m_3 m_4) \ldots (1) \]
Where

\[ Q_1 = (2^{2n+4} + 2^{2n+4} + 2^{2n+3} + 2^3 + 1) \times (x_2 - x_1) \]

\[ Q_2 = (2^{4n+3} + 2^{2n+3} + 2^{2n+1} + 2^1 + 1) \times (x_3 - x_2) \]

\[ Q_3 = [(2^{4n+3} + 2^{2n+3} + 2^{2n+1} + 2^1 + 1) \times (2^{2n} + 1)] \times (x_4 - x_3) \]

\[ Q_3 = (2^{2n}a + a) (x_4 - x_3) \]

Where

\[ a = 2^{4n+3} + 2^{2n+3} + 2^{2n+1} + 2^1 + 1 \]

The hardware optimization using carry save adders reduces the number of steps needed for the implementation. This intends helps to reduce the area and delay for implementation.

The figures 3, 4, 5 and 6 below show the optimized implementation of equation using carry save adders and carry propagate adders.

Hardware implementation of \( Q_1, Q_2, Q_3 \) and \( X \) as shown in Figures 3, 4, 5 and 6 respectively.

**ANALYSES OF RESULTS**

**Area and Delay of Implementation**

The tables below show the area and delay of implementation of the optimized equation with the proposed moduli set \( \{2^{2n+2} + 3, 2^{2n+1} + 1, 2^{2n} + 1, 2\} \).

**Area and Delay of \( Q_1 \)**

In the implementation of \( Q_1 \) as shown in Fig 3, we have 3 carry save adders, 1 carry propagate adder and 1 subtractor. The proposed moduli set is a 6n + 5 moduli set. The delay of a \( CSA = 1 \) and the area is 6n + 5. The table below shows the calculations of the area and delay.

<table>
<thead>
<tr>
<th></th>
<th>Area</th>
<th>Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>( CSA = 3 )</td>
<td>18n + 15</td>
<td>3</td>
</tr>
<tr>
<td>( CPA = 1 )</td>
<td>6n + 5</td>
<td>12n + 10</td>
</tr>
<tr>
<td>Sub-tractor</td>
<td>2n + 3</td>
<td>2n + 3</td>
</tr>
<tr>
<td>Total</td>
<td>26n + 23</td>
<td>14n + 16</td>
</tr>
</tbody>
</table>

**Area and Delay of \( Q_2 \)**
The implementation of Fig 4 shows that, we have 3 carry save adders, 1 carry propagate adder and 1 sub-tractor. This is shown below.

### Table 2 Area and Delay of Q

<table>
<thead>
<tr>
<th></th>
<th>Area</th>
<th>Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>CSA = 3</td>
<td>18n + 15</td>
<td>3</td>
</tr>
<tr>
<td>CPA = 1</td>
<td>6n + 5</td>
<td>12n + 10</td>
</tr>
<tr>
<td>Sub-tractor</td>
<td>2n + 1</td>
<td>2n + 1</td>
</tr>
<tr>
<td>Total</td>
<td>26n + 21</td>
<td>14n + 14</td>
</tr>
</tbody>
</table>

### Area and Delay of Q₃

The implementation of Fig 5 shows that, we have 3 carry save adders, 2 carry propagate adder and 1 sub-tractor. This is shown below.

### Table 3 Area and Delay of Q₃

<table>
<thead>
<tr>
<th></th>
<th>Area</th>
<th>Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>CSA = 3</td>
<td>18n + 15</td>
<td>3</td>
</tr>
<tr>
<td>CPA = 2</td>
<td>12n + 10</td>
<td>24n + 20</td>
</tr>
<tr>
<td>Sub-tractor</td>
<td>2n</td>
<td>2n</td>
</tr>
<tr>
<td>Total</td>
<td>32n + 25</td>
<td>26n + 23</td>
</tr>
</tbody>
</table>

### Area and Delay of X

The implementation of Fig 6 shows that, we have 2 carry save adders and 1 carry propagate adder. This is shown below.

### Table 4 Area and Delay of X

<table>
<thead>
<tr>
<th></th>
<th>Area</th>
<th>Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>CSA = 2</td>
<td>12n + 10</td>
<td>2</td>
</tr>
<tr>
<td>CPA = 1</td>
<td>6n + 5</td>
<td>12n + 10</td>
</tr>
<tr>
<td>Total</td>
<td>18n + 15</td>
<td>12n + 12</td>
</tr>
</tbody>
</table>

The total area = Area Q₁ + Area Q₂ + Area Q₃ + Area X = 102n + 84

The table above shows the comparison between the two moduli set. The proposed one is a 6n + 5 bits moduli set whereas the one being compared with is a 3n + 5 bits moduli set. In terms of area, the proposed moduli set have a larger area than the previous one.

In terms of time delay, the proposed moduli set has a smaller delay. We can therefore conclude that, the proposed moduli set is more efficient in terms of time.

### Illustration

Consider a weighted number X = 100 and the moduli set of the form \( [2^{2n+2} + 3, 2^{2n+1} + 1, 2^{2n} + 1, 2n] \), and taking \( n = 2 \), we get the moduli set \( m = \{m_1, m_2, m_3, m_4\} = \{67, 33, 17, 2\} \).

The dynamic range is \( M = 75174 \) RNS representation of X is shown below:

\[
X = (x_1, x_2, x_3, x_4) = (X \mod m_1, X \mod m_2, X \mod m_3, X \mod m_4) = (33, 11, 15, 0)
\]

\[X_1 = 33, X_2 = 1, X_3 = 15, X_4 = 0\]

From the above implementation, \( Q_1 = (2^{4n+2} + 2^{2n+4} + 2^{2n+3} + 2^3 + 1) * (x_2 - x_1)\)

When \( n = 2 \),

\( Q_2 = (2^{12} + 2^8 + 2^7 + 2^3 + 1) * (1 - 33)\)

\( Q_3 = (2^{12} + 2^8 + 2^7 + 2^3 + 1) * (-33)\)

\( Q_4 = 4489 * 75142\)

\( Q_1 = 337312438\)

\( Q_2 = (2^{4n+3} + 2^{2n+3} + 2^{2n+1} + 2^1 + 1) * (x_3 - x_2)\)

When \( n = 2 \),

\( Q_2 = (2^{11} + 2^7 + 2^5 + 2^1 + 1) * (15 - 1)\)

\( Q_2 = (2^{11} + 2^7 + 2^5 + 2^1 + 1) * (14)\)

\( Q_2 = 2211 * 14\)

\( Q_3 = 30954\)

\( Q_3 = (2^{4n+3} + 2^{2n+3} + 2^{2n+1} + 2^1 + 1) * (2^{2n} + 1) * (x_4 - x_3)\)

When \( n = 2 \),

\( Q_3 = (2^{11} + 2^7 + 2^5 + 2^1 + 1) * (2^4 + 1) * (0 - 15)\)

\( Q_3 = (2^{11} + 2^7 + 2^5 + 2^1 + 1) * (2^4 + 1) * (-15)\)

\( Q_3 = 2211 * 17 * 75159\)

\( Q_3 = 282500133\)

\( X_1 = 33\)

\( X = X_1 + Q_1 + Q_2 + Q_3\)

\( X = (33 + 337312438 + 30954 + 282500133) \mod 75174\)

\( X = 31623444758 \mod 75174 = 100\)

### 10. CONCLUSION

In this paper, we address the problems identified with the New CRT 1 which includes the presence of an inverse modulo and multipliers which makes implementation difficult and expensive in terms of speed, area and cost. Our scheme
optimizes the New Chinese Remainder Theorem one (CRT 1) by eliminating the inverse modulo operators and also reduce the number of multipliers using four moduli set to implement the hardware optimization using carry save adders therefore reduces the delay. Our proposed scheme’s implementation shows clearly that the proposed moduli set is better than the other ones stated in the work with regards to time.

The future direction of this work is to extend the number of moduli set to 5 or more to increase the dynamic range. Further efforts could be made to completely reduce the final modulo and this could improve the operations. Exploring other implementation methods could also be looked at.

11. REFERENCES