## Emerging Aspects of Asynchronous Technology used for Design of Energy Efficient Computing Platforms

Hassan Arif

Department of Software Engineering, FUIEMS, Rawalpindi Pakistan M. Aqeel lqbal Department of Software Engineering, FUIEMS, Rawalpindi Pakistan Muhammad Ali

Department of Telecom Engineering, FUIEMS, Rawalpindi Pakistan

Fahad Bajwa Department of Software Engineering, FUIEMS, Rawalpindi Pakistan

### ABSTRACT

The increasing needs for power consumption of computing machines is getting a worrying factor for the emerging embedded systems. Asynchronous computer architecture reduces the power consumption of the systems. They are built around using emerging concept of clock less systems. Asynchronous computing systems have demonstrated a full scaled computational performance to fulfill the emerging needs of computational intensive applications. Design and fabrication of asynchronous processing systems has emerged dramatically and the area is under close observation for more advanced research on the issues of asynchronous designs, programming, and protocols implementation, development of computational algorithms as well as fabrication of such architecture at silicon wafer level. This research paper presents the critical aspects of high design complexity, better scalable computational performance and reduction in energy obtained by using modern asynchronous computing architectures.

### **1. INTRODUCTION**

MULLER and HUFFMAN started work on asynchronous circuits separately since last few decades. A design was made which is called fundamental mode of circuits. MULLER developed the theoretical material of the speed independent circuits. UNGER was the member of HUFFMAN School so his textbooks were based on fundamental circuits. In the early stages asynchronous technology was used in many computing systems.1st decade of asynchronous computers consisted speed independent circuit's . Asynchronous design was also used in early mainframe computers. In 2<sup>nd</sup> decade of asynchronous computers at WASHINGTON University, asynchronous macro modules were developed. Asynchronous circuits were developed according to the "actor model" and "process calculi" models.

The nano scale technology of VLSI will not be operating under a global clock and now it's era for Asynchronous technologies. The two major concerns in the modern era are power consumption and

complexity integrating millions of transistor in VLSI technology and synchronizing it with global clock is becoming more and more challenging [11].



Figure-1 Synchronous vs. Asynchronous Sequential Circuits

This problem can be solved by asynchronous computer architecture with its clock less technology. The global clock is missing and the timing and synchronization between the devices is done through handshaking protocols. Handshaking protocols define the rules for communication between the devices when it is required so no device is using the power when they are idol. The control delays in clocks and signals are consuming energy and time. In Asynchronous the bit will only be read when a binary 1 will be inputted and so this will reduce efficiency and increase energy consumption.

Problems such as clock skew and synchronization failure makes processing difficult and produces errors. Such factor's can be dealt with by using asynchronous computer architecture. The clock less technology of asynchronous technology removes the global clocking and so there is minimum energy consumption by the processor. The clock is switched over the entire chip and thus even in idol case the circuits are consuming the energy whereas in asynchronous handshaking is local and is performed when needed.



Figure-2a Synchronous vs. Asynchronous Adders

The local handshaking is spread to short distances and so saves power [1]. The addition function description is shown in figure 1 and 2 to elaborate the difference in operation in asynchronous and synchronous functions performance



Figure-2b Energy Consumption and Synchronous vs. Asynchronous Circuits

As described for the addition function synchronous computer is consuming energy even when there is no instruction being sent. An asynchronous circuit are self timed and sends signal as after they perform computation.



Figure-3 Sequential Circuit as Self timed Circuits

The Asynchronous is shown in the  $2^{nd}$  block of figure 4. It is shown how clock can be eliminated in asynchronous systems. Another problem with synchronous is of clock cycles, if a device is sending data at **t** units and clock is also synchronised at **t cycles** then performance doesn't deteriorate. In the case of data sent at 2t or 0.5 t the problems begins as the clock can't change it synchronisation and hence the performance will deteriorate. [7]



Figure-4 Clock Elimination in Asynchronous Systems

Partially Asynchronous computing systems (PAC) are under development and observation which consists of

## 1.1 Globally Asynchronous Locally Synchronous (GALS) systems

A GAL system consists of local a clock which operates independently and the communication between units is done asynchronously. This system is designed to recover the disadvantages of both synchronous and asynchronous architectures. As main disadvantage of synchronous is hardware effort as we have to double the circuit elements like transistors. And disadvantage of asynchronous is that they use more energy to accomplish the task thus producing more heat. The basic idea of GALS is to partition a system into several independent modules. GAL's system idea was first proposed by Chapiro in 1984.He mainly focused on small systems.

## 1.2 Locally Asynchronous Globally Synchronous (LAGS) systems

LAGS consist of a global clock which synchronizes the whole communication and the communication done in local modules is done through communication protocols. Designs such as LAGS best fits where power efficiency is required. The process of converting synchronous systems to asynchronous is under process and rather than doing it on a whole it is being done in modules, algorithm tics conversation from synchronous pipelines to asynchronous and mixed systems are under observation and experimentation. The designs of the processor purposed as yet are very simple and have yet to be optimized. Those architectures' doesn't vet meet the developed industry of computing. The 1st Asynchronous Computer was made by Caltech which performed Fetch and Execute only. CFPP (Counter flow Pipeline) developed by Sum has two pipelines for input and output and is yet as early stages. [10]

## 2. FUTURE TECHNOLOGY CONSTRAINTS AND ASYNCHRONOUS DESIGN

Digital systems are typically designed for computations and communication of information and data. For these processes the system needs to synchronise between recourses of the digital system and then transfer data for computations. [2]

## 2.1 Methods of Communication

The synchronization between the devices in synchronous systems is done through global clocking in which the operation are carried away in a sequential and predefined order. There is another method taken under consideration which allows the recourses to communicate only when it is required. The communication done in these systems is through handshaking protocols. [9]

The electromagnetic field in vacuum travels at the speed of light but the electrical signals travel 10-100 times slower inside chips depending on the power and area of the circuit. In VLSI technology the signal transmits through the system busses in 66 MHZ and chip size is about 30-35 mm, signals will require 3 ns approx for travelling for chips start to end, systems working at 2 GHZ requires almost 7-8 cycles to propagate a signal so it is not useful separate logical and physical pipelines as it being done now a days, buses are being transformed into pipelines whose job is to move data among recourse. The variation in signals adds problems to the receiving of cycles.

## 2.2 Clock Distribution and Clock Skew

One factor which affects the technology to progress is clock distribution. Several cycles are required for propagation on one clock cycle over the entire chip. If the clock is not distributed properly over the clock drivers and distributed lines then it may result in clock skew. Skew can be only balanced at very high power of circuits and powerful drivers which results in dissipations of power and energy and around 40 % of the power is dissipated in complex VLSI circuits. Even if one ignore the clock distribution problem signal propagation delays makes it impossible to synchronize a processor under a single clock. Asynchronous designs are under study for the past 50 years and have been acknowledged by the experts. As they eliminate or restrict the global clock they also eliminate the problems carried by it. Instead of global clock the data moves from one unit to another through local handshaking protocols over asynchronous channels.

## 2.3 Power Consumption

Power consumption is major concern in the growing industry. The designs of RISC dissipate heat and then have methods to cool them down. Low voltage VLSI processor reduces power dissipation but with power it also decreases performance of the systems. In synchronous systems, units are charged and discharged by global clock even if it is under use or not. To handle this problem the design techniques should be designed which should get off the units from the clock which is not in use. To apply this on synchronous systems and design such architecture increases complexity and make synchronization even more difficult [3]. In Asynchronous circuit's components use power only when it is required, and so the units that are idle doest need to be synchronized and so energy consumption is reduced.

## 2.4 Key Benefits of Asynchronous Systems

The factors in which asynchronous technology is better then synchronous is clock distribution, power consumption, performance and technology.

## 2.5 Timing Models for Asynchronous Systems

Timing models describe how timing of the units for communication can be set, there are following purposed timing models

- 1) Uni synchronous model
- 2) Asynchronous Model
- 3) Multi Synchronous
- 4) Multi Clocked
- 5) Mixed Time System

The two major disciplines of timing models are

#### Delay Insensitive systems

These systems guarantee correct functionality even if delays occur in circuits. A speed independent system which ignores delays

#### **Bounded Delay Systems**

These systems work on bounded delays after some interval of time. It assumes that processing and communicating is bounded with some limit.

## 3. ASYNCHRONOUS CIRCUIT DESIGN

In an asynchronous circuit the clock signals are replaced by the signal which indicates the completion of an operation. The term used for such signals is 'completed' in this paragraph. The data and completed signals are connecting the registers to the next one. The combinational circuit takes the data stored in the registers as an input and performs its computation and produce an output. In an asynchronous circuit data flows in a static manner so, for correct operation the data must not be lost i.e. any data overtake another or no new data will come from anywhere. A rule that fulfill these requirements is as follows:

A register may input and stores a new data from its predecessor if their successors has input and store the data that the register was previously holding. On the basis of this rule the data is copied from one register to the other following the path through the circuit. Later coming registers will hold the same data but the old duplicate data will be overwritten by the new data, and the process continues [12]. Completed signals are between registers which controls the flow of data, whereas combinational circuits are totally transparent to these completed signals. This transparency is not always of less value , so it takes more than a normal combinational circuit , so for this purpose we use the term 'FUNTIOANL BLOCK' to denote a combinational circuit whose inputs and outputs are completed signals or links. [4]

#### 4. SIGNALLING PROTOCOL

Most asynchronous communications are based on using signaling protocols which define a "handshake" procedure between two computation blocks. The signaling protocol means events happen between the communication phases between two elements of the asynchronous systems. Most asynchronous systems include "request" used for the initiation of the signal and then "acknowledgement" used for the completion. [5]

These handshaking protocols are strictly independent of global system time. Sender and a receiver communication phase includes two types of phases

- 1) Two phase signaling protocol
- 2) Four phase signaling protocol

# 4.1 Handshaking Protocols for Asynchronous Computers

Asynchronous computer doesn't operate on global clock and so the communication between modules in asynchronous computer is done through handshaking protocols.

There is no clock signal involved and so data is send directly without any delay. The module 1 send request to module 2 to send data, if module 2 is ready then it will send acknowledgement back to module 1 to send data and so module 1 will send data to module 2 and module 2 will send Acknowledgement that it has received data. The above mention process contains two types of models

- 1) Four Phase Handshake
- 2) Two Phase Handshake

#### Four Phase Handshaking Protocol

It is called four phase because it contains four transitions to complete an event. In it only one type of transitions (mostly rising) is used to Signal events; the other (falling one) is used only when the transaction is going to complete, to return wires to their initial state. Designers of protocol said that 4-cycle circuits are smaller than 2-cycle. **Four phase** signaling is also called RZ (return to zero) because in this protocol the actions of the receiver and the sender are terminated when both the request and the acknowledge signals are tend to zero.[6]

This Handshaking protocol sends data in a 4 way process. The 4 process steps are described below

- 1) Sender issues data request and make the request wave high
- 2) The receiver gets the data from the sender and makes the acknowledgment wave high.
- 3) The sender (After the data is received) makes the request waveform low.
- 4) The receiver then after the data is send make the request waveform low.

After this the modules are ready to send the next data. The procedure is explain with the help of the following figure





#### Two Phase Signalling:

Two phase signaling is called non-return to zero (NRZ) signaling. Falling and rising of the signal indicates a new request from the user .There is a debate that 2 phase signaling is more efficient with .respect to performance and power. Two-phase signaling recognizes and responds to transitions of the voltage on a Wire. A transition is referred to as an event. In this type of protocol Rising and falling edges are equivalent and carry the same information two-phase signaling is also called transition signaling. The sender initiates the transaction and sends a request event to the receiver by generating a transition on the request wire (first phase); the receiver answers by generating an event on the acknowledge wire (second phase).

#### 5. BUNDLED DATA TECHNIQUE

Bundled protocol is a technique which consists of 2 or 4 cycle signals. N number data bits are passed to the receiver by the sender and N+2 numbers of wires are required. The two extra wires are for receiver to send the acknowledgment and other for the sender to send the request to the receiver. The propagation time for data signals is slower or equal to the propagation time for control signal. The sender places the data on the data wires and then initiates a communication transaction by issuing a request event to the receiver. Often asynchronous circuits are designed using bundled data techniques due to its the logical and circuitry is not very complex.

#### 5.1 Dual Rail

Other then bundled data techniques there is another data technique known as dual rail encoding. Signals are sent in the same wire paths and data bits are encoded into dual wire on request.

Duals Rail encoding consists of 4 steps

- 1. input 00 -> not valid data inputted
- 2. input 10 -> Zero Valid
- 3. input 01 -> One Valid
- 4. input 11-> Not Valid



Figure-6 Handshaking Patterns for Asynchronous Systems

The linking wires between sender and receiver are 3 N and N number of wires contains valid data. The 2N wires are for sending requests and acknowledgements. This protocol can be improved by combining the acknowledgments and requests into a unary wire .This reduced the complexity of the wire to 2N+1 where 2N is for data and One is to send and receive acknowledgments. In 4 cycles dual rail protocol the transition state should be either valid 0 or 1 after that acknowledgement signal will be received the transition state should become idle and the wire for acknowledgment signals should be set to valid 0-1.

#### 5.2 Two Cycle Dual Rail

Dual Cycle dual rail protocols must have a Zero signal in its left single transition bit and a valid one place at the right bit. If the left and right bits are same then the transition state is illegal. The bit being send should be followed up with its opposite transition bit as shown in the diagram.

| Data | Invalid | 0 | 1 | No Data |
|------|---------|---|---|---------|
| D0   | 0       | 1 | 0 | 1       |
| D1   | 0       | 0 | 1 | 1       |

#### 5.3 Completion Signals:

The complexities asynchronous circuits faces that the completion circuit is a necessity and have to be generated and so it will control the acknowledgement signals. Many methods used but none has been s satisfactory.

One of the methods is described in the figure where the circuits are designed as synchronous circuits. When the request is starts the unit/module internal clock starts and after some intervals the circuit has performed the operation and acknowledgment is sent.



Figure-7 Signalling for Asynchronous Systems

#### 5.4 Data Passing Techniques

The request and acknowledge signals are used to regulate the flow of information between two communicating elements in the asynchronous systems. This information is a set of bits, with each bit being either "1" (high) or "0" (low). A variety of techniques have been developed to encode the value of each bit being transmitted during a communication transaction. 4.3.3.1 The Four-Wire Technique the four-wire technique uses two pairs of requestacknowledge signals, one for each value of the transmitted bit. An event on one request signal denotes a "1" while an event on the other indicates a "0". The two request signals are mutually exclusive: the value of a bit cannot be both "0" and "1" at the same time. Furthermore, in every transaction there is always an event on one of the two request wires for each bit; thus, the entire data word has reached the receiver when an event has been detected for each bit of the word. A new transaction will not commence until an event has been detected by the sender on an acknowledge signal for each bit.

The three-wire technique is similar to the four-wire mentioned above, but uses one acknowledge wire, instead of two, per pair of request wires. This scheme has the advantage of using fewer wires per bit of information. The Two-Plus-Wire Technique Another variation of the two aforementioned techniques is the two-pluswire scheme, whereby two request wires per bit are used, one for encoding each of the two values, but the acknowledgements for all bits are combined into a single event which is transmitted over a single wire. Thus, for an n-bit data word, 2n+1 wire are required. The two-plus-wire scheme, like both its aforementioned variants, may use transition or four-phase signaling for the communication of the request and acknowledge events; if four-phase signaling is employed the protocol is referred to as dual rail encoding. The Bundled Data Technique

The bundled data technique employs a single pair of request and acknowledges signals for the entire data word. This scheme is illustrated in figure 4.5. As for the three techniques mentioned above, transition or four-phase signaling may be used. Figure 4.6 illustrates the two-phase bundled data protocol. The sender places the data to be transmitted on the data wires (grey area in figure 4.6) and then initiates a communication transaction by issuing a request event to the receiver. When the request is detected the receiver starts the process of receiving data. After completing the process receiver sends an acknowledgment to the sender and the sender clear's the data and set itself for the next process.

Data should be kept in stable state unless the sender has received the acknowledgement from the receiver. Contrary to the three data passing techniques described delays insensitive, the bundled data technique is based on the bounded delay model; all transitions on the data wires must be observed at the receiver before the request event, i.e. the delay on the data wires must be less than the delay on the request signal. This requirement is known as the bundled data delay constraint. [8]

### 6. CONCLUSION

Embedded systems are characterized by different parameters including performance, throughput, design complexity, cost, reliability and power consumption. Optimal power consumption using asynchronous technology has been under focused research since last few decades. Design of high performance, scalable and cost effective asynchronous systems has become a real challenge for scientists. There are many dimensions of asynchronous technology which should be explored to meet future requirements of low energy consumption with station able high computational power. If a certain level of the above mentioned system parameters become achievable, then it would be possible to launch commercial products based on this technology.

#### 7. REFERENCES

- Stefan Hirschmann, Asynchronous processors, Seminar Embedded System Design, WS 2007/08, Institute of Computer Science, University of Innsbruck February 25, 2008
- [2] Arjan Bink and Richard York. ARM996HS: The First Licensable, Clockless 32-Bit Processor Core. IEEE Micro, 27(2):58{68, 2007.
- [3] Alain J. Martin, Mika Nyström, and Paul I. P'enzes. Et2: A Metric for Time and Energy Efficiency of Computation. *Power-Aware Computing*, R. Melhem and R. Graybill, eds. Kluwer, 2002.
- [4] A.J. Martin, M. Nystrom. Asynchronous Techniques for System-on-Chip Design. *Proc. of the IEEE*, Special Issue on Systems-on-Chip, 94, 6, 1089-1120. 2006.
- [5] Sean Keller, Michael Katelman, Alain J. Martin. A Necessary and Sufficient Timing Assumption for Speed-Independent Circuits. Proc. 15<sup>th</sup> IEEE Int. Symp. on Asynchronous Circuits & Systems, 2009.
- [6] VV, Asynchronous Pulse Logic. Boston, MA: Kluwer, 2001
- [7] S.-Y. Tan and W.-T. Huang, The Design of sharing resources for asynchronous systems, *Proceedings of the 12th WSEAS InternationalConference on MATHEMATICAL METHODS AND COMPUTATIONAL TECHNIQUES INELECTRICAL ENGINEERING (MMACTEE '10)*, Timisoara, Romania, October 21-23, 2010, pp. 171-176.
- [8] C. G. Wong and A. J. Martin, BHigh-level synthesis of asynchronous systems by data-driven decomposition,[ in Proc.ACM/IEEE Design Automation Conf., 2003, pp. 508– 513.
- [9] F. K. G"rkaynak et al., BGALS at ETH Zurich:Success or failure?[ in Proc. Int. Symp.Advanced Research in Asynchronous Circuits and Systems, 2006, pp. 150–159.
- [10] J. Carlsson, K. Palmkvist, and L. Wanhammar, "Synchronous Design Flow for Globally Asynchronous Locally Synchronous Systems", *Proceedings of the 10th WSEAS International Conference on CIRCUITS*, Vouliagmeni, Athens, Greece, July 10-12, 2006, pp. 64-69.
- [11] M. Nystro"m and A. J. Martin, BCrossing thesynchronousasynchronous divide, [presented at the Workshop Complexity Effective Design, Anchorage, AK, 2002.
- [12] A. J. Martin and P. Prakash, "Asynchronous nanoelectronics: Preliminary investigation," in Asynchronous Circuits and Systems, 2008. ASYNC '08. 14th IEEE International Symposium on, 2008.