DOI: 10.1109/TCSVT.2009.2022779

# Low Power H.264 Video Compression Architectures for Mobile Communication

Asral Bahari, Tughrul Arslan and Ahmet T. Erdogan

Abstract—This paper presents a method to reduce the computation and memory access for variable-block-size motion estimation using pixel truncation. Previous work has focused on implementing pixel truncation using a fixed-block-size (16x16 pixels) motion estimation. However, pixel truncation fails to give satisfactory results for smaller block partitions. In this paper, we analyse the effect of truncating pixels for smaller block partitions and propose a method to improve the frame prediction. Our method is able to reduce the total computation and memory access compared to conventional full search method without significantly degrading picture quality. With unique data arrangement, the proposed architectures are able to save up to 53% energy compared to the conventional full search architecture. This makes such architectures attractive for H.264 application in future mobile devices.

Index Terms—Motion estimation (ME), video coding, VLSI architecture, low-power design.

### I. INTRODUCTION

Video compression plays an important role in today's wireless communications. It allows raw video data to be compressed before it is sent through a wireless channel. However, video compression is computation intensive and dissipates a significant amount of power. This is a major limitation in today's portable devices. Existing multimedia devices can only play video applications for a few hours before the battery is depleted.

The latest video compression standard, MPEG-4 AVC/H.264 [1] gives 50% improvement in compression efficiency compared to previous standard. However, the coding gain comes at the expense of increased computational complexity at the encoder. Motion estimation (ME) has been identified as the main source of power consumption in video encoders. It consumes 50% to 90% of the total power used in video compression [2]. The introduction of variable block size partitions and multiple reference frames in the standard result in increased of computational load and memory bandwidth during motion prediction.

Block-based motion estimation has been widely adopted by the industry due to its simplicity and ease of implementation. Each frame is partitioned into 16x16 pixels, known as macroblocks (MB). Full-search motion estimation predicts the current macroblock by finding the candidate that gives the minimum sum of absolute difference (SAD), as follows:

$$SAD(i,j) = \sum_{k=0}^{M-1} \sum_{l=0}^{N-1} |C(k,l) - R(i+k,j+l)|$$
(1)

where C(k, l) is the current macroblock, R(i + k, j + l) is the candidate macroblock located in the search window within the previously encoded frame. From Equation 1, the power consumption in motion estimation is affected by the number of candidates and the total computation to calculate the matching cost. Thus, the power can be reduced by minimising these parameters.

Furthermore, to maximise the available battery energy, the computational power should be adapted to the supply power, picture characteristics and available bandwidth. Because these parameters change over time, the ME computation should be adaptable to different scenarios without degrading the picture quality.

Pixel truncation can be used to reduce the computational load by allowing us to disable the hardware that processes the truncated bits. While previous studies focused on fixed-block-size motion estimation (16x16 pixels), very little work has been done to study the effect of pixel truncation for smaller block sizes. The latest MPEG-4 standard, MPEG-4 AVC/H.264, allows variable block size for motion estimation (VBSME). [1] defines 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 block sizes. At smaller block partitions, a better prediction is achieved for objects with complex motion.

Truncating pixels at a 16x16 block size results in acceptable performance as shown in the literature [3]. However, at smaller block sizes, the number of pixels involved during motion prediction is reduced. Due to the truncation error, there is a tendency for smaller blocks to yield matched candidates which could lead to the wrong motion vector. Thus, truncating pixels using smaller blocks results in poor prediction.

In [4] and [5], we have proposed a low-power algorithm and architecture for motion estimation using pixel truncation for smaller block sizes. The search is performed in two steps: (1) truncation mode; and (2) refinement mode. This method reduces the computational cost and memory access without significantly degrading the prediction accuracy.

In this paper, we perform an in depth analysis of this technique and extend the technique to a complete H.264 system. The rest of this paper is organised as follows. The existing techniques of low resolution ME are reviewed in Section II. Section III investigates the effect of pixel truncation on

Copyright (c) 2009 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

Manuscript received July 14, 2008; revised November 14, 2008.

Asral Bahari (email: asral@unimap.edu.my) is with University Malaysia Perlis, Malaysia.

Tughrul Arslan (tughrul.arslan@ed.ac.uk) and Ahmet T. Erdogan (ahmet.erdogan@ed.ac.uk) are with the School of Engineering and Electronics, University of Edinburgh, Edinburgh EH9 3JL U.K.

variable block size motion estimation. Section IV outlines the proposed two step search for VBSME. Section V analyses the proposed architecture. Our experimental results are discussed in Section VI. Finally, Section VII concludes this paper.

## II. LOW RESOLUTION ME

In low resolution ME, the bit size and computational cost are normally tackled simultaneously. [6] introduced a onebit transform (1BT) to reduce the computational cost. In this method, the original image is filtered using a band-pass filter. The output image is represented by one bit and the ME is carried out at this frame plane. To improve the frame prediction, [7] proposes a two-bit transform (2BT) where the original image is converted into two bits using the threshold value derived from the local image standard deviation. More than 0.2 dB improvement is achieved with this method as compared to 1BT. [8] proposes a low resolution quantised ME (LRQME) where the pixel is transformed to two bits using an adaptive quantiser. To produce the two bit image, three quantisation thresholds are calculated according to the current MB pixel mean.

In motion estimation, pixel truncation has been used to reduce the computational load for the matching calculation unit [9]. [3] proposes adaptive pixel truncation during ME. The pixel's least significant bits (LSB) are adaptively truncated depending on the quantisation parameter (QP). This direct quantise provides a trade-off between PSNR and power. Truncating the pixel's most significant bits (MSB) was discussed in [10].

In low resolution ME, most methods focus on reducing the power by minimising the computational load. However, memory access is not taken into account. This is because the original pixel needs to be accessed before it is transformed into low resolution. In some cases, where the PSNR is dropped due to truncation error, a second search is done at full resolution. This increases the memory access and thus increases the power consumption. On the other hand, while direct pixel truncation has the potential to reduce memory access, it is often at the expense of a decrease in PSNR in some motion types.

#### III. THE EFFECT OF PIXEL TRUNCATION FOR VBSME

For video applications, data is highly correlated, and the switching activity is distributed non-uniformly [10]. Since the less significant bits (LSB) of a data word experience a higher switching activity, significant power reduction can be achieved by truncating these bits. In general, about 50% switching activity reduction is obtained if we truncate up to 3 LSB. Further reduction can be achieved if the number of truncated bits (NTB) is increased. For example, if NTB is set to 6, the switching activity could be reduced by 80-90%. This makes pixel truncation attractive to minimise power in motion estimation.

Table I shows the Cumulative Distribution Function (CDF) for SAD that is obtained during motion estimation using five Foreman sequences. The SAD is grouped into 5 categories: 0% represents the percentage for SAD = 0, 5% represents the percentage of  $SAD < 5\% SAD_{max}$  and so on. For 16x16

Table I CDF of calculated SAD during motion estimation using Foreman sequence (search range,  $p = \pm 8$ ).

| Block size | NTB | 0    | < 5% | < 10% | < 20% | < 40% |
|------------|-----|------|------|-------|-------|-------|
| 16x16      | 0   | 0%   | 25%  | 60%   | 94%   | 100%  |
| 10210      | 4   | 0.2% | 25%  | 60%   | 94%   | 100%  |
| 8x8        | 0   | 0%   | 35%  | 58%   | 87%   | 98%   |
| 070        | 4   | 5%   | 35%  | 58%   | 87%   | 98%   |
| 4x4        | 0   | 0%   | 58%  | 58%   | 81%   | 99%   |
| +14        | 4   | 12%  | 58%  | 58%   | 81%   | 99%   |

Table II Average full search PSNR for various NTB using SAD as matching criteria (search range,  $p = \pm 8$ )

| NTB |       | Block size |       |       |       |       |  |
|-----|-------|------------|-------|-------|-------|-------|--|
| NID | 16x16 | diff       | 8x8   | diff  | 4x4   | diff  |  |
| 0   | 33.11 | -          | 34.89 | -     | 36.82 | -     |  |
| 2   | 33.12 | 0.01       | 34.85 | -0.03 | 36.75 | -0.07 |  |
| 4   | 33.03 | -0.08      | 34.35 | -0.54 | 34.66 | -2.16 |  |
| 6   | 31.79 | -1.33      | 30.29 | -4.60 | 27.46 | -9.36 |  |

block size with NTB = 4, the percentage of SAD = 0 is close to the untruncated bit (NTB = 0). This shows that for 16x16 block size, the truncated pixel is more likely to have the same matched candidate as in the untruncated pixel. However, for 4x4 block with NTB = 4, the percentage of SAD = 0 is 12% compared to 0% for NTB = 0. This shows that there are more matched candidates using truncated pixel for 4x4 block size which could lead to incorrect motion vectors.

To illustrate the effect of pixel truncation on variable block size motion estimation, we computed the average peak signalto-noise ratio (PSNR) for 50 predicted frames of Foreman sequence (QCIF@30fps) as shown in Table II. The frames are predicted using full search algorithm at different block sizes and NTB. From Table II, for full pixel resolution (NTB = 0), the prediction accuracy improves as the block size decreases. This is reflected by a higher PSNR for predictions using a 4x4 block compared to a 16x16 block.

For NTB = 4, a small PSNR drop is observed for a block size of 16x16 (0.08 dB) compared to untruncated pixels. The PSNR drop for predictions using smaller block sizes is higher with 0.54 dB and 2.16 dB drops for frames with block sizes 8x8 and 4x4, respectively.

As we increase the NTB = 6, the PSNR drop for the smaller blocks increases rapidly. The PSNR drop for the 16x16 block size is 1.33 dB. However, for 8x8 and 4x4 block sizes, the PSNR drop increases to 4.6 dB and 9.36 dB, respectively. This shows that pixel truncation is not suitable for smaller block sizes. In the H.264 standard, substantial improvement in motion prediction is gained by using smaller blocks. Therefore, it is important to improve the PSNR gain especially for smaller block partitions.

## **IV. TWO-STEP ALGORITHM**

In this paper, we propose a method of pixel truncation for variable-block-size motion estimation. This method is based // T=8'b1100\_0000 // Truncating the search window pixel, Y.  $Y_t = BITAND(Y,T)$ // Truncating the current MB pixel, X.  $X_t = BITAND(X,T)$ // Initialise mv and min\_cost  $mv_x = 0, mv_y = 0, cost_{min} = cost_{max}$ // Scanning the search windows and find the best match using block size N =8For  $i_1 = -p_1, p_1$ For  $j_1 = -p_1, p_1$   $cost = \sum_{m=0_1}^{N-1} \sum_{n=0}^{N-1} [MATCH_1(X_t(m, n), Y_t(i_1 + m, j_1 + n))]$  $cost_{min} = cost, mv_x = i_1, mv_y = j_1$ End of j End of i // Refining the search result using full pixel for variable block size  $cost_{min} = cost_{max}$ For  $i_2 = -p_2, p_2$ For  $j_2 = -p_2, p_2$   $cost = \sum_{m=0_1}^{N-1} \sum_{n=0}^{N-1} [MATCH_2(X_t(m,n), Y_t(i_2 + m, j_2 + n))]$ If  $(cost < cost_{min})$  $cost_{min} = cost, mv_x = i_2, mv_y = j_2$ End of j End of i

Figure 1. Pixel truncation algorithm using two step approach

on the following observations:

- 1) Truncating pixels for larger block sizes can result in better motion prediction compared to smaller block sizes.
- 2) At higher pixel resolutions, smaller block sizes can result in better prediction compared to the larger block sizes.

To avoid having large motion vector errors with smaller blocks, we have implemented motion prediction in two steps. In the first search, the prediction is performed using pixels with NTB = 6 at 8x8 block size. Then, the result of the first search is refined using full pixel resolution (8 bit) in a smaller search area. The algorithm is summarised in Fig. 1.

Fig. 2 shows the simulation results using truncated pixels with several matching criteria. Two error-based matching criteria and two boolean-based matching criteria are compared against SAD, namely MinMax [11], mean removed MAD (MRMAD) [12], binary XOR (BXOR) [13] and difference pixel count (DPC) [8], respectively. From the figure, at high NTB, error base matching criteria gives a poor result compared to the boolean-based matching criteria. The combination of NTB = 6 and DPC gives a good trade-off between PSNR and the computational load.

At highly truncated bits, 16x16 block size is more reliable since it has more data compared to the smaller block size. However, for complex motion, the motion vector for a smaller block size, especially a 4x4 block, is not necessarily close to that of a 16x16 block. Since the block with smaller size difference tends to move in a similar direction, the 8x8 block is used in the first search. This allows us to get better predictions for either the smaller block (8x4, 4x8 and 4x4) or the larger block (16x8, 8x16, 16x16) from the 8x8 motion vector.

In the second step, we perform full-pixel resolution to refine the result obtained from the first search. To ensure that the overall computation cost does not exceed the conventional full search computation, the second search is done at a quarter of the size of the first search area. Increasing the refinement area



Figure 2. Average PSNR vs. NTB for different block matching criteria using 8x8 block size (50 Foreman frame sequences, QCIF@30fps, search range, p=[-8,7]).

Table III Memory bandwidth and computational cost comparison ( pREPRESENTS THE SEARCH RANGE, q AND r ARE THE MEMORY ACCESS AND COMPUTATIONAL LOAD PERCANDIDATE, RESPECTIVELY)

| Memory bandwidth |                               |                                         |  |  |  |
|------------------|-------------------------------|-----------------------------------------|--|--|--|
|                  | Conv. Full Search             | Proposed 2-Step                         |  |  |  |
| 1st search       | $(2p)^2 \times q \times 8bit$ | $(2p)^2 \times q \times 2bit$           |  |  |  |
| 2nd search       | -                             | $(2\frac{p}{2})^2 \times q \times 8bit$ |  |  |  |
| Total            | $32p^2q$                      | $16p^2q$                                |  |  |  |

Computational cost

|            | Full Search                | Proposed 2-Step                      |
|------------|----------------------------|--------------------------------------|
| 1st search | $(2p)^2 \times r \times 1$ | $(2p)^2 \times r \times 0.25$        |
| 2nd search | -                          | $(2\frac{p}{2})^2 \times r \times 1$ |
| Total      | $4p^2r$                    | $2p^2r$                              |

will not only increase the total computation, but also increase the memory access, as shown in Table III.

Fig. 3 illustrates the method used to determine the second search area where blocks A, B, C and D are the 8x8 partitions for a macroblock. After the first search, each partition A, B, C and D has its own motion vector,  $mv_A$ ,  $mv_B$ ,  $mv_C$  and  $mv_D$ respectively. Let mvx and mvy represent their horizontal and vertical motion vector components respectively. Thus,  $mvx_A$ and  $mvy_A$  represent the horizontal and vertical components, respectively, of the motion vector for block A;  $mvx_B$  and  $mvy_B$  represent the horizontal and vertical components, respectively, of the motion vector for block B, and so on. The minimum and maximum motion vector of each component is represented by:

$$mvx_{min} = min \{mvx_A, mvx_B, mvx_c, mvx_D\}$$
$$mvx_{max} = max \{mvx_A, mvx_B, mvx_c, mvx_D\}$$
$$mvy_{min} = min \{mvy_A, mvy_B, mvy_c, mvy_D\}$$
$$mvy_{max} = max \{mvy_A, mvy_B, mvy_c, mvy_D\}$$

The second search centre is defined as:

 $\left(\frac{mvx_{min+}mvx_{max}}{2}, \frac{mvy_{min+}mvy_{max}}{2}\right)$ 

with search range,  $p_2 = \frac{1}{2}p_1$ .

γ



Figure 3. Defining the second search area for the proposed two-step algorithm.

## V. HARDWARE IMPLEMENTATION

This section discusses the proposed architectures to implement the two-step algorithm. First, the conventional ME architecture that is used in our analysis is reviewed. Next, we discuss the architectures needed to support the two-step method as proposed in Section IV. The area and power overhead for the computation and memory unit are also investigated. Based on these analyses, we propose three low power ME architectures with different area and power efficiencies.

In this work, we implement the ME architecture based on 2D ME as discussed in [2]. We choose 2D ME because it can cope with the high computational needs of the real-time requirement of H.264 using a lower clock frequency than 1D architecture.

### A. Computation Unit

Fig. 4 shows the functional units in the conventional 2D ME (me\_sad) [2]. The ME consists of search area (SA) memory, a processing array which contains 256 processing elements (PEs), an adder tree, a comparator and a decision unit. The search area memory consists of sixteen memory banks where each bank stores 8-bit pixels in a  $H \times \frac{W}{N}$  total word, where H and W are the search area window's height and width respectively, and N is the macroblock's (MB) width. During motion prediction, 16 pixels are read from the 16 memory banks simultaneously. The data in the memory are stored in a ladder-like manner to avoid delay during the scanning [14].

At each initial search, the current and the first candidate MB are loaded into the processing array's registers. Then, it calculates the matching cost for one candidate per clock cycle. The 256 absolute-different from the PEs are summed by the adder tree, and outputs the sum of absolute-different (SAD) for 41 block partitions. The adder tree reuses the SAD for 4x4 blocks to calculate a larger block partition. In total, the adder tree calculates 41 partitions per clock cycle.

Throughout the scanning process, the comparator updates the minimum SAD and the respective candidate location for each 41-block partition. Once the scanning is complete, the decision unit outputs the best MB partition and its motion vectors. The ME requires 256 clock cycles to scan all candidates.



Figure 4. Me\_sad block diagram



Figure 5. Processing element (PE) to support (a) SAD and (b) DPC

For me\_sad, the input and output for each of the PE are 8bits wide as shown in Fig 5 (a). The input for the adder tree is 8 bits wide, and the SAD output is 12 to 16 bits wide, depending on the partition size. These data are then input into the comparator, together with the current search location information.

Using similar architecture as in me\_sad, DPC based ME (me\_dpc) requires 2 bits for the current and reference pixel inputs as shown in Fig. 5 (b). Furthermore, the matching cost is calculated using boolean logic (XOR and OR) rather than arithmetic operation as in SAD based PE. These make the overall area for the 256 PEs in me\_dpc much smaller than me\_sad. The reduction in output bitwidth in DPC based PE also reduces the bitwidth required for adder tree and comparator unit. The input and output for the adder tree is 1 bit and 5 to 9 bitwidths, respectively. A similar bitwidth is applied to the comparator's input.

Table IV compares the area  $(mm^2)$ , the total equivalent gates (based on 2-input NAND gate) and power consumption (mW) for me\_sad and me\_dpc computational units. The comparisons are based on synthesis results using  $0.13\mu m$ CMOS UMC technology. The table shows that me\_sad's area is dominated by the 256 PE (73%). Thus, with the significantly smaller area for 256 PE, the me\_dpc will require less area than the me\_sad. The overall me\_dpc requires 42% of the me\_sad

Table IV ME\_SAD AND ME\_DPC AREA ( $mm^2$ ), TOTAL EQUIVALENT GATES (BASED ON 2-INPUT NAND GATE) AND POWER (mW)

| Modules         |      | me_sad |       | me_dpc |        |       |
|-----------------|------|--------|-------|--------|--------|-------|
| wodules         | Area | Gates  | Power | Area   | Gates  | Power |
| 256 PE          | 0.90 | 173611 | 28.67 | 0.32   | 61728  | 2.31  |
| Adder_tree      | 0.13 | 25077  | 5.53  | 0.04   | 7716   | 0.99  |
| Comparator Unit | 0.11 | 21219  | 1.25  | 0.09   | 17361  | 0.86  |
| Decision Unit   | 0.10 | 19290  | 0.54  | 0.07   | 13503  | 0.50  |
| Total           | 1.24 | 239198 | 36.00 | 0.52   | 100309 | 4.66  |



Figure 6. Computational unit: (a) me\_split (b) me\_combine

area.

Based on the above analysis, we propose two types of architectures for the ME computation unit that can perform both low resolution and full resolution searches. These are me\_split and me\_combine as shown in Fig. 6.

Me\_split implements both me\_sad and me\_dpc as two separate modules, as shown in Fig. 6 (a). During low resolution search, me\_sad is switched off while the me\_dpc is used to perform the low resolution search. The second step uses the me\_sad, while the me\_dpc is switched off. This architecture allows only the necessary bit size to be used during different search modes. While potential power savings is possible,



Figure 7. SA memory arrangement (a) mem8 (b) mem28, (c) mem8pre

this architecture requires additional area for the adder tree, comparator and decision unit to support the low resolution search.

Due to the functions of the adder tree, the comparator and the decision units are similar for both me\_sad and me\_dpc, me\_combine shares these units during low resolution search and full pixel resolution (Fig. 6 (b)). This architecture results in a much smaller area compared to me\_split. However, higher power consumption is expected during the low resolution search because the adder tree, comparator and decision unit operate at higher bit size than needed.

## B. Memory Architecture

Conventional ME architecture implements the SA memory using single port static random access memory (SRAM) with a pixel (8 bits) per word. To implement the two-step search, we need to access the first two MSBs for each pixel during the first search and 8 bits in the second stage. Thus, the pixels need to be stored to allow two reading modes. For this, three types of memory architecture are proposed. These are (a) 8bit memory (mem8), (b) 2bit and 8bit memory (mem28), and (c) 8bit memory with prearranged data and transposed register (mem8pre) as shown in Fig 7.

Mem8 stores the data in the same way as in the conventional ME. We access 8-bit data during both low resolution and the refinement stage. However, during the low resolution search, the lower 6 bits are not used by the PE. Because the memory is accessed during both low resolution and the refinement stage, it results in higher memory bandwidth than the conventional ME architecture.

To overcome the problem in mem8, mem28 uses two types of memory: 2-bit and 8-bit. The 2-bit memory stores the first two MSBs of each datum, and the 8-bit memory stores the complete full pixel bitwidth. During the low resolution search, the data from the 2-bit memory are accessed. This allows only the required bits to be accessed without wasting any power during low resolution. In the refinement stage, the 8-bit memory is read into the PEs. Although this architecture can



Figure 8. Storing 8bit pixel in 8bit memory (a) conventional arrangement (b) mem8pre

 Table V

 Memory bandwidth for different architectures

|         | Low Resolution    | High Resolution             |
|---------|-------------------|-----------------------------|
| mem8    | $NWH \times 8bit$ | $N\frac{WH}{4} \times 8bit$ |
| mem28   | $NWH \times 2bit$ | $N\frac{WH}{4} \times 8bit$ |
| mem8pre | $NWH \times 2bit$ | $N\frac{WH}{4} \times 8bit$ |

potentially reduce memory bandwidth and power consumption, it needs an additional area for the 2-bit memory.

In mem8pre, the data is prearranged before storing them in 8-bit memory. Four pixels are grouped together, and then transposed according to their bit position, as shown in Fig. 8. During the low resolution search, we read only the memory locations that store the first two MSBs of the original pixels. Thus, the total memory accessed during the low resolution is  $\frac{1}{4}$  of the conventional full pixel access.

In full resolution search, we read four memory locations that contain the first up to eighth bits in four clock cycles. Delay buffers, as shown in Fig. 7 (c), realigns these words to match the original 8-bit pixel. By prearranging the pixels this way, we can use the same memory size as in the conventional full search while retaining the ability to access the first two MSBs, as well as the full bit resolution. The drawback of this approach is it needs additional circuitry to transpose and realign the pixels during the motion prediction. The estimated bandwidth for the above three memory architectures are shown in Table V.

# C. Overall Architecture

From the above discussion, we propose three different architectures that can perform both low resolution and full resolution searches. By combining different computation and memory units, we propose the following architectures:

- 1) me\_split+mem28 (ms\_m8)
- 2) me\_combine+mem8 (mc\_m8)
- 3) me\_combine+mem8pre (mc\_m8p)

In these architectures, both low resolution and full resolution search can be performed. With proper configuration, the conventional full search algorithm can be used during normal conditions to ensure a high quality picture at the output. In condition where energy consumption is the main concern, the two-step method is used. This allows us to reduce the energy consumption without significantly degrading the output picture quality.

Table VI PSNR DROP AGAINST CONVENTIONAL FULL SEARCH SAD QCIF@ 30FPS, p1=[-8,7]

|         | 16x16 |        |         |        |  |  |
|---------|-------|--------|---------|--------|--|--|
|         | Akiyo | Mobile | Foreman | Stefan |  |  |
| fs_p4   | 0     | 0      | 0.12    | 0.22   |  |  |
| 2step16 | 0     | 0.01   | 0.08    | 0.03   |  |  |
| 2step8  | 0     | 0      | 0.07    | 0.03   |  |  |
|         | 8x8   |        |         |        |  |  |
|         | Akiyo | Mobile | Foreman | Stefan |  |  |
| fs_p4   | 0     | 0.02   | 0.3     | 0.41   |  |  |
| 2step16 | 0     | 0.03   | 0.23    | 0.13   |  |  |
| 2step8  | 0     | 0.02   | 0.19    | 0.11   |  |  |
|         |       | 4      | x4      |        |  |  |
|         | Akiyo | Mobile | Foreman | Stefan |  |  |
| fs_p4   | 0.06  | 0.16   | 0.54    | 0.58   |  |  |
| 2step16 | 0.06  | 0.17   | 0.47    | 0.31   |  |  |
| 2step8  | 0.05  | 0.14   | 0.44    | 0.27   |  |  |

Table VII PSNR DROP AGAINST CONVENTIONAL FULL SEARCH USING SAD FOR  $CIF@30FPS, p_1=[-16, 15]$ 

|         |        | 16x16   |        |  |  |  |
|---------|--------|---------|--------|--|--|--|
|         | Mobile | Foreman | Stefan |  |  |  |
| fs_p8   | 0.04   | 0.14    | 0.29   |  |  |  |
| 2step16 | 0.04   | 0.25    | 0.04   |  |  |  |
| 2step8  | 0.11   | 0.12    | 0.04   |  |  |  |
|         | 8x8    |         |        |  |  |  |
|         | Mobile | Foreman | Stefan |  |  |  |
| fs_p8   | 0.12   | 0.24    | 0.44   |  |  |  |
| 2step16 | 0.12   | 0.39    | 0.21   |  |  |  |
| 2step8  | 0.20   | 0.21    | 0.09   |  |  |  |
|         |        | 4x4     |        |  |  |  |
|         | Mobile | Foreman | Stefan |  |  |  |
| fs_p8   | 0.38   | 0.44    | 0.58   |  |  |  |
| 2step16 | 0.37   | 0.59    | 0.44   |  |  |  |
| 2step8  | 0.40   | 0.39    | 0.32   |  |  |  |

# VI. SIMULATION AND IMPLEMENTATION RESULTS

## A. Peformance of The Proposed Two-step Algorithm

Tables VI and VII show the PSNR difference using the proposed method against the conventional full-search ME (FS). The comparison is done for the frames predicted using 16x16, 8x8 and 4x4 partitions. Other block sizes are not included for simplicity. The difference is calculated based on the average PSNR of 85 frames. Different frame sequences that represent various types of motion from low to high are used in this experiment: Akiyo, Mobile, Foreman and Stefan. Both QCIF and CIF frame resolutions are considered, which represent the typical frame size for mobile devices. The search range,  $p_1$ =[-8,7] and  $p_1$ =[-16,15] is defined for QCIF and CIF, respectively.

2step8 represents the proposed 2-step search using the 8x8 block partition. For comparison, we include the result for

the 2-step search where the first search is done using 16x16 partitions (2step16). The result of the first search is used as the centre for the second search. fs\_p4 and fs\_p8 represent the conventional full-search ME with a search range equivalent to  $\frac{1}{2}p1$  for QCIF and CIF, respectively.

From the table, our method is able to achieve a good prediction with a smaller PSNR drop compared to the other method. For a low-motion sequence such as Akiyo, the PSNR drop for QCIF is below 0.05dB. The PSNR drop increases slightly for a high-motion sequence such as Stefan. This is due to the prediction error and search range limitation during the first and second searches, respectively.

The smaller PSNR drop for 2Step8 compared to 2Step16 shows that the first search using 8x8 partition gives a good approximation compared to 16x16 block size. In the 8x8 partitions, we have more information for the macroblock motion which is important when determining the second search range for the high-motion sequence.

In addition, the PSNR drop varies depending on the level of detail of the frame. For a frame sequence with high detail, such as Mobile, the first search with NTB = 6 contains more information for the macroblock feature. Thus, it can obtain a better match, which is reflected by the lower PSNR drop. For a frame with less object detail, the PSNR drop is slightly higher. The same explanation is applicable for the higher PSNR drop for CIF compared to QCIF frame resolution. This is because the macroblock content for the CIF frame is more sparse compared to the QCIF macroblock.

The reduction in the search range in the refinement stage does affect the prediction for the 4x4 block partition. This is expected, since in 2step8, the full pixel resolution is done at  $\frac{1}{4}$  of the conventional full-search area. Thus, the reduction in computation comes at the expense of decreasing the prediction accuracy. However, compared to the direct reduction of the search range, our method gives a higher PSNR as opposed to the fs\_p4 and fs\_p8 for the QCIF and CIF frames, respectively.

In all frame sequences tested, our method gives good PSNR compared to the conventional full search with PSNR drop < 0.5 dB. Furthermore, our method could yield a better uniform motion vector for the 4x4 partitions which is required to reduce the overall bitrates.

Table VIII shows the average PSNR drop for several existing motion estimation techniques discussed in Section II. In 2BT, frames are converted from 8bit to 2bit using the threshold value derived from the local image standard deviation. The motion estimation is performed based on the transformed two-bit image. LRQME transforms the eight-bit pixel to two bits using an adaptive quantiser. The result of the motion estimation obtained using the low-resolution image is refined at higher pixel resolution using 16x16 blok size. As shown in Table VIII, our method shows superior performance compared to other techniques for 16x16 to 4x4 block sizes.

To evaluate the effectiveness of our method within H.264 software toward generating the final bitrates, we modified the existing H.264 reference software (version JM8.6 for baseline profile) to include the proposed algorithm. We replaced the existing full search motion estimation algorithm with the two-step method. The simulations were done with rate control

Table VIII Average PSNR drop (dB) for several motion estimation technioues.

|                   | FS  | 2BT | LRQME | Proposed |
|-------------------|-----|-----|-------|----------|
| Low resolution    | No  | Yes | Yes   | Yes      |
| High resolution   | Yes | No  | Yes   | Yes      |
| Block Size: 16x16 | 0.0 | 0.7 | 0.8   | 0.1      |
| Block Size: 8x8   | 0.0 | 1.3 | -     | 0.2      |
| Block Size: 4x4   | 0.0 | 3.0 | -     | 0.4      |



Figure 9. PSNR vs Bitrate using the proposed 2-Step method for (a) QCIF@30fps, p=[-8,7] (b) CIF@30fps,p=[-16,15].

off,  $QP = \{20, 25, 30, 35, 40\}$ , 85 frames, QCIF and CIF frame format at 30fps. Both the conventional and the proposed method were compared.

Fig. 9 show the PSNR versus bitrates graphs simulated using the modified H.264 reference software. Two video sequences (Foreman and Stefan), which represent medium and high motion sequence, are shown. From the graphs, it can be seen that the proposed method could achieve good performance, close to the conventional method, without significantly degrading the picture quality. For typical application of QCIF@30fps at 146kb/s and CIF@30fps at 736kb/s, only 0.02dB difference is observed compared to the conventional method.

#### B. Peformance of The Proposed Architectures

This section presents the synthesised results of our analysis. First, we present the results for the computation unit. Next, the area and power consumption for the proposed memory architectures are analysed. Finally, we present the results for

#### Table IX

Computational unit: area  $(mm^2)$ , total equivalent gates (based on 2-input NAND gate) and power (mW) comparison

|                 | me_split |        |          |           | me_combine |        |          |           |
|-----------------|----------|--------|----------|-----------|------------|--------|----------|-----------|
| Modules         | Area     | Gates  | Po       | Power     |            | Gates  | Po       | wer       |
|                 | Alca     | Gales  | Low Res. | High Res. | Area       | Gales  | Low Res. | High Res. |
| 256 PE          | 1.17     | 225694 | 3.62     | 24.44     | 1.06       | 204475 | 3.222    | 24.39     |
| Adder_tree      | 0.18     | 34722  | 1.54     | 6.63      | 0.13       | 25077  | 1.871    | 6.48      |
| Comparator Unit | 0.20     | 38580  | 0.80     | 1.26      | 0.11       | 21219  | 1.034    | 1.25      |
| Decision Unit   | 0.16     | 30864  | 0.50     | 0.54      | 0.10       | 19290  | 0.543    | 0.54      |
| Others          | 0.04     | 7716   | 0.01     | 1.29      | 0.07       | 13503  | 0.17     | 1.36      |
| Total           | 1.75     | 337577 | 6.46     | 34.16     | 1.47       | 283565 | 6.84     | 34.02     |

Table X MEMORY UNIT: AREA  $(mm^2)$  AND POWER (mW) COMPARISON

| Modules | Area | Po       | wer       |
|---------|------|----------|-----------|
| Wouldes | Alea | Low Res. | High Res. |
| mem8    | 0.23 | 4.7      | 4.7       |
| mem28   | 0.39 | 3.3      | 4.7       |
| mem8pre | 0.42 | 2.2      | 7.5       |

the overall ME architectures that can provide efficient area and power consumption.

We have synthesised our design using the UMC  $0.13\mu m$  CMOS library. We have used Verilog-XL for functional simulation, and Power Compiler to perform power analysis at 20MHz. Actual video data is used to verify the hardware and to obtain the estimated power consumption.

Table IX compares the synthesis results for the proposed computation units: me\_split and me\_combine. The power consumption during low resolution and full resolution searches are shown in detail. During the motion prediction, the search range [-8,7] and [-4,3] is used during low resolution and full pixel resolution search, respectively.

From the table, the me\_split consumes 6% less power during the low resolution search than the me\_combine. However, this comes at the cost of an additional 41% of area on top of the existing me\_sad. On the other hand, because me\_combine shares some modules, it requires only 19% additional area compared to me\_sad. This shows that, combining the adder tree, comparator and decision unit as in me\_combine is preferred over me\_split.

Table X shows the area and power comparison for the proposed memory unit. The reported power includes the power consumed by the additional circuit needed by respective memory architectures. The SRAM model is generated using a UMC 0.13 memory compiler with estimated area and power provided by the datasheet.

From Table X, it is clear that the mem8pre provided the lowest bandwidth and power compared to the other memory configurations during the low resolution search. This is expected since only  $\frac{1}{4}$  of the memory is accessed during the low resolution search. However, it requires extra area for the additional circuits needed to arrange the pixels.

On the other hand, mem8 gives the minimum area compared to the other configurations. Because the full pixel bitwidth is

Table XII NORMALISE ME ARCHITECTURE AREA AND POWER COMPARISON

|        | Area | Energy | Area*Energy |
|--------|------|--------|-------------|
| me_sad | 1.00 | 1.00   | 1.00        |
| ms_m28 | 1.45 | 0.48   | 0.70        |
| mc_m8  | 1.16 | 0.52   | 0.60        |
| mc_m8p | 1.28 | 0.47   | 0.61        |

accessed in both low and full resolution, it requires higher memory bandwidth and uses more power than the others during the low resolution search.

Table XI shows the total area and power consumption results based on extracted layout for the proposed overall ME architectures: ms\_m28, mc\_m8, mc\_m8p. In order to make a fair comparison between these architectures, we calculate the total energy needed during each motion prediction. For the two-step method, this includes energy consumed during both low and full resolution search. The energy is calculated as the power consumption multiplied by the total time taken to complete the motion prediction. The normalised energy and area is shown in Table XII. From the table, the architecture mc\_m8p consumes the least energy compared to the others, and saves 53% compared to conventional ME (me\_sad). However, because it requires additional area for the DPC based PEs and buffer to arrange the pixels, 28% area overhead is required to implement this method.

Compared to other architectures, mc\_m8 gives the best trade-off in terms of both area and energy efficiency. By using the two-step method, this architecture can save 48% of the energy compared to conventional ME with 16% additional area required to implement a low resolution search.

Table XIII shows the comparison of various motion estimation architectures. Since our architecture requires less than 500 clock cycles to process one macroblock, the architecture can achieve real-time operation for processing QCIF@30fps at 1.4 MHz. At this clock frequency, our method shows that it consumes lower power per macroblock compared to other architectures.

# C. Energy Saving on H.264 System

In order to evaluate the low power techniques, an H.264 system was built, as shown in Fig. 10, which will be referred

Table XI

ME ARCHITECTURE AREA  $(mm^2)$ , TOTAL EQUIVALENT GATES (BASED ON 2-INPUT NAND GATE) AND POWER (mW) COMPARISON

|         | Area | Gates  | Power    |     |       |           |     |       |
|---------|------|--------|----------|-----|-------|-----------|-----|-------|
| Modules |      |        | Low Res. |     |       | Full Res. |     |       |
|         |      |        | Logic    | Mem | Total | Logic     | Mem | Total |
| me_sad  | 1.77 | 341435 | NA       | NA  | NA    | 45.00     | 4.7 | 49.70 |
| ms_m28  | 2.56 | 493827 | 8.32     | 3.3 | 11.62 | 44.02     | 4.7 | 48.72 |
| mc_m8   | 2.05 | 395448 | 8.79     | 4.7 | 13.49 | 43.72     | 4.7 | 48.42 |
| mc_m8p  | 2.26 | 435957 | 9.80     | 1.2 | 10.98 | 45.74     | 4.7 | 50.44 |

 Table XIII

 POWER COMPARISON FOR DIFFERENT MOTION ESTIMATION ARCHITECTURES

|                                       | Miyakosyu04 [15] | Chen07 [14]  | Proposed Method |
|---------------------------------------|------------------|--------------|-----------------|
| Process                               | 0.13             | 0.18         | 0.13            |
| Voltage (V)                           | 1.0              | 1.3          | 1.2             |
| Core Size $(mm^2)$                    | 13.7             | 3.6          | 2.26            |
| Video Spec                            | QCIF 15fps       | CIF 30fps    | QCIF 30fps      |
| Frequency (MHz)                       | 1.7              | 13.5         | 1.4             |
| Block Size                            | 16x16 and 8x8    | 16x16 to 4x4 | 16x16 to 4x4    |
| Power (mW)                            | 0.9              | 16.72        | 1.33            |
| Normalise Power $(1.2V, 0.13\mu m)^*$ | 1.30             | 7.43         | 1.33            |
| Normalise power/(MB persecond)        | 0.00087          | 0.00063      | 0.00045         |

\* Normalised Power [14] =  $Power \times \frac{0.13^2}{Process^2} \times \frac{1.2^2}{Voltage^2}$ 



Figure 10. MPEG system 3 stage pipeline

to as the conventional system in this paper. The architecture of each module in the system has been carefully selected based on existing literature to achieve real time implementation [2], [16], [17], [18], [19].

This system consists of a motion estimator (ME), a motion compensator (MC), a transform coder, a deblocking filter (DFIR) and an entropy coder (EC). The transform coder is comprised of an integer transform (IT), an inverse integer transform (IIT), a quantiser (Q) and an inverse quantiser (IQ). Modules dealing with intra prediction and fractional pel motion estimation were not included due to time constraints. Furthermore, since the proposed low power techniques mainly affect the integer motion estimation and transform coding loop, intra prediction and fractional pel motion estimation are not directly affected by these techniques. The architecture aimed to process QCIF resolution (YUV420) at 30fps with a search range of p = [-8, 7].

The basic processing unit in a video encoder is a macroblock. To encode the macroblock in serial order, starting from predicting the current MB to bitstream generation would result in slow throughput and low hardware utilization. To overcome this problem, macroblock pipeline processing is typically adopted in MPEG hardware architecture [20].

In this work, the system is divided into three pipeline stages. The first stage performs motion prediction. This stage includes loading the search area from external memory into on-chip memory and performing motion estimation and motion compensation. The second stage performs transform coding including integer transform, quantizer, inverse quantizer and inverse integer transform. Since the entropy coder and deblocking filter are operated based on output from the transform coder, these operations are executed in the third pipeline stage. This arrangement allows the hardware of each stage to be ready for processing the next MB once the output is stored into the pipeline buffers.

In this design, a maximum of four reference frames are used which allows the throughput for the pipeline to be set to 2000 clock cycle permacroblock. To achieve real-time operation, the clock has to operate at 6MHz for QCIF@30fps. In total, the conventional architecture requires  $4.85mm^2$  of chip area for the chip core.

The second column of Table XIV shows the energy consumes by the conventional H.264 architecture. Since different modules require different clock cycles to process one MB, the energy consumption is more accurate than the power values in representing the actual amount of work required to process MBs. In total, using one reference frame during motion prediction, the system consumes 695.43 nJ to process one MB within 2000 clock cycles (6MHz). The motion estimation dominates energy consumption, taking 77% of the total power. This is followed by the transform coder, deblocking filter and

# Table XIV

Energy consumption (nJ) comparison between the conventional H.264 architecture (Conv.) and the proposed H.264 low power architecture (LP) at 6MHz for one reference Frame

| Modules      | Energy |        |  |  |
|--------------|--------|--------|--|--|
| Wiodules     | Conv.  | LP     |  |  |
| ME           | 536.27 | 265.74 |  |  |
| MC           | 4.62   | 4.62   |  |  |
| IT/IIT, Q/IQ | 75.34  | 76.41  |  |  |
| DFIR         | 54.02  | 54.02  |  |  |
| EC           | 25.18  | 25.18  |  |  |
| Total        | 695.43 | 425.97 |  |  |

entropy coder at 11%, 8% and 4% respectively.

As discussed in Section VI-B, since the mc\_m8p architecture gives the largest energy saving compared to other architectures, this architecture is used to implement the two-step algorithm. By applying the proposed architecture, additional area is introduced to allow low resolution searches to be performed. As discussed in previous section, this increases the ME area by  $0.49mm^2$ . At the system level, the additional area increases to 10% of the total area.

The introduction of the proposed two-step hardware results in reduced energy consumption in the ME as shown in the third column of Table XIV. The total energy consumed by the ME using the proposed two-step method is 265.74nJ. The proposed architecture has saved 50% of ME power compared to the conventional ME architecture. The implementation of the low resolution which requires a smaller computational load has contributed to this saving.

Due to a slight decrease in the motion prediction accuracy during the two-step search, the residue generated from the two-step technique is slightly higher than the conventional ME. This results in a small increase in energy consumption (<2%) in the transform coder. Since most of the coefficient is quantized to zero, the small increase in residue is masked by the quantizer. Thus, the increase in the residue does not propagate to other modules, so there is no change in energy consumption in the deblocking filter and entropy coder. In total, for one reference frame, the two-step method reduces the total energy consumption for the H.264 system by 40% , as shown in Table XIV.

# VII. CONCLUSION

This paper has presented a method to reduce the computational cost and memory access for variable-block-size motion estimation using pixel truncation. Previous work has shown that pixel truncation provides an acceptable performance for motion prediction using a 16x16 block size. However, for motion prediction using smaller block sizes, pixel truncation reduces the motion prediction accuracy. In this paper, we have proposed a two-step search to improve the frame prediction using pixel truncation. Our method reduces the total computation and memory access compared to the conventional method without significantly degrading the picture quality. The results show that the proposed architectures are able to save up to 53% energy compared to the conventional full search ME architecture, which is equivalent to 40% energy saving over the conventional H.264 system. This makes such architecture attractive for H.264 application in future mobile devices.

#### REFERENCES

- "ITU-T Recommendation H.264 & ISO/IEC 14496-10 (MPEG-4) AVC "Advanced VideoCoding for Generic Audiovisual Services"," 2005.
- [2] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, Tung-Chienand Wang, and L.-G. Chen, "Analysis and architecture design of variable block-size motion estimationfor h.264/avc," *IEEE Transactions* on Circuits and Systems I: Regular Papers, vol. 53, no. 3, pp. 578 – 593, 2006.
- [3] Z.-L. He, C.-Y. Tsui, K.-K. Chan, and M. L. Liou, "Low-power VLSI design for motion estimation using adaptive pixel truncation," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 10, pp. 669 – 678, 2000.
- [4] A. Bahari, T. Arslan, and A. T. Erdogan, "Low computation and memory access for variable block size motion estimationusing pixel truncation," *IEEE 2007 Workshop on Signal Processing Systems (SiPS 2007)*, 2007.
- [5] A. Bahari, T. Arslan, and A. Erdogan, "Low power hardware architecture for vbsme using pixel truncation," 21st International Conference on VLSI Design, pp. 389 – 94, 2008.
- [6] B. Natarajan, V. Bhaskaran, and K. Konstantinides, "Low-complexity block-based motion estimation via one-bit transforms," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 7, no. 4, pp. 702 – 6, Aug 1997.
- [7] A. Erturk and S. Erturk, "Two-bit transform for binary block motion estimation," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 15, no. 7, pp. 938 – 46, July 2005.
- [8] S. Lee, J.-M. Kim, and S.-I. Chae, "New motion estimation algorithm using adaptively quantized low bit-resolutionimage and its VLSI architecture for MPEG2 video encoding," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 8, no. 6, pp. 734 – 44, Oct 1998.
- [9] Y. Chan and S. Kung, "Multi-level pixel difference classification methods," *IEEE International Conference on Image Processing*, vol. 3, pp. 252 – 255, 1995.
- [10] V. G. Moshnyaga, "Msb truncation scheme for low-power video processors," *Proceedings - IEEE International Symposium on Circuits and Systems*, vol. 4, pp. 291–294 –, 1999.
- [11] M.-J. Chen, L.-G. Chen, T.-D. Chiueh, and Y.-P. Lee, "A new blockmatching criterion for motion estimation and its implementation," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 5, no. 3, pp. 231 – 6, June 1995.
- [12] S. Sharma, P. Mishra, S. Sawant, C. Mammen, and V. Gadre, "Predecision strategy for coded/non-coded mbs in MPEG4," 2004 International Conference on Signal Processing and Communications (SPCOM), pp. 501 – 5, 2004.
- [13] H. Yeo and Y. H. Hu, "A novel architecture and processor-level design based on a new matchingcriterion for video compression," VLSI Signal Processing, IX, pp. 448 – 57, 1996.
- [14] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, and S. Y. C. andLiang Gee Chen, "Fast algorithm and architecture design of low-power integer motion estimationfor h.264/avc," *IEEE Trans. Circuits Syst. Video Technol.* (USA), vol. 17, no. 5, pp. 568 – 77, 2007.
- [15] M. Miyama, J. Miyakoshi, Y. Kuroda, H. Imamura, Kousukeand Hashimoto, and M. Yoshimoto, "A sub-mw MPEG-4 motion estimation processor core for mobile video application," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 9, pp. 1562 – 1570, 2004.
- [16] T.-C. Wang, Y.-W. Huang, H.-C. Fang, and L.-G. Chen, "Parallel 4\*4 2d transform and inverse transform architecture for MPEG-4avc/h.264," *Proceedings of the 2003 IEEE International Symposium on Circuits and Systems*, vol. vol.2, pp. 800 – 3, 2003.
- [17] Y.-W. Huang, T.-W. Chen, B.-Y. Hsieh, T.-C. Wang, Te-HaoChang, and L.-G. Chen, "Architecture design for deblocking filter in h.264/jvt/avc," *Proceedings 2003 International Conference on Multimedia and Expo*, vol. vol. 1, pp. 693 – 6, 2003.
- [18] S.-C. Chang, W.-H. Peng, S.-H. Wang, and T. Chiang, "A platform based bus-interleaved architecture for de-blocking filter inh.264/MPEG-4 avc," *IEEE Transactions on Consumer Electronics*, vol. 51, no. 1, pp. 249 – 55, 2005.
- [19] T.-C. Chen, Y.-W. Huang, C.-Y. Tsai, B.-Y. Hsieh, and L.-G. Chen, "Dual-block-pipelined vlsi architecture of entropy coding for h.264/avc baseline profile," vol. 2005, (Hsinchu, Taiwan), pp. 271 – 274, 2005.

[20] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, and L.-G. Chen, "Analysis and architecture design of an hdtv720p 30 frames/s h.264/avc encoder," *IEEE Transactions on Circuits and Systems for Video Technology*, vol. 16, no. 6, pp. 673 – 688, 2006.



Asral Bahari received the B.Eng from the Department of Electronics and Computer Engineering at University of Southampton, United Kingdom in 1998, the M.Sc from University Putra Malaysia, Malaysia in 2002, and Ph.D from University of Edinburgh, United Kingdom in 2008.

He is currently a senior lecturer with the School of Microelectronic Engineering, University Malaysia Perlis, Malaysia. From 1998 to 2004 he was a researcher at Malaysian Institute of Microelectronic Systems. His research interest includes VLSI design,

low power algorithm and architecture, multimedia communication and DSP architecture design.



**Tughrul Arslan** holds the Chair of Integrated Electronic Systems in the School of Engineering and Electronics, University of Edinburgh, Edinburgh, U.K., and is also a cofounder and the Chief Technical Officer of Spiral Gateway Ltd., ETTC, University of Edinburgh. He is a member of the Integrated Micro and Nano Systems (IMNS)Institute and leads the System Level Integration Group (SLIg) in the University. His research interests include low power design, DSP hardware design, system-on-chip (SoC) architectures, evolvable hardware, multi-objective

optimization and the use of genetic algorithms in hardware design issues. Prof. Arslan is an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II, a member of the IEEE CAS Committee on VLSI Systems and Applications, and sits on the editorial board of IEE Proceedings on Computers and Digital Techniques and the technical committees of a number of international conferences.



Ahmet T. Erdogan received the B.Sc. degree in electronics engineering from Dokuz Eylul University, Izmir, Turkey, in 1990, and the M.Sc. and Ph.D. degrees from Cardiff University, Cardiff, U.K., in 1995 and 1999, respectively.

He is currently a Senior Research Fellow with System Level Integration, School of Engineering and Electronics, University of Edinburgh, Edinburgh, U.K. He is also a member of the Institute for Integrated Micro and Nano Systems, Edinburgh, U.K., and the Institute for System Level Integration, Liv-

ingston, U.K. His research interests include VLSI design, circuits and systems for communications and signal processing, design of energy efficient digital circuits and systems, computer arithmetic, application-specific architecture design, and reconfigurable computing. He has published several journal and conference papers in these areas.