Let $mathbb{F}_q$ be a finite field of q elements, $R=mathbb{F}_q+umathbb{F}_q$ (u2=0) and D2n=
Dajiang LIU Shouyi YIN Leibo LIU Shaojun WEI
The coarse-grained reconfigurable architecture (CGRA) is a promising computing platform that provides both high performance and high power-efficiency. The computation-intensive portions of an application (e.g. loop nests) are often mapped onto CGRA for acceleration. However, mapping loop nests onto CGRA efficiently is quite a challenge due to the special characteristics of CGRA. To optimize the mapping of loop nests onto CGRA, this paper makes three contributions: i) Establishing a precise performance model of mapping loop nests onto CGRA, ii) Formulating the loop nests mapping as a nonlinear optimization problem based on polyhedral model, iii) Extracting an efficient heuristic algorithm and building a complete flow of mapping loop nests onto CGRA (PolyMAP). Experiment results on most kernels of the PolyBench and real-life applications show that our proposed approach can improve the performance of the kernels by 27% on average, as compared to the state-of-the-art methods. The runtime complexity of our approach is also acceptable.
Yu PENG Shouyi YIN Leibo LIU Shaojun WEI
Coarse-grained Reconfigurable Architecture (CGRA) is a promising mobile computing platform that provides both high performance and high energy efficiency. In an application, loop nests are usually mapped onto CGRA for further acceleration, so optimizing the mapping is an important goal for design of CGRAs. Moreover, obviously almost all of mobile devices are powered by batteries, how to reduce energy consumption also becomes one of primary concerns in using CGRAs. This paper makes three contributions: a) Proposing an energy consumption model for CGRA; b) Formulating loop nests mapping problem to minimize the battery charge loss; c) Extract an efficient heuristic algorithm called BPMap. Experiment results on most kernels of the benchmarks and real-life applications show that our methods can improve the performance of the kernels and lower the energy consumption.
Akihiro SUDA Hideki TAKASE Kazuyoshi TAKAGI Naofumi TAKAGI
We propose a synthesis method of nested loops into parallelized circuits by integrating the polyhedral optimization, which is a state-of-the-art technique in the field of software, into high-level synthesis. Our method constructs circuits equipped with multiple processing elements (PEs), using information generated by the polyhedral optimizing compiler. Since multiple PEs cannot concurrently access the off-chip RAM, a method for constructing on-chip buffers is also proposed. Our buffering method reduces the off-chip RAM access conflicts and further enables burst accesses and data reuses. In our experimental result, the buffered circuits generated by our method are 8.2 times on average and 26.5 times at maximum faster than the sequential non-buffered ones, when each of the parallelized circuits is configured with eight PEs.
Naoya ONIZAWA Akira MOCHIZUKI Hirokatsu SHIRAHAMA Masashi IMAI Tomohiro YONEDA Takahiro HANYU
This paper introduces a partially parallel inter-chip link architecture for asynchronous multi-chip Network-on-Chips (NoCs). The multi-chip NoCs that operate as a large NoC have been recently proposed for very large systems, such as automotive applications. Inter-chip links are key elements to realize high-performance multi-chip NoCs using a limited number of I/Os. The proposed asynchronous link based on level-encoded dual-rail (LEDR) encoding transmits several bits in parallel that are received by detecting the phase information of the LEDR signals at each serial link. It employs a burst-mode data transmission that eliminates a per-bit handshake for a high-speed operation, but the elimination may cause data-transmission errors due to cross-talk and power-supply noises. For triggering data retransmission, errors are detected from the embedded phase information; error-detection codes are not used. The throughput is theoretically modelled and is optimized by considering the bit-error rate (BER) of the link. Using delay parameters estimated for a 0.13 µm CMOS technology, the throughput of 8.82 Gbps is achieved by using 10 I/Os, which is 90.5% higher than that of a link using 9 I/Os without an error-detection method operating under negligible low BER (<10-20).
Yusuke MIZUNO Kazunobu KONDO Takanori NISHINO Norihide KITAOKA Kazuya TAKEDA
Blind source separation is a technique that can separate sound sources without such information as source location, the number of sources, and the utterance content. Multi-channel source separation using many microphones separates signals with high accuracy, even if there are many sources. However, these methods have extremely high computational complexity, which must be reduced. In this paper, we propose a computational complexity reduction method for blind source separation based on frequency domain independent component analysis (FDICA) and examine temporal data that are effective for source separation. A frame with many sound sources is effective for FDICA source separation. We assume that a frame with a low kurtosis has many sound sources and preferentially select such frames. In our proposed method, we used the log power spectrum and the kurtosis of the magnitude distribution of the observed data as selection criteria and conducted source separation experiments using speech signals from twelve speakers. We evaluated the separation performances by the signal-to-interference ratio (SIR) improvement score. From our results, the SIR improvement score was 24.3dB when all the frames were used, and 23.3dB when the 300 frames selected by our criteria were used. These results clarified that our proposed selection criteria based on kurtosis and magnitude is effective. Furthermore, we significantly reduced the computational complexity because it is proportional to the number of selected frames.
Shouyi YIN Dajiang LIU Leibo LIU Shaojun WEI
A coarse-grained reconfigurable architecture (CGRA) is typically hybrid architecture, which is composed of a reconfigurable processing unit (RPU) and a host microprocessor. Many computation-intensive kernels (e.g., loop nests) are often mapped onto RPUs to speed up the execution of programs. Thus, mapping optimization of loop nests is very important to improve the performance of CGRA. Processing element (PE) utilization rate, communication volume and reconfiguration cost are three crucial factors for the performance of RPUs. Loop transformations can affect these three performance influencing factors greatly, and would be of much significance when mapping loops onto RPUs. In this paper, a joint loop transformation approach for RPUs is proposed, where the PE utilization rate, communication cost and reconfiguration cost are under a joint consideration. Our approach could be integrated into compilers for CGRAs to improve the operating performance. Compared with the communication-minimal approach, experimental results show that our scheme can improve 5.8% and 13.6% of execution time on motion estimation (ME) and partial differential equation (PDE) solvers kernels, respectively. Also, run-time complexity is acceptable for the practical cases.
Dajiang LIU Shouyi YIN Chongyong YIN Leibo LIU Shaojun WEI
Reconfigurable computing system is a class of parallel architecture with the ability of computing in hardware to increase performance, while remaining much of flexibility of a software solution. This architecture is particularly suitable for running regular and compute-intensive tasks, nevertheless, most compute-intensive tasks spend most of their running time in nested loops. Polyhedron model is a powerful tool to give a reasonable transformation on such nested loops. In this paper, a number of issues are addressed towards the goal of optimization of affine loop nests for reconfigurable cell array (RCA), such as approach to make the most use of processing elements (PE) while minimizing the communication volume by loop transformation in polyhedron model, determination of tilling form by the intra-statement dependence analysis and determination of tilling size by the tilling form and the RCA size. Experimental results on a number of kernels demonstrate the effectiveness of the mapping optimization approaches developed. Compared with DFG-based optimization approach, the execution performances of 1-d jacobi and matrix multiplication are improved by 28% and 48.47%. Lastly, the run-time complexity is acceptable for the practical cases.
Chuzo IWAMOTO Yusuke KITAGAKI Kenichi MORITA
We study the complexity of finding the minimum number of face guards which can observe the whole surface of a polyhedral terrain. Here, a face guard is allowed to be placed on the faces of a terrain, and the guard can walk around on the allocated face. It is shown that finding the minimum number of face guards is NP-hard.
Motoki OGASAWARA Takanori NISHINO Kazuya TAKEDA
The separation and localization of sound source signals are important techniques for many applications, such as highly realistic communication and speech recognition systems. These systems are expected to work without such prior information as the number of sound sources and the environmental conditions. In this paper, we developed a dodecahedral microphone array and proposed a novel separation method with our developed device. This method refers to human sound localization cues and uses acoustical characteristics obtained by the shape of the dodecahedral microphone array. Moreover, this method includes an estimation method of the number of sound sources that can operate without prior information. The sound source separation performances were evaluated under simulated and actual reverberant conditions, and the results were compared with the conventional method. The experimental results showed that our separation performance outperformed the conventional method.
Jeonghun KIM Suki KIM Kwang-Hyun BAEK
This paper presents a low-power System on Chip (SOC) architecture for the v2.0+EDR (Enhanced Data Rate) Bluetooth and its applications. Our design includes a link controller, modem, RF transceiver, Sub-Band Codec (SBC), Expanded Instruction Set Computer (ESIC) processor, and peripherals. To decrease power consumption of the proposed SOC, we reduce data transfer using a dual-port memory, including a power management unit, and a clock gated approach. We also address some of issues and benefits of reusable and unified environment on a centralized data structure and SOC verification platform. This includes flexibility in meeting the final requirements using technology-independent tools wherever possible in various processes and for projects. The other aims of this work are to minimize design efforts by avoiding the same work done twice by different people and to reuse the similar environment and platform for different projects. This chip occupies a die size of 30 mm2 in 0.18 µm CMOS, and the worst-case current of the total chip is 54 mA.
Shota ISHIHARA Yoshiya KOMATSU Masanori HARIYAMA Michitaka KAMEYAMA
This paper presents an asynchronous FPGA that combines 4-phase dual-rail encoding and LEDR (Level-Encoded Dual-Rail) encoding. 4-phase dual-rail encoding is employed to achieve small area and low power for function units, while LEDR encoding is employed to achieve high throughput and low power for the data transfer using programmable interconnection resources. Area-efficient protocol converters and their control circuits are also proposed in transistor-level implementation. The proposed FPGA is designed using the e-Shuttle 65nm CMOS process. Compared to the 4-phase-dual-rail-based FPGA, the throughput is increased by 69% with almost the same transistor count. Compared to the LEDR-based FPGA, the transistor count is reduced by 47% with almost the same throughput. In terms of power consumption, the proposed FPGA achieves the lowest power compared to the 4-phase-dual-rail-based and the LEDR-based FPGAs. Compared to the synchronous FPGA, the proposed FPGA has lower power consumption when the workload is below 35%.
A trihedral corner reflector is often used for SAR polarimetric calibration. However, the scattering property of the reflector used for the calibration may not be correct if the high frequency approximation is not satisfied or if an incident angle deviates from the symmetric axis of the reflector. In order to know the conditions for precise SAR polarimetric calibration, we evaluated the polarimetric response of the reflector by a numerical simulation using the method of moment (MoM). It is found that allowable incident angle deviation is 5 degree to azimuth direction and 4 degree to elevation direction for precise SAR polarimetric calibration when the size of the reflector is 7.5 times larger than the wavelength of an incident wave.
Heeryon CHO Toru ISHIDA Satoshi OYAMA Rieko INABA Toshiyuki TAKASAKI
Since participants at both end of the communication channel must share common pictogram interpretation to communicate, the pictogram selection task must consider both participants' pictogram interpretations. Pictogram interpretation, however, can be ambiguous. To assist the selection of pictograms more likely to be interpreted as intended, we propose a categorical semantic relevance measure which calculates how relevant a pictogram is to a given interpretation in terms of a given category. The proposed measure defines similarity measurement and probability of interpretation words using pictogram interpretations and frequencies gathered from a web survey. Moreover, the proposed measure is applied to categorized pictogram interpretations to enhance pictogram retrieval performance. Five pictogram categories used for categorizing pictogram interpretations are defined based on the five first-level classifications defined in the Concept Dictionary of the EDR Electronic Dictionary. Retrieval performances among not-categorized interpretations, categorized interpretations, and categorized and weighted interpretations using semantic relevance measure were compared, and the categorized semantic relevance approaches showed more stable performances than the not-categorized approach.
Masanori HARIYAMA Shota ISHIHARA Michitaka KAMEYAMA
This paper presents a novel asynchronous architecture of Field-programmable gate arrays (FPGAs) to reduce the power consumption. In the dynamic power consumption of the conventional FPGAs, the power consumed by the switch blocks and clock distribution is dominant since FPGAs have complex switch blocks and the large number of registers for high programmability. To reduce the power consumption of switch blocks and clock distribution, asynchronous bit-serial architecture is proposed. To ensure the correct operation independent of data-path lengths, we use the level-encoded dual-rail encoding and propose its area-efficient implementation. The proposed field-programmable VLSI is implemented in a 90 nm CMOS technology. The delay and the power consumption of the proposed FPVLSI are respectively 61% and 58% of those of 4-phase dual-rail encoding which is the most common encoding in delay insensitive encoding.
Haruaki ONISHI Yuuki TANAKA Yukio SHIBATA
In this paper, we present a new extension of the butterfly digraph, which is known as one of the topologies used for interconnection networks. The butterfly digraph was previously generalized from binary to d-ary. We define a new digraph by adding a signed label to each vertex of the d-ary butterfly digraph. We call this digraph the dihedral butterfly digraph and study its properties. Furthermore, we show that this digraph can be represented as a Cayley graph. It is well known that a butterfly digraph can be represented as a Cayley graph on the wreath product of two cyclic groups [1]. We prove that a dihedral butterfly digraph can be represented as a Cayley graph in two ways.
Hongwei ZHU Ilie I. LUICAN Florin BALASA
In real-time multimedia processing systems a very large part of the power consumption is due to the data storage and data transfer. Moreover, the area cost is often largely dominated by the memory modules. In deriving an optimized (for area and/or power) memory architecture, memory size computation is an important step in the exploration of the possible algorithmic specifications of multimedia applications. This paper presents a novel non-scalar approach for computing exactly the memory size in real-time multimedia algorithms. This methodology uses both algebraic techniques specific to the data-flow analysis used in modern compilers and, also, more recent advances in the theory of polyhedra. In contrast with all the previous works which are only estimation methods, this approach performs exact memory computations even for applications significantly large in terms of the code size, number of scalars, and number of array references.
Kei HAYASHI Ryoichi SATO Yoshio YAMAGUCHI Hiroyoshi YAMADA
This paper examines polarimetric scattering characteristics caused by a dihedral corner reflector of finite size. The dihedral corner reflector is a basic model of double-bounce structure in urban area. The detailed scattering information serves the interpretation of Polarimetric Synthetic Aperture Radar (POLSAR) data analysis. The Finite-Difference Time-Domain (FDTD) method is utilized for the scattering calculation because of its simplicity and flexibility in the target shape modeling. This paper points out that there exists a stable double-bounce squint angle region both for perfect electric conductor (PEC) and dielectric corner reflectors. Beyond this stable squint angular region, the scattering characteristics become completely different from the assumed response. A criterion on the double-bounce scattering is proposed based on the physical optics (PO) approximation. The detailed analyses on the polarimetric index (co-polarization ratio) with respect to squint angle and an experimental result measured in an anechoic chamber are shown.
Atsutada NAKATSUJI Yasuyuki SUGAYA Kenichi KANATANI
In reconstructing 3-D from images based on feature points, one usually defines a triangular mesh that has these feature points as vertices and displays the scene as a polyhedron. If the scene itself is a polyhedron, however, some of the displayed edges may be inconsistent with the true shape. This paper presents a new technique for automatically eliminating such inconsistencies by using a special template. We also present a technique for removing spurious occluding edges. All the procedures do not require any thresholds to be adjusted. Using real images, we demonstrate that our method has high capability to correct inconsistencies.
Masaharu FUJITA Chikage MURAKAMI
Polarimetric calibration of radar is indispensable for using radar data effectively. This paper proposes a polarimetric radar calibration algorithm using polarization-preserving and polarization-selective reflectors as reference targets. The algorithm assumes radar antenna reciprocity but allows different co-polarization transmission characteristics between horizontal and vertical polarization channels. In processing, the second order terms of small cross-talk factors in antenna polarization transfer characteristics are ignored. The major advantage of the present algorithm is that it does not need assumptions on the scattering characteristics of the background natural surface and is independent of external phase calibration. The results of error analysis show that the present algorithm has sufficient tolerance against errors of reference targets. The validity of the present algorithm was evaluated by analyzing the Spaceborne Imaging Radar C (SIR-C) data and the results were satisfactory.