Qian DENG Li GUO Chao DONG Jiaru LIN Xueyan CHEN
In this paper, we propose a low-complexity widely-linear minimum mean square error (WL-MMSE) signal detection based on the Chebyshev polynomials accelerated symmetric successive over relaxation (SSORcheb) algorithm for uplink (UL) over-loaded large-scale multiple-input multiple-output (MIMO) systems. The technique of utilizing Chebyshev acceleration not only speeds up the convergence rate significantly, and maximizes the data throughput, but also reduces the cost. By utilizing the random matrix theory, we present good estimates for the Chebyshev acceleration parameters of the proposed signal detection in real large-scale MIMO systems. Simulation results demonstrate that the new WL-SSORcheb-MMSE detection not only outperforms the recently proposed linear iterative detection, and the optimal polynomial expansion (PE) WL-MMSE detection, but also achieves a performance close to the exact WL-MMSE detection. Additionally, the proposed detection offers superior sum rate and bit error rate (BER) performance compared to the precision MMSE detection with substantially fewer arithmetic operations in a short coherence time. Therefore, the proposed detection can satisfy the high-density and high-mobility requirements of some of the emerging wireless networks, such as, the high-mobility Internet of Things (IoT) networks.
With the advances in computer processing that have yielded an enormous increase in performance, numerical analytical approaches based on electromagnetic theory have recently been applied to mobile radio propagation analysis. One such approach is the ray-tracing method based on geometrical optics and the uniform geometrical theory of diffraction. In this paper, ray-tracing techniques that have been proposed in order to improve computational accuracy and speed are surveyed. First, imaging and ray-launching methods are described and their extended methods are surveyed as novel fundamental ray-tracing techniques. Next, various ray-tracing acceleration techniques are surveyed and categorized into three approaches, i.e., deterministic, heuristic, and brute force. Then, hybrid methods are surveyed such as those employing Physical optics, the Effective Roughness model, and the Finite-Difference Time-Domain method that have been proposed in order to improve analysis accuracy.
In this paper, we present an FPGA hardware implementation for a phylogenetic tree reconstruction with a maximum parsimony algorithm. We base our approach on a particular stochastic local search algorithm that uses the Progressive Neighborhood and the Indirect Calculation of Tree Lengths method. This method is widely used for the acceleration of the phylogenetic tree reconstruction algorithm in software. In our implementation, we define a tree structure and accelerate the search by parallel and pipeline processing. We show results for eight real-world biological datasets. We compare execution times against our previous hardware approach, and TNT, the fastest available parsimony program, which is also accelerated by the Indirect Calculation of Tree Lengths method. Acceleration rates between 34 to 45 per rearrangement, and 2 to 6 for the whole search, are obtained against our previous hardware approach. Acceleration rates between 2 to 36 per rearrangement, and 18 to 112 for the whole search, are obtained against TNT.
Shadow is an important effect that makes virtual 3D scenes more realistic. In this paper, we propose a fast and correct soft shadow generation method for area lights of various shapes and colors. To conduct efficient as well as accurate visibility tests, we exploit the complexity of shadow and area light color.
Yoshitaka OTANI Osamu AOKI Tomohiro HIROTA Hiroshi ANDO
The purpose of this study is to make available a fall risk assessment for stroke patients during walking using an accelerometer. We assessed gait parameters, normalized root mean squared acceleration (NRMSA) and berg balance scale (BBS) values. Walking dynamics were better reflected in terms of the risk of falls during walking by NRMSA compared to the BBS.
Surachai THONGKAEW Tsuyoshi ISSHIKI Dongju LI Hiroaki KUNIEDA
The Process Virtual Machine (VM) is typical software that runs applications inside operating systems. Its purpose is to provide a platform-independent programming environment that abstracts away details of the underlying hardware, operating system and allows bytecodes (portable code) to be executed in the same way on any other platforms. The Process VMs are implemented using an interpreter to interpret bytecode instead of direct execution of host machine codes. Thus, the bytecode execution is slower than those of the compiled programming language execution. Several techniques including our previous paper, the “Fetch/Decode Hardware Extension”, have been proposed to speed up the interpretation of Process VMs. In this paper, we propose an additional methodology, the “Hardware Extension with Hybrid Execution” to further enhance the performance of Process VMs interpretation and focus on Register-based model. This new technique provides an additional decoder which can classify bytecodes into either simple or complex instructions. With “Hybrid Execution”, the simple instruction will be directly executed on hardware of native processor. The complex instruction will be emulated by the “extra optimized bytecode software handler” of native processor. In order to eliminate the overheads of retrieving and storing operand on memory, we utilize the physical registers instead of (low address) virtual registers. Moreover, the combination of 3 techniques: Delay scheduling, Mode predictor HW and Branch/goto controller can eliminate all of the switching mode overheads between native mode and bytecode mode. The experimental results show the improvements of execution speed on the Arithmetic instructions, loop & conditional instructions and method invocation & return instructions can be achieved up to 16.9x, 16.1x and 3.1x respectively. The approximate size of the proposed hardware extension is 0.04mm2 (or equivalent to 14.81k gates) and consumes an additional power of only 0.24mW. The stated results are obtained from logic synthesis using the TSMC 90nm technology @ 200MHz.
Stewart DENHOLM Hiroaki INOUE Takashi TAKENAKA Tobias BECKER Wayne LUK
Financial exchanges provide market data feeds to update their members about changes in the market. Feed messages are often used in time-critical automated trading applications, and two identical feeds (A and B feeds) are provided in order to reduce message loss. A key challenge is to support A/B line arbitration efficiently to compensate for missing packets, while offering flexibility for various operational modes such as prioritising for low latency or for high data reliability. This paper presents a reconfigurable acceleration approach for A/B arbitration operating at the network level, capable of supporting any messaging protocol. Two modes of operation are provided simultaneously: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. We also present a model for message feed processing latencies that is useful for evaluating scalability in future applications. We outline a new low latency, high throughput architecture and demonstrate a cycle-accurate testing framework to measure the actual latency of packets within the FPGA. We implement and compare the performance of the NASDAQ TotalView-ITCH, OPRA and ARCA market data feed protocols using a Xilinx Virtex-6 FPGA. For high reliability messages we achieve latencies of 42ns for TotalView-ITCH and 36.75ns for OPRA and ARCA. 6ns and 5.25ns are obtained for low latency messages. The most resource intensive protocol, TotalView-ITCH, is also implemented in a Xilinx Virtex-5 FPGA within a network interface card; it is used to validate our approach with real market data. We offer latencies 10 times lower than an FPGA-based commercial design and 4.1 times lower than the hardware-accelerated IBM PowerEN processor, with throughputs more than double the required 10Gbps line rate.
Yang XUE Yaoquan HU Lianwen JIN
With the development of personal electronic equipment, the use of a smartphone with a tri-axial accelerometer to detect human physical activity is becoming popular. In this paper, we propose a new feature based on FFT for activity recognition from tri-axial acceleration signals. To improve the classification performance, two fusion methods, minimal distance optimization (MDO) and variance contribution ranking (VCR), are proposed. The new proposed feature achieves a recognition rate of 92.41%, which outperforms six traditional time- or frequency-domain features. Furthermore, the proposed fusion methods effectively improve the recognition rates. In particular, the average accuracy based on class fusion VCR (CFVCR) is 97.01%, which results in an improvement in accuracy of 4.14% compared with the results without any fusion. Experiments confirm the effectiveness of the new proposed feature and fusion methods.
The authors have developed a mechanism that applies real vibration to electrical contacts by hammering oscillation in the vertical direction similar to that in real cases, and they have studied the effects of micro-oscillation on the contacts using the mechanism. It is shown that the performance of the hammering oscillation mechanism (HOM) for measuring acceleration and force is superior to that of other methods in terms of the stability of data. Using the mechanism, much simpler and more practical protocols are proposed for evaluating acceleration, force, and mass using only the measured acceleration. It is also indicated that the relationship between the inertial force generated by the hammering oscillation mechanism and the frictional force in electrical devices attached on a board is related to one of the causes of the degradation of electrical contacts under the effect of external micro-oscillation.
This paper presents a response time acceleration technique in a high-gain capacitive-feedback frontend amplifier (FA) for high output impedance sensors. Using an auxiliary amplifier as a unity-gain buffer, a sample-and-hold capacitor which is used for band-limiting and sampling the FA output is driven at the beginning of the transient response to make the response faster and then it is re-charged directly by the FA output. A condition and parameters for the response time acceleration using this technique while maintaining the noise level unaffected are discussed. Theoretical analysis and simulation results show that the response time can be less than half of the case without the acceleration technique for the specified settling error of less than 0.5%.
An algorithm for the discrimination between human upstairs and downstairs using a tri-axial accelerometer is presented in this paper, which consists of vertical acceleration calibration, extraction of two kinds of features (Interquartile Range and Wavelet Energy), effective feature subset selection with the wrapper approach, and SVM classification. The proposed algorithm can recognize upstairs and downstairs with 95.64% average accuracy for different sensor locations, i.e. located on the subject's waist belt, in the trousers pocket, and in the shirt pocket. Even for the mixed data from all sensor locations, the average recognition accuracy can reach 94.84%. Experimental results have successfully validated the effectiveness of the proposed method.
Atsunori OGAWA Satoshi TAKAHASHI Atsushi NAKAMURA
This paper proposes an efficient combination of state likelihood recycling and batch state likelihood calculation for accelerating acoustic likelihood calculation in an HMM-based speech recognizer. Recycling and batch calculation are each based on different technical approaches, i.e. the former is a purely algorithmic technique while the latter fully exploits computer architecture. To accelerate the recognition process further by combining them efficiently, we introduce conditional fast processing and acoustic backing-off. Conditional fast processing is based on two criteria. The first potential activity criterion is used to control not only the recycling of state likelihoods at the current frame but also the precalculation of state likelihoods for several succeeding frames. The second reliability criterion and acoustic backing-off are used to control the choice of recycled or batch calculated state likelihoods when they are contradictory in the combination and to prevent word accuracies from degrading. Large vocabulary spontaneous speech recognition experiments using four different CPU machines under two environmental conditions showed that, compared with the baseline recognizer, recycling and batch calculation, our combined acceleration technique further reduced both of the acoustic likelihood calculation time and the total recognition time. We also performed detailed analyses to reveal each technique's acceleration and environmental dependency mechanisms by classifying types of state likelihoods and counting each of them. The analysis results comfirmed the effectiveness of the combined acceleration technique.
Yuma MUNEKAWA Fumihiko INO Kenichi HAGIHARA
This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.
Ukrit WATCHAREERUETAI Tetsuya MATSUMOTO Noboru OHNISHI Hiroaki KUDO Yoshinori TAKEUCHI
We propose a learning strategy for acceleration in learning speed of genetic programming (GP), named hierarchical structure GP (HSGP). The HSGP exploits multiple learning nodes (LNs) which are connected in a hierarchical structure, e.g., a binary tree. Each LN runs conventional evolutionary process to evolve its own population, and sends the evolved population into the connected higher-level LN. The lower-level LN evolves the population with a smaller subset of training data. The higher-level LN then integrates the evolved population from the connected lower-level LNs together, and evolves the integrated population further by using a larger subset of training data. In HSGP, evolutionary processes are sequentially executed from the bottom-level LNs to the top-level LN which evolves with the entire training data. In the experiments, we adopt conventional GPs and the HSGPs to evolve image recognition programs for given training images. The results show that the use of hierarchical structure learning can significantly improve learning speed of GPs. To achieve the same performance, the HSGPs need only 30-40% of the computation cost needed by conventional GPs.
Hong Bo CHE Jin Wook KIM Tae Il BAE Young Hwan KIM
A new acceleration scheme that decreases the number of required iterations in relaxation methodology is proposed. The proposed scheme uses dynamic error prediction of an improved approximation to the solution during an iterative computation. The proposed scheme's application to circuit simulations required an average of 67.3% fewer iterations compared to un-accelerated relaxation methods.
Satoshi GOUNAI Tomoaki OHTSUKI
In multiple-input multiple-output (MIMO) wireless systems, the receiver must extract each transmitted signal from received signals. Iterative signal detection with belief propagation (BP) can improve the error rate performance, by increasing the number of detection and decoding iterations in MIMO systems. This number of iterations is, however, limited in actual systems because each additional iteration increases latency, receiver size, and so on. This paper proposes a convergence acceleration technique that can achieve better error rate performance with fewer iterations than the conventional iterative signal detection. Since the Log-Likelihood Ratio (LLR) of one bit propagates to all other bits with BP, improving some LLRs improves overall decoder performance. In our proposal, all the coded bits are divided into groups and only one group is detected in each iterative signal detection whereas in the conventional approach, each iterative signal detection run processes all coded bits, simultaneously. Our proposal increases the frequency of initial LLR update by increasing the number of iterative signal detections and decreasing the number of coded bits that the receiver detects in one iterative signal detection. Computer simulations show that our proposal achieves better error rate performance with fewer detection and decoding iterations than the conventional approach.
Yoichi AOYAMA Hisa NUMA Ryo FUJITA
To evaluate heat and fire phenomena caused by accumulated microslide motion on an imperfectly connected electrical terminal, an acceleration test method using vibrator was developed. The process from the generation of CuO to that of Cu2O has been reproduced. The influence of current is investigated, and it is found that as current increases, CuO generation time T1 and Cu2O generation time T2 decrease for pure copper, however when current exceeds 3 A, we could not produce CuO or Cu2O. The contact resistances of a Cu terminal and wire, compared with the terminal material were investigated in terms of the effects of current and ambient temperature.
Mitsuru TANAKA Kazuki YANO Hiroyuki YOSHIDA Atsushi KUSUNOKI
An iterative reconstruction algorithm of accelerating the estimation of the complex relative permittivity of a cylindrical dielectric object based on the multigrid optimization method (MGOM) is presented. A cost functional is defined by the norm of a difference between the scattered electric fields measured and calculated for an estimated contrast function, which is expressed as a function of the complex relative permittivity of the object. Then the electromagnetic inverse scattering problem can be treated as an optimization problem where the contrast function is determined by minimizing the cost functional. We apply the conjugate gradient method (CGM) and the frequency-hopping technique (FHT) to the minimization of the cost functional, and also employ the multigrid method (MGM) with a V-cycle to accelerate the rate of convergence for getting the reconstructed profile. The reconstruction scheme is called the multigrid optimization method. Computer simulations are performed for lossy and inhomogeneous dielectric circular cylinders by using single-frequency or multifrequency scattering data. The numerical results demonstrate that the rate of convergence of the proposed metod is much faster than that of the conventional CGM for both noise-free and noisy cases.
Seunglak CHOI Jinwon LEE Su Myeon KIM Junehwa SONG Yoon-Joon LEE
Most commercial Web sites dynamically generate their contents through a three-tier server architecture composed of a Web server, an application server, and a database server. In such an architecture, the database server easily becomes a bottleneck to the overall performance. In this paper, we propose WDBAccel, a high-performance database server accelerator that significantly improves the throughput of database processing. WDBAccel eliminates costly, complex query processing needed to obtain query results by reusing the results from previous queries for subsequent queries. This differentiates WDBAccel from other database cache systems, which employ traditional query processing. WDBAccel further improves its performance by fully utilizing main memory as the primary storage. This paper presents the design and implementation of the WDBAccel as well as the results of performance evaluation with a prototype.
Masayuki HASHIMOTO Kenji MATSUO Atsushi KOIKE
This paper proposes an effective JPEG 2000 encoding method for reducing tiling artifacts, which cause one of the biggest problems in JPEG 2000 encoders. Symmetric pixel extension is generally thought to be the main factor in causing artifacts. However this paper shows that differences in quantization accuracy between tiles are a more significant reason for tiling artifacts at middle or low bit rates. This paper also proposes an algorithm that predicts whether tiling artifacts will occur at a tile boundary in the rate control process and that locally improves quantization accuracy by the original post quantization control. This paper further proposes a method for reducing processing time which is yet another serious problem in the JPEG 2000 encoder. The method works by predicting truncation points using the entropy of wavelet transform coefficients prior to the arithmetic coding. These encoding methods require no additional processing in the decoder. The experiments confirmed that tiling artifacts were greatly reduced and that the coding process was considerably accelerated.