Tadayoshi ENOMOTO Nobuaki KOBAYASHI
A motion estimation (ME) multimedia processor was developed by employing dynamic voltage and frequency scaling (DVFS) technique to greatly reduce the power dissipation. To make full use of the advantages of DVFS technique, a fast motion estimation (ME) algorithm was also developed. It can adaptively predict the optimum supply voltage and the optimum clock frequency before ME process starts for each macro-block for encoding. Power dissipation of the 90-nm CMOS DVFS controlled multimedia processor, which contained an absolute difference accumulator as well as a small on-chip DC/DC level converter, a minimum value detector and DVFS controller, was reduced to 38.48 µW, which was only 3.261% that of a conventional multimedia processor.
Complex-valued Hopfield associative memory (CHAM) is one of the most promising neural network models to deal with multilevel information. CHAM has an inherent property of rotational invariance. Rotational invariance is a factor that reduces a network's robustness to noise, which is a critical problem. Here, we proposed complex-valued bipartite auto-associative memory (CBAAM) to solve this reduction in noise robustness. CBAAM consists of two layers, a visible complex-valued layer and an invisible real-valued layer. The invisible real-valued layer prevents rotational invariance and the resulting reduction in noise robustness. In addition, CBAAM has high parallelism, unlike CHAM. By computer simulations, we show that CBAAM is superior to CHAM in noise robustness. The noise robustness of CHAM decreased as the resolution factor increased. On the other hand, CBAAM provided high noise robustness independent of the resolution factor.
Several models of feed-forward complex-valued neural networks have been proposed, and those with split and polar-represented activation functions have been mainly studied. Neural networks with split activation functions are relatively easy to analyze, but complex-valued neural networks with polar-represented functions have many applications but are difficult to analyze. In previous research, Nitta proved the uniqueness theorem of complex-valued neural networks with split activation functions. Subsequently, he studied their critical points, which caused plateaus and local minima in their learning processes. Thus, the uniqueness theorem is closely related to the learning process. In the present work, we first define three types of reducibility for feed-forward complex-valued neural networks with polar-represented activation functions and prove that we can easily transform reducible complex-valued neural networks into irreducible ones. We then prove the uniqueness theorem of complex-valued neural networks with polar-represented activation functions.
In recent years, applications of complex-valued neural networks have become wide spread. Quaternions are an extension of complex numbers, and neural networks with quaternions have been proposed. Because quaternion algebra is non-commutative algebra, we can consider two orders of multiplication to calculate weighted input. However, both orders provide almost the same performance. We propose hybrid quaternionic Hopfield neural networks, which have both orders of multiplication. Using computer simulations, we show that these networks outperformed conventional quaternionic Hopfield neural networks in noise tolerance. We discuss why hybrid quaternionic Hopfield neural networks improve noise tolerance from the standpoint of rotational invariance.
Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI
As energy consumption of cache memories increases, an energy-efficient cache management mechanism is required. While a dynamic cache resizing mechanism is one promising approach to the energy reduction of microprocessors, one problem is that its effect is limited by the existence of dead-on-fill blocks, which are not used until their evictions from the cache memory. To solve this problem, this paper proposes a cache management policy named FLEXII, which can reduce the number of dead-on-fill blocks and help dynamic cache resizing mechanisms further reduce the energy consumption of the cache memories.
James OKELLO Shin'ichi ARITA Yoshio ITOH Yutaka FUKUI Masaki KOBAYASHI
In this paper we propose a new simplified algorithm for cascaded second order adaptive notch filters implemented using an allpass filter, for elimination of multiple sinusoids. Each of the stages of the notch filter is implemented using direct form second order allpass filter. We also present an analysis which compares the proposed algorithm with the conventional simplified algorithm, and which indicates that the proposed algorithm has a reduced bias in the estimation of the multiple input sinusoids. Simulation results that have been provided confirm this analysis.
Shunsuke TSUKADA Hikaru TAKAYASHIKI Masayuki SATO Kazuhiko KOMATSU Hiroaki KOBAYASHI
A hybrid memory architecture (HMA) that consists of some distinct memory devices is expected to achieve a good balance between high performance and large capacity. Unlike conventional memory architectures, the HMA needs the metadata for data management since the data are migrated between the memory devices during the execution of an application. The memory controller caches the metadata to avoid accessing the memory devices for the metadata reference. However, as the amount of the metadata increases in proportion to the size of the HMA, the memory controller needs to handle a large amount of metadata. As a result, the memory controller cannot cache all the metadata and increases the number of metadata references. This results in an increase in the access latency to reach the target data and degrades the performance. To solve this problem, this paper proposes a metadata prefetching mechanism for HMAs. The proposed mechanism loads the metadata needed in the near future by prefetching. Moreover, to increase the effect of the metadata prefetching, the proposed mechanism predicts the metadata used in the near future based on an address difference that is the difference between two consecutive access addresses. The evaluation results show that the proposed metadata prefetching mechanism can improve the instructions per cycle by up to 44% and 9% on average.
Shoichi HIRASAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI
Automatic performance tuning of a practical application could be time-consuming and sometimes infeasible, because it often needs to evaluate the performances of a large number of code variants to find the best one. In this paper, hence, a light-weight rollback mechanism is proposed to evaluate each of code variants at a low cost. In the proposed mechanism, once one code variant of a target code block is executed, the execution state is rolled back to the previous state of not yet executing the block so as to repeatedly execute only the block to find the best code variant. It also has a feature of terminating a code variant whose execution time is longer than the shortest execution time so far. As a result, it can prevent executing the whole application many times and thus reduces the timing overhead of an auto-tuning process required for finding the best code variant.
Muhammad ALFIAN AMRIZAL Atsuya UNO Yukinori SATO Hiroyuki TAKIZAWA Hiroaki KOBAYASHI
Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.
Kouji SHIBATA Masaki KOBAYASHI
In this study, expressions were compared with reference material using the coaxial feed-type open-ended cut-off circular waveguide reflection method to support simple and instantaneous evaluation of dielectric constants in small amounts of scarce liquids over a broad frequency range. S11 values were determined via electromagnetic analysis for individual jig structure conditions and dielectric property values without actual S11 measurement under the condition that the tip of the measurement jig with open and short-ended conditions and with the test material inserted. Next, information on the relationships linking jig structure, dielectric properties and S11 properties was stored on a database to simplify the procedure and improve accuracy in reference material evaluation. The accuracy of the estimation formula was first theoretically verified for cases in which values indicating the dielectric properties of the reference material and the actual material differed significantly to verify the effectiveness of the proposed method. The results indicated that dielectric property values for various liquids measured at 0.5 and 1.0GHz using the proposed method corresponded closely to those obtained using the method previously proposed by the authors. The effectiveness of the proposed method was evaluated by determining the dielectric properties of certain liquids at octave-range continuous frequencies between 0.5 and 1.0GHz based on interpolation from limited data of several frequencies. The results indicated that the approach enables quicker and easier measurement to establish the complex permittivity of liquids over a broad frequency range than the previous method.
Nobuaki KOBAYASHI Tadayoshi ENOMOTO
To completely utilize the advantages of dynamic voltage and frequency scaling (DVFS) techniques, a quantized decoder (QNT-D) was developed. The QNT-D generates a quantized signal processing quantity (Q) using a predicted signal processing quantity (M). Q is used to produce the optimum frequency (opt.fc) and the optimum supply voltage (opt.VD) that are proportional to Q. To develop a DVFS controlled motion estimation (ME) processor, we used both the QNT-D and a fast ME algorithm called A2BC (Adaptively Assigned Breaking-off Condition) to predict M for each macro-block (MB). A DVFS controlled ME processor was fabricated using 90-nm CMOS technology. The total power dissipation (PT) of the processor was significantly reduced and varied from 38.65 to 99.5 µW, only 3.27 to 8.41 % of PT of a conventional ME processor, depending on the test video picture.
Hazaoud AHMED Etsuro HAYAHARA Masaki KOBAYASHI Yoshio ITOH
This paper describes a digital filter realization method by simulating an LCR filter. Having the node equation of an original LCR filter the frequency variable s is transformed into a z one using the bilinear transformation. The resulting network equation can be digitally realized with the same transfer function as the original LCR filter. Using such a method, the circuit either has a large error in the transfer response near zero frequencies or causes oscillations. A technique to avoid this problem by a simple modification of the multiplier coefficients is shown. A fifth order elliptic filter is presented with illustrative comparison to classical cascade structure.
Kazuki SHIOGAI Naoto SASAOKA Masaki KOBAYASHI Isao NAKANISHI James OKELLO Yoshio ITOH
Conventional adaptive notch filter based on an infinite impulse response (IIR) filter is well known. However, this kind of adaptive notch filter has a problem of stability due to its adaptive IIR filter. In addition, tap coefficients of this notch filter converge to solutions with bias error. In order to solve these problems, an adaptive notch filter using Fourier sine series (ANFF) is proposed. The ANFF is stable because an adaptive IIR filter is not used as an all-pass filter. Further, the proposed adaptive notch filter is robust enough to overcome effects of a disturbance signal, due to a structure of the notch filter based on an exponential filter and line symmetry of auto correlation.
Ye GAO Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI
Vector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array.
James OKELLO Yoshio ITOH Yutaka FUKUI Masaki KOBAYASHI
Adaptive infinite impulse response (IIR) digital filter implemented using a cascade of second order direct form allpass filters and a finite impulse response (FIR) filter, has the property of its poles converging to those of the unknown system. In this paper we implement the adaptive allpass-FIR digital filter using a lattice allpass filter with minimum number of multipliers. We then derive a simple adaptive algorithm, which does not increase the overall number of multipliers of the proposed adaptive digital filter (ADF) in comparison to the ADF that uses the direct form allpass filter. The proposed structure and algorithm exhibit a kind of orthogonality, which ensures convergence of the poles of the ADF to those of the unknown system. Simulation results confirm this convergence.
Taira NAKAJIMA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI Tadao NAKAMURA
We propose a learning algorithm for self-organizing neural networks to form a topology preserving map from an input manifold whose topology may dynamically change. Experimental results show that the network using the proposed algorithm can rapidly adjust itself to represent the topology of nonstationary input distributions.
Ken NAKAMURA Yuya OMORI Daisuke KOBAYASHI Koyo NITTA Kimikazu SANO Masayuki SATO Hiroe IWASAKI Hiroaki KOBAYASHI
This paper proposes an efficient reference image sharing method for the image-division parallel video encoding architecture. This method efficiently reduces the amount of data transfer by using pre-transfer with area prediction and on-demand transfer with a transfer management table. Experimental results show that the data transfer can be reduced to 19.8-35.3% of the conventional method on average without major degradation of coding performance. This makes it possible to reduce the required bandwidth of the inter-chip transfer interface by saving the amount of data transfer.
Tadayoshi ENOMOTO Nobuaki KOBAYASHI
We developed a self-controllable voltage level (SVL) circuit and applied this circuit to a single-power-supply, six-transistor complementary metal-oxide-semiconductor static random-access memory (SRAM) to not only improve both write and read performances but also to achieve low standby power and data retention (holding) capability. The SVL circuit comprises only three MOSFETs (i.e., pull-up, pull-down and bypass MOSFETs). The SVL circuit is able to adaptively generate both optimal memory cell voltages and word line voltages depending on which mode of operation (i.e., write, read or hold operation) was used. The write margin (VWM) and read margin (VRM) of the developed (dvlp) SRAM at a supply voltage (VDD) of 1V were 0.470 and 0.1923V, respectively. These values were 1.309 and 2.093 times VWM and VRM of the conventional (conv) SRAM, respectively. At a large threshold voltage (Vt) variability (=+6σ), the minimum power supply voltage (VMin) for the write operation of the conv SRAM was 0.37V, whereas it decreased to 0.22V for the dvlp SRAM. VMin for the read operation of the conv SRAM was 1.05V when the Vt variability (=-6σ) was large, but the dvlp SRAM lowered it to 0.41V. These results show that the SVL circuit expands the operating voltage range for both write and read operations to lower voltages. The dvlp SRAM reduces the standby power consumption (PST) while retaining data. The measured PST of the 2k-bit, 90-nm dvlp SRAM was only 0.957µW at VDD=1.0V, which was 9.46% of PST of the conv SRAM (10.12µW). The Si area overhead of the SVL circuits was only 1.383% of the dvlp SRAM.
Masaaki KOBAYASHI Yasutoshi YAMAMOTO Shinichi AKI Yoshitomi NAGAOKA
To study lower sideband enhancement, three kinds of modulation methods are applied to recording. Compared with results of experiments using a VCR and simulations, it is concluded that the order of magnitude of lower sideband enhancement is AM, Dual-frequency, FM with lower modulation index.
Hitoshi YAMAUCHI Takayuki MAEDA Hiroaki KOBAYASHI Tadao NAKAMURA
The multipass rendering method based on the global illumination model can generate the most photo-realistic images. However, since the multipass rendering method is very time consuming, it is impractical in the industrial world. This paper discusses a massively parallel processing approach to fast image synthesis by the multipass rendering method. Especially, we focus on the performance evaluation of the view-dependent object-space parallel processing on the (Mπ)2 which has been proposed in our previous paper. We also propose two kinds of distributed frame buffer system named cached frame buffer and multistage-interconnected frame buffer. These frame buffer systems can solve the access conflict problem on the frame buffer. The simulation results show that the (Mπ)2 has a scalable performance. For example, the (Mπ)2 with more than 4000 processing elements can achieve an efficiency of over 50%. We also show that both of the proposed distributed frame buffer systems can relieve the overhead due to frame buffer access in the (Mπ)2 in the case that a large number of high-performance processing elements are adopted in the system.