Lei GUO Yuhua TANG Yong DOU Yuanwu LEI Meng MA Jie ZHOU
The effective bandwidth of the dynamic random-access memory (DRAM) for the alternate row-wise/column-wise matrix access (AR/CMA) mode, which is a basic characteristic in scientific and engineering applications, is very low. Therefore, we propose the window memory layout scheme (WMLS), which is a matrix layout scheme that does not require transposition, for AR/CMA applications. This scheme maps one row of a logical matrix into a rectangular memory window of the DRAM to balance the bandwidth of the row- and column-wise matrix access and to increase the DRAM IO bandwidth. The optimal window configuration is theoretically analyzed to minimize the total number of no-data-visit operations of the DRAM. Different WMLS implementationsare presented according to the memory structure of field-programmable gata array (FPGA), CPU, and GPU platforms. Experimental results show that the proposed WMLS can significantly improve DRAM bandwidth for AR/CMA applications. achieved speedup factors of 1.6× and 2.0× are achieved for the general-purpose CPU and GPU platforms, respectively. For the FPGA platform, the WMLS DRAM controller is custom. The maximum bandwidth for the AR/CMA mode reaches 5.94 GB/s, which is a 73.6% improvement compared with that of the traditional row-wise access mode. Finally, we apply WMLS scheme for Chirp Scaling SAR application, comparing with the traditional access approach, the maximum speedup factors of 4.73X, 1.33X and 1.56X can be achieved for FPGA, CPU and GPU platform, respectively.
Hui ZHAO Shuqiang YANG Hua FAN Zhikun CHEN Jinghu XU
Scheduling plays a key role in MapReduce systems. In this paper, we explore the efficiency of an MapReduce cluster running lots of independent and continuously arriving MapReduce jobs. Data locality and load balancing are two important factors to improve computation efficiency in MapReduce systems for data-intensive computations. Traditional cluster scheduling technologies are not well suitable for MapReduce environment, there are some in-used schedulers for the popular open-source Hadoop MapReduce implementation, however, they can not well optimize both factors. Our main objective is to minimize total flowtime of all jobs, given it's a strong NP-hard problem, we adopt some effective heuristics to seek satisfied solution. In this paper, we formalize the scheduling problem as job selection problem, a load balance aware job selection algorithm is proposed, in task level we design a strict data locality tasks scheduling algorithm for map tasks on map machines and a load balance aware scheduling algorithm for reduce tasks on reduce machines. Comprehensive experiments have been conducted to compare our scheduling strategy with well-known Hadoop scheduling strategies. The experimental results validate the efficiency of our proposed scheduling strategy.
Ahmadou Dit Adi CISSE Michihiro KOIBUCHI Masato YOSHIMI Hidetsugu IRIE Tsutomu YOSHINAGA
Silicon photonics Network-on-Chips (NoCs) have emerged as an attractive solution to alleviate the high power consumption of traditional electronic interconnects. In this paper, we propose a fully optical ring NoC that combines static and dynamic wavelength allocation communication mechanisms. A different wavelength-channel is statically allocated to each destination node for light weight communication. Contention of simultaneous communication requests from multiple source nodes to the destination is solved by a token based arbitration for the particular wavelength-channel. For heavy load communication, a multiwavelength-channel is available by requesting it in execution time from source node to a special node that manages dynamic allocation of the shared multiwavelength-channel among all nodes. We combine these static and dynamic communication mechanisms in a same network that introduces selection techniques based on message size and congestion information. Using a photonic NoC simulator based on Phoenixsim, we evaluate our architecture under uniform random, neighbor, and hotspot traffic patterns. Simulation results show that our proposed fully optical ring NoC presents a good performance by utilizing adequate static and dynamic channels based on the selection techniques. We also show that our architecture can reduce by more than half, the energy consumption necessary for arbitration compared to hybrid photonic ring and mesh NoCs. A comparison with several previous works in term of architecture hardware cost shows that our architecture can be an attractive cost-performance efficient interconnection infrastructure for future SoCs and CMPs.
We show teachability of a subclass of simple deterministic languages. The subclass we define is called stack uniform simple deterministic languages. Teachability is derived by showing the query learning algorithm for this language class. Our learning algorithm uses membership, equivalence and superset queries. Then, it terminates in polynomial time. It is already known that simple deterministic languages are polynomial time query learnable by context-free grammars. In contrast, our algorithm guesses a hypothesis by a stack uniform simple deterministic grammar, thus our result is strict teachability of the subclass of simple deterministic languages. In addition, we discuss parameters of the polynomial for teachability. The “thickness” is an important parameter for parsing and it should be one of parameters to evaluate the time complexity.
Chen JI Jiang WU Dongming WANG Xiaohu YOU
We analyze a power adaptation method to maximize the achievable rate under the finite block length regime, for MIMO block fading channel with channel state information available at both the transmitter and receiver side. We find a convex approximation to the lower bound of the achievable rate, and it leads to a simple power and rate adaptation method. We show that the method achieves near optimal channel rate under the finite block length regime. Compared to the classical waterfilling method, the proposed method can further improve achievable rate especially for short block lengths.
Takahiro OTA Hiroyoshi MORITA Adriaan J. de Lind van WIJNGAARDEN
This paper presents a real-time and memory-efficient arrhythmia detection system with binary classification that uses antidictionary coding for the analysis and classification of electrocardiograms (ECGs). The measured ECG signals are encoded using a lossless antidictionary encoder, and the system subsequently uses the compression rate to distinguish between normal beats and arrhythmia. An automated training data procedure is used to construct the automatons, which are probabilistic models used to compress the ECG signals, and to determine the threshold value for detecting the arrhythmia. Real-time computer simulations with samples from the MIT-BIH arrhythmia database show that the averages of sensitivity and specificity of the proposed system are 97.8% and 96.4% for premature ventricular contraction detection, respectively. The automatons are constructed using training data and comprise only 11 kilobytes on average. The low complexity and low memory requirements make the system particularly suitable for implementation in portable ECG monitors.
Md. Ezharul ISLAM Nobuo FUNABIKI Toru NAKANISHI Kan WATANABE
Nowadays, with spreads of inexpensive small communication devices, a number of wireless local area networks (WLANs) have been deployed even in the same building for the Internet access services. Their wireless access-points (APs) are often independently installed and managed by different groups such as departments or laboratories in a university or a company. Then, a user host can access to multiple WLANs by detecting signals from their APs, which increases the energy consumption and the operational cost. It may also degrade the communication performance by increasing interferences. In this paper, we present an AP aggregation approach to solve these problems in multiple WLAN environments by aggregating deployed APs of different groups into limited ones using virtual APs. First, we formulate the AP aggregation problem as a combinatorial optimization problem and prove the NP-completeness of its decision problem. Then, we propose its heuristic algorithm composed of five phases. We verify the effectiveness through extensive simulations using the WIMNET simulator.
Yoichi SATO Ichiro FUKUDA Tomonori FUJITA
The use of computing resources on network is becoming active in the Internet and private networks. OpenFlow/Software-Defined Networking (SDN) is drawing attention as a method to control network virtualization for the cloud computing services and other carrier services. This paper introduces examples of OpenFlow/SDN technologies applied to commercial cloud services. Various activities to expand coverage over commercial carrier networks are also mentioned.
Shohei SATO Yoshiaki YAMAGUCHI Ryuji SUGITA
The uniform magnetic field of various strength was applied to the perpendicularly and in-plane demagnetized media, and the change in each magnetic cluster state was investigated as the fundamental investigation of the influence of demagnetization method on noise during signal recording on the stacked perpendicular recording media. The results showed that the in-plane demagnetization can achieve lower noise level if the recording field is not very high. In other words, the in-plane demagnetization is an effective way to achieve lower noise in transition area, near track edge of recorded bit, and in high-density bit. In addition, the simulation clarified that this noise reduction can be explained using the idea of sub-domain structure in the in-plane demagnetized media.
Deposition of inclined anisotropy film for bit-patterned media was studied using an oblique incidence collimated sputtering. Pt underlayer increased the inclination angle of magnetic layer more than 5° on a Ta seed layer. Further increase of the angle was obtained by annealing Pt/Ru underlayer resulting an inclination angle of 9.4° for a Co-Cr15.5 film on the underlayer. The magnetic properties of the Co-Cr15.5 film with an inclined orientation was estimated comparing measured hysteresis loops with simulated ones, which indicated to have inclined magnetic anisotropy with an anisotropy field of about 4.5kOe and a deflection angle of the anisotropy about the same as that of the crystalline orientation.
Autthasith ARRAYANGKOOL Chanon WARISARN Piya KOVINTAVEWAT
To achieve high recording density in a bit-patterned media recording system, the spacing between data bit islands in both the along-track and the across-track directions must be decreased, thus leading to the increase of two-dimensional (2D) interference. One way to reduce the 2D interference is to apply a 2D coding scheme on a data sequence before recording; however, this method usually requires many redundant bits, thus lowering a code rate. Therefore, we propose a novel 2D coding scheme referred to as a recorded-bit patterning (RBP) scheme to mitigate the 2D interference, which requires no redundant bits at the expense of using more buffer memory. Specifically, an input data sequence is first split into three tracks in which will then be rotated to find the best 3-track data pattern based on a look-up table before recording, such that the shifted data tracks yield the least effect of 2D interference in the readback signal. Numerical results indicate that the proposed RBP scheme provides a significant performance improvement if compared to a conventional system (without 2D coding), especially when the recording density is high and/or the position jitter noise is large.
Yichao LU Gang HE Guifen TIAN Satoshi GOTO
Recently, non-binary low-density parity-check (NB-LDPC) codes starts to show their superiority in achieving significant coding gains when moderate codeword lengths are adopted. However, the overwhelming decoding complexity keeps NB-LDPC codes from being widely employed in modern communication devices. This paper proposes a hybrid message-passing decoding algorithm which consumes very low computational complexity. It achieves competitive error performance compared with conventional Min-max algorithm. Simulation result on a (255,174) cyclic code shows that this algorithm obtains at least 0.5dB coding gain over other state-of-the-art low-complexity NB-LDPC decoding algorithms. A partial-parallel NB-LDPC decoder architecture for cyclic NB-LDPC codes is also developed based on this algorithm. Optimization schemes are employed to cut off hard decision symbols in RAMs and also to store only part of the reliability messages. In addition, the variable node units are redesigned especially for the proposed algorithm. Synthesis results demonstrate that about 24.3% gates and 12% memories can be saved over previous works.
Ahmed BOUDISSA Joo Kooi TAN Hyoungseop KIM Takashi SHINOMIYA Seiji ISHIKAWA
This paper introduces a simple algorithm for pedestrian detection on low resolution images. The main objective is to create a successful means for real-time pedestrian detection. While the framework of the system consists of edge orientations combined with the local binary patterns (LBP) feature extractor, a novel way of selecting the threshold is introduced. Using the mean-variance of the background examples this threshold improves significantly the detection rate as well as the processing time. Furthermore, it makes the system robust to uniformly cluttered backgrounds, noise and light variations. The test data is the INRIA pedestrian dataset and for the classification, a support vector machine with a radial basis function (RBF) kernel is used. The system performs at state-of-the-art detection rates while being intuitive as well as very fast which leaves sufficient processing time for further operations such as tracking and danger estimation.
We propose a new face relighting method using an illuminance template generated from a single reference portrait. First, the reference is wrapped according to the shape of the target. Second, we employ a new spatially variant edge-preserving smoothing filter to remove the facial identity and texture details of the wrapped reference, and obtain the illumination template. Finally, we relight the target with the template in CIELAB color space. Experiments show the effectiveness of our method for both grayscale and color faces taken from different databases, and the comparisons with previous works demonstrate a better relighting effect produced by our method.
The “Blind Men and an Elephant” is an old Indian story about a group of blind men who encounter an elephant and do not know what it is. This story describes the difficulties of understanding a large concept or global view based on only local information. Modern technologies enable us to easily obtain and retain local information. However, simply collecting local information does not give us a global view, as evident in this old story. This paper gives a concrete model of this story on the plane to theoretically and mathematically discuss it. It analyzes what information we can obtain from collected local information. For a convex target object modeling the elephant and a convex sensing area, it is proven that the size and perimeter length of the target object are the only parameters that can be observed by randomly deployed sensors modeling the blind men. To increase the number of observable parameters, this paper argues that non-convex sensing areas are important and introduces composite sensor nodes as an approach to implement non-convex sensing areas. The paper also derives a model on the discrete space and analyzes it. The analysis results on the discrete space are applicable to some network related issues such as link quality estimation in a part of a network based on end-to-end probing.
Yuuki ARAGA Nao UEDA Yasumasa TAKAGI Makoto NAGATA
A probing front end circuit (PFE) senses and digitizes voltage noises at in-circuit locations on such as power supply wiring and substrate taps in a chip, with the simplest circuit construction only with a source follower or a unity gain buffer, followed by a latch comparator. The PFE with 2.5V I/O transistors in a 65nm CMOS technology node demonstrates 9.0 ENOB and 60.7dB SFDR in equivalent sampling at 1.0Gs/s, for a sinusoidal waveform of 10MHz with 200mV peak-to-peak amplitude. Behavioral modeling of an entire waveform acquisition system using PFEs includes the statistical variations of reference voltage and sampling timing. The simulation quantitatively explains the measured dynamic properties of on-chip noise monitoring, such as the AC response in SNDR and digitizing throughputs, with the clear dependency on the frequency and amplitude of input waveforms.
In this paper, we present a pure hardware implementation of the advanced encryption standard (AES) with 8-bit data path with both encryption/decryption abilities for applications of wireless network. To achieve the requirements of low area resource and high throughput performance, the 8-bit AES design overlaps the MixColumns (MC) and ShiftRows (SR), Inverse MixColumns (IMC) and Inverse ShiftRows (ISR) operations in order to reduce the required clock cycles and critical path delay of transformations involved. The combinations of SB with ISB, MC with IMC, and SR with ISR can effectively reduce the area cost of the AES realization. We implement the AES processor in an ASIC chip. The design has the area cost of 4.3 k-gates with throughput of 72Mbps which can meet the throughput requirement of IEEE 802.11g wireless network standard. From the experimental results, we observe that our AES design has better performance compared with other previous designs.
There is a well known Steiner tree algorithm called minimum-cost paths heuristic (MPH), which is used for many multicast network operations and is considered a benchmark for other Steiner tree algorithms. MPH's average case time complexity is O(m(l+nlog n)), where m is the number of end nodes, n is the number of nodes, and l is the number of links in the network, because MPH has to run Dijkstra's algorithm as many times as the number of end nodes. The author recently proposed a Steiner tree algorithm called branch-based multi-cast (BBMC), which produces exactly the same multicast tree as MPH in a constant processing time irrespective of the number of multicast end nodes. However, the theoretical result for the average case time complexity of BBMC was expressed as O(log m(l+nlog n)) and could not accurately reflect the above experimental result. This paper proves that the average case time complexity of BBMC can be shortened to O(l+nlog n), which is independent of the number of end nodes, when there is an upper limit of the node degree, which is the number of links connected to a node. In addition, a new parameter β is applied to BBMC, so that the multicast tree created by BBMC has less links on it. Even though the tree costs increase due to this parameter, the tree cost increase rates are much smaller than the link decrease rates.
Takayuki NOZAKI Masaki MAEHARA Kenta KASAI Kohichi SAKANIWA
This paper derives the average symbol and bit weight distributions for the irregular non-binary cluster low-density parity-check (LDPC) code ensembles. Moreover, we give the exponential growth rates of the average weight distributions in the limit of large code length. We show the condition that the typical minimum distances linearly grow with the code length.
Taeko MATSUNAGA Shinji KIMURA Yusuke MATSUNAGA
Multi-operand adders that calculate the summation of more than two operands usually consist of compressor trees, which reduce the number of operands to two without any carry propagation, and carry-propagate adders for the two operands in the ASIC implementation. Compressor trees that consist of full adders and half adders cannot be implemented efficiently on LUT-based FPGAs, and carry-chains or dedicated structures have been utilized to produce multi-operand adders on FPGAs. Recent studies indicate that compressor trees can be implemented efficiently on LUTs using Generalized Parallel Counters (GPCs) as the building blocks of compressor trees. This paper addresses the problem of synthesizing compressor trees based on GPCs. Based on the observation that characteristics such as the area, power, and delay correlate roughly to the total number and the maximum level of GPCs, the target problem can be regarded as a minimization problem for the total number of GPCs and the maximum levels of the GPCs, for which an ILP-based approach is proposed. The key point of our formulation is not to model the problem based on the structures of compressor trees like the existing approach, but instead the compression process itself is used to reduce the number of variables and constraints in the ILP formulation. The experimental results demonstrate the advantage of our formulation in terms of the quality and runtime.