IEICE global.ieice.org Site

Keyword Search Result

[Keyword] PAR(2741hit)

781-800hit(2741hit)

Periodic Pattern Coding for Last Level Cache Data Compression
Haruhiko KANEKO

PAPER-Data Compression

Vol:
E96-A No:12
Page(s):
2351-2359
In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.
Synchronization-Aware Virtual Machine Scheduling for Parallel Applications in Xen
Cheol-Ho HONG Chuck YOO

LETTER

Vol:
E96-D No:12
Page(s):
2720-2723
In this paper, we propose a synchronization-aware VM scheduler for parallel applications in Xen. The proposed scheduler prevents threads from waiting for a significant amount of time during synchronization. For this purpose, we propose an identification scheme that can identify the threads that have awaited other threads for a long time. In this scheme, a detection module that can infer the internal status of guest OSs was developed. We also present a scheduling policy that can accelerate bottlenecks of concurrent VMs. We implemented our VM scheduler in the recent Xen hypervisor with para-virtualized Linux-based operating systems. We show that our approach can improve the performance of concurrent VMs by up to 43% as compared to the credit scheduler.
Construction of High Rate Punctured Convolutional Codes by Exhaustive Search and Partial Search
Sen MORIYA Hiroshi SASANO

PAPER-Coding Theory

Vol:
E96-A No:12
Page(s):
2374-2381
We consider two methods for constructing high rate punctured convolutional codes. First, we present the best high rate R=(n-1)/n punctured convolutional codes, for n=5,6,…,16, which are obtained by exhaustive searches. To obtain the best code, we use a regular convolutional code whose weight spectrum is equivalent to that of each punctured convolutional code. We search these equivalent codes for the best one. Next, we present a method that searches for good punctured convolutional codes by partial searches. This method searches the codes that are derived from rate 1/2 original codes obtained in the first method. By this method, we obtain some good punctured convolutional codes relatively faster than the case in which we search for the best codes.
Optimal Parallel Algorithms for Computing the Sum, the Prefix-Sums, and the Summed Area Table on the Memory Machine Models
Koji NAKANO

PAPER

Vol:
E96-D No:12
Page(s):
2626-2634
The main contribution of this paper is to show optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, and the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. We then go on to show that $Omega({nover w}+{nlover p}+llog n)$ time units are necessary to compute the sum. We also present a parallel algorithm that computes the prefix-sums of n numbers in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Finally, we show that the summed area table of size $sqrt{n} imessqrt{n}$ can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Since the computation of the prefix-sums and the summed area table is at least as hard as the sum computation, these parallel algorithms are also optimal.
A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation
Yasuaki ITO Koji NAKANO

PAPER

Vol:
E96-D No:12
Page(s):
2596-2603
This paper presents a GPU (Graphics Processing Units) implementation of dynamic programming for the optimal polygon triangulation. Recently, GPUs can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture) provided by NVIDIA. The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem for a convex n-gon can be solved using the dynamic programming technique in O(n3) time using a work space of size O(n2). In this paper, we propose an efficient parallel implementation of this O(n3)-time algorithm on the GPU. In our implementation, we have used two new ideas to accelerate the dynamic programming. The first idea (adaptive granularity) is to partition the dynamic programming algorithm into many sequential kernel calls of CUDA, and to select the best parameters for the size and the number of blocks for each kernel call. The second idea (sliding and mirroring arrangements) is to arrange the working data for coalesced access of the global memory in the GPU to minimize the memory access overhead. Our implementation using these two ideas solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.02 seconds. Thus, our GPU implementation attains a speedup factor of 348.02.
An Exact Approach for GPC-Based Compressor Tree Synthesis
Taeko MATSUNAGA Shinji KIMURA Yusuke MATSUNAGA

PAPER-Logic Synthesis, Test and Verification

Vol:
E96-A No:12
Page(s):
2553-2560
Multi-operand adders that calculate the summation of more than two operands usually consist of compressor trees, which reduce the number of operands to two without any carry propagation, and carry-propagate adders for the two operands in the ASIC implementation. Compressor trees that consist of full adders and half adders cannot be implemented efficiently on LUT-based FPGAs, and carry-chains or dedicated structures have been utilized to produce multi-operand adders on FPGAs. Recent studies indicate that compressor trees can be implemented efficiently on LUTs using Generalized Parallel Counters (GPCs) as the building blocks of compressor trees. This paper addresses the problem of synthesizing compressor trees based on GPCs. Based on the observation that characteristics such as the area, power, and delay correlate roughly to the total number and the maximum level of GPCs, the target problem can be regarded as a minimization problem for the total number of GPCs and the maximum levels of the GPCs, for which an ILP-based approach is proposed. The key point of our formulation is not to model the problem based on the structures of compressor trees like the existing approach, but instead the compression process itself is used to reduce the number of variables and constraints in the ILP formulation. The experimental results demonstrate the advantage of our formulation in terms of the quality and runtime.
Time Shift Parameter Setting of Temporal Decorrelation Source Separation for Periodic Gaussian Signals
Takeshi AMISHIMA Kazufumi HIRATA

PAPER-Sensing

Vol:
E96-B No:12
Page(s):
3190-3198
Temporal Decorrelation source SEParation (TDSEP) is a blind separation scheme that utilizes the time structure of the source signals, typically, their periodicities. The advantage of TDSEP over non-Gaussianity based methods is that it can separate Gaussian signals as long as they are periodic. However, its shortcoming is that separation performance (SEP) heavily depends upon the values of the time shift parameters (TSPs). This paper proposes a method to automatically and blindly estimate a set of TSPs that achieves optimal SEP against periodic Gaussian signals. It is also shown that, selecting the same number of TSPs as that of the source signals, is sufficient to obtain optimal SEP, and adding more TSPs does not improve SEP, but only increases the computational complexity. The simulation example showed that the SEP is higher by approximately 20dB, compared with the ordinary method. It is also shown that the proposed method successfully selects just the same number of TSPs as that of incoming signals.
Study of Multi-Cell Interference in a 2-Hop OFDMA Virtual Cellular Network
Gerard J. PARAISON Eisuke KUDOH

PAPER-Wireless Communication Technologies

Vol:
E96-B No:12
Page(s):
3163-3171
In the literature, many resource allocation schemes have been proposed for multi-hop networks. However, the analyses provided focus mainly on the single cell case. Inter-cell interference severely degrades the performance of a wireless mobile network. Therefore, incorporating the analysis of inter-cell interference into the study of a scheme is required to more fully understand the performance of that scheme. The authors of this paper have proposed a parallel relaying scheme for a 2-hop OFDMA virtual cellular network (VCN). The purpose of this paper is to study a new version of that scheme which considers a multi-cell environment and evaluate the performance of the VCN. The ergodic channel capacity and outage capacity of the VCN in the presence of inter-cell interference are evaluated, and the results are compared to those of the single hop network (SHN). Furthermore, the effect of the location and number of wireless ports in the VCN on the channel capacity of the VCN is investigated, and the degree of fairness of the VCN relative to that of the SHN is compared. Using computer simulations, it is found that in the presence of inter-cell interference, a) the VCN outperforms the SHN even in the interference dominant transmission power region (when a single cell is considered, the VCN is better than the SHN only in the noise dominant transmission power region), b) the channel capacity of the VCN remains greater than that of the SHN even if the VCN is fully loaded, c) an optimal distance ratio for the location of the wireless ports can be found in the interval 0.2∼0.4, d) increasing the number of wireless ports from 3 to 6 can increase the channel capacity of the VCN, and e) the VCN can achieve better outage capacity than the SHN.
Hybrid Message-Passing Algorithm and Architecture for Decoding Cyclic Non-binary LDPC Codes
Yichao LU Gang HE Guifen TIAN Satoshi GOTO

PAPER-High-Level Synthesis and System-Level Design

Vol:
E96-A No:12
Page(s):
2652-2659
Recently, non-binary low-density parity-check (NB-LDPC) codes starts to show their superiority in achieving significant coding gains when moderate codeword lengths are adopted. However, the overwhelming decoding complexity keeps NB-LDPC codes from being widely employed in modern communication devices. This paper proposes a hybrid message-passing decoding algorithm which consumes very low computational complexity. It achieves competitive error performance compared with conventional Min-max algorithm. Simulation result on a (255,174) cyclic code shows that this algorithm obtains at least 0.5dB coding gain over other state-of-the-art low-complexity NB-LDPC decoding algorithms. A partial-parallel NB-LDPC decoder architecture for cyclic NB-LDPC codes is also developed based on this algorithm. Optimization schemes are employed to cut off hard decision symbols in RAMs and also to store only part of the reliability messages. In addition, the variable node units are redesigned especially for the proposed algorithm. Synthesis results demonstrate that about 24.3% gates and 12% memories can be saved over previous works.
Multilane Hashing Mode Suitable for Parallel Processing
Hidenori KUWAKADO Shoichi HIROSE

PAPER-Information Security

Vol:
E96-A No:12
Page(s):
2434-2442
A hash function is an important primitive for cryptographic protocols. Since algorithms of well-known hash functions are almost serial, it seems difficult to take full advantage of recent multi-core processors. This paper proposes a multilane hashing (MLH) mode that achieves both of high parallelism and high security. The MLH mode is designed in such a way that the processing speed is almost linear in the number of processors. Since the MLH mode exploits an existing hash function as a black box, it is applicable to any hash function. The bound on the indifferentiability of the MLH mode from a random oracle is beyond the birthday bound on the output length of an underlying primitive.
A Study on Signal Processing for Barium Ferrite Particulate Tape Systems
Atsushi MUSHA Osamu SHIMIZU

PAPER

Vol:
E96-C No:12
Page(s):
1474-1478
The optimum generalized partial response (GPR) target for barium ferrite (BaFe) tape systems was studied. The shift in perpendicular magnetic recording technology in HDDs to systems employing single-pole-type (SPT) recording heads and media with a soft under layer (SUL) has been accompanied by a change in the read channel design, whereas current magnetic tape recording systems utilize a combination of a ring-type recording head with a single magnetic layer structured medium. Therefore, the read channel performance of current oriented BaFe particulate tape systems needs to be studied to best exploit the potential of this medium. Toward this end, DC-free, DC-full, and DC-suppressed targets were compared. The results show that assuming a GPRML detector with 16 or more states, a traditional DC-free target exhibits the best bit error rate performance for both longitudinally and perpendicularly oriented BaFe media, suggesting that the current read channel designed for longitudinally oriented media can also be utilized for BaFe particulate tape systems.
A Practical and Optimal Path Planning for Autonomous Parking Using Fast Marching Algorithm and Support Vector Machine
Quoc Huy DO Seiichi MITA Keisuke YONEDA

PAPER-Artificial Intelligence, Data Mining

Vol:
E96-D No:12
Page(s):
2795-2804
This paper proposes a novel practical path planning framework for autonomous parking in cluttered environments with narrow passages. The proposed global path planning method is based on an improved Fast Marching algorithm to generate a path while considering the moving forward and backward maneuver. In addition, the Support Vector Machine is utilized to provide the maximum clearance from obstacles considering the vehicle dynamics to provide a safe and feasible path. The algorithm considers the most critical points in the map and the complexity of the algorithm is not affected by the shape of the obstacles. We also propose an autonomous parking scheme for different parking situation. The method is implemented on autonomous vehicle platform and validated in the real environment with narrow passages.
Region-Based Way-Partitioning on L1 Data Cache for Low Power
Zhong ZHENG Zhiying WANG Li SHEN

LETTER-Computer System

Vol:
E96-D No:11
Page(s):
2466-2469
Power consumption has become a critical factor for embedded systems, especially for battery powered ones. Caches in these systems consume a large portion of the whole chip power. Embedded systems usually adopt set-associative caches to get better performance. However, parallel accessed cache ways incur more energy dissipation. This paper proposed a region-based way-partitioning scheme to reduce cache way access, and without sacrificing performance, to reduce the cache power consumption. The stack accesses and non-stack accesses are isolated and redirected to different ways of the L1 data cache. Under way-partitioning, cache way accesses are reduced, as well as the memory reference interference. Experimental results show that the proposed approach could save around 27.5% of L1 data cache energy on average, without significant performance degradation.
A Partially Driven Array Antenna Backed by a Reflector with a Reduction in the Number of Driven Elements by Up to 67%
Tadashi TAKANO Takehiro IMURA Midori OKUMURA

PAPER-Antennas and Propagation

Vol:
E96-B No:11
Page(s):
2883-2890
This paper describes a novel technique to replace some of the driven elements in an array antenna with parasitic elements. First, the antenna characteristics are studied by simulation for a basic unit array with one driven and two parasitic elements. The entire antenna is backed with a flat reflector to conform to practical applications. The parasitic elements are excited by the neighboring driven elements through the electromagnetic coupling effect. It is shown that at the optimal coupling condition, the radiation patterns are almost identical with those of an array antenna whose elements are all driven without coupling. The simulation result is confirmed by performing an experiment at 5.8GHz (λ =51.7mm). Finally, a 12-element array is formed by combining four unit arrays. The simulation results show that the maximum antenna gain is 19.4dBi, indicating that there is no penalty with respect to the antenna gain of a fully driven 12-element array. Therefore, the array antenna can be considerably simplified by replacing 67% of its elements with parasitic elements.
An Inter-Prediction Method Using Sparse Representation for High Efficiency Video Coding
Koji INOUE Kohei ISECHI Hironobu SAITO Yoshimitsu KUROKI

LETTER-Image Processing

Vol:
E96-A No:11
Page(s):
2191-2193
This paper proposes an inter-prediction method for the upcoming video coding standard named HEVC (High Efficiency Video Coding). The HEVC offers an inter-prediction framework called local intensity compensation which represents a current block by a linear combination of some reference blocks. The proposed method calculates weight coefficients of the linear combination by using sparse representation. Experimental results show that the proposed method increases prediction accuracy in comparison with other methods.
A Novel Fast Mode Decision Algorithm for H.264/AVC Using Particle Swarm Optimization
Jia-Ching WANG Yu-Huan SUNG

PAPER-Image Processing

Vol:
E96-A No:11
Page(s):
2154-2160
Video coding plays an important role in human life especially in communications. H.264/AVC is a prominent video coding standard that has been used in a variety of applications due to its high efficiency comes from several new coding techniques. However, the extremely high encoding complexity hinders itself from real-time applications. This paper presents a new encoding algorithm that makes use of particle swarm optimization (PSO) to train discriminant functions for classification based fast mode decision. Experimental results show that the proposed algorithm can successfully reduce encoding time at the expense of negligible quality degradation and bitrate increases.
Single Parameter Logarithmic Image Processing for Edge Detection
Fuji REN Bo LI Qimei CHEN

PAPER-Image Processing and Video Processing

Vol:
E96-D No:11
Page(s):
2437-2449
Considering the non-linear properties of the human visual system, many non-linear operators and models have been developed, particularly the logarithmic image processing (LIP) model proposed by Jourlin and Pinoli, which has been proved to be physically justified in several laws of the human visual system and has been successfully applied in image processing areas. Recently, several modifications based on this logarithmic mathematical framework have been presented, such as parameterized logarithmic image processing (PLIP), pseudo-logarithmic image processing, homomorphic logarithmic image processing. In this paper, a new single parameter logarithmic model for image processing with an adaptive parameter-based Sobel edge detection algorithm is presented. On the basis of analyzing the distributive law, the subtractive law, and the isomorphic property of the PLIP model, the five parameters in PLIP are replaced by a single parameter to ensure the completeness of the model and physical constancy with the nature of an image, and then an adaptive parameter-based Sobel edge detection algorithm is proposed. By using an image noise estimation method to evaluate the noise level of image, the adaptive parameter in the single parameter LIP model is calculated based on the noise level and grayscale value of a corresponding image area, followed by the single-parameter LIP-based Sobel operation to overcome the noise-sensitive problem of classical LIP-based Sobel edge detection methods, especially in the dark area of an image, while retaining edge sensitivity. Compared with the classical LIP and PLIP model, the given single parameter LIP achieves satisfactory results in noise suppression and edge accuracy.
Bitstream Protection in Dynamic Partial Reconfiguration Systems Using Authenticated Encryption
Yohei HORI Toshihiro KATASHITA Hirofumi SAKANE Kenji TODA Akashi SATOH

PAPER-Computer System

Vol:
E96-D No:11
Page(s):
2333-2343
Protecting the confidentiality and integrity of a configuration bitstream is essential for the dynamic partial reconfiguration (DPR) of field-programmable gate arrays (FPGAs). This is because erroneous or falsified bitstreams can cause fatal damage to FPGAs. In this paper, we present a high-speed and area-efficient bitstream protection scheme for DPR systems using the Advanced Encryption Standard with Galois/Counter Mode (AES-GCM), which is an authenticated encryption algorithm. Unlike many previous studies, our bitstream protection scheme also provides a mechanism for error recovery and tamper resistance against configuration block deletion, insertion, and disorder. The implementation and evaluation results show that our DPR scheme achieves a higher performance, in terms of speed and area, than previous methods.
Sequential Loss Tomography Using Compressed Sensing
Kazushi TAKEMOTO Takahiro MATSUDA Tetsuya TAKINE

PAPER

Vol:
E96-B No:11
Page(s):
2756-2765
Network tomography is a technique for estimating internal network characteristics from end-to-end measurements. In this paper, we focus on loss tomography, which is a network tomography problem for estimating link loss rates. We study a loss tomography problem to detect links with high link loss rates in network environments with dynamically changing link loss rates, and propose a window-based sequential loss tomography scheme. The loss tomography problem is formulated as an underdetermined linear inverse problem, where there are infinitely many candidates of the solution. In the proposed scheme, we use compressed sensing, which can solve the problem with a prior information that the solution is a sparse vector. Measurement nodes transmit probe packets on measurement paths established between them, and calculate packet loss rates of measurement paths (path loss rates) from probe packets received within a window. Measurement paths are classified into normal quality and low quality states according to the path loss rates. When a measurement node finds measurement paths in the low quality states, link loss rates are estimated by compressed sensing. Using simulation scenarios with a few link states changing dynamically from low to high link loss rates, we evaluate the performance of the proposed scheme.
On Minimum Feedback Vertex Sets in Bipartite Graphs and Degree-Constraint Graphs
Asahi TAKAOKA Satoshi TAYU Shuichi UENO

PAPER-Fundamentals of Information Systems

Vol:
E96-D No:11
Page(s):
2327-2332
We consider the minimum feedback vertex set problem for some bipartite graphs and degree-constrained graphs. We show that the problem is linear time solvable for bipartite permutation graphs and NP-hard for grid intersection graphs. We also show that the problem is solvable in O(n2log 6n) time for n-vertex graphs with maximum degree at most three.

781-800hit(2741hit)

Keyword Search Result

[Keyword] PAR(2741hit)

Periodic Pattern Coding for Last Level Cache Data Compression

Synchronization-Aware Virtual Machine Scheduling for Parallel Applications in Xen

Construction of High Rate Punctured Convolutional Codes by Exhaustive Search and Partial Search

Optimal Parallel Algorithms for Computing the Sum, the Prefix-Sums, and the Summed Area Table on the Memory Machine Models

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

An Exact Approach for GPC-Based Compressor Tree Synthesis

Time Shift Parameter Setting of Temporal Decorrelation Source Separation for Periodic Gaussian Signals

Study of Multi-Cell Interference in a 2-Hop OFDMA Virtual Cellular Network

Hybrid Message-Passing Algorithm and Architecture for Decoding Cyclic Non-binary LDPC Codes

Multilane Hashing Mode Suitable for Parallel Processing

A Study on Signal Processing for Barium Ferrite Particulate Tape Systems

A Practical and Optimal Path Planning for Autonomous Parking Using Fast Marching Algorithm and Support Vector Machine

Region-Based Way-Partitioning on L1 Data Cache for Low Power

A Partially Driven Array Antenna Backed by a Reflector with a Reduction in the Number of Driven Elements by Up to 67%

An Inter-Prediction Method Using Sparse Representation for High Efficiency Video Coding

A Novel Fast Mode Decision Algorithm for H.264/AVC Using Particle Swarm Optimization

Single Parameter Logarithmic Image Processing for Edge Detection

Bitstream Protection in Dynamic Partial Reconfiguration Systems Using Authenticated Encryption

Sequential Loss Tomography Using Compressed Sensing

On Minimum Feedback Vertex Sets in Bipartite Graphs and Degree-Constraint Graphs

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles