The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] PAR(2741hit)

781-800hit(2741hit)

  • Periodic Pattern Coding for Last Level Cache Data Compression

    Haruhiko KANEKO  

     
    PAPER-Data Compression

      Vol:
    E96-A No:12
      Page(s):
    2351-2359

    In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.

  • Synchronization-Aware Virtual Machine Scheduling for Parallel Applications in Xen

    Cheol-Ho HONG  Chuck YOO  

     
    LETTER

      Vol:
    E96-D No:12
      Page(s):
    2720-2723

    In this paper, we propose a synchronization-aware VM scheduler for parallel applications in Xen. The proposed scheduler prevents threads from waiting for a significant amount of time during synchronization. For this purpose, we propose an identification scheme that can identify the threads that have awaited other threads for a long time. In this scheme, a detection module that can infer the internal status of guest OSs was developed. We also present a scheduling policy that can accelerate bottlenecks of concurrent VMs. We implemented our VM scheduler in the recent Xen hypervisor with para-virtualized Linux-based operating systems. We show that our approach can improve the performance of concurrent VMs by up to 43% as compared to the credit scheduler.

  • Construction of High Rate Punctured Convolutional Codes by Exhaustive Search and Partial Search

    Sen MORIYA  Hiroshi SASANO  

     
    PAPER-Coding Theory

      Vol:
    E96-A No:12
      Page(s):
    2374-2381

    We consider two methods for constructing high rate punctured convolutional codes. First, we present the best high rate R=(n-1)/n punctured convolutional codes, for n=5,6,…,16, which are obtained by exhaustive searches. To obtain the best code, we use a regular convolutional code whose weight spectrum is equivalent to that of each punctured convolutional code. We search these equivalent codes for the best one. Next, we present a method that searches for good punctured convolutional codes by partial searches. This method searches the codes that are derived from rate 1/2 original codes obtained in the first method. By this method, we obtain some good punctured convolutional codes relatively faster than the case in which we search for the best codes.

  • Optimal Parallel Algorithms for Computing the Sum, the Prefix-Sums, and the Summed Area Table on the Memory Machine Models

    Koji NAKANO  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2626-2634

    The main contribution of this paper is to show optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, and the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. We then go on to show that $Omega({nover w}+{nlover p}+llog n)$ time units are necessary to compute the sum. We also present a parallel algorithm that computes the prefix-sums of n numbers in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Finally, we show that the summed area table of size $sqrt{n} imessqrt{n}$ can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Since the computation of the prefix-sums and the summed area table is at least as hard as the sum computation, these parallel algorithms are also optimal.

  • A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

    Yasuaki ITO  Koji NAKANO  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2596-2603

    This paper presents a GPU (Graphics Processing Units) implementation of dynamic programming for the optimal polygon triangulation. Recently, GPUs can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture) provided by NVIDIA. The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem for a convex n-gon can be solved using the dynamic programming technique in O(n3) time using a work space of size O(n2). In this paper, we propose an efficient parallel implementation of this O(n3)-time algorithm on the GPU. In our implementation, we have used two new ideas to accelerate the dynamic programming. The first idea (adaptive granularity) is to partition the dynamic programming algorithm into many sequential kernel calls of CUDA, and to select the best parameters for the size and the number of blocks for each kernel call. The second idea (sliding and mirroring arrangements) is to arrange the working data for coalesced access of the global memory in the GPU to minimize the memory access overhead. Our implementation using these two ideas solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.02 seconds. Thus, our GPU implementation attains a speedup factor of 348.02.

  • An Exact Approach for GPC-Based Compressor Tree Synthesis

    Taeko MATSUNAGA  Shinji KIMURA  Yusuke MATSUNAGA  

     
    PAPER-Logic Synthesis, Test and Verification

      Vol:
    E96-A No:12
      Page(s):
    2553-2560

    Multi-operand adders that calculate the summation of more than two operands usually consist of compressor trees, which reduce the number of operands to two without any carry propagation, and carry-propagate adders for the two operands in the ASIC implementation. Compressor trees that consist of full adders and half adders cannot be implemented efficiently on LUT-based FPGAs, and carry-chains or dedicated structures have been utilized to produce multi-operand adders on FPGAs. Recent studies indicate that compressor trees can be implemented efficiently on LUTs using Generalized Parallel Counters (GPCs) as the building blocks of compressor trees. This paper addresses the problem of synthesizing compressor trees based on GPCs. Based on the observation that characteristics such as the area, power, and delay correlate roughly to the total number and the maximum level of GPCs, the target problem can be regarded as a minimization problem for the total number of GPCs and the maximum levels of the GPCs, for which an ILP-based approach is proposed. The key point of our formulation is not to model the problem based on the structures of compressor trees like the existing approach, but instead the compression process itself is used to reduce the number of variables and constraints in the ILP formulation. The experimental results demonstrate the advantage of our formulation in terms of the quality and runtime.

  • Time Shift Parameter Setting of Temporal Decorrelation Source Separation for Periodic Gaussian Signals

    Takeshi AMISHIMA  Kazufumi HIRATA  

     
    PAPER-Sensing

      Vol:
    E96-B No:12
      Page(s):
    3190-3198

    Temporal Decorrelation source SEParation (TDSEP) is a blind separation scheme that utilizes the time structure of the source signals, typically, their periodicities. The advantage of TDSEP over non-Gaussianity based methods is that it can separate Gaussian signals as long as they are periodic. However, its shortcoming is that separation performance (SEP) heavily depends upon the values of the time shift parameters (TSPs). This paper proposes a method to automatically and blindly estimate a set of TSPs that achieves optimal SEP against periodic Gaussian signals. It is also shown that, selecting the same number of TSPs as that of the source signals, is sufficient to obtain optimal SEP, and adding more TSPs does not improve SEP, but only increases the computational complexity. The simulation example showed that the SEP is higher by approximately 20dB, compared with the ordinary method. It is also shown that the proposed method successfully selects just the same number of TSPs as that of incoming signals.

  • Study of Multi-Cell Interference in a 2-Hop OFDMA Virtual Cellular Network

    Gerard J. PARAISON  Eisuke KUDOH  

     
    PAPER-Wireless Communication Technologies

      Vol:
    E96-B No:12
      Page(s):
    3163-3171

    In the literature, many resource allocation schemes have been proposed for multi-hop networks. However, the analyses provided focus mainly on the single cell case. Inter-cell interference severely degrades the performance of a wireless mobile network. Therefore, incorporating the analysis of inter-cell interference into the study of a scheme is required to more fully understand the performance of that scheme. The authors of this paper have proposed a parallel relaying scheme for a 2-hop OFDMA virtual cellular network (VCN). The purpose of this paper is to study a new version of that scheme which considers a multi-cell environment and evaluate the performance of the VCN. The ergodic channel capacity and outage capacity of the VCN in the presence of inter-cell interference are evaluated, and the results are compared to those of the single hop network (SHN). Furthermore, the effect of the location and number of wireless ports in the VCN on the channel capacity of the VCN is investigated, and the degree of fairness of the VCN relative to that of the SHN is compared. Using computer simulations, it is found that in the presence of inter-cell interference, a) the VCN outperforms the SHN even in the interference dominant transmission power region (when a single cell is considered, the VCN is better than the SHN only in the noise dominant transmission power region), b) the channel capacity of the VCN remains greater than that of the SHN even if the VCN is fully loaded, c) an optimal distance ratio for the location of the wireless ports can be found in the interval 0.2∼0.4, d) increasing the number of wireless ports from 3 to 6 can increase the channel capacity of the VCN, and e) the VCN can achieve better outage capacity than the SHN.

  • Hybrid Message-Passing Algorithm and Architecture for Decoding Cyclic Non-binary LDPC Codes

    Yichao LU  Gang HE  Guifen TIAN  Satoshi GOTO  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E96-A No:12
      Page(s):
    2652-2659

    Recently, non-binary low-density parity-check (NB-LDPC) codes starts to show their superiority in achieving significant coding gains when moderate codeword lengths are adopted. However, the overwhelming decoding complexity keeps NB-LDPC codes from being widely employed in modern communication devices. This paper proposes a hybrid message-passing decoding algorithm which consumes very low computational complexity. It achieves competitive error performance compared with conventional Min-max algorithm. Simulation result on a (255,174) cyclic code shows that this algorithm obtains at least 0.5dB coding gain over other state-of-the-art low-complexity NB-LDPC decoding algorithms. A partial-parallel NB-LDPC decoder architecture for cyclic NB-LDPC codes is also developed based on this algorithm. Optimization schemes are employed to cut off hard decision symbols in RAMs and also to store only part of the reliability messages. In addition, the variable node units are redesigned especially for the proposed algorithm. Synthesis results demonstrate that about 24.3% gates and 12% memories can be saved over previous works.

  • Multilane Hashing Mode Suitable for Parallel Processing

    Hidenori KUWAKADO  Shoichi HIROSE  

     
    PAPER-Information Security

      Vol:
    E96-A No:12
      Page(s):
    2434-2442

    A hash function is an important primitive for cryptographic protocols. Since algorithms of well-known hash functions are almost serial, it seems difficult to take full advantage of recent multi-core processors. This paper proposes a multilane hashing (MLH) mode that achieves both of high parallelism and high security. The MLH mode is designed in such a way that the processing speed is almost linear in the number of processors. Since the MLH mode exploits an existing hash function as a black box, it is applicable to any hash function. The bound on the indifferentiability of the MLH mode from a random oracle is beyond the birthday bound on the output length of an underlying primitive.

  • A Study on Signal Processing for Barium Ferrite Particulate Tape Systems

    Atsushi MUSHA  Osamu SHIMIZU  

     
    PAPER

      Vol:
    E96-C No:12
      Page(s):
    1474-1478

    The optimum generalized partial response (GPR) target for barium ferrite (BaFe) tape systems was studied. The shift in perpendicular magnetic recording technology in HDDs to systems employing single-pole-type (SPT) recording heads and media with a soft under layer (SUL) has been accompanied by a change in the read channel design, whereas current magnetic tape recording systems utilize a combination of a ring-type recording head with a single magnetic layer structured medium. Therefore, the read channel performance of current oriented BaFe particulate tape systems needs to be studied to best exploit the potential of this medium. Toward this end, DC-free, DC-full, and DC-suppressed targets were compared. The results show that assuming a GPRML detector with 16 or more states, a traditional DC-free target exhibits the best bit error rate performance for both longitudinally and perpendicularly oriented BaFe media, suggesting that the current read channel designed for longitudinally oriented media can also be utilized for BaFe particulate tape systems.

  • A Practical and Optimal Path Planning for Autonomous Parking Using Fast Marching Algorithm and Support Vector Machine

    Quoc Huy DO  Seiichi MITA  Keisuke YONEDA  

     
    PAPER-Artificial Intelligence, Data Mining

      Vol:
    E96-D No:12
      Page(s):
    2795-2804

    This paper proposes a novel practical path planning framework for autonomous parking in cluttered environments with narrow passages. The proposed global path planning method is based on an improved Fast Marching algorithm to generate a path while considering the moving forward and backward maneuver. In addition, the Support Vector Machine is utilized to provide the maximum clearance from obstacles considering the vehicle dynamics to provide a safe and feasible path. The algorithm considers the most critical points in the map and the complexity of the algorithm is not affected by the shape of the obstacles. We also propose an autonomous parking scheme for different parking situation. The method is implemented on autonomous vehicle platform and validated in the real environment with narrow passages.

  • Region-Based Way-Partitioning on L1 Data Cache for Low Power

    Zhong ZHENG  Zhiying WANG  Li SHEN  

     
    LETTER-Computer System

      Vol:
    E96-D No:11
      Page(s):
    2466-2469

    Power consumption has become a critical factor for embedded systems, especially for battery powered ones. Caches in these systems consume a large portion of the whole chip power. Embedded systems usually adopt set-associative caches to get better performance. However, parallel accessed cache ways incur more energy dissipation. This paper proposed a region-based way-partitioning scheme to reduce cache way access, and without sacrificing performance, to reduce the cache power consumption. The stack accesses and non-stack accesses are isolated and redirected to different ways of the L1 data cache. Under way-partitioning, cache way accesses are reduced, as well as the memory reference interference. Experimental results show that the proposed approach could save around 27.5% of L1 data cache energy on average, without significant performance degradation.

  • A Partially Driven Array Antenna Backed by a Reflector with a Reduction in the Number of Driven Elements by Up to 67%

    Tadashi TAKANO  Takehiro IMURA  Midori OKUMURA  

     
    PAPER-Antennas and Propagation

      Vol:
    E96-B No:11
      Page(s):
    2883-2890

    This paper describes a novel technique to replace some of the driven elements in an array antenna with parasitic elements. First, the antenna characteristics are studied by simulation for a basic unit array with one driven and two parasitic elements. The entire antenna is backed with a flat reflector to conform to practical applications. The parasitic elements are excited by the neighboring driven elements through the electromagnetic coupling effect. It is shown that at the optimal coupling condition, the radiation patterns are almost identical with those of an array antenna whose elements are all driven without coupling. The simulation result is confirmed by performing an experiment at 5.8GHz (λ =51.7mm). Finally, a 12-element array is formed by combining four unit arrays. The simulation results show that the maximum antenna gain is 19.4dBi, indicating that there is no penalty with respect to the antenna gain of a fully driven 12-element array. Therefore, the array antenna can be considerably simplified by replacing 67% of its elements with parasitic elements.

  • An Inter-Prediction Method Using Sparse Representation for High Efficiency Video Coding

    Koji INOUE  Kohei ISECHI  Hironobu SAITO  Yoshimitsu KUROKI  

     
    LETTER-Image Processing

      Vol:
    E96-A No:11
      Page(s):
    2191-2193

    This paper proposes an inter-prediction method for the upcoming video coding standard named HEVC (High Efficiency Video Coding). The HEVC offers an inter-prediction framework called local intensity compensation which represents a current block by a linear combination of some reference blocks. The proposed method calculates weight coefficients of the linear combination by using sparse representation. Experimental results show that the proposed method increases prediction accuracy in comparison with other methods.

  • A Novel Fast Mode Decision Algorithm for H.264/AVC Using Particle Swarm Optimization

    Jia-Ching WANG  Yu-Huan SUNG  

     
    PAPER-Image Processing

      Vol:
    E96-A No:11
      Page(s):
    2154-2160

    Video coding plays an important role in human life especially in communications. H.264/AVC is a prominent video coding standard that has been used in a variety of applications due to its high efficiency comes from several new coding techniques. However, the extremely high encoding complexity hinders itself from real-time applications. This paper presents a new encoding algorithm that makes use of particle swarm optimization (PSO) to train discriminant functions for classification based fast mode decision. Experimental results show that the proposed algorithm can successfully reduce encoding time at the expense of negligible quality degradation and bitrate increases.

  • Single Parameter Logarithmic Image Processing for Edge Detection

    Fuji REN  Bo LI  Qimei CHEN  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E96-D No:11
      Page(s):
    2437-2449

    Considering the non-linear properties of the human visual system, many non-linear operators and models have been developed, particularly the logarithmic image processing (LIP) model proposed by Jourlin and Pinoli, which has been proved to be physically justified in several laws of the human visual system and has been successfully applied in image processing areas. Recently, several modifications based on this logarithmic mathematical framework have been presented, such as parameterized logarithmic image processing (PLIP), pseudo-logarithmic image processing, homomorphic logarithmic image processing. In this paper, a new single parameter logarithmic model for image processing with an adaptive parameter-based Sobel edge detection algorithm is presented. On the basis of analyzing the distributive law, the subtractive law, and the isomorphic property of the PLIP model, the five parameters in PLIP are replaced by a single parameter to ensure the completeness of the model and physical constancy with the nature of an image, and then an adaptive parameter-based Sobel edge detection algorithm is proposed. By using an image noise estimation method to evaluate the noise level of image, the adaptive parameter in the single parameter LIP model is calculated based on the noise level and grayscale value of a corresponding image area, followed by the single-parameter LIP-based Sobel operation to overcome the noise-sensitive problem of classical LIP-based Sobel edge detection methods, especially in the dark area of an image, while retaining edge sensitivity. Compared with the classical LIP and PLIP model, the given single parameter LIP achieves satisfactory results in noise suppression and edge accuracy.

  • Bitstream Protection in Dynamic Partial Reconfiguration Systems Using Authenticated Encryption

    Yohei HORI  Toshihiro KATASHITA  Hirofumi SAKANE  Kenji TODA  Akashi SATOH  

     
    PAPER-Computer System

      Vol:
    E96-D No:11
      Page(s):
    2333-2343

    Protecting the confidentiality and integrity of a configuration bitstream is essential for the dynamic partial reconfiguration (DPR) of field-programmable gate arrays (FPGAs). This is because erroneous or falsified bitstreams can cause fatal damage to FPGAs. In this paper, we present a high-speed and area-efficient bitstream protection scheme for DPR systems using the Advanced Encryption Standard with Galois/Counter Mode (AES-GCM), which is an authenticated encryption algorithm. Unlike many previous studies, our bitstream protection scheme also provides a mechanism for error recovery and tamper resistance against configuration block deletion, insertion, and disorder. The implementation and evaluation results show that our DPR scheme achieves a higher performance, in terms of speed and area, than previous methods.

  • Sequential Loss Tomography Using Compressed Sensing

    Kazushi TAKEMOTO  Takahiro MATSUDA  Tetsuya TAKINE  

     
    PAPER

      Vol:
    E96-B No:11
      Page(s):
    2756-2765

    Network tomography is a technique for estimating internal network characteristics from end-to-end measurements. In this paper, we focus on loss tomography, which is a network tomography problem for estimating link loss rates. We study a loss tomography problem to detect links with high link loss rates in network environments with dynamically changing link loss rates, and propose a window-based sequential loss tomography scheme. The loss tomography problem is formulated as an underdetermined linear inverse problem, where there are infinitely many candidates of the solution. In the proposed scheme, we use compressed sensing, which can solve the problem with a prior information that the solution is a sparse vector. Measurement nodes transmit probe packets on measurement paths established between them, and calculate packet loss rates of measurement paths (path loss rates) from probe packets received within a window. Measurement paths are classified into normal quality and low quality states according to the path loss rates. When a measurement node finds measurement paths in the low quality states, link loss rates are estimated by compressed sensing. Using simulation scenarios with a few link states changing dynamically from low to high link loss rates, we evaluate the performance of the proposed scheme.

  • On Minimum Feedback Vertex Sets in Bipartite Graphs and Degree-Constraint Graphs

    Asahi TAKAOKA  Satoshi TAYU  Shuichi UENO  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E96-D No:11
      Page(s):
    2327-2332

    We consider the minimum feedback vertex set problem for some bipartite graphs and degree-constrained graphs. We show that the problem is linear time solvable for bipartite permutation graphs and NP-hard for grid intersection graphs. We also show that the problem is solvable in O(n2log 6n) time for n-vertex graphs with maximum degree at most three.

781-800hit(2741hit)