The search functionality is under construction.

Keyword Search Result

[Keyword] SIMD(51hit)

1-20hit(51hit)

  • The Weight Distributions of the (256, k) Extended Binary Primitive BCH Codes with k≤71 and k≥187

    Toru FUJIWARA  Takuya KUSAKA  

     
    PAPER-Coding Theory

      Pubricized:
    2021/03/12
      Vol:
    E104-A No:9
      Page(s):
    1321-1328

    Computing the weight distribution of a code is a challenging problem in coding theory. In this paper, the weight distributions of (256, k) extended binary primitive BCH codes with k≤71 and k≥187 are given. The weight distributions of the codes with k≤63 and k≥207 have already been obtained in our previous work. Affine permutation and trellis structure are used to reduce the computing time. Computer programs in C language which use recent CPU instructions, such as SIMD, are developed. These programs can be deployed even on an entry model workstation to obtain the new results in this paper.

  • Implementing 128-Bit Secure MPKC Signatures

    Ming-Shing CHEN  Wen-Ding LI  Bo-Yuan PENG  Bo-Yin YANG  Chen-Mou CHENG  

     
    PAPER-Cryptography and Information Security

      Vol:
    E101-A No:3
      Page(s):
    553-569

    Multivariate Public Key Cryptosystems (MPKCs) are often touted as future-proofing against Quantum Computers. In 2009, it was shown that hardware advances do not favor just “traditional” alternatives such as ECC and RSA, but also makes MPKCs faster and keeps them competitive at 80-bit security when properly implemented. These techniques became outdated due to emergence of new instruction sets and higher requirements on security. In this paper, we review how MPKC signatures changes from 2009 including new parameters (from a newer security level at 128-bit), crypto-safe implementations, and the impact of new AVX2 and AESNI instructions. We also present new techniques on evaluating multivariate polynomials, multiplications of large finite fields by additive Fast Fourier Transforms, and constant time linear solvers.

  • Efficient Parallel Join Processing Exploiting SIMD in Multi-Thread Environments

    Gilseok HONG  Seonghyeon KANG  Chang soo KIM  Jun-Ki MIN  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/12/14
      Vol:
    E101-D No:3
      Page(s):
    659-667

    In this paper, we study parallel join processing to improve the performance of the merge phase of sort-merge join by integrating all parallelism provided by mainstream CPUs. Modern CPUs support SIMD instruction sets with wider SIMD registers which allows to process multiple data items per each instruction. Thus, we devise an efficient parallel join algorithm, called Parallel Merge Join with SIMD instructions (PMJS). In our proposed algorithm, we utilize data parallelism by exploiting SIMD instructions. And we also accelerate the performance by avoiding the usage of conditional branch instructions. Furthermore, to take advantage of the multiple cores, our proposed algorithm is threaded in multi-thread environments. In our multi-thread algorithm, to distribute workload evenly to each thread, we devise an efficient workload balancing algorithm based on the kernel density estimator which allows to estimate the workload of each thread accurately.

  • Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

    Wei GAO  Lin HAN  Rongcai ZHAO  Yingying LI  Jian LIU  

     
    PAPER-Software System

      Pubricized:
    2016/09/29
      Vol:
    E100-D No:1
      Page(s):
    91-106

    Single-instruction multiple-data (SIMD) extension provides an energy-efficient platform to scale the performance of media and scientific applications while still retaining post-programmability. However, the major challenge is to translate the parallel resources of the SIMD hardware into real application performance. Currently, all the slots in the vector register are used when compilers exploit SIMD parallelism of programs, which can be called sufficient vectorization. Sufficient vectorization means all the data in the vector register is valid. Because all the slots which vector register provides must be used, the chances of vectorizing programs with low SIMD parallelism are abandoned by sufficient vectorization method. In addition, the speedup obtained by full use of vector register sometimes is not as great as that obtained by partial use. Specifically, the length of vector register provided by SIMD extension becomes longer, sufficient vectorization method cannot exploit the SIMD parallelism of programs completely. Therefore, insufficient vectorization method is proposed, which refer to partial use of vector register. First, the adaptation scene of insufficient vectorization is analyzed. Second, the methods of computing inter-iteration and intra-iteration SIMD parallelism for loops are put forward. Furthermore, according to the relationship between the parallelism and vector factor a method is established to make the choice of vectorization method, in order to vectorize programs as well as possible. Finally, code generation strategy for insufficient vectorization is presented. Benchmark test results show that insufficient vectorization method vectorized more programs than sufficient vectorization method by 107.5% and the performance achieved by insufficient vectorization method is 12.1% higher than that achieved by sufficient vectorization method.

  • Packing Messages and Optimizing Bootstrapping in GSW-FHE

    Ryo HIROMASA  Masayuki ABE  Tatsuaki OKAMOTO  

     
    PAPER

      Vol:
    E99-A No:1
      Page(s):
    73-82

    We construct the first fully homomorphic encryption (FHE) scheme that encrypts matrices and supports homomorphic matrix addition and multiplication. This is a natural extension of packed FHE and thus supports more complicated homomorphic operations. We optimize the bootstrapping procedure of Alperin-Sheriff and Peikert (CRYPTO 2014) by applying our scheme. Our optimization decreases the lattice approximation factor from Õ(n3) to Õ(n2.5). By taking a lattice dimension as a larger polynomial in a security parameter, we can also obtain the same approximation factor as the best known one of standard lattice-based public-key encryption without successive dimension-modulus reduction, which was essential for achieving the best factor in prior works on bootstrapping of standard lattice-based FHE.

  • An FPGA Implementation of the Two-Dimensional FDTD Method and Its Performance Comparison with GPGPU

    Ryota TAKASU  Yoichi TOMIOKA  Yutaro ISHIGAKI  Ning LI  Tsugimichi SHIBATA  Mamoru NAKANISHI  Hitoshi KITAZAWA  

     
    PAPER

      Vol:
    E97-C No:7
      Page(s):
    697-706

    Electromagnetic field analysis is a time-consuming process, and a method involving the use of an FPGA accelerator is one of the attractive ways to accelerate the analysis; the other method involve the use of CPU and GPU. In this paper, we propose an FPGA accelerator dedicated for a two-dimensional finite-difference time-domain (FDTD) method. This accelerator is based on a two-dimensional single instruction multiple data (SIMD) array architecture. Each processing element (PE) is composed of a six-stage pipeline that is optimized for the FDTD method. Moreover, driving signal generation and impedance termination are also implemented in the hardware. We demonstrate that our accelerator is 11 times faster than existing FPGA accelerators and 9 times faster than parallel computing on the NVIDIA Tesla C2075. As an application of the high-speed FDTD accelerator, the design optimization of a waveguide is shown.

  • Efficient Utilization of Vector Registers to Improve FFT Performance on SIMD Microprocessors

    Feng YU  Ruifeng GE  Zeke WANG  

     
    LETTER-Digital Signal Processing

      Vol:
    E96-A No:7
      Page(s):
    1637-1641

    We investigate the utilization of vector registers (VRs) on reducing memory references for single instruction multiple data fast Fourier transform calculation. We propose to group the butterfly computations in several consecutive stages to maximize utilization of the available VRs and take the advantage of the symmetries in twiddle factors. All the butterflies sharing identical twiddle factors are clustered and computed together to further improve performance. The relationship between the number of fused stages and the number of available VRs is then examined. Experimental results on different platforms show that the proposed method is effective.

  • Dual-Core Framework: Eliminating the Bottleneck Effect of Scalar Kernels on SIMD Architectures

    Yaohua WANG  Shuming CHEN  Hu CHEN  Jianghua WAN  Kai ZHANG  Sheng LIU  

     
    LETTER-Computer System

      Vol:
    E96-D No:2
      Page(s):
    365-369

    The efficiency of ubiquitous SIMD (Single Instruction Multiple Data) media processors is seriously limited by the bottleneck effect of the scalar kernels in media applications. To solve this problem, a dual-core framework, composed of a micro control unit and an instruction buffer, is proposed. This framework can dynamically decouple the scalar and vector pipelines of the original single-core SIMD architecture into two free-running cores. Thus, the bottleneck effect can be eliminated by effectively exploiting the parallelism between scalar and vector kernels. The dual-core framework achieves the best attributes of both single-core and dual-core SIMD architectures. Experimental results exhibit an average performance improvement of 33%, at an area overhead of 4.26%. What's more, with the increase of the SIMD width, higher performance gain and lower cost can be expected.

  • A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS

    Xun HE  Xin JIN  Minghui WANG  Dajiang ZHOU  Satoshi GOTO  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E94-A No:12
      Page(s):
    2609-2618

    This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm2 in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.

  • Software-Based Parallel Cryptographic Solution with Massive-Parallel Memory-Embedded SIMD Matrix Architecture for Data-Storage Systems

    Takeshi KUMAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Masaharu TAGAMI  Masakatsu ISHIZAKI  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E94-D No:9
      Page(s):
    1742-1754

    This paper presents a software-based parallel cryptographic solution with a massive-parallel memory-embedded SIMD matrix (MTX) for data-storage systems. MTX can have up to 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. Furthermore, a next-generation SIMD matrix called MX-2 has been developed by expanding processing-element capability of MTX from 2-bit to 4-bit processing. These SIMD matrix architectures are verified to be a better alternative for processing repeated-arithmetic and logical-operations in multimedia applications with low power consumption. Moreover, we have proposed combining Content Addressable Memory (CAM) technology with the massive-parallel memory-embedded SIMD matrix architecture to enable fast pipelined table-lookup coding. Since both arithmetic logical operation and table-lookup coding execute extremely fast on these architectures, efficient execution of encryption and decryption algorithms can be realized. Evaluation results of the CAM-less and CAM-enhanced massive-parallel SIMD matrix processor for the example of the Advanced Encryption Standard (AES), which is a widely-used cryptographic algorithm, show that a throughput of up to 2.19 Gbps becomes possible. This means that several standard data-storage transfer specifications, such as SD, CF (Compact Flash), USB (Universal Serial Bus) and SATA (Serial Advanced Technology Attachment) can be covered. Consequently, the massive-parallel SIMD matrix architecture is very suitable for private information protection in several data-storage media. A further advantage of the software based solution is the flexible update possibility of the implemented-cryptographic algorithm to a safer future algorithm. The massive-parallel memory-embedded SIMD matrix architecture (MTX and MX-2) is therefore a promising solution for integrated realization of real-time cryptographic algorithms with low power dissipation and small Si-area consumption.

  • Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor

    Takeshi KUMAKI  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Yasuto KURODA  Takayuki GYOHTEN  Hideyuki NODA  Katsumi DOSAKA  Kazutami ARIMOTO  Kazunori SAITO  

     
    PAPER

      Vol:
    E91-C No:9
      Page(s):
    1409-1418

    This paper presents an integration architecture of content addressable memory (CAM) and a massive-parallel memory-embedded SIMD matrix for constructing a versatile multimedia processor. The massive-parallel memory-embedded SIMD matrix has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. The SIMD matrix architecture is verified to be a better way for processing the repeated arithmetic operation types in multimedia applications. The proposed architecture, reported in this paper, exploits in addition CAM technology and enables therefore fast pipelined table-lookup coding operations. Since both arithmetic and table-lookup operations execute extremely fast, the proposed novel architecture can realize consequently efficient and versatile multimedia data processing. Evaluation results of the proposed CAM-enhanced massive-parallel SIMD matrix processor for the example of the frequently used JPEG image-compression application show that the necessary clock cycle number can be reduced by 86% in comparison to a conventional mobile DSP architecture. The determined performances in Mpixel/mm2 are factors 3.3 and 4.4 better than with a CAM-less massive-parallel memory-embedded SIMD matrix processor and a conventional mobile DSP, respectively.

  • A Partial Access Mechanism on a Register for Low-Cost Embedded Multimedia ASIP

    Ha-young JEONG  Min-young CHO  Won HUR  Yong-surk LEE  

     
    LETTER-Integrated Electronics

      Vol:
    E91-C No:7
      Page(s):
    1171-1174

    In this letter, we propose a partial access mechanism to be used on a register file for low-cost embedded multimedia processor architecture. In the embedded system, supporting the SIMD operations is a burden because of the wide SIMD register file and its execution unit. So a new architecture is proposed to increase the performance of SIMD operations with minimal additional hardware overhead. To evaluate the performance and hardware overhead, this architecture is adopted to an embedded multimedia processor and simulated with five DSP benchmarks. The simulation results indicate that the performance is increased by 38% and the total area is increased by 13.4%. The proposed partial access mechanism may be useful for low-cost embedded multimedia ASIP.

  • Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram

    Hiroaki TANAKA  Yoshinori TAKEUCHI  Keishi SAKANUSHI  Masaharu IMAI  Hiroki TAGAWA  Yutaka OTA  Nobu MATSUMOTO  

     
    PAPER-System Level Design

      Vol:
    E90-A No:12
      Page(s):
    2800-2809

    SIMD instructions are often implemented in modern multimedia oriented processors. Although SIMD instructions are useful for many digital signal processing applications, most compilers do not exploit SIMD instructions. The difficulty in the utilization of SIMD instructions stems from data parallelism in registers. In assembly code generation, the positions of data in registers must be noted. A technique of generating pack instructions which pack or reorder data in registers is essential for exploitation of SIMD instructions. This paper presents a code generation technique for SIMD instructions with pack instructions. SIMD instructions are generated by finding and grouping the same operations in programs. After the SIMD instruction generation, pack instructions are generated. In the pack instruction generation, Multi-valued Decision Diagram (MDD) is introduced to represent and to manipulate sets of packed data. Experimental results show that the proposed code generation technique can generate assembly code with SIMD and pack instructions performing repacking of 8 packed data in registers for a RISC processor with a dual-issue coprocessor which supports SIMD and pack instructions. The proposed method achieved speedup ratio up to about 8.5 by SIMD instructions and multiple-issue mechanism of the target processor.

  • Acceleration of DCT Processing with Massive-Parallel Memory-Embedded SIMD Matrix Processor

    Takeshi KUMAKI  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Yasuto KURODA  Hideyuki NODA  Katsumi DOSAKA  Kazutami ARIMOTO  Kazunori SAITO  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E90-D No:8
      Page(s):
    1312-1315

    This paper reports an efficient Discrete Cosine Transform (DCT) processing method for images using a massive-parallel memory-embedded SIMD matrix processor. The matrix-processing engine has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. For compatibility with this matrix-processing architecture, the conventional DCT algorithm has been improved in arithmetic order and the vertical/horizontal-space 1 Dimensional (1D)-DCT processing has been further developed. Evaluation results of the matrix-engine-based DCT processing show that the necessary clock cycles per image block can be reduced by 87% in comprison to a conventional DSP architecture. The determined performances in MOPS and MOPS/mm2 are factors 8 and 5.6 better than with a conventional DSP, respectively.

  • Continuous Design Efforts for Ubiquitous Network Era under the Physical Limitation of Advanced CMOS

    Kazutami ARIMOTO  Toshihiro HATTORI  Hidehiro TAKATA  Atsushi HASEGAWA  Toru SHIMIZU  

     
    INVITED PAPER

      Vol:
    E90-C No:4
      Page(s):
    657-665

    Many embedded system application in ubiquitous network strongly require the high performance SoC with overcoming the physical limitations in the advanced CMOS. To develop these SoC, the continuous design efforts have been done. The initial efforts are the primitive level circuit technique and power switching control method for suppressing the standby currents. However, the additional physical limitations and system enhancements becomes main factors, the new design efforts have been proposed. These design efforts are the application-oriented technologies from the system level to device level. This paper introduces the self voltage controlled technique to cancel the PVT (process, voltage, and temperature) variation, power distribution and power management for cellular phone application, parallel algorithm and optimized layout DSP, and massively parallel fine-grained SIMD processor for next multimedia application. The high performance SoC for the embedded are achieved by providing the components of the system level IPs and making the application oriented SoC platform.

  • Scalable FPGA/ASIC Implementation Architecture for Parallel Table-Lookup-Coding Using Multi-Ported Content Addressable Memory

    Takeshi KUMAKI  Yutaka KONO  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E90-D No:1
      Page(s):
    346-354

    This paper presents a scalable FPGA/ASIC implementation architecture for high-speed parallel table-lookup-coding using multi-ported content addressable memory, aiming at facilitating effective table-lookup-coding solutions. The multi-ported CAM adopts a Flexible Multi-ported Content Addressable Memory (FMCAM) technology, which represents an effective parallel processing architecture and was previously reported in [1]. To achieve a high-speed parallel table-lookup-coding solution, FMCAM is improved by additional schemes for a single search mode and counting value setting mode, so that it permits fast parallel table-lookup-coding operations. Evaluation results for Huffman encoding within the JPEG application show that a synthesized semi-custom ASIC implementation of the proposed architecture can already reduce the required clock-cycle number by 93% in comparison to a conventional DSP. Furthermore, the performance per area unit, measured in MOPS/mm2, can be improved by a factor of 3.8 in comparison to parallel operated DSPs. Consequently, the proposed architecture is very suitable for FPGA/ASIC implementation, and is a promising solution for small area integrated realization of real-time table-lookup-coding applications.

  • A Sub-mW H.264 Baseline-Profile Motion Estimation Processor Core with a VLSI-Oriented Block Partitioning Strategy and SIMD/Systolic-Array Architecture

    Junichi MIYAKOSHI  Yuichiro MURACHI  Tetsuro MATSUNO  Masaki HAMAMOTO  Takahiro IINUMA  Tomokazu ISHIHARA  Hiroshi KAWAGUCHI  Masayuki MIYAMA  Masahiko YOSHIMOTO  

     
    PAPER-VLSI Architecture

      Vol:
    E89-A No:12
      Page(s):
    3623-3633

    We propose a sub-mW H.264 baseline-profile motion estimation processor for portable video applications. It features a VLSI-oriented block partitioning strategy and low-power SIMD/systolic-array datapath architecture, where the datapath can be switched between an SIMD and systolic array depending on processing flow. The processor supports all the seven kinds of block modes, and can handle three reference frames for a CIF (352288) 30-fps to QCIF (176144) 15-fps sequences with a quarter-pixel accuracy. It integrates 3.3 million transistors, and occupies 2.83.1 mm2 in a 130-nm CMOS technology. The proposed processor achieves a power of 800 µW in a QCIF 15-fps sequence with one reference picture.

  • Design and Evaluation of a Massively Parallel Processor Based on Matrix Architecture

    Toru SHIMIZU  Masami NAKAJIMA  Masahiro KAINAGA  

     
    INVITED PAPER

      Vol:
    E89-C No:11
      Page(s):
    1512-1518

    This paper describes the design and evaluation of a massively parallel processor base on Matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves 40 GOPS of 16-bit fixed-point additions at 200 MHz clock frequency and 250 mW power dissipation. In addition, 1 M-bit SRAM for data registers and 2,048 2-bit processing elements connected by a flexible switching network are integrated in 3.1 mm2 in 90 nm low-power CMOS technology. The energy-efficient Matrix architecture supports 2,048-way parallel operations and the programmable functions required for multimedia SoCs.

  • A Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers

    Xizhen XU  Sotirios G. ZIAVRAS  

     
    PAPER-Parallel/Distributed Algorithms

      Vol:
    E89-D No:2
      Page(s):
    639-646

    FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1],[2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.

  • An Efficient Matrix-Based 2-D DCT Splitter and Merger for SIMD Instructions

    Yuh-Jue CHUANG  Ja-Ling WU  

     
    PAPER-Image Processing and Multimedia Systems

      Vol:
    E88-D No:7
      Page(s):
    1569-1577

    Recent microprocessors have included SIMD (single instruction multiple data) extensions into their instruction set architecture to improve the performance of multimedia applications. SIMD instructions speed up the execution of programs but pose lots of challenges to software developers. An efficient matrix-based splitter (or merger), which can split an N N 2-D DCT block into four N/2 N/2 or two N N/2 (or N/2 N) 2-D DCT blocks (or merger small size blocks into a large size one), specialized for SIMD architectures is presented in this paper. The programming-level complexity of the proposed methods is lower than that of the direct approach. Furthermore, even without using SIMD instructions, the algorithmic-level complexity of the proposed DCT splitter/merger is still lower than that of the direct one and is the same as that of the most efficient approach existed in the literature. When N = 8, our method can be applied to act as a transcoder between the latest video coding standards AVC/H.264 and the older ones, such as MPEG-1, MPEG-2 and MPEG-4 part 2. We also provide the image quality tests to show the performance of the proposed 2-D DCT splitter and merger.

1-20hit(51hit)