The search functionality is under construction.

Keyword Search Result

[Keyword] parallelism(33hit)

1-20hit(33hit)

  • A Performance Model for Reconfigurable Block Cipher Array Utilizing Amdahl's Law

    Tongzhou QU  Zibin DAI  Yanjiang LIU  Lin CHEN  Xianzhao XIA  

     
    PAPER-Computer System

      Pubricized:
    2022/02/17
      Vol:
    E105-D No:5
      Page(s):
    964-972

    The existing research on Amdahl's law is limited to multi/many-core processors, and cannot be applied to the important parallel processing architecture of coarse-grained reconfigurable arrays. This paper studies the relation between the multi-level parallelism of block cipher algorithms and the architectural characteristics of coarse-grain reconfigurable arrays. We introduce the key variables that affect the performance of reconfigurable arrays, such as communication overhead and configuration overhead, into Amdahl's law. On this basis, we propose a performance model for coarse-grain reconfigurable block cipher array (CGRBA) based on the extended Amdahl's law. In addition, this paper establishes the optimal integer nonlinear programming model, which can provide a parameter reference for the architecture design of CGRBA. The experimental results show that: (1) reducing the communication workload ratio and increasing the number of configuration pages reasonably can significantly improve the algorithm performance on CGRBA; (2) the communication workload ratio has a linear effect on the execution time.

  • Reducing Energy Consumption of Wakeup Logic through Double-Stage Tag Comparison

    Yasutaka MATSUDA  Ryota SHIOYA  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2021/11/02
      Vol:
    E105-D No:2
      Page(s):
    320-332

    The high energy consumption of current processors causes several problems, including a limited clock frequency, short battery lifetime, and reduced device reliability. It is therefore important to reduce the energy consumption of the processor. Among resources in a processor, the issue queue (IQ) is a large consumer of energy, much of which is consumed by the wakeup logic. Within the wakeup logic, the tag comparison that checks source operand readiness consumes a significant amount of energy. This paper proposes an energy reduction scheme for tag comparison, called double-stage tag comparison. This scheme first compares the lower bits of the tag and then, only if these match, compares the higher bits. Because the energy consumption of tag comparison is roughly proportional to the total number of bits compared, energy is saved by reducing this number. However, this sequential comparison increases the delay of the IQ, thereby increasing the clock cycle time. Although this can be avoided by allocating an extra cycle to the issue operation, this in turn degrades the IPC. To avoid IPC degradation, we reconfigure a small number of entries in the IQ, where several oldest instructions that are likely to have an adverse effect on performance reside, to a single stage for tag comparison. Our evaluation results for SPEC2017 benchmark programs show that the double-stage tag comparison achieves on average a 21% reduction in the energy consumed by the wakeup logic (15% when including the overhead) with only 3.0% performance degradation.

  • An Efficient Method for Training Deep Learning Networks Distributed

    Chenxu WANG  Yutong LU  Zhiguang CHEN  Junnan LI  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2020/09/07
      Vol:
    E103-D No:12
      Page(s):
    2444-2456

    Training deep learning (DL) is a computationally intensive process; as a result, training time can become so long that it impedes the development of DL. High performance computing clusters, especially supercomputers, are equipped with a large amount of computing resources, storage resources, and efficient interconnection ability, which can train DL networks better and faster. In this paper, we propose a method to train DL networks distributed with high efficiency. First, we propose a hierarchical synchronous Stochastic Gradient Descent (SGD) strategy, which can make full use of hardware resources and greatly increase computational efficiency. Second, we present a two-level parameter synchronization scheme which can reduce communication overhead by transmitting parameters of the first layer models in shared memory. Third, we optimize the parallel I/O by making each reader read data as continuously as possible to avoid the high overhead of discontinuous data reading. At last, we integrate the LARS algorithm into our system. The experimental results demonstrate that our approach has tremendous performance advantages relative to unoptimized methods. Compared with the native distributed strategy, our hierarchical synchronous SGD strategy (HSGD) can increase computing efficiency by about 20 times.

  • Exploiting Packet-Level Parallelism of Packet Parsing for FPGA-Based Switches

    Junnan LI  Biao HAN  Zhigang SUN  Tao LI  Xiaoyan WANG  

     
    PAPER-Transmission Systems and Transmission Equipment for Communications

      Pubricized:
    2019/03/18
      Vol:
    E102-B No:9
      Page(s):
    1862-1874

    FPGA-based switches are appealing nowadays due to the balance between hardware performance and software flexibility. Packet parser, as the foundational component of FPGA-based switches, is to identify and extract specific fields used in forwarding decisions, e.g., destination IP address. However, traditional parsers are too rigid to accommodate new protocols. In addition, FPGAs usually have a much lower clock frequency and fewer hardware resources, compared to ASICs. In this paper, we present PLANET, a programmable packet-level parallel parsing architecture for FPGA-based switches, to overcome these two limitations. First, PLANET has flexible programmability of updating parsing algorithms at run-time. Second, PLANET highly exploits parallelism inside packet parsing to compensate FPGA's low clock frequency and reduces resource consumption with one-block recycling design. We implemented PLANET on an FPGA-based switch prototype with well-integrated datacenter protocols. Evaluation results show that our design can parse packets at up to 100 Gbps, as well as maintain a relative low parsing latency and fewer hardware resources than existing proposals.

  • View Priority Based Threads Allocation and Binary Search Oriented Reweight for GPU Accelerated Real-Time 3D Ball Tracking

    Yilin HOU  Ziwei DENG  Xina CHENG  Takeshi IKENAGA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2018/08/31
      Vol:
    E101-D No:12
      Page(s):
    3190-3198

    In real-time 3D ball tracking of sports analysis in computer vision technology, complex algorithms which assure the accuracy could be time-consuming. Particle filter based algorithm has a large potential to accelerate since the algorithm between particles has the chance to be paralleled in heterogeneous CPU-GPU platform. Still, with the target multi-view 3D ball tracking algorithm, challenges exist: 1) serial flowchart for each step in the algorithm; 2) repeated processing for multiple views' processing; 3) the low degree of parallelism in reweight and resampling steps for sequential processing. On the CPU-GPU platform, this paper proposes the double stream system flow, the view priority based threads allocation, and the binary search oriented reweight. Double stream system flow assigns tasks which there is no data dependency exists into different streams for each frame processing to achieve parallelism in system structure level. View priority based threads allocation manipulates threads in multi-view observation task. Threads number is view number multiplied by particles number, and with view priority assigning, which could help both memory accessing and computing achieving parallelism. Binary search oriented reweight reduces the time complexity by avoiding to generate cumulative distribution function and uses an unordered array to implement a binary search. The experiment is based on videos which record the final game of an official volleyball match (2014 Inter-High School Games of Men's Volleyball held in Tokyo Metropolitan Gymnasium in Aug. 2014) and the test sequences are taken by multiple-view system which is made of 4 cameras locating at the four corners of the court. The success rate achieves 99.23% which is the same as target algorithm while the time consumption has been accelerated from 75.1ms/frame in CPU environment to 3.05ms/frame in the proposed system which is 24.62 times speed up, also, it achieves 2.33 times speedup compared with basic GPU implemented work.

  • An Efficient Parallel Coding Scheme in Erasure-Coded Storage Systems

    Wenrui DONG  Guangming LIU  

     
    PAPER-Computer System

      Pubricized:
    2017/12/12
      Vol:
    E101-D No:3
      Page(s):
    627-643

    Erasure codes have been considered as one of the most promising techniques for data reliability enhancement and storage efficiency in modern distributed storage systems. However, erasure codes often suffer from a time-consuming coding process which makes them nearly impractical. The opportunity to solve this problem probably rely on the parallelization of erasure-code-based application on the modern multi-/many-core processors to fully take advantage of the adequate hardware resources on those platforms. However, the complicated data allocation and limited I/O throughput pose a great challenge on the parallelization. To address this challenge, we propose a general multi-threaded parallel coding approach in this work. The approach consists of a general multi-threaded parallel coding model named as MTPerasure, and two detailed parallel coding algorithms, named as sdaParallel and ddaParallel, respectively, adapting to different I/O circumstances. MTPerasure is a general parallel coding model focusing on the high level data allocation, and it is applicable for all erasure codes and can be implemented without any modifications of the low level coding algorithms. The sdaParallel divides the data into several parts and the data parts are allocated to different threads statically in order to eliminate synchronization latency among multiple threads, which improves the parallel coding performance under the dummy I/O mode. The ddaParallel employs two threads to execute the I/O reading and writing on the basis of small pieces independently, which increases the I/O throughput. Furthermore, the data pieces are assigned to the coding thread dynamically. A special thread scheduling algorithm is also proposed to reduce thread migration latency. To evaluate our proposal, we parallelize the popular open source library jerasure based on our approach. And a detailed performance comparison with the original sequential coding program indicates that the proposed parallel approach outperforms the original sequential program by an extraordinary speedups from 1.4x up to 7x, and achieves better utilization of the computation and I/O resources.

  • Insufficient Vectorization: A New Method to Exploit Superword Level Parallelism

    Wei GAO  Lin HAN  Rongcai ZHAO  Yingying LI  Jian LIU  

     
    PAPER-Software System

      Pubricized:
    2016/09/29
      Vol:
    E100-D No:1
      Page(s):
    91-106

    Single-instruction multiple-data (SIMD) extension provides an energy-efficient platform to scale the performance of media and scientific applications while still retaining post-programmability. However, the major challenge is to translate the parallel resources of the SIMD hardware into real application performance. Currently, all the slots in the vector register are used when compilers exploit SIMD parallelism of programs, which can be called sufficient vectorization. Sufficient vectorization means all the data in the vector register is valid. Because all the slots which vector register provides must be used, the chances of vectorizing programs with low SIMD parallelism are abandoned by sufficient vectorization method. In addition, the speedup obtained by full use of vector register sometimes is not as great as that obtained by partial use. Specifically, the length of vector register provided by SIMD extension becomes longer, sufficient vectorization method cannot exploit the SIMD parallelism of programs completely. Therefore, insufficient vectorization method is proposed, which refer to partial use of vector register. First, the adaptation scene of insufficient vectorization is analyzed. Second, the methods of computing inter-iteration and intra-iteration SIMD parallelism for loops are put forward. Furthermore, according to the relationship between the parallelism and vector factor a method is established to make the choice of vectorization method, in order to vectorize programs as well as possible. Finally, code generation strategy for insufficient vectorization is presented. Benchmark test results show that insufficient vectorization method vectorized more programs than sufficient vectorization method by 107.5% and the performance achieved by insufficient vectorization method is 12.1% higher than that achieved by sufficient vectorization method.

  • Power Consumption Signature: Characterizing an SSD

    Balgeun YOO  Seongjin LEE  Youjip WON  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2016/03/30
      Vol:
    E99-D No:7
      Page(s):
    1796-1809

    SSDs consist of non-mechanical components (host interface, control core, DRAM, flash memory, etc.) whose integrated behavior is not well-known. This makes an SSD seem like a black-box to users. We analyzed power consumption of four SSDs with standard I/O operations. We find the following: (a) the power consumption of SSDs is not significantly lower than that of HDDs, (b) all SSDs we tested had similar power consumption patterns which, we assume, is a result of their internal parallelism. SSDs have a parallel architecture that connects flash memories by channel or by way. This parallel architecture improves performance of SSDs if the information is known to the file system. This paper proposes three SSD characterization algorithms to infer the characteristics of SSD, such as internal parallelism, I/O unit, and page allocation scheme, by measuring its power consumption with various sized workloads. These algorithms are applied to four real SSDs to find: (i) the internal parallelism to decide whether to perform I/Os in a concurrent or an interleaved manner, (ii) the I/O unit size that determines the maximum size that can be assigned to a flash memory, and (iii) a page allocation method to map the logical address of write operations, which are requested from the host to the physical address of flash memory. We developed a data sampling method to provide consistency in collecting power consumption patterns of each SSD. When we applied three algorithms to four real SSDs, we found flash memory configurations, I/O unit sizes, and page allocation schemes. We show that the performance of SSD can be improved by aligning the record size of file system with I/O unit of SSD, which we found by using our algorithm. We found that Q Pro has I/O unit of 32 KB, and by aligning the file system record size to 32 KB, the performance increased by 201% and energy consumption decreased by 85%, which compared to the record size of 4 KB.

  • Performance of Dynamic Instruction Window Resizing for a Given Power Budget under DVFS Control

    Hideki ANDO  Ryota SHIOYA  

     
    PAPER-Computer System

      Pubricized:
    2015/11/12
      Vol:
    E99-D No:2
      Page(s):
    341-350

    Dynamic instruction window resizing (DIWR) is a scheme that effectively exploits both memory-level parallelism and instruction-level parallelism by configuring the instruction window size appropriately for exploiting each parallelism. Although a previous study has shown that the DIWR processor achieves a significant speedup, power consumption has not been explored. The power consumption is increased in DIWR because the instruction window resources are enlarged in memory-intensive phases. If the power consumption exceeds the power budget determined by certain requirements, the DIWR processor must save power and thus, the performance previously presented cannot be achieved. In this paper, we explore to what extent the DIWR processor can achieve improved performance for a given power budget, assuming that dynamic voltage and frequency scaling (DVFS) is introduced as a power saving technique. Evaluation results using the SPEC2006 benchmark programs show that the DIWR processor, even with a constrained power budget, achieves a speedup over the conventional processor over a wide range of given power budgets. At the most important power budget point, i.e., when the power a conventional processor consumes without any power constraint is supplied, DIWR achieves a 16% speedup.

  • A Load-Balanced Deterministic Runtime for Pipeline Parallelism

    Chen CHEN  Kai LU  Xiaoping WANG  Xu ZHOU  Zhendong WU  

     
    LETTER-Software System

      Pubricized:
    2014/10/21
      Vol:
    E98-D No:2
      Page(s):
    433-436

    Most existing deterministic multithreading systems are costly on pipeline parallel programs due to load imbalance. In this letter, we propose a Load-Balanced Deterministic Runtime (LBDR) for pipeline parallelism. LBDR deterministically takes some tokens from non-synchronization-intensive threads to synchronization-intensive threads. Experimental results show that LBDR outperforms the state-of-the-art design by an average of 22.5%.

  • MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism

    Yuya KORA  Kyohei YAMAGUCHI  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2014/09/22
      Vol:
    E97-D No:12
      Page(s):
    3110-3123

    Single-thread performance has not improved much over the past few years, despite an ever increasing transistor budget. One of the reasons for this is that there is a speed gap between the processor and main memory, known as the memory wall. A promising method to overcome this memory wall is aggressive out-of-order execution by extensively enlarging the instruction window resources to exploit memory-level parallelism (MLP). However, simply enlarging the window resources lengthens the clock cycle time. Although pipelining the resources solves this problem, it in turn prevents instruction-level parallelism (ILP) from being exploited because issuing instructions requires multiple clock cycles. This paper proposed a dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP. Specifically, if the scheme predicts that MLP is available during execution, the instruction window is enlarged and the window resources are pipelined, thereby exploiting MLP. Conversely, if the scheme predicts that less MLP is available, that is, ILP is exploitable for improved performance, the instruction window is shrunk and the window resources are de-pipelined, thereby exploiting ILP. Our evaluation results using the SPEC2006 benchmark programs show that the proposed scheme achieves nearly the best performance possible with fixed-size resources. On average, our scheme realizes a performance improvement of 21% over that of a conventional processor, with additional cost of only 6% of the area of the conventional processor core or 3% of that of the entire processor chip. The evaluation results also show 8% better energy efficiency in terms of 1/EDP (energy-delay product).

  • Tuning GridFTP Pipelining, Concurrency and Parallelism Based on Historical Data

    Jangyoung KIM  

     
    LETTER-Information Network

      Pubricized:
    2014/07/28
      Vol:
    E97-D No:11
      Page(s):
    2963-2966

    This paper presents a prediction model based on historical data to achieve optimal values of pipelining, concurrency and parallelism (PCP) in GridFTP data transfers in Cloud systems. Setting the correct values for these three parameters is crucial in achieving high throughput in end-to-end data movement. However, predicting and setting the optimal values for these parameters is a challenging task, especially in shared and non-predictive network conditions. Several factors can affect the optimal values for these parameters such as the background network traffic, available bandwidth, Round-Trip Time (RTT), TCP buffer size, and file size. Existing models either fail to provide accurate predictions or come with very high prediction overheads. The author shows that new model based on historical data can achieve high accuracy with low overhead.

  • Parallelism Analysis of H.264 Decoder and Realization on a Coarse-Grained Reconfigurable SoC

    Gugang GAO  Peng CAO  Jun YANG  Longxing SHI  

     
    PAPER-Application

      Vol:
    E96-D No:8
      Page(s):
    1654-1666

    One of the largest challenges for coarse-grained reconfigurable arrays (CGRAs) is how to efficiently map applications. The key issues for mapping are (1) how to reduce the memory bandwidth, (2) how to exploit parallelism in algorithms and (3) how to achieve load balancing and take full advantage of the hardware potential. In this paper, we propose a novel parallelism scheme, called ‘Hybrid partitioning’, for mapping a H.264 high definition (HD) decoder onto REMUS-II, a CGRA system-on-chip (SoC). Combining good features of data partitioning and task partitioning, our methodology mainly consists of three levels from top to bottom: (1) hybrid task pipeline based on slice and macroblock (MB) level; (2) MB row-level data parallelism; (3) sub-MB level parallelism method. Further, on the sub-MB level, we propose a few mapping strategies such as hybrid variable block size motion compensation (Hybrid VBSMC) for MC, 2D-wave for intra 44, parallel processing order for deblocking. With our mapping strategies, we improved the algorithm's performance on REMUS-II. For example, with a luma 1616 MB, the Hybrid VBSMC achieves 4 times greater performance than VBSMC and 2.2 times greater performance than fixed 44 partition approach. Finally, we achieve 1080p@33fps H.264 high-profile (HiP)@level 4.1 decoding when the working frequency of REMUS-II is 200 MHz. Compared with typical hardware platforms, we can achieve better performance, area, and flexibility. For example, our performance achieves approximately 175% improvement than that of a commercial CGRA processor XPP-III while only using 70% of its area.

  • A 64 Cycles/MB, Luma-Chroma Parallelized H.264/AVC Deblocking Filter for 4 K2 K Applications

    Weiwei SHEN  Yibo FAN  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    441-446

    In this paper, a high-throughput debloking filter is presented for H.264/AVC standard, catering video applications with 4 K2 K (40962304) ultra-definition resolution. In order to strengthen the parallelism without simply increasing the area, we propose a luma-chroma parallel method. Meanwhile, this work reduces the number of processing cycles, the amount of external memory traffic and the working frequency, by using triple four-stage pipeline filters and a luma-chroma interlaced sequence. Furthermore, it eliminates most unnecessary off-chip memory bandwidth with a highly reusable memory scheme, and adopts a “slide window” buffer scheme. As a result, our design can support 4 K2 K at 30 fps applications at the working frequency of only 70.8 MHz.

  • Solving SAT and Hamiltonian Cycle Problem Using Asynchronous P Systems

    Hirofumi TAGAWA  Akihiro FUJIWARA  

     
    PAPER

      Vol:
    E95-D No:3
      Page(s):
    746-754

    In the present paper, we consider fully asynchronous parallelism in membrane computing, and propose two asynchronous P systems for the satisfiability (SAT) and Hamiltonian cycle problem. We first propose an asynchronous P system that solves SAT with n variables and m clauses, and show that the proposed P system computes SAT in O(mn2n) sequential steps or O(mn) parallel steps using O(mn) kinds of objects. We next propose an asynchronous P system that solves the Hamiltonian cycle problem with n nodes, and show that the proposed P system computes the problem in O(n!) sequential steps or O(n2) parallel steps using O(n2) kinds of objects.

  • Generic Permutation Network for QC-LDPC Decoder

    Xiao PENG  Xiongxin ZHAO  Zhixiang CHEN  Fumiaki MAEHARA  Satoshi GOTO  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E93-A No:12
      Page(s):
    2551-2559

    Permutation network plays an important role in the reconfigurable QC-LDPC decoder for most modern wireless communication systems with multiple code rates and various code lengths. This paper presents the generic permutation network (GPN) for the reconfigurable QC-LDPC decoder. Compared with conventional permutation networks, this proposal could break through the input number restriction, such as power of 2 and other limited number, and optimize the network for any application in demand. Moreover, the proposed scheme could greatly reduce the latency because of less stages and efficient control signal generating algorithm. In addition, the proposed network processes the nature of high parallelism which could enable several groups of data to be cyclically shifted simultaneously. The synthesis results using the 90 nm technology demonstrate that this architecture can be implemented with the gate count of 18.3k for WiMAX standard at the frequency of 600 MHz and 10.9k for WiFi standard at the frequency of 800 MHz.

  • GridFTP-APT: Automatic Parallelism Tuning Mechanism for GridFTP in Long-Fat Networks

    Takeshi ITO  Hiroyuki OHSAKI  Makoto IMASE  

     
    PAPER-Network

      Vol:
    E91-B No:12
      Page(s):
    3925-3936

    In this paper, we propose an extension to GridFTP that optimizes its performance by dynamically adjusting the number of parallel TCP connections. GridFTP has been used as a data transfer protocol to effectively transfer a large volume of data in Grid computing. GridFTP supports a feature called parallel data transfer that improves throughput by establishing multiple TCP connections in parallel. However, for achieving high GridFTP throughput, the number of TCP connections should be optimized based on the network status. In this paper, we propose an automatic parallelism tuning mechanism called GridFTP-APT (GridFTP with Automatic Parallelism Tuning) that adjusts the number of parallel TCP connections according to information available to the Grid middleware. Through simulations, we demonstrate that GridFTP-APT significantly improves the performance of GridFTP in various network environments.

  • Power Estimation of Partitioned Register Files in a Clustered Architecture with Performance Evaluation

    Yukinori SATO  Ken-ichi SUZUKI  Tadao NAKAMURA  

     
    PAPER-VLSI Systems

      Vol:
    E90-D No:3
      Page(s):
    627-636

    High power consumption and slow access of enlarged and multiported register files make it difficult to design high performance superscalar processors. The clustered architecture, where the conventional monolithic register file is partitioned into several smaller register files, is expect to overcome the register file issues. In the clustered architecture, the more a monolithic register file is partitioned, the lower power and faster access register files can be realized. However, the partitioning causes losses of IPC (instructions per clock cycle) due to communication among register files. Therefore, degree of partitioning has a strong impact on the trade-off between power consumption and performance. In addition, the organization of partitioned register files also affects the trade-off. In this paper, we attempt to investigate appropriate degrees of partitioning and organizations of partitioned register files in a clustered architecture to assess the trade-off. From the results of execute-driven simulation, we find that the organization of register files and the degree of partitioning have a strong impact on the IPC, and the configuration with non-consistent register files can make use of the partitioned resources more effectively. From the results of register file access time and energy modeling, we find that the configurations with the highly partitioned non-consistent register file organization can receive benefit of the partitioning in terms of operating frequency and access energy of register files. Further, we examine relationship between IPS (instructions per second) and the product of IPC and operating frequency of register files. The results suggest that highly partitioned non-consistent configurations tends to gain more advantage in performance and power.

  • Optimal Scheduling for Real-Time Parallel Tasks

    Wan Yeon LEE  Heejo LEE  

     
    LETTER-Algorithm Theory

      Vol:
    E89-D No:6
      Page(s):
    1962-1966

    We propose an optimal algorithm for the real-time scheduling of parallel tasks on multiprocessors, where the tasks have the properties of flexible preemption, linear speedup, bounded parallelism, and arbitrary deadline. The proposed algorithm is optimal in the sense that it always finds out a feasible schedule if one exists. Furthermore, the algorithm delivers the best schedule consuming the fewest processors among feasible schedules. In this letter, we prove the optimality of the proposed algorithm. Also, we show that the time complexity of the algorithm is O(M2N2) in the worst case, where M and N are the number of tasks and the number of processors, respectively.

  • Sub-operation Parallelism Optimization in SIMD Processor Core Synthesis

    Hideki KAWAZU  Jumpei UCHIDA  Yuichiro MIYAOKA  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E88-A No:4
      Page(s):
    876-884

    A b-bit SIMD functional unit has n k-bit sub-functional units in itself, where b = k n. It can execute n-parallel k-bit operations. However, all the b-bit functional units in a processor core do not necessarily execute n-parallel operations. Depending on an application program, some of them just execute n/2-parallel operations or even n/4-parallel operations. This means that we can modify a b-bit SIMD functional unit so that it has n/2 k-bit sub-functional units or n/4 k-bit sub-functional units. The number of k-bit sub-functional units in a SIMD functional unit is called sub-operation parallelism. We incorporate a sub-operation parallelism optimization algorithm into SIMD functional unit optimization. Our proposed algorithm gradually reduces sub-operation parallelism of a SIMD functional unit while the timing constraint of execution time satisfied. Thereby, we can finally find a processor core with small area under the given timing constraint. We expect that we can obtain processor core configurations of smaller area in the same timing constraint rather than a conventional system. The promising experimental results are also shown.

1-20hit(33hit)