The search functionality is under construction.

Keyword Search Result

[Keyword] MpSoC(9hit)

1-9hit
  • CoDMA: Buffer Avoided Data Exchange in Distributed Memory Systems

    Ting CHEN  Hengzhu LIU  Botao ZHANG  

     
    PAPER-Integrated Electronics

      Vol:
    E97-C No:4
      Page(s):
    386-391

    Data exchange, in which two blocks of data are swapped between cores in distributed memory systems, necessitates additional memory buffer in a multiprocessor system-on-chip. In this paper, we propose a novel bidirectional inter-core communication mechanism called coherent direct memory access (CoDMA). The CoDMA ensures that the writing address is always less than the reading address in coherent read and write mode, so as to avoid read-after-write (RAW) errors. It features an efficient data exchanging scheme without using data buffer in the memory. A four-core single-instruction multiple-data processor is established for the experiments, based on a multi-bus network-on-chip. Experimental results show that the proposed method consumes no additional memory buffer and achieves 39% and 20% average performance improvement compared with traditional Methods 1 and 2, respectively. And a maximal of 43% reduction in memory usage is achieved, at the cost of only 0.22% more area overhead compared with the entire system.

  • A Low-Cost and Energy-Efficient Multiprocessor System-on-Chip for UWB MAC Layer

    Hao XIAO  Tsuyoshi ISSHIKI  Arif Ullah KHAN  Dongju LI  Hiroaki KUNIEDA  Yuko NAKASE  Sadahiro KIMURA  

     
    PAPER-Computer System

      Vol:
    E95-D No:8
      Page(s):
    2027-2038

    Ultra-wideband (UWB) technology has attracted much attention recently due to its high data rate and low emission power. Its media access control (MAC) protocol, WiMedia MAC, promises a lot of facilities for high-speed and high-quality wireless communication. However, these benefits in turn involve a large amount of computational load, which challenges the traditional uniprocessor architecture based implementation method to provide the required performance. However, the constrained cost and power budget, on the other hand, makes using commercial multiprocessor solutions unrealistic. In this paper, a low-cost and energy-efficient multiprocessor system-on-chip (MPSoC), which tackles at once the aspects of system design, software migration and hardware architecture, is presented for the implementation of UWB MAC layer. Experimental results show that the proposed MPSoC, based on four simple RISC processors and shared-memory infrastructure, achieves up to 45% performance improvement and 65% power saving, but takes 15% less area than the uniprocessor implementation.

  • An H.264/AVC High422 Profile and MPEG-2 422 Profile Encoder LSI for HDTV Broadcasting Infrastructures

    Koyo NITTA  Hiroe IWASAKI  Takayuki ONISHI  Takashi SANO  Atsushi SAGATA  Yasuyuki NAKAJIMA  Minoru INAMORI  Ryuichi TANIDA  Atsushi SHIMIZU  Ken NAKAMURA  Mitsuo IKEDA  Jiro NAGANUMA  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    432-440

    An H.264/AVC encoder LSI (named “SARA”) that supports High422 profile, as well as 422 profile of MPEG-2, has been developed for HDTV broadcasting infrastructures. It contains three motion estimation and compensation (ME/MC) engines with wide search ranges of -217.75 to +199.75 horizontally, -109.75 to +145.75 vertically, which can utilize almost all H.264/AVC ME/MC coding tools, such as multiple reference frame, variable block size, quarter-pel prediction, macroblock adaptive field/frame prediction (MBAFF), spatial/temporal direct mode, and weighted prediction. Our evaluations show that it can encode fast moving scenes with 1.2 dB to 1.7 dB higher than the JM. It was successfully fabricated in a 90-nm technology, and integrates 140 million transistors.

  • A Novel Sequential Tree Algorithm Based on Scoreboard for MPI Broadcast Communication

    Won-young CHUNG  Jae-won PARK  Seung-Woo LEE  Won Woo RO  Yong-surk LEE  

     
    LETTER-Computer System

      Vol:
    E94-D No:12
      Page(s):
    2523-2527

    The message passing interface (MPI) broadcast communication commonly causes a severe performance bottleneck in multicore system that uses distributed memory. Thus, in this paper, we propose a novel algorithm and hardware structure for the MPI broadcast communication to reduce the bottleneck situation. The transmission order is set based on the state of each processing node that comprises the multicore system, so the novel algorithm minimizes the performance degradation caused by conflict. The proposed scoreboard MPI unit is evaluated by modeling it with SystemC and implemented using VerilogHDL. The size of the proposed scoreboard MPI unit occupies less than 1.03% of the whole chip, and it yields a highly improved performance up to 75.48% as its maximum with 16 processing nodes. Hence, with respect to low-cost design and scalability, this scoreboard MPI unit is particularly useful towards increasing overall performance of the embedded MPSoC.

  • A Low-Cost Standard Mode MPI Hardware Unit for Embedded MPSoC

    Won-young CHUNG  Ha-young JEONG  Won Woo RO  Yong-surk LEE  

     
    LETTER-Computer System

      Vol:
    E94-D No:7
      Page(s):
    1497-1501

    In this paper, we propose a novel low-cost Message Passing Interface (MPI) unit between processor nodes, which supports message passing in multiprocessor systems using distributed memory architecture. Our MPI unit operates in the standard mode – using the buffered mode for small amounts of data transaction and the synchronous mode for large amounts of data transaction. This results in increased performance by reducing the control message transmission time for the small amount of data. We verified the performance with a simulator designed based on SystemC. Additionally, we designed the MPI unit using VerilogHDL, and we synthesized it with a synopsys design compiler. The proposed standard mode MPI unit shows a high performance even though the size of the MPI unit occupies less than 1% of the whole chip. Thus, with respect to low-cost design and scalability, this MPI hardware unit is useful to increase overall performance of the embedded Multiprocessor System on a Chip (MPSoC).

  • Automatic Communication Synthesis with Hardware Sharing for Multi-Processor SoC Design

    Yuki ANDO  Seiya SHIBATA  Shinya HONDA  Hiroyuki TOMIYAMA  Hiroaki TAKADA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E93-A No:12
      Page(s):
    2509-2516

    We present a hardware sharing method for design space exploration of multi-processor embedded systems. In our prior work, we had developed a system-level design tool named SystemBuilder which automatically synthesizes target implementation of a system from a functional description. In this work, we have extended SystemBuilder so that it can automatically synthesize an area-efficient implementation which shares a hardware module among different applications. With SystemBuilder, designers only need to enable an option in order to share a hardware module. The designers, therefore, can easily explore a design space including hardware sharing in short time. A case study shows the effectiveness of the hardware sharing on design space exploration.

  • Variation-Aware Task and Communication Scheduling in MPSoCs for Power-Yield Maximization

    Mahmoud MOMTAZPOUR  Maziar GOUDARZI  Esmaeil SANAEI  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E93-A No:12
      Page(s):
    2542-2550

    Parameter variations reveal themselves as different frequency and leakage powers per instances of the same MPSoC. By the increasing variation with technology scaling, worst-case-based scheduling algorithms result in either increasingly less optimal schedules or otherwise more lost yield. To address this problem, this paper introduces a variation-aware task and communication scheduling algorithm for multiprocessor system-on-chip (MPSoC). We consider both delay and leakage power variations during the process of finding the best schedule so that leakier processors are less utilized and can be more frequently put in sleep mode to reduce power. Our algorithm takes advantage of event tables to accelerate the statistical timing and power analysis. We use genetic algorithm to find the best schedule that maximizes power-yield under a performance-yield constraint. Experimental results on real world benchmarks show that our proposed algorithm achieves 16.6% power-yield improvement on average over deterministic worst-case-based scheduling.

  • Pipeline-Based Partition Exploration for Heterogeneous Multiprocessor Synthesis

    Kang ZHAO  Jinian BIAN  Sheqin DONG  Yang SONG  Satoshi GOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E92-A No:9
      Page(s):
    2283-2294

    To achieve an automated implementation for the application-specific heterogeneous multiprocessor systems-on-chip (MPSoC), partitioning and mapping the sequential programs onto multiple parallel processors is one of the most difficult challenges. However, the existing traditional parallelizing techniques cannot solve the MPSoC-related problems effectively, so designers are still required to manually extract the concurrency potentials in the program. To solve this bottleneck, an automated application partition technique is needed. However, completely automatic parallelism is ineffective, so it is promising to explore concurrency for certain practical special structures. To settle those issues, this paper proposes a template-based algorithm to automatically partition a special load-compute-store (LCS) loop structure. Since specific-instruction customization for the application specific instruction-set processors (ASIPs) has interactions with task partitioning, the proposed algorithm integrates the dynamic pipelining and ASIP techniques using an iterative improvement strategy: first, an initial pipelining scheme is generated to obtain the maximum parallelism; second, under the primary partition results specific instructions are customized respectively for each subprogram; third, the program is repartitioned via pipelining under the specific instruction configurations. The proposed method has been implemented in the context of a commercial extensible multiprocessor design flow, using the Xtensa-based XTMP platform from Tensilica Inc. Based on a case study of Fast Fourier Transform (FFT), the experimental results indicate that the partitioned programs by the proposed method demonstrate an average speedup of 10 compared to the original sequential programs which have not been partitioned and run on the uniprocessor system.

  • Exploring Partitions Based on Search Space Smoothing for Heterogeneous Multiprocessor System

    Kang ZHAO  Jinian BIAN  Sheqin DONG  Yang SONG  Satoshi GOTO  

     
    PAPER-Electronic Circuits and Systems

      Vol:
    E91-A No:9
      Page(s):
    2456-2464

    Programming the multiprocessor system-on-chip (MPSoC) requires partitioning the sequential reference programs onto multiple processors running in parallel. However, designers still need to partition the code manually due to the lack of automated partition techniques. To settle this issue, this paper proposes a partition exploration algorithm based on the search space smoothing techniques, and implements the proposed method using a commercial extensible processor (Xtensa LX2 processor from Tensilica Inc.). We have verified the feasibility of the algorithm by implementing the MPEG2 benchmark on the Xtensa-based two-processor system. The final experimental results indicate a performance improvement of at least 1.6 compared to the single-processor system.