The search functionality is under construction.

Keyword Search Result

[Keyword] memory allocation(11hit)

1-11hit
  • Block Level TLB Coalescing for Buddy Memory Allocator Open Access

    Jae Young HUR  

     
    LETTER-Computer System

      Pubricized:
    2019/07/17
      Vol:
    E102-D No:10
      Page(s):
    2043-2046

    Conventional TLB (Translation Lookaside Buffer) coalescing schemes do not fully exploit the contiguity that a memory allocator provides. The conventional schemes accordingly have certain performance overheads due to page table walks. To address this issue, we propose an efficient scheme, called block contiguity translation (BCT), that accommodates the block size information in a page table considering the Buddy algorithm. By fully exploiting the block-level contiguity, we can reduce the page table walks as certain physical memory is allocated in the contiguous way. Additionally, we present unified per-level page sizes to simplify the design and better utilize the contiguity information. Considering the state-of-the-art schemes as references, the comparative analysis and the performance simulations are conducted. Experiments indicate that the proposed scheme can improve the memory system performance with moderate hardware overheads.

  • Memory-Access-Driven Context Partitioning for Window-Based Image Processing on Heterogeneous Multicore Processors

    Hasitha Muthumala WAIDYASOORIYA  Yosuke OHBAYASHI  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    354-363

    Accelerator cores in low-power heterogeneous processors have on-chip local memories to enable parallel data access. The memory capacities of the local memories are very small. Therefore, the data should be transferred from the global memory to the local memories many times. These data transfers greatly increase the total processing time. Memory allocation technique to increase the data sharing is a good solution to this problem. However, when using reconfigurable cores, the data must be shared among multiple contexts. However, conventional context partitioning methods only consider how to reuse limited hardware resources in different time slots. They do not consider the data sharing. This paper proposes a context partitioning method to share both the hardware resources and the local memory data. According to the experimental results, the proposed method reduces the processing time by more than 87% compared to conventional context partitioning techniques.

  • Memory Allocation for Window-Based Image Processing on Multiple Memory Modules with Simple Addressing Functions

    Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E94-A No:1
      Page(s):
    342-351

    Accelerator cores in low-power embedded processors have on-chip multiple memory modules to increase the data access speed and to enable parallel data access. When large functional units such as multipliers and dividers are used for addressing, a large power and chip area are consumed. Therefore, recent low-power processors use small functional units such as adders and counters to reduce the power and area. Such small functional units make it difficult to implement complex addressing patterns without duplicating data among multiple memory modules. The data duplication wastes the memory capacity and increases the data transfer time significantly. This paper proposes a method to reduce the memory duplication for window-based image processing, which is widely used in many applications. Evaluations using an accelerator core show that the proposed method reduces the data amount and data transfer time by more than 50%.

  • Energy-Aware Memory Allocation Framework for Embedded Data-Intensive Signal Processing Applications

    Florin BALASA  Ilie I. LUICAN  Hongwei ZHU  Doru V. NASUI  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E92-A No:12
      Page(s):
    3160-3168

    Many signal processing systems, particularly in the multimedia and telecommunication domains, are synthesized to execute data-intensive applications: their cost related aspects -- namely power consumption and chip area -- are heavily influenced, if not dominated, by the data access and storage aspects. This paper presents an energy-aware memory allocation methodology. Starting from the high-level behavioral specification of a given application, this framework performs the assignment of the multidimensional signals to the memory layers -- the on-chip scratch-pad memory and the off-chip main memory -- the goal being the reduction of the dynamic energy consumption in the memory subsystem. Based on the assignment results, the framework subsequently performs the mapping of signals into both memory layers such that the overall amount of data storage be reduced. This software system yields a complete allocation solution: the exact storage amount on each memory layer, the mapping functions that determine the exact locations for any array element (scalar signal) in the specification, and an estimation of the dynamic energy consumption in the memory subsystem.

  • Formal Model for the Reduction of the Dynamic Energy Consumption in Multi-Layer Memory Subsystems

    Hongwei ZHU  Ilie I. LUICAN  Florin BALASA  Dhiraj K. PRADHAN  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E91-A No:12
      Page(s):
    3559-3567

    In real-time data-dominated communication and multimedia processing applications, a multi-layer memory hierarchy is typically used to enhance the system performance and also to reduce the energy consumption. Savings of dynamic energy can be obtained by accessing frequently used data from smaller on-chip memories rather than from large background memories. This paper focuses on the reduction of the dynamic energy consumption in the memory subsystem of multidimensional signal processing systems, starting from the high-level algorithmic specification of the application. The paper presents a formal model which identifies those parts of arrays more intensely accessed, taking also into account the relative lifetimes of the signals. Tested on a two-layer memory hierarchy, this model led to savings of dynamic energy from 40% to over 70% relative to the energy used in the case of flat memory designs.

  • Memory Size Computation for Real-Time Multimedia Applications Based on Polyhedral Decomposition

    Hongwei ZHU  Ilie I. LUICAN  Florin BALASA  

     
    PAPER-System Level Design

      Vol:
    E89-A No:12
      Page(s):
    3378-3386

    In real-time multimedia processing systems a very large part of the power consumption is due to the data storage and data transfer. Moreover, the area cost is often largely dominated by the memory modules. In deriving an optimized (for area and/or power) memory architecture, memory size computation is an important step in the exploration of the possible algorithmic specifications of multimedia applications. This paper presents a novel non-scalar approach for computing exactly the memory size in real-time multimedia algorithms. This methodology uses both algebraic techniques specific to the data-flow analysis used in modern compilers and, also, more recent advances in the theory of polyhedra. In contrast with all the previous works which are only estimation methods, this approach performs exact memory computations even for applications significantly large in terms of the code size, number of scalars, and number of array references.

  • Architecture of a Stereo Matching VLSI Processor Based on Hierarchically Parallel Memory Access

    Masanori HARIYAMA  Haruka SASAKI  Michitaka KAMEYAMA  

     
    PAPER-Digital Circuits and Computer Arithmetic

      Vol:
    E88-D No:7
      Page(s):
    1486-1491

    This paper presents a VLSI processor for high-speed and reliable stereo matching based on adaptive window-size control of SAD(Sum of Absolute Differences) computation. To reduce its computational complexity, SADs are computed using multi-resolution images. Parallel memory access is essential for highly parallel image processing. For parallel memory access, this paper also presents an optimal memory allocation that minimizes the hardware amount under the condition of parallel memory access at specified resolutions.

  • Memory Allocation and Code Optimization Methods for DSPs with Indexed Auto-Modification

    Yuhei KANEKO  Nobuhiko SUGINO  Akinori NISHIHARA  

     
    PAPER

      Vol:
    E88-A No:4
      Page(s):
    846-854

    A memory address allocation method for digital signal processors of indirect addressing with indexed auto-modification is proposed. At first, address auto-modification amounts for a given program are analyzed. And then, address allocation of program variables are moved and shifted so that both indexed and simple auto-modifications are effectively exploited. For further reduction in overhead codes, a memory address allocation method coupled with computational reordering is proposed. The proposed methods are applied to the existing compiler, and generated codes prove their effectiveness.

  • Code Optimization Technique for Indirect Addressing DSPs with Consideration in Local Computational Order and Memory Allocation

    Nobuhiko SUGINO  Akinori NISHIHARA  

     
    PAPER-Implementations of Signal Processing Systems

      Vol:
    E84-A No:8
      Page(s):
    1960-1968

    Digital signal processors (DSPs) usually employ indirect addressing using address registers (ARs) to indicate their memory addresses, which often introduces overhead codes in AR updates for next memory accesses. Reduction of such overhead code is one of the important issues in automatic generation of highly-efficient DSP codes. In this paper, a new automatic address allocation method incorpolated with computational order rearrangement at local commutative parts is proposed. The method formulates a given memory access sequence by a graph representation, where several strategies to handle freedom in memory access orders at the computational commutative parts are introduced and examined. A compiler scheme is also extended such that computational order at the commutative parts is rearranged according to the derived memory allocation. The proposed methods are applied to an existing DSP compiler for µPD77230(NEC), and codes generated for several examples are compared with memory allocations by the conventional methods.

  • Highly-Parallel Stereo Vision VLSI Processor Based on an Optimal Parallel Memory Access Scheme

    Masanori HARIYAMA  Seunghwan LEE  Michitaka KAMEYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E84-C No:3
      Page(s):
    382-389

    In a real-time vision system, parallel memory access is essential for highly parallel image processing. The use of multiple memory modules is one efficient technique for parallel access. In the technique, data stored in different memory modules can be accessed in parallel. This paper presents an optimal memory allocation methodology to map data to be read in parallel onto different memory modules. Based on the methodology, a high-performance VLSI processor for three-dimensional instrumentation is proposed.

  • An FPGA-Oriented Motion-Stereo Processor with a Simple Interconnection Network for Parallel Memory Access

    Seunghwan LEE  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Image Processing, Image Pattern Recognition

      Vol:
    E83-D No:12
      Page(s):
    2122-2130

    In designing a field-programmable gate array (FPGA)-based processor for motion stereo, a parallel memory system and a simple interconnection network for parallel data transfer are essential for parallel image processing. This paper, firstly, presents an FPGA-oriented hierarchical memory system. To reduce the bandwidth requirement between an on-chip memory in an FPGA and external memories, we propose an efficient scheduling: Once pixels are transferred to the on-chip memory, operations associated with the data are consecutively performed. Secondly, a rectangular memory allocation is proposed which allocates pixels to be accessed in parallel onto different memory modules of the on-chip memory. Consequently, completely parallel access can be achieved. The memory allocation also minimizes the required capacity of the on-chip memory and thus is suitable for FPGA-based implementation. Finally, a functional unit allocation is proposed to minimize the complexity between memory modules and functional units. An experimental result shows that the performance of the processor becomes 96 times higher than that of a 400 MHz Pentium II.