The search functionality is under construction.

Keyword Search Result

[Keyword] memory system(9hit)

1-9hit
  • A Lightweight Method to Evaluate Effect of Approximate Memory with Hardware Performance Monitors

    Soramichi AKIYAMA  

     
    PAPER-Computer System

      Pubricized:
    2019/09/02
      Vol:
    E102-D No:12
      Page(s):
    2354-2365

    The latency and the energy consumption of DRAM are serious concerns because (1) the latency has not improved much for decades and (2) recent machines have huge capacity of main memory. Device-level studies reduce them by shortening the wait time of DRAM internal operations so that they finish fast and consume less energy. Applying these techniques aggressively to achieve approximate memory is a promising direction to further reduce the overhead, given that many data-center applications today are to some extent robust to bit-flips. To advance research on approximate memory, it is required to evaluate its effect to applications so that both researchers and potential users of approximate memory can investigate how it affects realistic applications. However, hardware simulators are too slow to run workloads repeatedly with different parameters. To this end, we propose a lightweight method to evaluate effect of approximate memory. The idea is to count the number of DRAM internal operations that occur to approximate data of applications and calculate the probability of bit-flips based on it, instead of using heavy-weight simulators. The evaluation shows that our system is 3 orders of magnitude faster than cycle accurate simulators, and we also give case studies of evaluating effect of approximate memory to some realistic applications.

  • Paging out Multiple Clusters to Improve Virtual Memory System Performance

    Woo Hyun AHN  Joon-Woo CHOI  Jaewon OH  Seung-Ho LIM  Kyungbaek KIM  

     
    LETTER-Software System

      Vol:
    E97-D No:7
      Page(s):
    1905-1909

    Virtual memory systems page out a cluster of contiguous modified pages in virtual memory to a swap disk at one disk I/O but cannot find large clusters in applications mainly changing non-contiguous pages. Our proposal stores small clusters at one disk I/O. This decreases disk writes for paging out small clusters, thus improving page-out performance.

  • Process Scheduling Based Memory Energy Management for Multi-Core Mobile Devices

    Tiefei ZHANG  Tianzhou CHEN  

     
    PAPER-Systems and Control

      Vol:
    E95-A No:10
      Page(s):
    1700-1707

    The energy consumption is always a serious problem for mobile devices powered by battery. As the capacity and density of off-chip memory continuous to scale, its energy consumption accounts for a considerable amount of the whole system energy. There are therefore strong demands for energy efficient techniques towards memory system. Different from previous works, we explore the different power management modes of the off-chip memory by process scheduling for the multi-core mobile devices. In particular, we schedule the processes based on their memory access characteristics to maximize the number of the memory banks being in low power mode. We propose a fast approximation algorithm to solve the scheduling process problem for the dual-core mobile device. And for those equipped with more than two cores, we prove that the scheduling process problem is NP-Hard, and propose two heuristic algorithms. The proposed algorithms are evaluated through a series of experiments, for which we have encouraging results.

  • An MAMS-PP4: Multi-Access Memory System Used to Improve the Processing Speed of Visual Media Applications in a Parallel Processing System

    Hyung LEE  Hyeon-Koo CHO  Dae-Sang YOU  Jong-Won PARK  

     
    PAPER-Concurrent Systems

      Vol:
    E87-A No:11
      Page(s):
    2852-2858

    To fulfill the computing demands in visual media processing, we have been investigating a parallel processing system to improve the processing speed of the visual media related to applications from the point of view of a memory system within a single instruction multiple data (SIMD) computer. In this paper, we have introduced MAMS-PP4, which is similar to a pipelined SIMD architecture type and consists of pq processing elements (PEs) as well as a multi-access memory system (MAMS). MAMS supports simultaneous access to pq data elements within a horizontal (1 pq), a vertical (pq 1) or a block (p q) subarray with a constant interval in an arbitrary position in an M N array of data elements, where the number of memory modules, m, is a prime number greater than pq. MAMS reduces the memory access time for an SIMD computer and also improves the cost and complexity that involved in controlling the large volume of data demanded in visual media applications. PE is designed to be a two-state machine in order to utilize MAMS efficiently. MAMS-PP4 was fabricated into ASIC using TOSHIBA TC240C series library and a test board was used to measure the performance of ASIC. The test board consists of devices such as an MPC860 embedded-PCI board, two ASICs and a FPGA for the control units. Experiment was done on various computer systems in order to compare the performance of MAMS-PP4 using morphological operations as the application. MAMS-PP4 shows a respectful and consistent processing speed.

  • New High-Order Associative Memory System Based on Newton's Forward Interpolation

    Hiromitsu HAMA  Chunfeng XING  Zhongkan LIU  

     
    PAPER-Algorithms and Data Structures

      Vol:
    E81-A No:12
      Page(s):
    2688-2693

    A double-layer Associative Memory System (AMS) based on the Cerebella Model Articulation Controller (CMAC) (CMAC-AMS), owing to its advantages of simple structures, fast searching procedures and strong mapping capability between multidimensional input/output vectors, has been successfully used in such applications as real-time intelligent control, signal processing and pattern recognition. However, it is still suffering from its requirement for a large memory size and relatively low precision. Furthermore, the hash code used in its addressing mechanism for memory size reduction can cause a data-collision problem. In this paper, a new high-order Associative Memory System based on the Newton's forward interpolation formula (NFI-AMS) is proposed. The NFI-AMS is capable of implementing high-precision approximation to multivariable functions with arbitrarily given sampling data. A learning algorithm and a convergence theorem of the NFI-AMS are proposed. The network structure and the scheme of its learning algorithm reveal that the NFI-AMS has advantages over the conventional CMAC-type AMS in terms of high precision of learning, much less required memory size without the data-collision problem, and also has advantages over the multilayer Back Propagation (BP) neural networks in terms of much less computational effort for learning and fast convergence rate. Numerical simulations verify these advantages. The proposed NFI-AMS, therefore, has potential in many application areas as a new kind of associative memory system.

  • Efficient Recovery from Communication Errors in Distributed Shared Memory Systems

    Jenn-Wei LIN  Sy-Yen KUO  

     
    PAPER-Fault Tolerant Computing

      Vol:
    E81-D No:11
      Page(s):
    1213-1223

    This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critical problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss problem causes the transmitted data lost during transmission and then not received. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DSM systems, the size of data transferred between nodes is larger than the size actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on this property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue computation without being blocked to wait for the correct data when it receives damaged data. Therefore, the latency for handling the data damage can be hidden. However, there is an optimistic assumption made in the proposed technique. If this optimistic assumption is not valid, the latency will not be hidden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show that at least 62% of the latency for handling data damage can be hidden.

  • High Bandwidth, Variable Line-Size Cache Architecture for Merged DRAM/Logic LSIs

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E81-C No:9
      Page(s):
    1438-1447

    Merged DRAM/logic LSIs could provide high on-chip memory bandwidth by interconnecting logic portions and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit such high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses. This approach tends to increase the cache-line size if we attempt to improve the attainable memory bandwidth. Larger cache lines, however, might worsen the system performance if programs running on the LSIs do not have enough spatial locality of references and cache misses frequently take place. This paper describes a novel cache architecture suitable for merged DRAM/logic LSIs, called variable line-size cache or VLS cache, for resolving the above-mentioned dilemma. The VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and, at the same time, alleviate the negative effects of larger cache-line size by partitioning each large cache line into multiple sub-lines and allowing every sub-line to work as an independent cache line. The number of sub-lines involved when a cache replacement occurs can be determined depending on the characteristics of programs. This paper also evaluates the cost/performance improvements attainable by the VLS cache and compares it with those of conventional cache architectures. As a result, it is observed that a VLS cache reduces the average memory-access time by 16. 4% while it increases the hardware cost by only 13%, compared to a conventional direct-mapped cache with fixed 32-byte lines.

  • Data Distribution and Alignment Scheme for Conflict-Free Memory Access in Parallel Image Processing System

    Gil-Yoon KIM  Yunju BAEK  Heung-Kyu LEE  

     
    PAPER-Computer Hardware and Design

      Vol:
    E81-D No:8
      Page(s):
    806-812

    In this paper, we give a solution to the problem of conflict-free access of various slices of data in parallel processor for image processing. Image processing operations require a memory system that permits parallel and conflict-free access of rows, columns, forward diagonals, backward diagonals, and blocks of two-dimensional image array for an arbitrary location. Linear skewing schemes are useful methods for those requirements, but these schemes require complex Euclidean division by prime number. On the contrary, nonlinear skewing schemes such as XOR-schemes have more advantages than the linear ones in address generation, but these schemes allow conflict-free access of some array slices in restricted region. In this paper, we propose a new XOR-scheme which allows conflict-free access of arbitrarily located various slices of data for image processing, with a two-fold the number of memory modules than that of processing elements. Further, we propose an efficient data alignment network which consists of log N + 2-stage multistage interconnection network utilizing Omega network.

  • Integrated Switching Architecture and Its Traffic Handling Capacity in Data Communication Networks

    Noriharu MIYAHO  Akira MIURA  

     
    PAPER-Communication Systems and Transmission Equipment

      Vol:
    E79-B No:12
      Page(s):
    1887-1899

    A mechanism of an integrated switching system architecture where PS, CS, and ATM switching functions are integrated based on a hierarchical memory system concept is discussed. A packet buffering control mechanism, and practical random time-slot assignment mechanism for CS traffic, which are composed of multiple bearer rate data traffic are then described. The feasibility of the random time-slot assignment mechanism is also confirmed by a practical experimental system using VLSI technology, particularly, content addressable memory (CAM) technology. The required queuing delay between the nodes for the corresponding call set up procedure is also shown and its application is clarified. For practical digital networks that provide various types of data communications including voice, data, and video services, it is highly desirable to evaluate the transmission efficiency of integrating packet switching (PS) type non-real time traffic and circuit switching (CS) type real time traffic. Transmission line utilization improvement is expected when the random time-slot assignment and the movable boundary scheme on a TDM (Time Division Multiplexing) data frame are adopted. The corresponding control procedure by signaling between switching nodes is also examined.