The search functionality is under construction.

Keyword Search Result

[Keyword] memory architecture(11hit)

1-11hit
  • A Metadata Prefetching Mechanism for Hybrid Memory Architectures Open Access

    Shunsuke TSUKADA  Hikaru TAKAYASHIKI  Masayuki SATO  Kazuhiko KOMATSU  Hiroaki KOBAYASHI  

     
    PAPER

      Pubricized:
    2021/12/03
      Vol:
    E105-C No:6
      Page(s):
    232-243

    A hybrid memory architecture (HMA) that consists of some distinct memory devices is expected to achieve a good balance between high performance and large capacity. Unlike conventional memory architectures, the HMA needs the metadata for data management since the data are migrated between the memory devices during the execution of an application. The memory controller caches the metadata to avoid accessing the memory devices for the metadata reference. However, as the amount of the metadata increases in proportion to the size of the HMA, the memory controller needs to handle a large amount of metadata. As a result, the memory controller cannot cache all the metadata and increases the number of metadata references. This results in an increase in the access latency to reach the target data and degrades the performance. To solve this problem, this paper proposes a metadata prefetching mechanism for HMAs. The proposed mechanism loads the metadata needed in the near future by prefetching. Moreover, to increase the effect of the metadata prefetching, the proposed mechanism predicts the metadata used in the near future based on an address difference that is the difference between two consecutive access addresses. The evaluation results show that the proposed metadata prefetching mechanism can improve the instructions per cycle by up to 44% and 9% on average.

  • Packet Processing Architecture with Off-Chip Last Level Cache Using Interleaved 3D-Stacked DRAM Open Access

    Tomohiro KORIKAWA  Akio KAWABATA  Fujun HE  Eiji OKI  

     
    PAPER-Network System

      Pubricized:
    2020/08/06
      Vol:
    E104-B No:2
      Page(s):
    149-157

    The performance of packet processing applications is dependent on the memory access speed of network systems. Table lookup requires fast memory access and is one of the most common processes in various packet processing applications, which can be a dominant performance bottleneck. Therefore, in Network Function Virtualization (NFV)-aware environments, on-chip fast cache memories of a CPU of general-purpose hardware become critical to achieve high performance packet processing speeds of over tens of Gbps. Also, multiple types of applications and complex applications are executed in the same system simultaneously in carrier network systems, which require adequate cache memory capacities as well. In this paper, we propose a packet processing architecture that utilizes interleaved 3 Dimensional (3D)-stacked Dynamic Random Access Memory (DRAM) devices as off-chip Last Level Cache (LLC) in addition to several levels of dedicated cache memories of each CPU core. Entries of a lookup table are distributed in every bank and vault to utilize both bank interleaving and vault-level memory parallelism. Frequently accessed entries in 3D-stacked DRAM are also cached in on-chip dedicated cache memories of each CPU core. The evaluation results show that the proposed architecture reduces the memory access latency by 57%, and increases the throughput by 100% while reducing the blocking probability but about 10% compared to the architecture with shared on-chip LLC. These results indicate that 3D-stacked DRAM can be practical as off-chip LLC in parallel packet processing systems.

  • A Cost-Effective 1T-4MTJ Embedded MRAM Architecture with Voltage Offset Self-Reference Sensing Scheme for IoT Applications

    Masanori HAYASHIKOSHI  Hiroaki TANIZAKI  Yasumitsu MURAI  Takaharu TSUJI  Kiyoshi KAWABATA  Koji NII  Hideyuki NODA  Hiroyuki KONDO  Yoshio MATSUDA  Hideto HIDAKA  

     
    PAPER

      Vol:
    E102-C No:4
      Page(s):
    287-295

    A 1-Transistor 4-Magnetic Tunnel Junction (1T-4MTJ) memory cell has been proposed for field type of Magnetic Random Access Memory (MRAM). Proposed 1T-4MTJ memory cell array is achieved 44% higher density than that of conventional 1T-1MTJ thanks to the common access transistor structure in a 4-bit memory cell. A self-reference sensing scheme which can read out with write-back in four clock cycles has been also proposed. Furthermore, we add to estimate with considering sense amplifier variation and show 1T-4MTJ cell configuration is the best solution in IoT applications. A 1-Mbit MRAM test chip is designed and fabricated successfully using 130-nm CMOS process. By applying 1T-4MTJ high density cell and partially embedded wordline driver peripheral into the cell array, the 1-Mbit macro size is 4.04 mm2 which is 35.7% smaller than the conventional one. Measured data shows that the read access is 55 ns at 1.5 V typical supply voltage and 25C. Combining with conventional high-speed 1T-1MTJ caches and proposed high-density 1T-4MTJ user memories is an effective on-chip hierarchical non-volatile memory solution, being implemented for low-power MCUs and SoCs of IoT applications.

  • Multiple-Valued Fine-Grain Reconfigurable VLSI Using a Global Tree Local X-Net Network

    Xu BAI  Michitaka KAMEYAMA  

     
    PAPER-VLSI Architecture

      Vol:
    E97-D No:9
      Page(s):
    2278-2285

    A global tree local X-net network (GTLX) is introduced to realize high-performance data transfer in a multiple-valued fine-grain reconfigurable VLSI (MVFG-RVLSI). A global pipelined tree network is utilized to realize high-performance long-distance bit-parallel data transfer. Moreover, a logic-in-memory architecture is employed for solving data transfer bottleneck between a block data memory and a cell. A local X-net network is utilized to realize simple interconnections and compact switch blocks for eight-near neighborhood data transfer. Moreover, multiple-valued signaling is utilized to improve the utilization of the X-net network, where two binary data can be transferred from two adjacent cells to one common adjacent cell simultaneously at each “X” intersection. To evaluate the MVFG-RVLSI, a fast Fourier transform (FFT) operation is mapped onto a previous MVFG-RVLSI using only the X-net network and the MVFG-RVLSI using the GTLX. As a result, the computation time, the power consumption and the transistor count of the MVFG-RVLSI using the GTLX are reduced by 25%, 36% and 56%, respectively, in comparison with those of the MVFG-RVLSI using only the X-net network.

  • Impact of Floating Body Type DRAM with the Vertical MOSFET

    Yuto NORIFUSA  Tetsuo ENDOH  

     
    PAPER

      Vol:
    E94-C No:5
      Page(s):
    705-711

    Several kinds of capacitor-less DRAM cells based on planar SOI-MOSFET technology have been proposed and researched to overcome the integration limit of the conventional DRAM. In this paper, we propose the Floating Body type DRAM cell array architecture with the Vertical MOSFET and discuss its basic operation using a 3-D device simulator. In contrast to previous planar SOI-MOSFET technology, the Floating Body type DRAM with the Vertical MOSFET achieves a cell area of 4F2 and obtain its floating body cell by isolating the body from the substrate vertically by the bottom-electrode. Therefore, the necessity for a SOI substrate is eliminated. In this paper, the cell array architecture of Floating Body type 1T-DRAM is proposed, and furthermore, the basic memory operations of read, write, and erase for Vertical type 1 transistor (1T) DRAM in the 45 nm technology node are shown. In addition, the retention and disturb characteristics of the Vertical type 1T-DRAM are discussed.

  • Minimizing the Directory Size for Large-Scale Shared-Memory Multiprocessors

    Jinseok KONG  Pen-Chung YEW  Gyungho LEE  

     
    PAPER-Computer Systems

      Vol:
    E88-D No:11
      Page(s):
    2533-2543

    Directory-based cache coherence schemes are commonly used in large-scale shared-memory multiprocessors, but most of them rely on heuristics to avoid large hardware requirements. We proposed using physical address mapping on directories to significantly reduce directory size needed. This approach allows the size of directory to grow as O(cn log2 n) as in optimal pointer-based directory schemes [11], where n is the number of nodes in the system and c is the number of cache lines in each cache memory. Performance aspects of the proposed scheme are studied in detail using simulation.

  • Architecture of a Stereo Matching VLSI Processor Based on Hierarchically Parallel Memory Access

    Masanori HARIYAMA  Haruka SASAKI  Michitaka KAMEYAMA  

     
    PAPER-Digital Circuits and Computer Arithmetic

      Vol:
    E88-D No:7
      Page(s):
    1486-1491

    This paper presents a VLSI processor for high-speed and reliable stereo matching based on adaptive window-size control of SAD(Sum of Absolute Differences) computation. To reduce its computational complexity, SADs are computed using multi-resolution images. Parallel memory access is essential for highly parallel image processing. For parallel memory access, this paper also presents an optimal memory allocation that minimizes the hardware amount under the condition of parallel memory access at specified resolutions.

  • A 100 MHz 7.84 mm2 31.7 msec 439 mW 512-Point 2-Dimensional FFT Single-Chip Processor

    Naoto MIYAMOTO  Leo KARNAN  Kazuyuki MARUO  Koji KOTANI  Tadahiro OHMI  

     
    PAPER

      Vol:
    E87-C No:4
      Page(s):
    502-509

    A single-chip 512-point FFT processor is presented. This processor is based on the cached-memory architecture (CMA) with the resource-saving multi-datapath radix-23 computation element. The 2-stage CMA, including a pair of single-port SRAMs, is also introduced to speedup the execution time of the 2-dimensional FFTs. Using the above techniques, we have designed an FFT processor core which integrates 552,000 transistors within an area of 2.82.8 mm2 with CMOS 0.35 µm triple-layer-metal process. This processor can execute a 512-point, 36-bit-complex fixed-point data format, 1-dimensonal FFT in 23.2 µsec and a 2-dimensional one in only 23.8 msec at 133 MHz operation. The power consumption of this processor is 439.6 mW at 3.3 V, 100 MHz operation.

  • Vienna Fortran and the Path Towards a Standard Parallel Language

    Barbara M. CHAPMAN  Piyush MEHROTRA  Hans P. ZIMA  

     
    INVITED PAPER

      Vol:
    E80-D No:4
      Page(s):
    409-416

    Highly parallel scalable multiprocessing systems (HMPs) are powerful tools for solving large-scale scientific and engineering problems. However, these machines are difficult to program since algorithms must exploit locality in order to achieve high performance. Vienna Fortran was the first fully specified data-parallel language for HMPs that provided features for the specification of data distribution and alignment at a high level of abstraction. In this paper we outline the major elements of Vienna Fortran and compare it to High Performance Fortran (HPF), a de-facto standard in this area. A significant weakness of HPF is its lack of support for many advanced applications, which require irregular data distributions and dynamic load balancing. We introduce HPF +, an extension of HPF based on Vienna Fortran, that provides the required functionality.

  • Intelligent Memory: An Architecture for Lock-Free Synchronization

    Nakun SEONG  Naihoon JUNG  Byungho KIM  Hyunsoo YOON  

     
    PAPER

      Vol:
    E80-D No:4
      Page(s):
    441-447

    This paper presents intelligent memory, a new memory architecture capable of providing efficient lock-free synchronization. In the intelligent memory, a sequence of operations on a shared object associated with that memory module can be processed without any intervention so that an environment for the synchronization can be provided by executing a critical section itself in that memory module. For this, we present a memory architecture for the intelligent memory having minimal instruction set and develop a progtramming model, called Critical Section Procedure (CSP), which consists of shared data structures and operations on them. Intelligent memory is intended to eliminate waste of processing time such as busy waiting in spin lock and the retry due to process contentions in existing lock-free synchronization schemes. Simulation results show that the intelligent memory provides better throughput compared with the spin lock and the existing lock-free synchronization schemes.

  • Stored/Forward Network Architecture for Multimedia Subscriber--ATM Mini-Bar System and Its Memory Architecture--

    Hideyoshi TOMINAGA  Yasuharu KOSUGE  Norio ITO  Naohisa KOMATSU  Dongwhee KIM  

     
    PAPER-Communication Networks and Service

      Vol:
    E78-B No:4
      Page(s):
    580-590

    In this paper, the ATM Mini-Bar System (AMBS) which is a future information providing service infrastructure is proposed. The purpose of AMBS is to provide a multi-media environment in which a user can (1) select and get quickly any needed information, in low cost, at any time, among very large amount of different media information provided by a variety of providers, (2) be charged only for the information which is selected and used, (3) edit or process informations into users' individually requested style or format before using them. The basic concept and configurations of AMBS are also addressed. This system is basically a center-end oriented one-way information providing system. The information center broadcasts its contents to all user equipments based on a user request forecast, and every user equipment stores the delivered contents in its large storage. A user can select one's needed informations from the storage, and may edit or process them within the user equipment. The charge is only on the read informations from the storage, not on all contents in it. The key points of this system are the following three. (A) Introduction of a broadcast (or multicast) media for economical information delivery (exactly speaking, it is a predelivery which means a delivery before request) to user equipments. (B) Introduction of a 1 to 1 communication network for selective charging and control of each user equipments. (C) Introduction of the user equipment storage for Quick response to user information request in most cases with the broadcast (or multicast) information delivery media described above, Separation of information delivery speed and replay speed to increase system flexibility, Local user information processing or editing. As an example of technical solutions, a memory architecture, which is based on hierarchical architecture, is described. AMBS is expected to give some impacts to information industries because it can integrate many kinds of services into the same platform, but some standerdization items are needed to realize it.