The search functionality is under construction.

Keyword Search Result

[Keyword] cache(201hit)

181-200hit(201hit)

  • The RDT Router Chip: A Versatile Router for Supporting a Distributed Shared Memory

    Hiroaki NISHI  Ken-ichiro ANJO  Tomohiro KUDOH  Hideharu AMANO  

     
    PAPER-Interconnection Networks

      Vol:
    E80-D No:9
      Page(s):
    854-862

    JUMP-1 is currently under development by seven Japanese universities to establish techniques for building an efficient distributed shared memory on a massively parallel processor. It provides a coherent cache with reduced hierarchical bit-map directory scheme to achieve cost effective and high performance management. Messages for coherent cache are transferred through a fat tree on the RDT (Recursive Diagonal Torus) interconnection network. RDT router supports versatile functions including multicast and acknowledge combining for the reduced hierarchical bit-map directory scheme. By using 0.5µm BiCMOS SOG technology, it can transfer all packets synchronized with a unique CPU clock (50MHz). Long coaxial cables (4m at maximum) are directly driven with the ECL interface of this chip. Using the dual port RAM, packet buffers allow to push and pull a flit of the packet simultaneously.

  • Adsmith: An Object-Based Distributed Shared Memory System for Networks of Workstations

    Wen-Yew LIANG  Chung-Ta KING  Feipei LAI  

     
    PAPER-Computer Architecture

      Vol:
    E80-D No:9
      Page(s):
    899-908

    This paper introduces an object-based distributed shared memory (DSM) system called Adsmith. The primary goal of Adsmith is to provide a low-cost, portable, and efficient DSM for networks of workstations (NOW). Adsmith achieves this goal by building on top of PVM, a widely supported communication subsystem, as a user-level library and by incorporating many traffic reduction and latency hiding techniques. Issues involved in the design of Adsmith and our solution strategies will be discussed. Preliminary performance evaluation of Adsmith on a network of Pentium computers will be presented. The results show that programs developed with Adsmith can achieve a performance comparable to that developed with PVM.

  • MINC: Multistage Interconnection Network with Cache Control Mechanism

    Toshihiro HANAWA  Takayuki KAMEI  Hideki YASUKAWA  Katsunobu NISHIMURA  Hideharu AMANO  

     
    PAPER-Interconnection Networks

      Vol:
    E80-D No:9
      Page(s):
    863-870

    A novel approach to the cache coherent Multistage Interconnection Network (MIN) called the MINC (MIN with Cache control mechanism) is proposed. In the MINC, the directory is located only on the shared memory using the Reduced Hierarchical Bit-map Directory schemes (RHBDs). In the RHBD, the bit-map directory is reduced and carried in the packet header for quick multicasting without accessing the directory in each hierarchy. In order to reduce unnecessary packets caused by compacting the bit map in the RHBD, a small cache called the pruning cache is introduced in the switching element. The simulation reveals the pruning cache works most effectively when it is provided in every switching element of the first stage, and it reduces the congestion more than 50% with only 4 entries. The MINC cache control chip with 16 inputs/outputs is implemented on the LPGA (Laser Programmable Gate Array), and works with a 66 MHz clock.

  • SEWD: A Cache Architecture to Speed up the Misaligned Instruction Prefetch

    Joon-Seo YIM  In-Cheol PARK  Chong-Min KYUNG  

     
    LETTER-Computer Hardware and Design

      Vol:
    E80-D No:7
      Page(s):
    742-745

    In microprocessors, reducing the cache access delay and the number of pipeline stall is critical to improve the system performance. In this paper, we propose a Separated Word-line Decoding (SEWD) cache to overcome the pipeline stall caused by the misaligned multi-words data or instruction prefetches which are placed over two cache lines. SEWD cache makes it possible to perform misaligned prefetch as well as aligned prefetch in one clock cycle. This feature is invaluable because the branch target addresses are very often misaligned (Percentage of misalignment in the cache is 8 to 13% for 16-byte caches). 8Kbyte SEWD cache chip was implemented in 0.8µm DLM CMOS process. It consists of 489,000 transistors on a die size of 0.8530.827cm2.

  • Deferred Locking with Buffer Validation on Demand for Client-Server Database Consistency: DL

    Hyeokmin KWON  Songchun MOON  

     
    PAPER-Databases

      Vol:
    E80-D No:7
      Page(s):
    705-716

    In client-server database management systems (DBMSs), inter-transaction caching is an effective technique for improving the performance. However, inter-transaction caching requires a cache consistency maintenance (CCM) protocol to ensure that cached copies at clients are kept mutually consistent. Such a protocol could be complex to implement and expensive to run, since several rounds of message exchange may be required. In this paper, we propose a new CCM scheme based on the primary-copy locking algorithm. In the proposed scheme, a number of lock requests and a data-shipping request are combined into a single message packet to reduce client-server interactions, which are known to be very critical to the performance of clientserver DBMSs. We examine its performance tradeoffs on the basis of a simulation model under a wide range of workloads. The performance results indicate that the proposed scheme improves the overall system throughput significantly over the caching two-phase locking and the optimistic two-phase locking scheme. Its higher performance mainly results from its lower communication overhead and lower degree of transaction blocking ratio.

  • A 167-MHz 1-Mbit CMOS Synchronous Cache SRAM

    Hideharu YAHATA  Yoji NISHIO  Kunihiro KOMIYAJI  Hiroshi TOYOSHIMA  Atsushi HIRAISHI  Yoshitaka KINOSHITA  

     
    PAPER

      Vol:
    E80-C No:4
      Page(s):
    557-565

    A 167-MHz 1-Mbit CMOS synchronous cache SRAM was developed using 0.40-µm process technology. The floor plan was designed so that the address registers are located in the center of the chip, and high-speed circuits were developed such as the quasi latch (QL) sense amplifier and the one-shot control (OSC) output register. To maintain suitable setup and hold time margins, an equivalent margin (EM) design method was developed. 167-MHz operation was measured at a supply voltage of 2.5 V and an ambient temperature of 75. The same margins 1.1 ns of the setup time and hold time were measured for the specifications of a setup time of 2.0 ns and a hold time of 0.5 ns.

  • Performance Evaluation of VEEC: The Virtual Execution Environment Control for a Remote Knowledge Base Access

    Yoshitaka FUJIWARA  Shin-ichiro OKADA  Hiroyuki TAKADOI  Toshiharu MATSUNISHI  Hiroshi OHKAMA  

     
    PAPER-Protocol

      Vol:
    E80-B No:1
      Page(s):
    81-86

    In a conventional client-server system using the satellite communications, the responsibility of the system to the client user is considerably degraded by the long transmission time between the satellite and the ground terminal as well as the relatively low data transmission rate in comparison with the ground transmission line as the Ethernet. In this paper, a new client-server control, VEEC, is proposed to solve the problem. As a result of the experimental performance studies, it is clarified that the responsibility in the client is remarkably improved when the pre-fetching mechanism of VEEC works efficiently.

  • Address Addition and Decoding without Carry Propagation

    Yung-Hei LEE  Seung Ho HWANG  

     
    LETTER-Algorithm and Computational Complexity

      Vol:
    E80-D No:1
      Page(s):
    98-100

    The response time of adders is mainly determined by the carry propagation delay. This letter deals with a scheme which combines the address addition and decoding together. Although addition is involved in the process, we show that it can be computed without carry propagation. Memory latency is one of the most performance limiting factors. The authors present a new decoder logic named fused add-decoder (FADEC), which performs address addition and decoding in a single process. FADEC can reduce memory latency by eliminating separate address addition cycle.

  • An 8-mW, 8-kB Cache Memory Using an Automatic-Power-Save Architecture for Low Power RISC Microprocessors

    Yasuhisa SHIMAZAKI  Katsuhiro NORISUE  Koichiro ISHIBASHI  Hideo MAEJIMA  

     
    PAPER

      Vol:
    E79-C No:12
      Page(s):
    1693-1698

    An embedded cache memory for low power RISC microprocessors is described. An automatic-power-save architecture (APSA) enables the cache memory to operate with high speed at high frequencies, and with low power dissipation at low frequencies. A pulsed word technique (PWT) and an isolated bit line technique (IBLT) reduce the power dissipation of the cache memory effectively. Using these three techniques, the power dissipation of the cache memory is reduced to almost 60% of the conventional cache memory at 60 MHz and to 20% at a clock frequency of 10 MHz. An 8 KByte test chip using 0.5 µm CMOS technology was fabricated, and it achieves 80 MHz operation at a supply voltage of 3.1 V, and 8 mW operation at a supply voltage of 2.5 V at 10 MHz.

  • A Virtual Cache Architecture for Retaining the Process Working Sets in a Multiprogramming Environment

    Dongwook KIM  Joonwon LEE  

     
    PAPER-Computer Hardware and Design

      Vol:
    E79-D No:12
      Page(s):
    1637-1645

    A direct-mapped cache takes less time for accessing data than a set-associative cache because the time needed for selecting a cache line among the set is not necessary. The hit ratio of a direct-mapped cache, however, is lower due to the conflict misses caused by mapping multiple addresses to the same cache line. Addressing cache memory by virtual addresses reduces the cache access time by eliminating the time needed for address translation. The synonym problem in virtual cache necessitates an additional field in the cache tag to denote the process to which cache line belongs. In this paper, we propose a new virtual cache architecture whose average access time is almost the same as the direct-mapped caches while the hit ratio is the same as the set-associative cashes. A victim for cache replacement is selected from those that belong to a process which is most remote from being scheduled. The entire cache memory is divided into n banks, and each process is assigned to a bank. Then, each process runs on the assigned bank, and the cache behaves like a direct-mapped cache. Trace-driven simulations confirm that the new scheme removes almost as many conflict misses as does the set-associative cache, while cache access time is similar to a direct-mapped cache.

  • Hiding Data Cache Latency with Load Address Prediction

    Toshinori SATO  Hiroshige FUJII  Seigo SUZUKI  

     
    PAPER-Computer Systems

      Vol:
    E79-D No:11
      Page(s):
    1523-1532

    A new prediction method for the effective address is presented. This method works with the buffer named the address prediction buffer, and allows the data cache to be accessed speculatively. As a consequence of the trend toward increasing clock frequency, the internal cache is no longer able to fill the speed gap between the processor and the external memory, and the data cache latency degrades the processor performance. In order to hide this latency, the prediction method is proposed. By this method, the load address is predicted, and the data is fetched earlier than the memory access stage. In the case that the prediction is correct, the latency is hidden. Even if the prediction is incorrect, the performance is not degraded by any miss penalties. We have found that the prediction accuracy is 81.9% on average, and thus the performance is improved by 6.6% on average and a maximum of 12.1% for the integer programs.

  • (Mπ)2: A Hierarchical Parallel Processing System for the Multipass Rendering Method

    Hiroaki KOBAYASHI  Hitoshi YAMAUCHI  Yuichiro TOH  Tadao NAKAMURA  

     
    PAPER-Architectures

      Vol:
    E79-D No:8
      Page(s):
    1055-1064

    This paper proposes a hierarchical parallel processing system for the multipass rendering method. The multipass rendering method based on the integration of radiosity and ray-tracing can synthesize photo-realistic images. However, the method is also computationally expensive. To accelerate the multipass rendering method, the system, called (Mπ)2, employs two kinds of parallel processing schemes. As a coarse-grain parallel processing, object-space parallel processing with multiple processing elements based on the object-space subdivision is adapted, and each processing element (PE) is equipped with multiple pipelined units for a fine-grain parallel processing. To balance load among the system, static load balancing at the PE level and dynamic load balancing at the pipelined unit level within the PE are introduced. Especially, we propose a novel static load allocation scheme, skewed-distributed allocation, which can effectively distribute a three-dimensional object space to one- or two-dimensional processor configuration of the (Mπ)2 system. Simulation experiments show that the two-dimensional (Mπ)2 systems with the skewed-distributed allocation outperform the three-dimensional systems with the non-skewed distributed allocation. Since lower dimensional systems can be built at a lower cost than higher dimensional systems, the skewed-distributed allocation will be meritorious. Besides, by the combination of static load balancing by the skewed-distributed allocation and the dynamic load balancing by dynamic ray allocation within each PE, the system performance can be further boosted. We also propose a cached frame buffer system to relieve access collision on a frame buffer.

  • Software Cache Techniques for Memory Nodes in Distributed Memory Parallel Production Systems

    Jun MIYAZAKI   Haruo YOKOTA  

     
    PAPER-Architectures

      Vol:
    E79-D No:8
      Page(s):
    1046-1054

    Because the match phase in OPS5-type production systems requires most of the system's execution time and memory accesses, we proposed hash-based parallel production systems, CPPS (Clustered Parallel Production Systems), based on the RETE algorithm for distributed memory parallel computers, or multicomputers to reduce such a bottleneck. CPPS was effective in speeding up the match phase, but still left room for optimizations. In this paper, we introduce software cache techniques to memory nodes in the CPPS as one of the optimizations, and implement it on a multicomputer, nCUBE2. The benchmark results show that the CPPS with the software cache is about 2-fold faster than the original, and more than 7-fold faster than the simple hash method proposed by Acharya et al. for a large scale problem. The speed-up can be attributed to decreased communication costs.

  • Analytic Modeling of Cache Coherence Based Parallel Computers

    Kazuki JOE  Akira FUKUDA  

     
    PAPER-Computer Systems

      Vol:
    E79-D No:7
      Page(s):
    925-935

    In this paper, we propose an analytic model using a semi-markov process for parallel computers which provides hardware support for a cache coherence mechanism. The model proposed here, the Semi-markov Memory and Cache coherence Interference model, can be used for the performance prediction of cache coherence based parallel computers since it can be easily applied to descriptions of the waiting states due to network contention or memory interference of both normal data accesses and cache coherence requests. Conventional analytic models using stochastic processes to describe parallel computers have the problem of numerical explosion in the number of states necessary as the system size increases even for simple parallel computers without cache coherence mechanisms. The number of states required by constructing our proposing analytic model, however, does not depend on the system size but only on the kind of cache coherence protocol. For example, the number of states for the Synapse cache coherence protocol is only 20, as is described in this paper. Using the proposed analytic model, we investigate several comparative experiments with widely known simulation results. We found that there is only a 7.08% difference between the simulation and our analytic model, while our analytic model can predict the performance of a 1,024 processor system in the order of microseconds.

  • High-Speed CMOS SRAM Technologies for Cache Applications

    Koichiro ISHIBASHI  

     
    INVITED PAPER-Static RAMs

      Vol:
    E79-C No:6
      Page(s):
    724-734

    This parer describes high-speed CMOS SRAM circuit technologies used in cache memories. In recent years, high-speed SRAM technology has led to higher cycle frequencies, but the rate of increase in the SRAM density has slowed. Operating modes of high-speed SRAMs are compared and the advantage of wave-pipelined SRAMs in terms of cycle frequency is shown. Three types of sense amplifiers used in SRAMs are also compared from the viewpoint of speed and power dissipation. Current sense amplifiers provide high-speed operation with low power dissipation, while latch-type sense amplifiers appear most suitable for ultra-low-power SRAMs. Low voltage operation and size reduction of full CMOS cells are now the most pressing issues in the development of SRAMs for cache memories.

  • A Supplementary Scheme for Reducing Cache Access Time

    Jong-Hong BAE  Chong-Min KYUNG  

     
    LETTER-Computer Hardware and Design

      Vol:
    E79-D No:4
      Page(s):
    385-387

    Among three factors mainly affecting the cache access time, i. e., hit access time, miss rate and miss penalty, previous approaches were focused on reducing the hit access time and miss rate. In this paper, we propose a scheme called MPC (Miss-Predicting Cache) which achives additional reduction of the average instruction cache access time through reducing the miss penalty. The MPC scheme which predicts cache miss and starts cache miss operations in advance, therefore, is supplementary to previous cache schemes targeted for reducing the miss rate and/or hit access time. Performance of the MPC scheme was evaluated using dinero, a trace-driven cache simulator, with the estimation of silicon area using 0.8 µm CMOS standard cell library.

  • A Selective Invalidation Strategy for Cache Coherence

    Cosimo Antonio PRETE  Gianpaolo PRINA  Luigi RICCIARDI  

     
    LETTER-Computer Hardware and Design

      Vol:
    E78-D No:10
      Page(s):
    1316-1320

    The overall performance of a shared-memory, common bus multiprocesser system can be seriously affected by useless coherence-related actions. This occurs, in particular, when a private data block of a process becomes resident in more than one cache as a consequence of the migration of the owner process. We introduce a hardware solution to eliminate these useless shared copies, and show how this technique can be applied to a specific coherence protocol. Two extreme workload conditions are properly selected to evaluate the performance of a multiprocessor system.

  • Masked Trnsferring Method of Discontinuous Sectors in Disk Cache System

    Tetsuhiko FUJII  Akira YAMAMOTO  Naoya TAKAHASHI  Minoru YOSHIDA  

     
    PAPER-Computer Systems

      Vol:
    E78-D No:10
      Page(s):
    1239-1247

    This paper proposes a masked data transferring method for the write-back controlled disk cache system employing a fixed-length recording disk drive, enabling data transfer of discontinuous sectors on the same track between the cache and the disk. This paper also evaluates the method. In write-back controlled disk cache sytems, random write requests cause dirty data (write-pending data on a cache) on discontinuous areas on the cache. It is likely that several sectors on the same track become dirty. These dirty sectors must be written onto the disk according to the cache management scheme. In conventional data transferring methods between a disk cache and a disk drive, plural sectors can be transferred in one single operation when the sectors are adjacent, but discrete sectors must be transferred by individual operations. In the methods, an address of the head sector and number of sectors to be transferred are given to the transfer unit. For example, when two sectors on the same track are located closely but not adjacently, and data transfer is requested for those two sectors, the transfer operation for the second sector must be prepared after the first transfer had completed and before the second sector arrives under the disk head. Although the time for the head to pass by the uninterested sector is often too short for the software overhead for the first transfer to be completed and the second transfer to be prepared, which leads to an unwanted extra rotation of the disk. With the masked transferring method proposed in this paper, the micro program creates a bit-map specifying the target sectors to be transferred and passes it to the data transfer unit, enabling to transfer the discontinuous sectors without latency. The method was evaluated using OLTP warkloads. Results show an improvement in random I/O throughput of between 8% and 27%. The masked transferring method is adopted in Hitachi's A-6521 disk subsytems, shipped since December 1993.

  • High Speed DRAMs with Innovative Architectures

    Shigeo OHSHIMA  Tohru FURUYAMA  

     
    INVITED PAPER-DRAM

      Vol:
    E77-C No:8
      Page(s):
    1303-1315

    The newly developed high speed DRAMs are introduced and their innovative circuit techniques for achieving a high data bandwidth are described; the synchronous DRAM, the cache DRAM and the Rambus DRAM. They are all designed to fill the performance gap between MPUs and the main memory of computer systems, which will diverge in '90s. Although these high speed DRAMs have the same purpose to increase the data bandwidth, their approaches to accomplish it is different, which may in turn lead to some advantages or disadvantages as well as their fields of applications. The paper is intended not only to discuss them from technical overview, but also to be a guide to DRAM users when choosing the best fitting one for their systems.

  • Performance Evaluation of a Processing Element for an On-Chip Multiprocessor

    Masafumi TAKAHASHI  Hiroshige FUJII  Emi KANEKO  Takeshi YOSHIDA  Toshinori SATO  Hiroyuki TAKANO  Haruyuki TAGO  Seigo SUZUKI  Nobuyuki GOTO  

     
    PAPER

      Vol:
    E77-C No:7
      Page(s):
    1092-1100

    A 250-MIPS, 125-MFLOPS peak performance processing element (PE), which is being developed for an on-chip multiprocessor, has been modeled and evaluated. The PE includes the following new architecture components: an FPU shared by several IUs in order to increase the efficiency of the FPU pipelines, an on-chip data cache with a prefetch mechanism to reduce clock cycles waiting for memory, and an interface to high speed DRAM, such as Rambus DRAM and Synchronous DRAM. As a result, a PE model with an FPU shared by four or eight IUs causes only 10% performance reduction compared to a model with an un-shared FPU model while saving the cost of three FPUs. Furthermore, a PE model with prefetch operates 1.2 to 1.8 times faster than a model without prefetch at 250-MHz clock rate when the Rambus DRAM is connected. It becomes clear that this PE architecture can bring a high effective performance at over 250-MHz, and is cost-effective for the on-chip multiprocessor.

181-200hit(201hit)