IEICE global.ieice.org Site

Keyword Search Result

[Keyword] cache(201hit)

181-200hit(201hit)

The RDT Router Chip: A Versatile Router for Supporting a Distributed Shared Memory
Hiroaki NISHI Ken-ichiro ANJO Tomohiro KUDOH Hideharu AMANO

PAPER-Interconnection Networks

Vol:
E80-D No:9
Page(s):
854-862
JUMP-1 is currently under development by seven Japanese universities to establish techniques for building an efficient distributed shared memory on a massively parallel processor. It provides a coherent cache with reduced hierarchical bit-map directory scheme to achieve cost effective and high performance management. Messages for coherent cache are transferred through a fat tree on the RDT (Recursive Diagonal Torus) interconnection network. RDT router supports versatile functions including multicast and acknowledge combining for the reduced hierarchical bit-map directory scheme. By using 0.5µm BiCMOS SOG technology, it can transfer all packets synchronized with a unique CPU clock (50MHz). Long coaxial cables (4m at maximum) are directly driven with the ECL interface of this chip. Using the dual port RAM, packet buffers allow to push and pull a flit of the packet simultaneously.
Adsmith: An Object-Based Distributed Shared Memory System for Networks of Workstations
Wen-Yew LIANG Chung-Ta KING Feipei LAI

PAPER-Computer Architecture

Vol:
E80-D No:9
Page(s):
899-908
This paper introduces an object-based distributed shared memory (DSM) system called Adsmith. The primary goal of Adsmith is to provide a low-cost, portable, and efficient DSM for networks of workstations (NOW). Adsmith achieves this goal by building on top of PVM, a widely supported communication subsystem, as a user-level library and by incorporating many traffic reduction and latency hiding techniques. Issues involved in the design of Adsmith and our solution strategies will be discussed. Preliminary performance evaluation of Adsmith on a network of Pentium computers will be presented. The results show that programs developed with Adsmith can achieve a performance comparable to that developed with PVM.
MINC: Multistage Interconnection Network with Cache Control Mechanism
Toshihiro HANAWA Takayuki KAMEI Hideki YASUKAWA Katsunobu NISHIMURA Hideharu AMANO

PAPER-Interconnection Networks

Vol:
E80-D No:9
Page(s):
863-870
A novel approach to the cache coherent Multistage Interconnection Network (MIN) called the MINC (MIN with Cache control mechanism) is proposed. In the MINC, the directory is located only on the shared memory using the Reduced Hierarchical Bit-map Directory schemes (RHBDs). In the RHBD, the bit-map directory is reduced and carried in the packet header for quick multicasting without accessing the directory in each hierarchy. In order to reduce unnecessary packets caused by compacting the bit map in the RHBD, a small cache called the pruning cache is introduced in the switching element. The simulation reveals the pruning cache works most effectively when it is provided in every switching element of the first stage, and it reduces the congestion more than 50% with only 4 entries. The MINC cache control chip with 16 inputs/outputs is implemented on the LPGA (Laser Programmable Gate Array), and works with a 66 MHz clock.
SEWD: A Cache Architecture to Speed up the Misaligned Instruction Prefetch
Joon-Seo YIM In-Cheol PARK Chong-Min KYUNG

LETTER-Computer Hardware and Design

Vol:
E80-D No:7
Page(s):
742-745
In microprocessors, reducing the cache access delay and the number of pipeline stall is critical to improve the system performance. In this paper, we propose a Separated Word-line Decoding (SEWD) cache to overcome the pipeline stall caused by the misaligned multi-words data or instruction prefetches which are placed over two cache lines. SEWD cache makes it possible to perform misaligned prefetch as well as aligned prefetch in one clock cycle. This feature is invaluable because the branch target addresses are very often misaligned (Percentage of misalignment in the cache is 8 to 13% for 16-byte caches). 8Kbyte SEWD cache chip was implemented in 0.8µm DLM CMOS process. It consists of 489,000 transistors on a die size of 0.8530.827cm2.
Deferred Locking with Buffer Validation on Demand for Client-Server Database Consistency: DL
Hyeokmin KWON Songchun MOON

PAPER-Databases

Vol:
E80-D No:7
Page(s):
705-716
In client-server database management systems (DBMSs), inter-transaction caching is an effective technique for improving the performance. However, inter-transaction caching requires a cache consistency maintenance (CCM) protocol to ensure that cached copies at clients are kept mutually consistent. Such a protocol could be complex to implement and expensive to run, since several rounds of message exchange may be required. In this paper, we propose a new CCM scheme based on the primary-copy locking algorithm. In the proposed scheme, a number of lock requests and a data-shipping request are combined into a single message packet to reduce client-server interactions, which are known to be very critical to the performance of clientserver DBMSs. We examine its performance tradeoffs on the basis of a simulation model under a wide range of workloads. The performance results indicate that the proposed scheme improves the overall system throughput significantly over the caching two-phase locking and the optimistic two-phase locking scheme. Its higher performance mainly results from its lower communication overhead and lower degree of transaction blocking ratio.
A 167-MHz 1-Mbit CMOS Synchronous Cache SRAM
Hideharu YAHATA Yoji NISHIO Kunihiro KOMIYAJI Hiroshi TOYOSHIMA Atsushi HIRAISHI Yoshitaka KINOSHITA

PAPER

Vol:
E80-C No:4
Page(s):
557-565
A 167-MHz 1-Mbit CMOS synchronous cache SRAM was developed using 0.40-µm process technology. The floor plan was designed so that the address registers are located in the center of the chip, and high-speed circuits were developed such as the quasi latch (QL) sense amplifier and the one-shot control (OSC) output register. To maintain suitable setup and hold time margins, an equivalent margin (EM) design method was developed. 167-MHz operation was measured at a supply voltage of 2.5 V and an ambient temperature of 75. The same margins 1.1 ns of the setup time and hold time were measured for the specifications of a setup time of 2.0 ns and a hold time of 0.5 ns.
Performance Evaluation of VEEC: The Virtual Execution Environment Control for a Remote Knowledge Base Access
Yoshitaka FUJIWARA Shin-ichiro OKADA Hiroyuki TAKADOI Toshiharu MATSUNISHI Hiroshi OHKAMA

PAPER-Protocol

Vol:
E80-B No:1
Page(s):
81-86
In a conventional client-server system using the satellite communications, the responsibility of the system to the client user is considerably degraded by the long transmission time between the satellite and the ground terminal as well as the relatively low data transmission rate in comparison with the ground transmission line as the Ethernet. In this paper, a new client-server control, VEEC, is proposed to solve the problem. As a result of the experimental performance studies, it is clarified that the responsibility in the client is remarkably improved when the pre-fetching mechanism of VEEC works efficiently.
Address Addition and Decoding without Carry Propagation
Yung-Hei LEE Seung Ho HWANG

LETTER-Algorithm and Computational Complexity

Vol:
E80-D No:1
Page(s):
98-100
The response time of adders is mainly determined by the carry propagation delay. This letter deals with a scheme which combines the address addition and decoding together. Although addition is involved in the process, we show that it can be computed without carry propagation. Memory latency is one of the most performance limiting factors. The authors present a new decoder logic named fused add-decoder (FADEC), which performs address addition and decoding in a single process. FADEC can reduce memory latency by eliminating separate address addition cycle.
An 8-mW, 8-kB Cache Memory Using an Automatic-Power-Save Architecture for Low Power RISC Microprocessors
Yasuhisa SHIMAZAKI Katsuhiro NORISUE Koichiro ISHIBASHI Hideo MAEJIMA

PAPER

Vol:
E79-C No:12
Page(s):
1693-1698
An embedded cache memory for low power RISC microprocessors is described. An automatic-power-save architecture (APSA) enables the cache memory to operate with high speed at high frequencies, and with low power dissipation at low frequencies. A pulsed word technique (PWT) and an isolated bit line technique (IBLT) reduce the power dissipation of the cache memory effectively. Using these three techniques, the power dissipation of the cache memory is reduced to almost 60% of the conventional cache memory at 60 MHz and to 20% at a clock frequency of 10 MHz. An 8 KByte test chip using 0.5 µm CMOS technology was fabricated, and it achieves 80 MHz operation at a supply voltage of 3.1 V, and 8 mW operation at a supply voltage of 2.5 V at 10 MHz.
A Virtual Cache Architecture for Retaining the Process Working Sets in a Multiprogramming Environment
Dongwook KIM Joonwon LEE

PAPER-Computer Hardware and Design

Vol:
E79-D No:12
Page(s):
1637-1645
A direct-mapped cache takes less time for accessing data than a set-associative cache because the time needed for selecting a cache line among the set is not necessary. The hit ratio of a direct-mapped cache, however, is lower due to the conflict misses caused by mapping multiple addresses to the same cache line. Addressing cache memory by virtual addresses reduces the cache access time by eliminating the time needed for address translation. The synonym problem in virtual cache necessitates an additional field in the cache tag to denote the process to which cache line belongs. In this paper, we propose a new virtual cache architecture whose average access time is almost the same as the direct-mapped caches while the hit ratio is the same as the set-associative cashes. A victim for cache replacement is selected from those that belong to a process which is most remote from being scheduled. The entire cache memory is divided into n banks, and each process is assigned to a bank. Then, each process runs on the assigned bank, and the cache behaves like a direct-mapped cache. Trace-driven simulations confirm that the new scheme removes almost as many conflict misses as does the set-associative cache, while cache access time is similar to a direct-mapped cache.
Hiding Data Cache Latency with Load Address Prediction
Toshinori SATO Hiroshige FUJII Seigo SUZUKI

PAPER-Computer Systems

Vol:
E79-D No:11
Page(s):
1523-1532
A new prediction method for the effective address is presented. This method works with the buffer named the address prediction buffer, and allows the data cache to be accessed speculatively. As a consequence of the trend toward increasing clock frequency, the internal cache is no longer able to fill the speed gap between the processor and the external memory, and the data cache latency degrades the processor performance. In order to hide this latency, the prediction method is proposed. By this method, the load address is predicted, and the data is fetched earlier than the memory access stage. In the case that the prediction is correct, the latency is hidden. Even if the prediction is incorrect, the performance is not degraded by any miss penalties. We have found that the prediction accuracy is 81.9% on average, and thus the performance is improved by 6.6% on average and a maximum of 12.1% for the integer programs.
(Mπ)²: A Hierarchical Parallel Processing System for the Multipass Rendering Method
Hiroaki KOBAYASHI Hitoshi YAMAUCHI Yuichiro TOH Tadao NAKAMURA

PAPER-Architectures

Vol:
E79-D No:8
Page(s):
1055-1064
This paper proposes a hierarchical parallel processing system for the multipass rendering method. The multipass rendering method based on the integration of radiosity and ray-tracing can synthesize photo-realistic images. However, the method is also computationally expensive. To accelerate the multipass rendering method, the system, called (Mπ)2, employs two kinds of parallel processing schemes. As a coarse-grain parallel processing, object-space parallel processing with multiple processing elements based on the object-space subdivision is adapted, and each processing element (PE) is equipped with multiple pipelined units for a fine-grain parallel processing. To balance load among the system, static load balancing at the PE level and dynamic load balancing at the pipelined unit level within the PE are introduced. Especially, we propose a novel static load allocation scheme, skewed-distributed allocation, which can effectively distribute a three-dimensional object space to one- or two-dimensional processor configuration of the (Mπ)2 system. Simulation experiments show that the two-dimensional (Mπ)2 systems with the skewed-distributed allocation outperform the three-dimensional systems with the non-skewed distributed allocation. Since lower dimensional systems can be built at a lower cost than higher dimensional systems, the skewed-distributed allocation will be meritorious. Besides, by the combination of static load balancing by the skewed-distributed allocation and the dynamic load balancing by dynamic ray allocation within each PE, the system performance can be further boosted. We also propose a cached frame buffer system to relieve access collision on a frame buffer.
Software Cache Techniques for Memory Nodes in Distributed Memory Parallel Production Systems
Jun MIYAZAKI Haruo YOKOTA

PAPER-Architectures

Vol:
E79-D No:8
Page(s):
1046-1054
Because the match phase in OPS5-type production systems requires most of the system's execution time and memory accesses, we proposed hash-based parallel production systems, CPPS (Clustered Parallel Production Systems), based on the RETE algorithm for distributed memory parallel computers, or multicomputers to reduce such a bottleneck. CPPS was effective in speeding up the match phase, but still left room for optimizations. In this paper, we introduce software cache techniques to memory nodes in the CPPS as one of the optimizations, and implement it on a multicomputer, nCUBE2. The benchmark results show that the CPPS with the software cache is about 2-fold faster than the original, and more than 7-fold faster than the simple hash method proposed by Acharya et al. for a large scale problem. The speed-up can be attributed to decreased communication costs.
Analytic Modeling of Cache Coherence Based Parallel Computers
Kazuki JOE Akira FUKUDA

PAPER-Computer Systems

Vol:
E79-D No:7
Page(s):
925-935
In this paper, we propose an analytic model using a semi-markov process for parallel computers which provides hardware support for a cache coherence mechanism. The model proposed here, the Semi-markov Memory and Cache coherence Interference model, can be used for the performance prediction of cache coherence based parallel computers since it can be easily applied to descriptions of the waiting states due to network contention or memory interference of both normal data accesses and cache coherence requests. Conventional analytic models using stochastic processes to describe parallel computers have the problem of numerical explosion in the number of states necessary as the system size increases even for simple parallel computers without cache coherence mechanisms. The number of states required by constructing our proposing analytic model, however, does not depend on the system size but only on the kind of cache coherence protocol. For example, the number of states for the Synapse cache coherence protocol is only 20, as is described in this paper. Using the proposed analytic model, we investigate several comparative experiments with widely known simulation results. We found that there is only a 7.08% difference between the simulation and our analytic model, while our analytic model can predict the performance of a 1,024 processor system in the order of microseconds.
High-Speed CMOS SRAM Technologies for Cache Applications
Koichiro ISHIBASHI

INVITED PAPER-Static RAMs

Vol:
E79-C No:6
Page(s):
724-734
This parer describes high-speed CMOS SRAM circuit technologies used in cache memories. In recent years, high-speed SRAM technology has led to higher cycle frequencies, but the rate of increase in the SRAM density has slowed. Operating modes of high-speed SRAMs are compared and the advantage of wave-pipelined SRAMs in terms of cycle frequency is shown. Three types of sense amplifiers used in SRAMs are also compared from the viewpoint of speed and power dissipation. Current sense amplifiers provide high-speed operation with low power dissipation, while latch-type sense amplifiers appear most suitable for ultra-low-power SRAMs. Low voltage operation and size reduction of full CMOS cells are now the most pressing issues in the development of SRAMs for cache memories.
A Supplementary Scheme for Reducing Cache Access Time
Jong-Hong BAE Chong-Min KYUNG

LETTER-Computer Hardware and Design

Vol:
E79-D No:4
Page(s):
385-387
Among three factors mainly affecting the cache access time, i. e., hit access time, miss rate and miss penalty, previous approaches were focused on reducing the hit access time and miss rate. In this paper, we propose a scheme called MPC (Miss-Predicting Cache) which achives additional reduction of the average instruction cache access time through reducing the miss penalty. The MPC scheme which predicts cache miss and starts cache miss operations in advance, therefore, is supplementary to previous cache schemes targeted for reducing the miss rate and/or hit access time. Performance of the MPC scheme was evaluated using dinero, a trace-driven cache simulator, with the estimation of silicon area using 0.8 µm CMOS standard cell library.
A Selective Invalidation Strategy for Cache Coherence
Cosimo Antonio PRETE Gianpaolo PRINA Luigi RICCIARDI

LETTER-Computer Hardware and Design

Vol:
E78-D No:10
Page(s):
1316-1320
The overall performance of a shared-memory, common bus multiprocesser system can be seriously affected by useless coherence-related actions. This occurs, in particular, when a private data block of a process becomes resident in more than one cache as a consequence of the migration of the owner process. We introduce a hardware solution to eliminate these useless shared copies, and show how this technique can be applied to a specific coherence protocol. Two extreme workload conditions are properly selected to evaluate the performance of a multiprocessor system.
Masked Trnsferring Method of Discontinuous Sectors in Disk Cache System
Tetsuhiko FUJII Akira YAMAMOTO Naoya TAKAHASHI Minoru YOSHIDA

PAPER-Computer Systems

Vol:
E78-D No:10
Page(s):
1239-1247
This paper proposes a masked data transferring method for the write-back controlled disk cache system employing a fixed-length recording disk drive, enabling data transfer of discontinuous sectors on the same track between the cache and the disk. This paper also evaluates the method. In write-back controlled disk cache sytems, random write requests cause dirty data (write-pending data on a cache) on discontinuous areas on the cache. It is likely that several sectors on the same track become dirty. These dirty sectors must be written onto the disk according to the cache management scheme. In conventional data transferring methods between a disk cache and a disk drive, plural sectors can be transferred in one single operation when the sectors are adjacent, but discrete sectors must be transferred by individual operations. In the methods, an address of the head sector and number of sectors to be transferred are given to the transfer unit. For example, when two sectors on the same track are located closely but not adjacently, and data transfer is requested for those two sectors, the transfer operation for the second sector must be prepared after the first transfer had completed and before the second sector arrives under the disk head. Although the time for the head to pass by the uninterested sector is often too short for the software overhead for the first transfer to be completed and the second transfer to be prepared, which leads to an unwanted extra rotation of the disk. With the masked transferring method proposed in this paper, the micro program creates a bit-map specifying the target sectors to be transferred and passes it to the data transfer unit, enabling to transfer the discontinuous sectors without latency. The method was evaluated using OLTP warkloads. Results show an improvement in random I/O throughput of between 8% and 27%. The masked transferring method is adopted in Hitachi's A-6521 disk subsytems, shipped since December 1993.
High Speed DRAMs with Innovative Architectures
Shigeo OHSHIMA Tohru FURUYAMA

INVITED PAPER-DRAM

Vol:
E77-C No:8
Page(s):
1303-1315
The newly developed high speed DRAMs are introduced and their innovative circuit techniques for achieving a high data bandwidth are described; the synchronous DRAM, the cache DRAM and the Rambus DRAM. They are all designed to fill the performance gap between MPUs and the main memory of computer systems, which will diverge in '90s. Although these high speed DRAMs have the same purpose to increase the data bandwidth, their approaches to accomplish it is different, which may in turn lead to some advantages or disadvantages as well as their fields of applications. The paper is intended not only to discuss them from technical overview, but also to be a guide to DRAM users when choosing the best fitting one for their systems.
Performance Evaluation of a Processing Element for an On-Chip Multiprocessor
Masafumi TAKAHASHI Hiroshige FUJII Emi KANEKO Takeshi YOSHIDA Toshinori SATO Hiroyuki TAKANO Haruyuki TAGO Seigo SUZUKI Nobuyuki GOTO

PAPER

Vol:
E77-C No:7
Page(s):
1092-1100
A 250-MIPS, 125-MFLOPS peak performance processing element (PE), which is being developed for an on-chip multiprocessor, has been modeled and evaluated. The PE includes the following new architecture components: an FPU shared by several IUs in order to increase the efficiency of the FPU pipelines, an on-chip data cache with a prefetch mechanism to reduce clock cycles waiting for memory, and an interface to high speed DRAM, such as Rambus DRAM and Synchronous DRAM. As a result, a PE model with an FPU shared by four or eight IUs causes only 10% performance reduction compared to a model with an un-shared FPU model while saving the cost of three FPUs. Furthermore, a PE model with prefetch operates 1.2 to 1.8 times faster than a model without prefetch at 250-MHz clock rate when the Rambus DRAM is connected. It becomes clear that this PE architecture can bring a high effective performance at over 250-MHz, and is cost-effective for the on-chip multiprocessor.

181-200hit(201hit)

Keyword Search Result

[Keyword] cache(201hit)

The RDT Router Chip: A Versatile Router for Supporting a Distributed Shared Memory

Adsmith: An Object-Based Distributed Shared Memory System for Networks of Workstations

MINC: Multistage Interconnection Network with Cache Control Mechanism

SEWD: A Cache Architecture to Speed up the Misaligned Instruction Prefetch

Deferred Locking with Buffer Validation on Demand for Client-Server Database Consistency: DL

A 167-MHz 1-Mbit CMOS Synchronous Cache SRAM

Performance Evaluation of VEEC: The Virtual Execution Environment Control for a Remote Knowledge Base Access

Address Addition and Decoding without Carry Propagation

An 8-mW, 8-kB Cache Memory Using an Automatic-Power-Save Architecture for Low Power RISC Microprocessors

A Virtual Cache Architecture for Retaining the Process Working Sets in a Multiprogramming Environment

Hiding Data Cache Latency with Load Address Prediction

(Mπ)²: A Hierarchical Parallel Processing System for the Multipass Rendering Method

Software Cache Techniques for Memory Nodes in Distributed Memory Parallel Production Systems

Analytic Modeling of Cache Coherence Based Parallel Computers

High-Speed CMOS SRAM Technologies for Cache Applications

A Supplementary Scheme for Reducing Cache Access Time

A Selective Invalidation Strategy for Cache Coherence

Masked Trnsferring Method of Discontinuous Sectors in Disk Cache System

High Speed DRAMs with Innovative Architectures

Performance Evaluation of a Processing Element for an On-Chip Multiprocessor

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles