The search functionality is under construction.

Keyword Search Result

[Keyword] cache(201hit)

161-180hit(201hit)

  • An Optimistic Cache Consistency Protocol Using Preemptive Approach

    SungHo CHO  Jeong-Hyon HWANG  Kyoung Yul BAE  Chong-Sun HWANG  

     
    PAPER-Databases

      Vol:
    E83-D No:9
      Page(s):
    1772-1780

    In Optimistic Two-Phase Locking (O2PL), when a transaction requests a commit, the transaction can not be committed until all requested locks are obtained. By this reason, O2PL leads to unnecessary waits and operations even though it adopts an optimistic approach. This paper suggests an efficient optimistic cache consistency protocol that provides serializability of committed transactions. Our cache consistency scheme, called PCP (Preemptive Cache Protocol), decides whether to commit or abort without waiting when transactions request commits. In PCP, some transactions that read stale data items can not be aborted, because it adopts a re-ordering scheme to enhance the performance. In addition, for re-ordering, PCP stores only one version of each data item. This paper presents a simulation-based analysis on the performance of PCP with other protocols such as O2PL, Optimistic Concurrency Control and Caching Two-Phase Locking. The simulation experiments show that PCP performs as well as or better than other schemes with low overhead.

  • Evaluation of Compulsory Miss Ratio for Address Cache and Replacement Policies for Restoring Packet Reachability

    Masaki AIDA  Noriyuki TAKAHASHI  Michiyo MATSUDA  

     
    PAPER-Fiber-Optic Transmission

      Vol:
    E83-B No:7
      Page(s):
    1400-1408

    In high-speed data networks, it is important to execute high-speed address resolution for packets at a router. To accomplish high-speed address resolution, address cache is effective. For HTTP accesses, it has been discussed that the Dual Zipfian Model can describe the distribution of the destination IP addresses, and it enabled us to derive the cache miss ratio in the steady state, i. e. , the cache miss ratio when the cache has full entries. However, at the time that systems are initialized or network topology is changed, the address cache has no address information or invalid address information. This paper shows the compulsory miss ratio which is the cache miss ratio when the cache has no address entry. In addition, we discuss the replacement policies of cache entries, for fast recovery of packet reachability, when the cache has information of unreachable address.

  • Dynamically Variable Line-Size Cache Architecture for Merged DRAM/Logic LSIs

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER-Computer System Element

      Vol:
    E83-D No:5
      Page(s):
    1048-1057

    This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache). " The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth on merged DRAM/logic LSIs appropriately. In our evaluation, it is observed that an average memory-access time improvement achieved by a direct-mapped D-VLS cache is about 20% compared to a conventional direct-mapped cache with fixed 32-byte lines. This performance improvement is better than that of a doubled-size conventional direct-mapped cache.

  • Architecture of a VOD System with Proxy Servers

    Kyung-Ah AHN  Hoon CHOI  Won-Ok KIM  

     
    PAPER-Multimedia Systems

      Vol:
    E83-B No:4
      Page(s):
    850-857

    We present an architecture of a VOD system employing proxy servers. The proposed VOD system provides efficient and reliable VOD services and solves the problems caused by traditional VOD systems of centralized, hierarchical or distributed architecture. The proxy servers are placed between video servers and user systems. The proxy server is a small size video server that has not only caching function but also intelligence such as VCR-like video stream control or navigation of other proxy/video servers to search for a selected video program. Using a VOD system of the proposed architecture, the VOD services can be provided to more users because it reduces the workload of video servers and network traffic. We provide the performance model of the system. Service availability is also analyzed. The proposed architecture shows better performance and availability than the traditional VOD architectures.

  • An Efficient Method of Eliminating Inclusion Overhead in Snoop-Based CC-NUMA Systems

    Hyo-Joong SUH  Seung Wha YOO  Chu Shik JHON  

     
    PAPER-Computer Systems

      Vol:
    E83-D No:2
      Page(s):
    159-167

    In a Cache Coherent Non-Uniform Memory Access (CC-NUMA) system, memory transactions can be classified into two types: inter-node transactions and intra-node transactions. Because the latency of inter-node transactions is usually hundreds times larger than that of intra-node transactions, it is important to reduce the latency of inter-node transactions. Even though the remote cache in the CC-NUMA systems improves the latency of inter-node transactions through caching the remote memory lines, the remote and processor caches of snoop-based CC-NUMA systems have to retain the multi-level cache inclusion property for the simplification of snooping. The inclusion property degrades the cache performance by following factors. First, all the remote memory lines in a processor cache should be preserved in the remote cache of the same node. Second, a line replacement at the remote cache replaces the same address line in the processor caches, which does not comply with the replacement policy of the processor caches. In this paper, we propose Access-list which renders the inclusion property unnecessary, and evaluate the performance of the proposed system by program-driven simulation. From the simulation results, it is shown that the miss rates of caches are reduced and the efficiency of the snoop filtering is similar to the system with the inclusion property. It turns out that the performance of the proposed system is improved up to 1.28 times.

  • A High-Performance and Low-Power Cache Architecture with Speculative Way-Selection

    Koji INOUE  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E83-C No:2
      Page(s):
    186-194

    This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.

  • A 2-ns-Access, 285-MHz, Two-Port Cache Macro Using Double Global Bit-Line Pairs

    Kenichi OSADA  Hisayuki HIGUCHI  Koichiro ISHIBASHI  Naotaka HASHIMOTO  Kenji SHIOZAWA  

     
    PAPER-Electronic Circuits

      Vol:
    E83-C No:1
      Page(s):
    109-114

    We fabricated a 16-kB cache macro using 0.35-µm quadruple-metal CMOS technology. This is a 285-MHz, two-port 16-kB (512256 b) cache macro that has a 2-ns access time. This high-speed performance is enabled by a hierarchical bit-line architecture that uses double global bit-line pairs (WGBs), and a high-speed timing-insensitive sense amplifier (ISA) that shortens the access time.

  • Fast Instruction Cache Simulation for Hardware/Software Co-Design

    Marcello LAJOLO  Luciano LAVAGNO  Alberto SANGIOVANNI-VINCENTELLI  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2475-2484

    Cache memories are one of the main factors that affect software performance, and their use is becoming increasingly common even in embedded systems. Efficient analysis of the effects of parameter variations (cache size, degree of associativity, replacement policy, line size, . . . ) is at the same time an essential and very time-consuming aspect of embedded system design, whose complexity increases when multi-tasking and real-time aspects must be considered. We propose a new simulation-based methodology, focused on an approximate model of the cache and of the multi-tasking reactive software, that allows one to trade off smoothly between accuracy and simulation speed. In particular, we propose to accurately consider intra-task conflicts, but approximate inter-task conflicts by considering only a finite number of previous task executions. The rationale for this choice can be found in a common pattern in embedded systems, where a "normal" data flow results in a regular intra-task common flow, interrupted from time to time by some urgent event, that pessimistically can be considered as disrupting the cache behavior. The approach is conservative because re-execution of a task after a large amount of time will always be considered as not in cache, and the simulation speed-up is considerable.

  • Path-Classified Trace Cache for Improving Hit Ratio in Wide-Issue Processors

    Jin-Hyuk YANG  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-Computer Hardware and Design

      Vol:
    E82-D No:10
      Page(s):
    1338-1343

    In this paper, an instruction-cache scheme called Multi-Path Tracing is proposed to enhance the trace cache. Paths are classified to improve the trace cache hit ratio by reducing the path conflict and basic blocks are joined to reduce the hardware cost needed to implement the trace cache. Simulation results for various SPEC integer benchmarks show that the proposed scheme increases the hit ratio by more than 25% and the effective fetch size by 10%.

  • A Single Chip Multiprocessor Integrated with High Density DRAM

    Tadaaki YAMAUCHI  Lance HAMMOND  Oyekunle A. OLUKOTUN  Kazutami ARIMOTO  

     
    PAPER-Electronic Circuits

      Vol:
    E82-C No:8
      Page(s):
    1567-1577

    A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing memory latency and improving memory bandwidth. In this paper we evaluate the performance of a single chip multiprocessor integrated with DRAM when the DRAM is organized as on-chip main memory and as on-chip cache. We compare the performance of this architecture with that of a more conventional chip which only has SRAM-based on-chip cache. The DRAM-based architecture with four processors outperforms the SRAM-based architecture on floating point applications which are effectively parallelized and have large working sets. This performance difference is significantly better than that possible in a uniprocessor DRAM-based architecture, which performs only slightly faster than an SRAM-based architecture on the same applications. In addition, on multiprogrammed workloads, in which independent processes are assigned to every processor in a single chip multiprocessor, the large bandwidth of on-chip DRAM can handle the inter-access contention better. These results demonstrate that a multiprocessor takes better advantage of the large bandwidth provided by the on-chip DRAM than a uniprocessor.

  • Cache Coherency and Concurrency Control in a Multisystem Data Sharing Environment

    Haengrae CHO  

     
    PAPER-Databases

      Vol:
    E82-D No:6
      Page(s):
    1042-1050

    In a multisystem data sharing environment (MDSE), the computing nodes are locally coupled via a high-speed network and share a common database at the disk level. To reduce the amount of expensive and slow disk I/O, each node caches database pages in its main memory buffer. This paper focuses on the MDSE that uses record-level locking as a concurrency control. While the record-level locking can guarantee higher concurrency than page-level locking, it may result in heavy message traffic. In this paper, we first propose a cache coherency scheme that can reduce the message traffic in the standard locking. Then the scheme is extended to the context where lock caching and lock de-escalation are adopted. Using a distributed database simulation model, we evaluate the performance of the proposed schemes under a wide variety of database workloads.

  • Hash-Based Query Caching Method for Distributed Web Caching in Wide Area Networks

    Takuya ASAKA  Hiroyoshi MIWA  Yoshiaki TANAKA  

     
    PAPER

      Vol:
    E82-B No:6
      Page(s):
    907-914

    Distributed Web caching allows multiple clients to quickly access a pool of popular Web pages. Conventional distributed Web caching schemes, e. g. , the Internet cache protocol and hash routing, require the sending of many query messages among cache servers and/or impose a large load on the cache servers when they are widely dispersed. To overcome these problems, we propose a hash-based query caching method using both a hash function and a query caching method. This method can find cached objects among several cache servers by using only one query message, enabling the construction of an efficient large-scale distributed Web cache server. Compared to conventional methods, this method reduces cache server overhead and object retrieval latency.

  • Instruction Scheduling to Reduce Switching Activity of Off-Chip Buses for Low-Power Systems with Caches

    Hiroyuki TOMIYAMA  Tohru ISHIHARA  Akihiko INOUE  Hiroto YASUURA  

     
    PAPER-Compiler

      Vol:
    E81-A No:12
      Page(s):
    2621-2629

    In many embedded systems, a significant amount of power is consumed for off-chip driving because off-chip capacitances are much larger than on-chip capacitances. This paper proposes instruction scheduling techniques to reduce power consumed for off-chip driving. The techniques minimize the switching activity of a data bus between an on-chip cache and a main memory when instruction cache misses occur. The scheduling problem is formulated and two scheduling algorithms are presented. Experimental results demonstrate the effectiveness and the efficiency of the proposed algorithms.

  • Planning and Design of Contents-Delivery Systems Using Satellite and Terrestrial Networks

    Kenichi MASE  Takuya ASAKA  Yoshiaki TANAKA  Hideyoshi TOMINAGA  

     
    PAPER-Satellite and Wireless Networks

      Vol:
    E81-B No:11
      Page(s):
    2041-2047

    An architecture is presented for efficient and reliable delivery of multimedia contents from a primary center (PC) to secondary centers (SCs). Requested contents are delivered from the PC to the SCs through a satellite broadcast channel, or from one SC to another SC through a terrestrial channel. Cycling methods are presented that enable sharing of the contents directory of each SC. Several fundamental models and algorithms are introduced for possible consideration during the planning and design of a contents-delivery system. Simulation has shown that using both satellite broadcast and terrestrial channels for contents delivery is superior in terms of cost to the conventional use of only a satellite network.

  • Selective Write-Update: A Method to Relax Execution Constraints in a Critical Section

    Jae Bum LEE  Chu Shik JHON  

     
    PAPER-Computer Systems

      Vol:
    E81-D No:11
      Page(s):
    1186-1194

    In a shared-memory multiprocessor, shared data are usually accessed in a critical section that is protected by a lock variable. Therefore, the order of accesses by multiple processors to the shared data corresponds to the order of acquiring the ownership of the lock variable. This paper presents a selective write-update protocol, where data modified in a critical section are stored in a write cache and, at a synchronization point, they are transferred only to the processor that will execute the critical section following the current processor. By using QOLB synchronization primitives, the next processor can be determined at the execution time. We prove that the selective write-update protocol ensures data coherency of parallel programs that comply with release consistency, and evaluate the performance of the protocol by analytical modeling and program-driven simulation. The simulation results show that our protocol can reduce the number of coherence misses in a critical section while avoiding the multicast of write-update requests on an interconnection network. In addition, we observe that synchronization latency can be decreased by reducing both the execution time of a critical section and the number of write-update requests. From the simulation results, it is shown that our protocol provides better performance than a write-invalidate protocol and a write-update protocol as the number of processors increases.

  • Query Caching Method for Distributed Web Caching

    Takuya ASAKA  Hiroyoshi MIWA  

     
    LETTER-Communication Networks and Services

      Vol:
    E81-B No:10
      Page(s):
    1931-1935

    Distributed web caching reduces retrieval latency of World Wide Web (WWW) objects such as text and graphics. Conventional distributed web caching methods, however, require many query messages among cache servers, which limits their scalability and reliability. To overcome these problems, we propose a query caching method in which each cache server caches not only WWW objects but also a query history. This method of finding cached objects can reduce the number of query messages among cache servers, making it possible to construct a large-scale distributed web cache server. We also propose an algorithm for constructing efficient query relationships among cache servers.

  • High Bandwidth, Variable Line-Size Cache Architecture for Merged DRAM/Logic LSIs

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E81-C No:9
      Page(s):
    1438-1447

    Merged DRAM/logic LSIs could provide high on-chip memory bandwidth by interconnecting logic portions and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit such high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses. This approach tends to increase the cache-line size if we attempt to improve the attainable memory bandwidth. Larger cache lines, however, might worsen the system performance if programs running on the LSIs do not have enough spatial locality of references and cache misses frequently take place. This paper describes a novel cache architecture suitable for merged DRAM/logic LSIs, called variable line-size cache or VLS cache, for resolving the above-mentioned dilemma. The VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and, at the same time, alleviate the negative effects of larger cache-line size by partitioning each large cache line into multiple sub-lines and allowing every sub-line to work as an independent cache line. The number of sub-lines involved when a cache replacement occurs can be determined depending on the characteristics of programs. This paper also evaluates the cost/performance improvements attainable by the VLS cache and compares it with those of conventional cache architectures. As a result, it is observed that a VLS cache reduces the average memory-access time by 16. 4% while it increases the hardware cost by only 13%, compared to a conventional direct-mapped cache with fixed 32-byte lines.

  • A Proposal of Dual Zipfian Model for Describing HTTP Access Trends and Its Application to Address Cache Design

    Masaki AIDA  Noriyuki TAKAHASHI  Tetsuya ABE  

     
    PAPER-Communication Software

      Vol:
    E81-B No:7
      Page(s):
    1475-1485

    This paper proposes the Dual Zipfian Model addressing how to describe HTTP access trends in large-scale data communication networks, and discusses how to design the capacity of address cache tables in an edge router of the networks. We show that destination addresses of packets can be characterized by two types of Zipf's law. Fundamental concept of the Dual Zipfian Model is in complementary use of these laws, and we can derive the relationship between the number of accesses and the number of destination addresses. Experimental results show that the relation gives a good approximation. Applying this relation, we derive cache hit probabilities of the address cache table that incorporates high-speed address resolution. Using the probabilities, design issues including the capacity of the cache tables and aging algorithms of cache entries are also discussed.

  • Unified Tag Memory Architecture with Snoop Support

    Yonghwan LEE  Wookyeong JEONG  Yongsurk LEE  

     
    LETTER-Systems and Control

      Vol:
    E81-A No:6
      Page(s):
    1172-1175

    A unified tag by which both TLBs and caches can be accessed is presented. This architecture reduces the chip area of conventional cache tags and also improves the speed of cache systems. In addition, it has expanded to support snoop accesses for multiprocessor environments. To validate the proposed architecture, we measured the area and speed based on VLSI circuits.

  • Analytic Modeling of Updating Based Cache Coherent Parallel Computers

    Kazuki JOE  Akira FUKUDA  

     
    PAPER-Computer Systems

      Vol:
    E81-D No:6
      Page(s):
    504-512

    In this paper, we apply the Semi-markov Memory and Cache coherence Interference (SMCI) model, which we had proposed for invalidating based cache coherent parallel computers, to an updating based protocol. The model proposed here, the SMCI/Dragon model, can predict performance of cache coherent parallel computers with the Dragon protocol as well as the original SMCI model for the Synapse protocol. Conventional analytic models by stochastic processes to describe parallel computers have the problem of numerical explosion in the number of states necessary as the system size increases. We have already shown that the SMCI model achieved both the small number of states to describe parallel computers with the Synapse protocol and the inexpensive computation cost to predict their performance. In this paper, we demonstrate generality of the SMCI model by applying it to the another cache coherence protocol, Dragon, which has opposite characteristics than Synapse. We show the number of states required by constructing the SMCI/Dragon model is only 21 which is as small as SMCI/Synapse, and the computation cost is also the order of microseconds. Using the SMCI/Dragon model, we investigate several comparative experiments with widely known simulation results. We found that there is only a 5. 4% differences between the simulation and the SMCI/Dragon model.

161-180hit(201hit)