The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] CAC(297hit)

221-240hit(297hit)

  • Cost Sensitive Space Management in Shared Web Caches

    Hyokyung BAHN  Yong Hyeon SHIN  Kern KOH  

     
    LETTER-Internet

      Vol:
    E84-B No:7
      Page(s):
    1965-1969

    A new Web cache sharing scheme is presented. Our method reduces the duplicated copies of the same objects in global shared Web caches, even though the hot working set of each local cache can be duplicated. Experimental results show that the proposed scheme outperforms existing sharing schemes.

  • Analytical Models and Performance Analyses of Instruction Fetch on Superscalar Processors

    Sun-Mo KIM  Jung-Woo LEE  Soo-Haeng LEE  Sang-Bang CHOI  

     
    PAPER

      Vol:
    E84-A No:6
      Page(s):
    1442-1453

    Cache memories are small fast memories used to temporarily hold the contents of main memory that are likely to be referenced by processors so as to reduce instruction and data access time. In study of cache performance, most of previous works have employed simulation-based methods. However, that kind of researches cannot precisely explain the obtained results. Moreover, when a new processor is designed, huge simulations must be performed again with several different parameters. This research classifies cache structures for superscalar processors into four types, and then represents analytical model of instruction fetch process for each cache type considering various kinds of architectural parameters such as the frequency of branch instructions in program, cache miss rate, cache miss penalty, branch misprediction frequency, and branch misprediction penalty, and etc. To prove the correctness of the proposed models, we performed extensive simulations and compared the results with the analytical models. Simulation results showed that the proposed model can estimate the expected instruction fetch rate accurately within 10% error in most cases. This paper shows that the increase of cache misses reduces the instruction fetch rate more severely than that of branch misprediction does. The model is also able to provide exact relationship between cache miss and branch misprediction for the instruction fetch analysis. The proposed model can explain the causes of performance degradation that cannot be uncovered by the simulation method only.

  • Exploiting Metadata of Absent Objects for Proxy Cache Consistency

    Jooyong KIM  Hyokyung BAHN  Kern KOH  

     
    PAPER-Network

      Vol:
    E84-B No:5
      Page(s):
    1406-1412

    Caching at the Web proxy server plays an important role in reducing the response time, the network traffic, and the load of Web servers. Many recent studies have proposed and examined the replacement and consistency policies for the proxy cache, which plays a central role in the performance of caching components. For better performance, they exploit various metadata of Web objects, such as the reference count, reference time, and modification time information of past behaviors, to estimate the re-reference likelihood and freshness of the objects. However, all of these known to the authors use the metadata only when the actual object is in the cache. We observed from various proxy traces that about 20-30% of clients' requests incurred only the validity checks of cached objects without transferring actual objects from the proxy server. In this case, only the metadata are necessary at the proxy server. This paper proposes a proxy cache consistency policy that uses the metadata even for absent objects. These include the time information of evicted objects from the cache and those out of the header-only replies from Web servers. Trace-driven simulations with public proxy cache traces show that our policy reduces the response time and the number of connections to Web servers significantly.

  • Tree-Caching for Multicast Connections with End-to-End Delay Constraint

    David Chee Kheong SIEW  Gang FENG  

     
    PAPER-Network

      Vol:
    E84-B No:4
      Page(s):
    1030-1040

    The problem of finding a minimum-cast multicast tree (Steiner tree) is known as NP complete. Heuristic based algorithms for this problem to achieve good performance are usually time-consuming. In this paper, we propose a new strategy called tree-caching for efficient multicast connection setup in connection-oriented networks. In this scheme, the tree topologies that have been computed are cached in a database of the source nodes. This can reduce the connection establishment time for subsequent connection requests which have some common multicast members, by an efficient reuse of cached trees without having to re-run a multicast routing algorithm for the whole group. This method can provide an efficient way to eliminate, when ever possible, the expensive tree computation algorithm that has to be performed in setting up a multicast connection. We first formulate the problem of tree-caching and then propose a tree-caching algorithm to reduce the complexity of the tree computations when a new connection is to be established. Through simulations, we find that the proposed tree-caching strategy performs very well and can significantly reduce the computation complexity for setting up multicast connections.

  • A Call Admission Control Scheme for Multi-Rate Traffic Based on Total Received Power

    Kuenyoung KIM  Youngnam HAN  

     
    PAPER

      Vol:
    E84-B No:3
      Page(s):
    457-463

    In this paper, CAC (Call Admission Control) algorithms where decisions are based on total received power for the reverse link in CDMA systems with multiple types of services are proposed. The proposed CAC schemes have advantages in that they have control over the cell load dynamically and guarantee the required SIR, that is, QoS (quality of service). The optimal ratio of required transmission power that achieves the maximum capacity is derived for evaluating the cell load and SIR (signal to interference ratio) reflected in total received power at base stations. Allocation of different thresholds can be made for different services which can be regarded as priority based schemes. Two types of services, voice and data, are considered and the maximum throughput is obtained when threshold values are the same. In addition, the transmitter removal scheme is adopted that eliminates calls when the system does not meet the required SIR. Simulation results are provided to evaluate the performance.

  • Effective Caching for NetNews Servers

    Junichi FUNASAKA  Keizo SAISHO  Akira FUKUDA  

     
    PAPER-Databases

      Vol:
    E84-D No:3
      Page(s):
    348-354

    Since the traffic of NetNews is increasing, keeping all articles becomes serious problem from a viewpoint of waste of network bandwidth and the amount of disk usage. In addition, users read not all incoming articles. We have proposed several caching algorithms to overcome this problem and shown that a selective prefetch scheme gives the best system performance among the proposed ones. However, since the selective prefetch scheme employed a simple selecting policy, the scheme gave low hit ratio in some cases. Therefore, this paper intends to improve the selective prefetch scheme from a viewpoint of the amount of disk usage as well as hit ratio. In this paper, we divide the scheme into three factors: reference span, criterion, and threshold in criterion. Through simulation experiments using actual NetNews logs, we investigate the influence of the factors of the reference span and the threshold to system performance. As a result, it is shown that the reference span is more significant factor than the threshold, the selective prefetch scheme with a value around the seven days reference span keeps high hit ratio and reduces the amount of disk usage.

  • Area-Efficient Multi-Port SRAMs for On-Chip Data-Storage with High Random-Access Bandwidth and Large Storage Capacity

    Hans Jurgen MATTAUSCH  Koji KISHI  Takayuki GYOHTEN  

     
    PAPER-Integrated Electronics

      Vol:
    E84-C No:3
      Page(s):
    410-417

    The recent trend towards highly parallel on-chip data processing, as e.g. in single-chip processors with parallel execution capability of multiple instructions, leads to the requirement of on-chip data storage with high random-access bandwidth, parallel access capability and large capacity. The first two requirements call for the application of multi-ported memories. However, the conventional architecture, based on multi-port storage cells for each bit, cannot efficiently realize the large storage capacity, because cell area explodes due to a quadratic increase with port number (N). A promising method for obtaining area efficiency is to increase the size of the smallest unit with N-port capability, e.g. by introducing N-port capability on the level of blocks of 1-port cells and not for each cell. We report a quantitative analysis of this method for the SRAM case, which is based on design data in a 0.5 µm, 2-metal CMOS technology. Achievable area-reduction magnitudes in comparison to the conventional architecture are found to be enormous and to accelerate as a function of N. Reduction factors to areas < 1/2, < 1/5, < 1/14 and < 1/30 are estimated for 4, 8, 16 and 32 ports, respectively. Since the demerit of the proposed approach is an increased access-rejection probability, a trade-off between area reduction and allowable access-rejection probability is always necessary for practical applications. This is discussed for the application of multi-port cache memories.

  • A High-Performance/Low-Power On-Chip Memory-Path Architecture with Variable Cache-Line Size

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E83-C No:11
      Page(s):
    1716-1723

    This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. In our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.

  • An Optimistic Cache Consistency Protocol Using Preemptive Approach

    SungHo CHO  Jeong-Hyon HWANG  Kyoung Yul BAE  Chong-Sun HWANG  

     
    PAPER-Databases

      Vol:
    E83-D No:9
      Page(s):
    1772-1780

    In Optimistic Two-Phase Locking (O2PL), when a transaction requests a commit, the transaction can not be committed until all requested locks are obtained. By this reason, O2PL leads to unnecessary waits and operations even though it adopts an optimistic approach. This paper suggests an efficient optimistic cache consistency protocol that provides serializability of committed transactions. Our cache consistency scheme, called PCP (Preemptive Cache Protocol), decides whether to commit or abort without waiting when transactions request commits. In PCP, some transactions that read stale data items can not be aborted, because it adopts a re-ordering scheme to enhance the performance. In addition, for re-ordering, PCP stores only one version of each data item. This paper presents a simulation-based analysis on the performance of PCP with other protocols such as O2PL, Optimistic Concurrency Control and Caching Two-Phase Locking. The simulation experiments show that PCP performs as well as or better than other schemes with low overhead.

  • A Scheduling Policy for Blocked Programs in Multiprogrammed Shared-Memory Multiprocessors

    Inbum JUNG  Jongwoong HYUN  Joonwon LEE  

     
    PAPER-Software Systems

      Vol:
    E83-D No:9
      Page(s):
    1762-1771

    Shared memory multiprocessors are frequently used as compute servers with multiple parallel programs executing at the same time. In such environments, an operating system switches the contexts of multiple processes. When the operating system switches contexts, in addition to the cost of saving the context of the process being swapped out and that of bringing in the context of the new process to be run, the cache performance of processors also can be affected. The blocked algorithm improves cache performance by increasing the locality of memory references. In a blocked program using this algorithm, program performance can be significantly affected by the reuse of a block loaded into a cache memory. If frequent context switching replaces the block before it is completely reused, the cache locality in a blocked program cannot be successfully exploited. To address this problem, we propose a preemption-safe policy to utilize the cache locality of blocked programs in a multiprogrammed system. The proposed policy delays context switching until a block is fully reused within a program, but also compensates for the monopolized processor time on processor scheduling mechanisms. Our simulation results show that in a situation where blocked programs are run on multiprogrammed shared-memory multiprocessors, the proposed policy improves the performance of these programs due to a decrease in cache misses. In such situations, it also has a beneficial impact on the overall system performance due to the enhanced processor utilization.

  • Evaluation of Compulsory Miss Ratio for Address Cache and Replacement Policies for Restoring Packet Reachability

    Masaki AIDA  Noriyuki TAKAHASHI  Michiyo MATSUDA  

     
    PAPER-Fiber-Optic Transmission

      Vol:
    E83-B No:7
      Page(s):
    1400-1408

    In high-speed data networks, it is important to execute high-speed address resolution for packets at a router. To accomplish high-speed address resolution, address cache is effective. For HTTP accesses, it has been discussed that the Dual Zipfian Model can describe the distribution of the destination IP addresses, and it enabled us to derive the cache miss ratio in the steady state, i. e. , the cache miss ratio when the cache has full entries. However, at the time that systems are initialized or network topology is changed, the address cache has no address information or invalid address information. This paper shows the compulsory miss ratio which is the cache miss ratio when the cache has no address entry. In addition, we discuss the replacement policies of cache entries, for fast recovery of packet reachability, when the cache has information of unreachable address.

  • Duplicated Hash Routing: A Robust Algorithm for a Distributed WWW Cache System

    Eiji KAWAI  Kadohito OSUGA  Ken-ichi CHINEN  Suguru YAMAGUCHI  

     
    PAPER

      Vol:
    E83-D No:5
      Page(s):
    1039-1047

    Hash routing is an algorithm for a distributed WWW caching system that achieves a high hit rate by preventing overlaps of objects between caches. However, one of the drawbacks of hash routing is its lack of robustness against failure. Because WWW becomes a vital service on the Internet, the capabilities of fault tolerance of systems that provide the WWW service come to be important. In this paper, we propose a duplicated hash routing algorithm, an extension of hash routing. Our algorithm introduces minimum redundancy to keep system performance when some caching nodes are crashed. In addition, we optionally allow each node to cache objects requested by its local clients (local caching), which may waste cache capacity of the system but it can cut down the network traffic between caching nodes. We evaluate various aspects of the system performance such as hit rates, error rates and network traffic by simulations and compare them with those of other algorithms. The results show that our algorithm achieves both high fault tolerance and high performance with low system overhead.

  • Dynamically Variable Line-Size Cache Architecture for Merged DRAM/Logic LSIs

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER-Computer System Element

      Vol:
    E83-D No:5
      Page(s):
    1048-1057

    This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache). " The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth on merged DRAM/logic LSIs appropriately. In our evaluation, it is observed that an average memory-access time improvement achieved by a direct-mapped D-VLS cache is about 20% compared to a conventional direct-mapped cache with fixed 32-byte lines. This performance improvement is better than that of a doubled-size conventional direct-mapped cache.

  • Architecture of a VOD System with Proxy Servers

    Kyung-Ah AHN  Hoon CHOI  Won-Ok KIM  

     
    PAPER-Multimedia Systems

      Vol:
    E83-B No:4
      Page(s):
    850-857

    We present an architecture of a VOD system employing proxy servers. The proposed VOD system provides efficient and reliable VOD services and solves the problems caused by traditional VOD systems of centralized, hierarchical or distributed architecture. The proxy servers are placed between video servers and user systems. The proxy server is a small size video server that has not only caching function but also intelligence such as VCR-like video stream control or navigation of other proxy/video servers to search for a selected video program. Using a VOD system of the proposed architecture, the VOD services can be provided to more users because it reduces the workload of video servers and network traffic. We provide the performance model of the system. Service availability is also analyzed. The proposed architecture shows better performance and availability than the traditional VOD architectures.

  • A High-Performance and Low-Power Cache Architecture with Speculative Way-Selection

    Koji INOUE  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E83-C No:2
      Page(s):
    186-194

    This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.

  • An Efficient Method of Eliminating Inclusion Overhead in Snoop-Based CC-NUMA Systems

    Hyo-Joong SUH  Seung Wha YOO  Chu Shik JHON  

     
    PAPER-Computer Systems

      Vol:
    E83-D No:2
      Page(s):
    159-167

    In a Cache Coherent Non-Uniform Memory Access (CC-NUMA) system, memory transactions can be classified into two types: inter-node transactions and intra-node transactions. Because the latency of inter-node transactions is usually hundreds times larger than that of intra-node transactions, it is important to reduce the latency of inter-node transactions. Even though the remote cache in the CC-NUMA systems improves the latency of inter-node transactions through caching the remote memory lines, the remote and processor caches of snoop-based CC-NUMA systems have to retain the multi-level cache inclusion property for the simplification of snooping. The inclusion property degrades the cache performance by following factors. First, all the remote memory lines in a processor cache should be preserved in the remote cache of the same node. Second, a line replacement at the remote cache replaces the same address line in the processor caches, which does not comply with the replacement policy of the processor caches. In this paper, we propose Access-list which renders the inclusion property unnecessary, and evaluate the performance of the proposed system by program-driven simulation. From the simulation results, it is shown that the miss rates of caches are reduced and the efficiency of the snoop filtering is similar to the system with the inclusion property. It turns out that the performance of the proposed system is improved up to 1.28 times.

  • A 2-ns-Access, 285-MHz, Two-Port Cache Macro Using Double Global Bit-Line Pairs

    Kenichi OSADA  Hisayuki HIGUCHI  Koichiro ISHIBASHI  Naotaka HASHIMOTO  Kenji SHIOZAWA  

     
    PAPER-Electronic Circuits

      Vol:
    E83-C No:1
      Page(s):
    109-114

    We fabricated a 16-kB cache macro using 0.35-µm quadruple-metal CMOS technology. This is a 285-MHz, two-port 16-kB (512256 b) cache macro that has a 2-ns access time. This high-speed performance is enabled by a hierarchical bit-line architecture that uses double global bit-line pairs (WGBs), and a high-speed timing-insensitive sense amplifier (ISA) that shortens the access time.

  • Fast Instruction Cache Simulation for Hardware/Software Co-Design

    Marcello LAJOLO  Luciano LAVAGNO  Alberto SANGIOVANNI-VINCENTELLI  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2475-2484

    Cache memories are one of the main factors that affect software performance, and their use is becoming increasingly common even in embedded systems. Efficient analysis of the effects of parameter variations (cache size, degree of associativity, replacement policy, line size, . . . ) is at the same time an essential and very time-consuming aspect of embedded system design, whose complexity increases when multi-tasking and real-time aspects must be considered. We propose a new simulation-based methodology, focused on an approximate model of the cache and of the multi-tasking reactive software, that allows one to trade off smoothly between accuracy and simulation speed. In particular, we propose to accurately consider intra-task conflicts, but approximate inter-task conflicts by considering only a finite number of previous task executions. The rationale for this choice can be found in a common pattern in embedded systems, where a "normal" data flow results in a regular intra-task common flow, interrupted from time to time by some urgent event, that pessimistically can be considered as disrupting the cache behavior. The approach is conservative because re-execution of a task after a large amount of time will always be considered as not in cache, and the simulation speed-up is considerable.

  • Path-Classified Trace Cache for Improving Hit Ratio in Wide-Issue Processors

    Jin-Hyuk YANG  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-Computer Hardware and Design

      Vol:
    E82-D No:10
      Page(s):
    1338-1343

    In this paper, an instruction-cache scheme called Multi-Path Tracing is proposed to enhance the trace cache. Paths are classified to improve the trace cache hit ratio by reducing the path conflict and basic blocks are joined to reduce the hardware cost needed to implement the trace cache. Simulation results for various SPEC integer benchmarks show that the proposed scheme increases the hit ratio by more than 25% and the effective fetch size by 10%.

  • A Single Chip Multiprocessor Integrated with High Density DRAM

    Tadaaki YAMAUCHI  Lance HAMMOND  Oyekunle A. OLUKOTUN  Kazutami ARIMOTO  

     
    PAPER-Electronic Circuits

      Vol:
    E82-C No:8
      Page(s):
    1567-1577

    A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing memory latency and improving memory bandwidth. In this paper we evaluate the performance of a single chip multiprocessor integrated with DRAM when the DRAM is organized as on-chip main memory and as on-chip cache. We compare the performance of this architecture with that of a more conventional chip which only has SRAM-based on-chip cache. The DRAM-based architecture with four processors outperforms the SRAM-based architecture on floating point applications which are effectively parallelized and have large working sets. This performance difference is significantly better than that possible in a uniprocessor DRAM-based architecture, which performs only slightly faster than an SRAM-based architecture on the same applications. In addition, on multiprogrammed workloads, in which independent processes are assigned to every processor in a single chip multiprocessor, the large bandwidth of on-chip DRAM can handle the inter-access contention better. These results demonstrate that a multiprocessor takes better advantage of the large bandwidth provided by the on-chip DRAM than a uniprocessor.

221-240hit(297hit)