Hyokyung BAHN Yong Hyeon SHIN Kern KOH
A new Web cache sharing scheme is presented. Our method reduces the duplicated copies of the same objects in global shared Web caches, even though the hot working set of each local cache can be duplicated. Experimental results show that the proposed scheme outperforms existing sharing schemes.
Sun-Mo KIM Jung-Woo LEE Soo-Haeng LEE Sang-Bang CHOI
Cache memories are small fast memories used to temporarily hold the contents of main memory that are likely to be referenced by processors so as to reduce instruction and data access time. In study of cache performance, most of previous works have employed simulation-based methods. However, that kind of researches cannot precisely explain the obtained results. Moreover, when a new processor is designed, huge simulations must be performed again with several different parameters. This research classifies cache structures for superscalar processors into four types, and then represents analytical model of instruction fetch process for each cache type considering various kinds of architectural parameters such as the frequency of branch instructions in program, cache miss rate, cache miss penalty, branch misprediction frequency, and branch misprediction penalty, and etc. To prove the correctness of the proposed models, we performed extensive simulations and compared the results with the analytical models. Simulation results showed that the proposed model can estimate the expected instruction fetch rate accurately within 10% error in most cases. This paper shows that the increase of cache misses reduces the instruction fetch rate more severely than that of branch misprediction does. The model is also able to provide exact relationship between cache miss and branch misprediction for the instruction fetch analysis. The proposed model can explain the causes of performance degradation that cannot be uncovered by the simulation method only.
Jooyong KIM Hyokyung BAHN Kern KOH
Caching at the Web proxy server plays an important role in reducing the response time, the network traffic, and the load of Web servers. Many recent studies have proposed and examined the replacement and consistency policies for the proxy cache, which plays a central role in the performance of caching components. For better performance, they exploit various metadata of Web objects, such as the reference count, reference time, and modification time information of past behaviors, to estimate the re-reference likelihood and freshness of the objects. However, all of these known to the authors use the metadata only when the actual object is in the cache. We observed from various proxy traces that about 20-30% of clients' requests incurred only the validity checks of cached objects without transferring actual objects from the proxy server. In this case, only the metadata are necessary at the proxy server. This paper proposes a proxy cache consistency policy that uses the metadata even for absent objects. These include the time information of evicted objects from the cache and those out of the header-only replies from Web servers. Trace-driven simulations with public proxy cache traces show that our policy reduces the response time and the number of connections to Web servers significantly.
David Chee Kheong SIEW Gang FENG
The problem of finding a minimum-cast multicast tree (Steiner tree) is known as NP complete. Heuristic based algorithms for this problem to achieve good performance are usually time-consuming. In this paper, we propose a new strategy called tree-caching for efficient multicast connection setup in connection-oriented networks. In this scheme, the tree topologies that have been computed are cached in a database of the source nodes. This can reduce the connection establishment time for subsequent connection requests which have some common multicast members, by an efficient reuse of cached trees without having to re-run a multicast routing algorithm for the whole group. This method can provide an efficient way to eliminate, when ever possible, the expensive tree computation algorithm that has to be performed in setting up a multicast connection. We first formulate the problem of tree-caching and then propose a tree-caching algorithm to reduce the complexity of the tree computations when a new connection is to be established. Through simulations, we find that the proposed tree-caching strategy performs very well and can significantly reduce the computation complexity for setting up multicast connections.
In this paper, CAC (Call Admission Control) algorithms where decisions are based on total received power for the reverse link in CDMA systems with multiple types of services are proposed. The proposed CAC schemes have advantages in that they have control over the cell load dynamically and guarantee the required SIR, that is, QoS (quality of service). The optimal ratio of required transmission power that achieves the maximum capacity is derived for evaluating the cell load and SIR (signal to interference ratio) reflected in total received power at base stations. Allocation of different thresholds can be made for different services which can be regarded as priority based schemes. Two types of services, voice and data, are considered and the maximum throughput is obtained when threshold values are the same. In addition, the transmitter removal scheme is adopted that eliminates calls when the system does not meet the required SIR. Simulation results are provided to evaluate the performance.
Junichi FUNASAKA Keizo SAISHO Akira FUKUDA
Since the traffic of NetNews is increasing, keeping all articles becomes serious problem from a viewpoint of waste of network bandwidth and the amount of disk usage. In addition, users read not all incoming articles. We have proposed several caching algorithms to overcome this problem and shown that a selective prefetch scheme gives the best system performance among the proposed ones. However, since the selective prefetch scheme employed a simple selecting policy, the scheme gave low hit ratio in some cases. Therefore, this paper intends to improve the selective prefetch scheme from a viewpoint of the amount of disk usage as well as hit ratio. In this paper, we divide the scheme into three factors: reference span, criterion, and threshold in criterion. Through simulation experiments using actual NetNews logs, we investigate the influence of the factors of the reference span and the threshold to system performance. As a result, it is shown that the reference span is more significant factor than the threshold, the selective prefetch scheme with a value around the seven days reference span keeps high hit ratio and reduces the amount of disk usage.
Hans Jurgen MATTAUSCH Koji KISHI Takayuki GYOHTEN
The recent trend towards highly parallel on-chip data processing, as e.g. in single-chip processors with parallel execution capability of multiple instructions, leads to the requirement of on-chip data storage with high random-access bandwidth, parallel access capability and large capacity. The first two requirements call for the application of multi-ported memories. However, the conventional architecture, based on multi-port storage cells for each bit, cannot efficiently realize the large storage capacity, because cell area explodes due to a quadratic increase with port number (N). A promising method for obtaining area efficiency is to increase the size of the smallest unit with N-port capability, e.g. by introducing N-port capability on the level of blocks of 1-port cells and not for each cell. We report a quantitative analysis of this method for the SRAM case, which is based on design data in a 0.5 µm, 2-metal CMOS technology. Achievable area-reduction magnitudes in comparison to the conventional architecture are found to be enormous and to accelerate as a function of N. Reduction factors to areas < 1/2, < 1/5, < 1/14 and < 1/30 are estimated for 4, 8, 16 and 32 ports, respectively. Since the demerit of the proposed approach is an increased access-rejection probability, a trade-off between area reduction and allowable access-rejection probability is always necessary for practical applications. This is discussed for the application of multi-port cache memories.
Koji INOUE Koji KAI Kazuaki MURAKAMI
This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. In our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.
SungHo CHO Jeong-Hyon HWANG Kyoung Yul BAE Chong-Sun HWANG
In Optimistic Two-Phase Locking (O2PL), when a transaction requests a commit, the transaction can not be committed until all requested locks are obtained. By this reason, O2PL leads to unnecessary waits and operations even though it adopts an optimistic approach. This paper suggests an efficient optimistic cache consistency protocol that provides serializability of committed transactions. Our cache consistency scheme, called PCP (Preemptive Cache Protocol), decides whether to commit or abort without waiting when transactions request commits. In PCP, some transactions that read stale data items can not be aborted, because it adopts a re-ordering scheme to enhance the performance. In addition, for re-ordering, PCP stores only one version of each data item. This paper presents a simulation-based analysis on the performance of PCP with other protocols such as O2PL, Optimistic Concurrency Control and Caching Two-Phase Locking. The simulation experiments show that PCP performs as well as or better than other schemes with low overhead.
Inbum JUNG Jongwoong HYUN Joonwon LEE
Shared memory multiprocessors are frequently used as compute servers with multiple parallel programs executing at the same time. In such environments, an operating system switches the contexts of multiple processes. When the operating system switches contexts, in addition to the cost of saving the context of the process being swapped out and that of bringing in the context of the new process to be run, the cache performance of processors also can be affected. The blocked algorithm improves cache performance by increasing the locality of memory references. In a blocked program using this algorithm, program performance can be significantly affected by the reuse of a block loaded into a cache memory. If frequent context switching replaces the block before it is completely reused, the cache locality in a blocked program cannot be successfully exploited. To address this problem, we propose a preemption-safe policy to utilize the cache locality of blocked programs in a multiprogrammed system. The proposed policy delays context switching until a block is fully reused within a program, but also compensates for the monopolized processor time on processor scheduling mechanisms. Our simulation results show that in a situation where blocked programs are run on multiprogrammed shared-memory multiprocessors, the proposed policy improves the performance of these programs due to a decrease in cache misses. In such situations, it also has a beneficial impact on the overall system performance due to the enhanced processor utilization.
Masaki AIDA Noriyuki TAKAHASHI Michiyo MATSUDA
In high-speed data networks, it is important to execute high-speed address resolution for packets at a router. To accomplish high-speed address resolution, address cache is effective. For HTTP accesses, it has been discussed that the Dual Zipfian Model can describe the distribution of the destination IP addresses, and it enabled us to derive the cache miss ratio in the steady state, i. e. , the cache miss ratio when the cache has full entries. However, at the time that systems are initialized or network topology is changed, the address cache has no address information or invalid address information. This paper shows the compulsory miss ratio which is the cache miss ratio when the cache has no address entry. In addition, we discuss the replacement policies of cache entries, for fast recovery of packet reachability, when the cache has information of unreachable address.
Eiji KAWAI Kadohito OSUGA Ken-ichi CHINEN Suguru YAMAGUCHI
Hash routing is an algorithm for a distributed WWW caching system that achieves a high hit rate by preventing overlaps of objects between caches. However, one of the drawbacks of hash routing is its lack of robustness against failure. Because WWW becomes a vital service on the Internet, the capabilities of fault tolerance of systems that provide the WWW service come to be important. In this paper, we propose a duplicated hash routing algorithm, an extension of hash routing. Our algorithm introduces minimum redundancy to keep system performance when some caching nodes are crashed. In addition, we optionally allow each node to cache objects requested by its local clients (local caching), which may waste cache capacity of the system but it can cut down the network traffic between caching nodes. We evaluate various aspects of the system performance such as hit rates, error rates and network traffic by simulations and compare them with those of other algorithms. The results show that our algorithm achieves both high fault tolerance and high performance with low system overhead.
Koji INOUE Koji KAI Kazuaki MURAKAMI
This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache). " The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth on merged DRAM/logic LSIs appropriately. In our evaluation, it is observed that an average memory-access time improvement achieved by a direct-mapped D-VLS cache is about 20% compared to a conventional direct-mapped cache with fixed 32-byte lines. This performance improvement is better than that of a doubled-size conventional direct-mapped cache.
Kyung-Ah AHN Hoon CHOI Won-Ok KIM
We present an architecture of a VOD system employing proxy servers. The proposed VOD system provides efficient and reliable VOD services and solves the problems caused by traditional VOD systems of centralized, hierarchical or distributed architecture. The proxy servers are placed between video servers and user systems. The proxy server is a small size video server that has not only caching function but also intelligence such as VCR-like video stream control or navigation of other proxy/video servers to search for a selected video program. Using a VOD system of the proposed architecture, the VOD services can be provided to more users because it reduces the workload of video servers and network traffic. We provide the performance model of the system. Service availability is also analyzed. The proposed architecture shows better performance and availability than the traditional VOD architectures.
Koji INOUE Tohru ISHIHARA Kazuaki MURAKAMI
This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.
Hyo-Joong SUH Seung Wha YOO Chu Shik JHON
In a Cache Coherent Non-Uniform Memory Access (CC-NUMA) system, memory transactions can be classified into two types: inter-node transactions and intra-node transactions. Because the latency of inter-node transactions is usually hundreds times larger than that of intra-node transactions, it is important to reduce the latency of inter-node transactions. Even though the remote cache in the CC-NUMA systems improves the latency of inter-node transactions through caching the remote memory lines, the remote and processor caches of snoop-based CC-NUMA systems have to retain the multi-level cache inclusion property for the simplification of snooping. The inclusion property degrades the cache performance by following factors. First, all the remote memory lines in a processor cache should be preserved in the remote cache of the same node. Second, a line replacement at the remote cache replaces the same address line in the processor caches, which does not comply with the replacement policy of the processor caches. In this paper, we propose Access-list which renders the inclusion property unnecessary, and evaluate the performance of the proposed system by program-driven simulation. From the simulation results, it is shown that the miss rates of caches are reduced and the efficiency of the snoop filtering is similar to the system with the inclusion property. It turns out that the performance of the proposed system is improved up to 1.28 times.
Kenichi OSADA Hisayuki HIGUCHI Koichiro ISHIBASHI Naotaka HASHIMOTO Kenji SHIOZAWA
We fabricated a 16-kB cache macro using 0.35-µm quadruple-metal CMOS technology. This is a 285-MHz, two-port 16-kB (512256 b) cache macro that has a 2-ns access time. This high-speed performance is enabled by a hierarchical bit-line architecture that uses double global bit-line pairs (WGBs), and a high-speed timing-insensitive sense amplifier (ISA) that shortens the access time.
Marcello LAJOLO Luciano LAVAGNO Alberto SANGIOVANNI-VINCENTELLI
Cache memories are one of the main factors that affect software performance, and their use is becoming increasingly common even in embedded systems. Efficient analysis of the effects of parameter variations (cache size, degree of associativity, replacement policy, line size, . . . ) is at the same time an essential and very time-consuming aspect of embedded system design, whose complexity increases when multi-tasking and real-time aspects must be considered. We propose a new simulation-based methodology, focused on an approximate model of the cache and of the multi-tasking reactive software, that allows one to trade off smoothly between accuracy and simulation speed. In particular, we propose to accurately consider intra-task conflicts, but approximate inter-task conflicts by considering only a finite number of previous task executions. The rationale for this choice can be found in a common pattern in embedded systems, where a "normal" data flow results in a regular intra-task common flow, interrupted from time to time by some urgent event, that pessimistically can be considered as disrupting the cache behavior. The approach is conservative because re-execution of a task after a large amount of time will always be considered as not in cache, and the simulation speed-up is considerable.
Jin-Hyuk YANG In-Cheol PARK Chong-Min KYUNG
In this paper, an instruction-cache scheme called Multi-Path Tracing is proposed to enhance the trace cache. Paths are classified to improve the trace cache hit ratio by reducing the path conflict and basic blocks are joined to reduce the hardware cost needed to implement the trace cache. Simulation results for various SPEC integer benchmarks show that the proposed scheme increases the hit ratio by more than 25% and the effective fetch size by 10%.
Tadaaki YAMAUCHI Lance HAMMOND Oyekunle A. OLUKOTUN Kazutami ARIMOTO
A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing memory latency and improving memory bandwidth. In this paper we evaluate the performance of a single chip multiprocessor integrated with DRAM when the DRAM is organized as on-chip main memory and as on-chip cache. We compare the performance of this architecture with that of a more conventional chip which only has SRAM-based on-chip cache. The DRAM-based architecture with four processors outperforms the SRAM-based architecture on floating point applications which are effectively parallelized and have large working sets. This performance difference is significantly better than that possible in a uniprocessor DRAM-based architecture, which performs only slightly faster than an SRAM-based architecture on the same applications. In addition, on multiprogrammed workloads, in which independent processes are assigned to every processor in a single chip multiprocessor, the large bandwidth of on-chip DRAM can handle the inter-access contention better. These results demonstrate that a multiprocessor takes better advantage of the large bandwidth provided by the on-chip DRAM than a uniprocessor.