Hans VANDIERENDONCK Koen De BOSSCHERE
Embedded processors are used in numerous devices executing dedicated applications. This setting makes it worthwhile to optimize the processor to the application it executes, in order to increase its power-efficiency. This paper proposes to enhance direct mapped data caches with automatically tuned randomized set index functions to achieve that goal. We show how randomization functions can be automatically generated and compare them to traditional set-associative caches in terms of performance and energy consumption. A 16 kB randomized direct mapped cache consumes 22% less energy than a 2-way set-associative cache, while it is less than 3% slower. When the randomization function is made configurable (i.e., it can be adapted to the program), the additional reduction of conflicts outweighs the added complexity of the hardware, provided there is a sufficient amount of conflict misses.
Masaaki KONDO Hiroshi NAKAMURA
In recent computer systems, a large portion of energy is consumed by on-chip cache accesses and data movement between cache and off-chip main memory. Reducing these memory system energy is indispensable for future microprocessors because power and thermal issues certainly become a key factor of limiting processor performance. In this paper, we discuss and evaluate how our architecture called SCIMA contributes to energy saving. SCIMA integrates software-controllable memory (SCM) into processor chip. SCIMA can save total memory system energy by using SCM under the support of compiler. The evaluation results reveal that SCIMA can reduce 5-50% of memory system energy and still faster than conventional cache based architecture.
Seiichiro TANI Toshiaki MIYAZAKI
Caching web files reduces user response time as well as network traffic. When implementing caches, the file caching problem must be addressed; the problem is how to determine which files should be evicted from a cache when there is insufficient space for storing a new file so that the sum of the mis-hit (fault) file costs is minimized. Greedy-Dual-Size (GDS) is the best online algorithm in terms of competitiveness, i. e. , (k)/(k-h+1)-competitive, where k and h are the storage space of, respectively, GDS and an optimal offline algorithm. GDS performs very well even in trace-driven simulations. The worst-case time taken to service a request is another important measure for online file caching algorithms since slow response times render caching meaningless from the client's view point. This paper proposes a fast randomized (k)/(k-h+1)-competitive algorithm that performs in O(2log ^* k) time per file eviction or insertion, whereas GDS takes O(log k) time, where 2log ^* k is a much slower increasing function than log k. To confirm its practicality, we conduct trace driven simulations. Experimental results show that our algorithm attains only slightly worse byte hit rates and sufficiently large reduced latency in comparison with GDS, and our algorithm is a good candidate for caches requiring high-speed processing such as second-level caches in the large networks.
Buffer caching is an integral part of the operating system. In this paper, we propose a scheme that integrates buffer cache management and prefetching via cache partitioning. The scheme, which we call SA-W2R, is simple to implement, making it a feasible solution in real systems. In its basic form, for buffer replacement, it uses the LRU policy. However, its modular design allows for any replacement policy to be incorporated into the scheme. For prefetching, it uses the LRU-One Block Lookahead (LRU-OBL) approach, eliminating any extra burden that is generally necessary in other prefetching approaches. Implementation studies based on the GNU/Linux kernel version 2.2.14 show that the SA-W2R performs better than the scheme currently used, with a maximum increases of 23% for the workloads considered.
Caching of frequently accessed data has been shown to be a useful technique for reducing congestion on the narrow bandwidth of wireless channels. However, traditional client/server strategies for supporting transactional cache consistency, which require extensive communications between a client and a server, are not appropriate in a wireless mobile database. This paper proposes two, simple but effective, transactional cache consistency protocols for mobile read-only transactions by utilizing the broadcast-based solutions for the problem of invalidating caches. The novelty of our approach is that the consistency check on accessed data and the commitment protocol are implemented in a truly distributed fashion as an integral part of cache invalidation process. The applicability of proposed techniques is also examined by an analytical study.
SungHun NAM IlYoung CHUNG SungHo CHO ChongSun HWANG
The stateless-based cache invalidation schemes for wireless environments can be categorized into either asynchronous or synchronous cache invalidation according to the broadcasting way of invalidation report. However, if the asynchronous cache invalidation scheme attempts to support local processing of read-only transaction, a critical problem may occur; the asynchronous invalidation reports provide no guarantee of waiting time for mobile transactions requesting commit. To solve this problem, the server in our approaches broadcasts two kind of messages, asynchronous invalidation report to reduce transaction latency and periodic guide message to avoid the uncertainty of waiting time for the next invalidation report. This paper presents a simulation-based analysis on the performance of the suggesting algorithms. The simulation experiments show that the local processing algorithms of read-only transaction based on asynchronous cache invalidation scheme get better response time than the algorithm based on synchronous cache invalidation scheme.
A fast, low-power 16-bit adder, 32-word register file and 512-bit cache SRAM have been developed using 0.25-µm GaAs HEMT technology for future multi-GHz processors. The 16-bit adder, which uses a negative logic binary look-ahead carry structure based on NOR gates, operates at the maximum clock frequency of 1.67 GHz and consumes 134.4 mW at a supply voltage of 0.6 V. The active area is 1.6 mm2 and there are about 1,230 FETs. A new DC/DC level converter has been developed for use in high-speed, low-power storage circuits such as SRAMs and register files. The level converter can increase the DC voltage, which is supplied to an active-load circuit on request, or supply a minimal DC voltage to a load circuit in the stand-by mode. The power dissipation (P) of the 32-word register file with on-chip DC/DC level converters is 459 mW, a reduction to 25.2% of that of an equivalent conventional register file, while the operating frequency (fc) was 5.17 GHz that is 74.8% of fc for the conventional register file. P for the 512-bit cache SRAM with the new DC/DC level converters is 34.3 mW, 89.7% of the value for an equivalent conventional cache SRAM, with the read-access time of 455 psec, only 1.1% longer than that of the conventional cache SRAM.
Koji INOUE Vasily G. MOSHNYAGA Kazuaki MURAKAMI
One of uncompromising requirements from portable computing is energy efficiency, because that affects directly the battery life. On the other hand, portable computing will target more demanding applications, for example moving pictures, so that higher performance is still required. Cache memories have been employed as one of the most important components of computer systems. In this paper, we briefly survey architectural techniques for high performance, low power cache memories.
Koji INOUE Vasily G. MOSHNYAGA Kazuaki MURAKAMI
In this paper, we propose a novel architecture for low-power direct-mapped instruction caches, called "history-based tag-comparison (HBTC) cache. " The cache attempts to reuse tag-comparison results for avoiding unnecessary tag checks. Execution footprints are recorded into an extended BTB (Branch Target Buffer). In our evaluation, it is observed that the energy for tag comparison can be reduced by more than 90% in many applications.
Hiroyoshi MIWA Kazunori KUMAGAI Shinya NOGAMI Takeo ABE Hisao YAMAMOTO
The explosive growth of World Wide Web usage is causing a number of performance problems, including slow response times, network congestion, and denial of service. Web site that has a huge number of accesses and requires high quality of services, such as a site offering hosting services, or content delivery services, usually uses a cache server to reduce the load on the original server offering the original content. To increase the throughput of the caching process and to improve service availability, multiple cache servers are often positioned in front of the original server. This requires a switch to direct incoming requests to one of the multiple cache servers. In this paper, we propose a routing algorithm for such a switch in front of clustered multiple cache servers and evaluate its performance by simulation. The results show that our routing algorithm is effective when content has request locality and a short period of validity, for example, news, map data, road traffic data, or weather information. We also identify points to consider when the proposed algorithm is applied to a real system.
Vasily G. MOSHNYAGA Hiroshi TSUJI
Due to a large capacitance and enormous access rate, caches dissipate about a third of the total energy consumed by today's processors. In this paper we present a new architectural technique to reduce energy consumption in caches. Unlike previous approaches, which have focused on lowering cache capacitance and the number of accesses, our method exploits a new freedom in cache design, namely the voltage per access. Since in modern block-buffered caches, the loading capacitance operated on block-hit is much less than the capacitance operated on miss, the given clock cycle time is inefficiently utilized during the hit. We propose to trade-off this unused time with the supply voltage, lowering the voltage level on the hit and increasing it on the miss. Experiments show that the approach can half the cache energy dissipation without large performance and area overhead.
Jeong-Joon LEE Kyu-Young WHANG Yang-Sae MOON Eui-Kyung HONG
Web caching has become an important problem when addressing the performance issues in Web applications. The expiration time of the Web data item is useful a piece of information for performance enhancement in Web caching. In this paper, we introduce the notion of the effective reference probability that incorporates the effect of expiration time for Web caching. For a formal approach, we propose the continuous independent reference model extending the existing independent reference model. Based on this model, we define formally the effective reference probability and derive it theoretically. By simply replacing the reference probability in the existing cache replacement algorithms with the effective reference probability, we can take the effect of expiration time into account. The results of performance experiments show that the replacement algorithms using the effective reference probability always outperform existing ones. In particular, when the cache fraction is 0.05 and data update is comparatively frequent (i.e., the update frequency is more than 1/10 of the reference frequency), the performance is enhanced by more than 30% in LRU-2 and 13% in Aggarwal's method. The results show that the effective reference probability significantly enhances the performance of Web caching when the expiration time is given.
Seiichiro TANI Toshiaki MIYAZAKI
Network caches reduce network traffic as well as user response time. When implementing network caches, the object replacement problem is one of the core problems; The problem is to determine which objects should be evicted from a cache when there is insufficient space. This paper first formalizes the problem and gives a simple but sufficient condition for deterministic online algorithms to be competitive. Based on the condition, a general framework to make a non-competitive algorithm competitive is constructed. As an application of the framework, an online algorithm, called Competitive_SIZE, is proposed. Both event-driven and trace-driven simulations show that Competitive_SIZE is better than previously proposed algorithms such as LRU (Least Recently Used).
Hyokyung BAHN Yong Hyeon SHIN Kern KOH
A new Web cache sharing scheme is presented. Our method reduces the duplicated copies of the same objects in global shared Web caches, even though the hot working set of each local cache can be duplicated. Experimental results show that the proposed scheme outperforms existing sharing schemes.
Sun-Mo KIM Jung-Woo LEE Soo-Haeng LEE Sang-Bang CHOI
Cache memories are small fast memories used to temporarily hold the contents of main memory that are likely to be referenced by processors so as to reduce instruction and data access time. In study of cache performance, most of previous works have employed simulation-based methods. However, that kind of researches cannot precisely explain the obtained results. Moreover, when a new processor is designed, huge simulations must be performed again with several different parameters. This research classifies cache structures for superscalar processors into four types, and then represents analytical model of instruction fetch process for each cache type considering various kinds of architectural parameters such as the frequency of branch instructions in program, cache miss rate, cache miss penalty, branch misprediction frequency, and branch misprediction penalty, and etc. To prove the correctness of the proposed models, we performed extensive simulations and compared the results with the analytical models. Simulation results showed that the proposed model can estimate the expected instruction fetch rate accurately within 10% error in most cases. This paper shows that the increase of cache misses reduces the instruction fetch rate more severely than that of branch misprediction does. The model is also able to provide exact relationship between cache miss and branch misprediction for the instruction fetch analysis. The proposed model can explain the causes of performance degradation that cannot be uncovered by the simulation method only.
Jooyong KIM Hyokyung BAHN Kern KOH
Caching at the Web proxy server plays an important role in reducing the response time, the network traffic, and the load of Web servers. Many recent studies have proposed and examined the replacement and consistency policies for the proxy cache, which plays a central role in the performance of caching components. For better performance, they exploit various metadata of Web objects, such as the reference count, reference time, and modification time information of past behaviors, to estimate the re-reference likelihood and freshness of the objects. However, all of these known to the authors use the metadata only when the actual object is in the cache. We observed from various proxy traces that about 20-30% of clients' requests incurred only the validity checks of cached objects without transferring actual objects from the proxy server. In this case, only the metadata are necessary at the proxy server. This paper proposes a proxy cache consistency policy that uses the metadata even for absent objects. These include the time information of evicted objects from the cache and those out of the header-only replies from Web servers. Trace-driven simulations with public proxy cache traces show that our policy reduces the response time and the number of connections to Web servers significantly.
Hans Jurgen MATTAUSCH Koji KISHI Takayuki GYOHTEN
The recent trend towards highly parallel on-chip data processing, as e.g. in single-chip processors with parallel execution capability of multiple instructions, leads to the requirement of on-chip data storage with high random-access bandwidth, parallel access capability and large capacity. The first two requirements call for the application of multi-ported memories. However, the conventional architecture, based on multi-port storage cells for each bit, cannot efficiently realize the large storage capacity, because cell area explodes due to a quadratic increase with port number (N). A promising method for obtaining area efficiency is to increase the size of the smallest unit with N-port capability, e.g. by introducing N-port capability on the level of blocks of 1-port cells and not for each cell. We report a quantitative analysis of this method for the SRAM case, which is based on design data in a 0.5 µm, 2-metal CMOS technology. Achievable area-reduction magnitudes in comparison to the conventional architecture are found to be enormous and to accelerate as a function of N. Reduction factors to areas < 1/2, < 1/5, < 1/14 and < 1/30 are estimated for 4, 8, 16 and 32 ports, respectively. Since the demerit of the proposed approach is an increased access-rejection probability, a trade-off between area reduction and allowable access-rejection probability is always necessary for practical applications. This is discussed for the application of multi-port cache memories.
Junichi FUNASAKA Keizo SAISHO Akira FUKUDA
Since the traffic of NetNews is increasing, keeping all articles becomes serious problem from a viewpoint of waste of network bandwidth and the amount of disk usage. In addition, users read not all incoming articles. We have proposed several caching algorithms to overcome this problem and shown that a selective prefetch scheme gives the best system performance among the proposed ones. However, since the selective prefetch scheme employed a simple selecting policy, the scheme gave low hit ratio in some cases. Therefore, this paper intends to improve the selective prefetch scheme from a viewpoint of the amount of disk usage as well as hit ratio. In this paper, we divide the scheme into three factors: reference span, criterion, and threshold in criterion. Through simulation experiments using actual NetNews logs, we investigate the influence of the factors of the reference span and the threshold to system performance. As a result, it is shown that the reference span is more significant factor than the threshold, the selective prefetch scheme with a value around the seven days reference span keeps high hit ratio and reduces the amount of disk usage.
Koji INOUE Koji KAI Kazuaki MURAKAMI
This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. In our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.
SungHo CHO Jeong-Hyon HWANG Kyoung Yul BAE Chong-Sun HWANG
In Optimistic Two-Phase Locking (O2PL), when a transaction requests a commit, the transaction can not be committed until all requested locks are obtained. By this reason, O2PL leads to unnecessary waits and operations even though it adopts an optimistic approach. This paper suggests an efficient optimistic cache consistency protocol that provides serializability of committed transactions. Our cache consistency scheme, called PCP (Preemptive Cache Protocol), decides whether to commit or abort without waiting when transactions request commits. In PCP, some transactions that read stale data items can not be aborted, because it adopts a re-ordering scheme to enhance the performance. In addition, for re-ordering, PCP stores only one version of each data item. This paper presents a simulation-based analysis on the performance of PCP with other protocols such as O2PL, Optimistic Concurrency Control and Caching Two-Phase Locking. The simulation experiments show that PCP performs as well as or better than other schemes with low overhead.