Koh JOHGUCHI Hans Jurgen MATTAUSCH Tetsushi KOIDE Tetsuo HIRONAKA
The presented unified data/instruction cache design uses multiple banks and features 4 ports, distributed crossbar, different word-length for data and instruction ports, interleaved cache-line words and synchronous access with hidden precharge. A 20.5 KByte storage capacity is integrated in 5-metal-layer CMOS logic technology with 200 nm minimum gate length and a 3.4 ns access-cycle time is achieved. The access bandwidth corresponds to 10 ports with standard word-length, while the cost in increased Si-area is only 25% in comparison to a 1-port cache.
Hakjoo LEE Jonghyun SUH Sungwon JUNG
In mobile computing environments, cache invalidation techiniques are widely used. However, theses techniques require a large-sized invalidation report and show low cache utilization under high server update rate. In this paper, we propose a new cache-level cache invalidation technique called TTCI (Timestamp Tree-based Cache Invalidation technique) to overcome the above two problems. TTCI also supports selective tuning for a cache-level cache invalidation. We show in our experiment that our technique requires much smaller size of cache invalidation report and improves cache utilization.
Zhou SU Masato OGURO Jiro KATTO Yasuhiko YASUDA
Content delivery network improves end-user performance by replicating Web contents on a group of geographically distributed sites interconnected over the Internet. However, with the development whereby content distribution systems can manage dynamically changing files, an important issue to be resolved is consistency management, which means the cached replicas on different sites must be updated if the originals change. In this paper, based on the analytical formulation of object freshness, web access distribution and network topology, we derive a novel algorithm as follows: (1) For a given content which has been changed on its original server, only a limited number of its replicas instead of all replicas are updated. (2) After a replica has been selected for update, the latest version will be sent from an algorithm-decided site instead of from its original server. Simulation results verify that the proposed algorithm provides better consistency management than conventional methods with the reduced the old hit ratio and network traffic.
Mariko SAKAMOTO Akira KATSUNO Go SUGIZAKI Toshio YOSHIDA Aiichiro INOUE Koji INOUE Kazuaki MURAKAMI
Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.
K.L. LAM K.F. TSANG Y.T. SUN H.Y. TUNG K.T. KO L.T. LEE
An adaptive tri-threshold dynamic call admission control scheme for wideband mobile cellular networks is proposed. The relationship between the Channel Utilization and the Weighted Handover Dropping Probability versus traffic loadings are investigated. This scheme supports voice, data and multimedia services with differentiated QoS.
Jeong-Ho KIM Joo-Young YANG MinYoung CHUNG
The downlink call admission control (CAC) scheme we call the combined CAC scheme is proposed to improve the call blocking for the 3rd (3G) generation W-CDMA system. The blocking probability of attempted calls would increase if a large number of users tries to place calls in a hot spot area. In this situation, the proposed scheme can mitigate the degradation of the GoS (grade of service) of multiple services by using the available radio channels of the neighboring cells and enhance the channel capacity by about 20% for the blocking probability of 1% under the given example condition of voice service.
Yun TANG Lifeng SUN Jianguang LUO Shiqiang YANG Yuzhuo ZHONG
In recent years, the inherent effectiveness of Peer-to-Peer (P2P) networks has been advocated to address scalability issues in large scale Internet-based on-Demand streaming services. Most of existing works adopt Cache-and-Relay (CR) scheme to exploit a cooperative paradigm among peers. In this paper, we mainly present our practical evaluation study of the scalability of the CR scheme by taking into account of more than 20,000,000 collected real traces. Based on trace-driven simulations, we conclude that the CR scheme is not as effective as previously reported in terms of saving server bandwidth.
Gi-Ho PARK Kil-Whan LEE Tack-Don HAN Shin-Dug KIM
This paper presents a dual data cache system structure, called a cooperative cache system, that is designed as a low power cache structure for embedded processors. The cooperative cache system consists of two caches, i.e., a direct-mapped temporal oriented cache (TOC) and a four-way set-associative spatial oriented cache (SOC). The cooperative cache system achieves improvement in performance and reduction in power consumption by virtue of the structural characteristics of the two caches designed inherently to help each other. An evaluation chip of an embedded processor having the cooperative cache system is manufactured by Samsung Electronics Co. with 0.25 µm 4-metal process technology.
Junhee KIM Sung-Soo LIM Jihong KIM
Cache performance optimization is an important design consideration in building high-performance embedded processors. Unlike general-purpose microprocessors, embedded processors can take advantages of application-specific information in optimizing the cache performance. One of such examples is to use modified cache index bits (over conventional index bits) based on memory access traces from key target embedded applications so that the number of conflict misses can be reduced. In this paper, we present a novel fine-grained cache reconfiguration technique which allows an intra-program reconfiguration of cache index bits, thus better reflecting the changing characteristics of a program execution. The proposed technique, called dynamic reconfiguration of index bits (DRIB), dynamically changes cache index bits in the function level. This compiler-directed and fine-grained approach allows each function to be executed using its own optimal index bits with no additional hardware support. In order to avoid potential performance degradation by frequent cache invalidations from reconfiguring cache index bits, we describe an efficient algorithm for selecting target functions whose cache index bits are reconfigured. Our algorithm ensures that the number of cache misses reduced by DRIB outnumbers the number of cache misses increased from cache invalidations. We also propose a new cache architecture, Two-Level Indexing (TLI) cache, which further reduces the number of conflict misses by intelligently dividing indexing steps into two stages. Our experimental results show that the DRIP approach combined with the TLI cache reduces the number of cache misses by 35% over the conventional cache indexing technique.
In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and temporarily storing a portion of each key into its corresponding key buffer. Experimental results in running time comparisons with other string sorting algorithms are provided for showing the effectiveness of CRadix sort.
While web proxy caching is a widely deployed technique, the performance of a proxy cache is limited by the local storage. Some studies have addressed this limitation by using the residual resources of clients via a p2p method and have achieved a very high hit rate. However, these approaches treat web objects as homogeneous objects and there is no consideration of various web characteristics. Consequently, the byte hit rate of the system is limited, external bandwidth is wasted, and perceived user latency is increased. The present paper suggests an efficient p2p based web caching technique that manages objects with different policies so as to exploit the characteristics of web objects, such as size and temporal locality. Small objects are stored alone whereas large objects are stored by dividing them into numerous small blocks, which are distributed in clients. On a proxy cache, header blocks of large objects take the place of objects themselves and smaller objects are cached. This technique increases the hit rate. Unlike a web cache, which evicts large objects as soon as possible in the case where clients fulfill the role of backup storage, large objects are given higher priority than small objects in the proposed approach. This maximizes the effect of hits for large objects and thereby increases the byte hit rate. Furthermore, we construct simple latency models for various p2p based web caching systems and analyze the effects of the proposed policies on these systems. We then examine the performances of the efficient policies via a trace driven simulation. The results demonstrate that the proposed techniques effectively enhance web cache performance, including hit rate, byte hit rate, and response time.
Providing data availability in a high performance computing environment is very important, especially in this data-intensive world. Most clusters either equip with RAID (Redundant Array of Independent Disks) devices or use redundant nodes to protect data from loss. However, neither of these can really solve the reliability problem incurred in a striped file system. Striping provides an efficient way to increase I/O throughput both in the distributed and parallel paradigms. But it also reduces the overall reliability of a disk system by N fold, where N is the number of independent disks in the system. Parallel Virtual File System (PVFS) is an open source parallel file system which has been widely used in the Linux environment. Its striping structure is good for performance but provides no fault tolerance. We implement Reliable Parallel File System (RPFS) based on PVFS but with reliability support. Our quantitative analysis shows that MTTF (Mean Time To Failure) of our RPFS is better than that of PVFS. Besides, we propose a parity cache table (PCT) to alleviate the penalty of parity updating. The evaluation of our RPFS shows that its read performance is almost the same as that of PVFS (2% to 13% degradation). As to the write performance, 28% to 45% improvement can be achieved depending on the behavior of the operations.
The present paper proposes a novel cache architecture, called SCache, to detect buffer overflow attacks at run time. In addition, we evaluate the energy-security efficiency of the proposed architecture. On a return-address store, SCache generates one or more copies of the return address value and saves them as read only in the cache area. The number of copies generated strongly affects both energy consumption and vulnerability. When the return address is loaded (or popped), the cache compares the value loaded from the memory stack with the corresponding copy existing in the cache. If they are not the same, then return-address corruption has occurred. In the present study, the proposed approach is shown to protect more than 99.5% of return-address loads from the threat of buffer overflow attacks, while increasing the total cache-energy consumption by, at worst, approximately 23%, compared to a well-known low-power cache. Furthermore, we explore the tradeoff between energy consumption and security, and our experimental results show that an energy-aware SCache model provides relatively higher security with only a 10% increase in energy consumption.
Michitaka OKUNO Shinji NISHIMURA Shin-ichi ISHIDA Hiroaki NISHI
A novel cache-based network processor (NP) architecture that can catch up with next generation 100-Gbps packet-processing throughput by exploiting a nature of network traffic is proposed, and the prototype is evaluated with real network traffic traces. This architecture consists of several small processing units (PUs) and a bit-stream manipulation hardware called a burst-stream path (BSP) that has a special cache mechanism called a process-learning cache (PLC) and a cache-miss handler (CMH). The PLC memorizes a packet-processing method with all table-lookup results, and applies it to subsequent packets that have the same information in their header. To avoid packet-processing blocking, the CMH handles cache-miss packets while registration processing is performed at the PLC. The combination of the PLC and CMH enables most packets to skip the execution at the PUs, which dissipate huge power in conventional NPs. We evaluated an FPGA-based prototype with real core network traffic traces of a WIDE backbone router. From the experimental results, we observed a special case where the packet of minimum size appeared in large quantities, and the cache-based NP was able to achieve 100% throughput with only the 10%-throughput PUs due to the existence of very high temporal locality of network traffic. From the whole results, the cache-based NP would be able to achieve 100-Gbps throughput by using 10- to 40-Gbps throughput PUs. The power consumption of the cache-based NP, which consists of 40-Gbps throughput PUs, is estimated to be only 44.7% that of a conventional NP.
Seunglak CHOI Jinwon LEE Su Myeon KIM Junehwa SONG Yoon-Joon LEE
Most commercial Web sites dynamically generate their contents through a three-tier server architecture composed of a Web server, an application server, and a database server. In such an architecture, the database server easily becomes a bottleneck to the overall performance. In this paper, we propose WDBAccel, a high-performance database server accelerator that significantly improves the throughput of database processing. WDBAccel eliminates costly, complex query processing needed to obtain query results by reusing the results from previous queries for subsequent queries. This differentiates WDBAccel from other database cache systems, which employ traditional query processing. WDBAccel further improves its performance by fully utilizing main memory as the primary storage. This paper presents the design and implementation of the WDBAccel as well as the results of performance evaluation with a prototype.
Thepparit BANDITWATTANAWONG Soichiro HIDAKA Hironori WASHIZAKI Katsumi MARUYAMA
Object caching is a common feature in the scalable distributed object systems. Fine-grained replication optimizes the performance and resource utilization in object caching by enabling a remote object-oriented application to be partially and incrementally on-demand replicated in units of cluster. Despite these benefits, the lack of common and simple implementation framework makes the fine-grained replication scheme not extensively used. This paper proposes the novel frameworks for dynamic, transparent, partial and automatically incremental replication of distributed Java objects based on three techniques that are lazy-object creation, proxy and hook. One framework enables the fine-grained replication of server-side stateful in-memory application, and the other framework enables the fine-grained replication of server-side stateless in-memory application, client-side program, or standalone application. The experimental evaluation demonstrates that the efficiency in terms of response time of both frameworks are relatively practical to the extent of a local method invocation.
Shiquan PIAO Jaewon PARK Yongwan PARK
In this letter, a more exact analysis scheme for outage probability is proposed for uplink of direct sequence code division multiple access (DS-CDMA) systems. In the previous works, the effect of call admission control (CAC) on signal to interference ratio (SIR) is considered to evaluate the performance of the outage probability for CDMA systems, however, the effect of CAC on system states is not accurately considered. In this letter, we first analyze the system states more exactly by taking the effect of CAC on CDMA system states into account. Then, the exact probability of the outage is derived according to the exact system states. The probability of the system states and the outage of the proposed approximation scheme are compared with the results of the traditional analysis schemes and the computer simulation. Compared with traditional analysis schemes, the numerical results of the proposed analysis scheme is more close to the computer simulation results.
CheolHong KIM SungWoo CHUNG ChuShik JHON
Energy efficiency of cache memories is crucial in designing embedded processors. Reducing energy consumption in the instruction cache is especially important, since the instruction cache consumes a significant portion of total processor energy. This paper proposes a new instruction cache architecture, named Partitioned Instruction Cache (PI-Cache), for reducing dynamic energy consumption in the instruction cache by partitioning it to smaller (less power-consuming) sub-caches. When the proposed PI-Cache is accessed, only one sub-cache is accessed by utilizing the temporal/spatial locality of applications. In the meantime, other sub-caches are not accessed, leading to dynamic energy reduction. The PI-Cache also reduces dynamic energy consumption by eliminating the energy consumed in tag lookup and comparison. Moreover, the performance gap between the conventional instruction cache and the proposed PI-Cache becomes little when the physical cache access time is considered. We evaluated the energy efficiency by running a cycle accurate simulator, SimpleScalar, with power parameters obtained from CACTI. Simulation results show that the PI-Cache improves the energy-delay product by 20%-54% compared to the conventional direct-mapped instruction cache.
Shigeaki TAGASHIRA Syuhei SHIRAKAWA Satoshi FUJITA
Content-Addressable Network (CAN) provides a mechanism that could retrieve objects in a P2P network by maintaining indices to those objects in a fully decentralized manner. In the CAN system, index caching is a useful technique for reducing the response time of retrieving objects. The key points of effective caching techniques are to improve cache hit ratio by actively sharing caches distributed over the P2P network with every node and to reduce a maintenance and/or routing overhead for locating the cache of a requested index. In this paper, we propose a new caching technique based on the notion of proxy-type caching techniques which have been widely used in WWW systems. It can achieve active cache sharing by incorporating the concept of proxy caching into the index access mechanism and locate a closer proxy cache of a requested index with a little routing overhead. By the result of simulations, we conclude that it can improve the response time of retrieving indices by 30% compared with conventional caching techniques.
Krishna KANT Amit SAHOO Nrupal JANI
Given the availability of high-speed Ethernet and HW based protocol offload, clustered systems using a commodity network fabric (e.g., TCP/IP over Ethernet) are expected to become more attractive for a range of e-business and data center applications. In this paper, we describe a comprehensive simulation to study the performance of clustered database systems using such a fabric. The simulation model currently supports both TCP and SCTP as the transport protocol and models an Oracle 9i like clustered DBMS running a TPC-C like workload. The model can be used to study a wide variety of issues regarding the performance of clustered DBMS systems including the impact of enhancements to network layers (transport, IP, MAC), QoS mechanisms or latency improvements, and cluster-wide power control issues.