The search functionality is under construction.

Keyword Search Result

[Keyword] cache(201hit)

121-140hit(201hit)

  • Load Balancing Routing Algorithm for Reverse Proxy Servers

    Satosi KATO  Hidetosi OKAMOTO  Toyofumi TAKENAKA  

     
    PAPER-Internet

      Vol:
    E88-B No:9
      Page(s):
    3693-3700

    We propose a novel routing algorithm for reverse proxy servers, called load balancing content address hashing (LB-CAH), and evaluate the performance of the proposed routing algorithm compared with that of the content address hashing (CAH) and the hash and slide (HAS) routing algorithms. The proposed LB-CAH routing algorithm calculates the popularity of pages in the load balancer using an LFU caching technique and periodically makes a popularity list. Using this popularity list, the proposed routing algorithm selects a reverse proxy server as follows. When the requested page appears in the popularity list, the request is routed according to the round robin method; otherwise, it is routed according to the content address hashing method. We evaluate and compare the LB-CAH, CAH and HAS routing algorithms by simulation experiments from the viewpoints of load balancing, consumed cache space and cache hit rate. Simulation experiments show that the proposed LB-CAH routing algorithm achieves almost the same degree of load balancing as the HAS algorithm and the same cache hit rate as the CAH algorithm for reverse proxy servers in various web site environments.

  • A Cache Optimized Multidimensional Index in Disk-Based Environments

    Myungsun PARK  Sukho LEE  

     
    PAPER-Database

      Vol:
    E88-D No:8
      Page(s):
    1932-1939

    R-trees have been traditionally optimized for I/O performance with disk pages as tree nodes. Recently, researchers have proposed cache-conscious variations of R-trees optimized for CPU cache performance in main memory environments, where the node size is several cache lines wide and more entries are packed in a node by compressing MBR keys. However, because there is a big difference between the node sizes of two types of R-trees, disk-optimized R-trees show poor cache performance while cache-optimized R-trees exhibit poor disk performance. In this paper, we propose a cache and disk optimized R-tree, called PR-tree (Prefetching R-tree). For cache performance, the node size of the PR-tree is wider than a cache line, and the prefetch instruction is used to reduce the number of cache misses. For I/O performance, the nodes of the PR-tree are fitted into one disk page. We represent the detailed analysis of cache misses for range queries, and enumerate all the reasonable in-page leaf and nonleaf node sizes, and heights of in-page trees to figure out tree parameters for the best cache and I/O performance. The PR-tree that we propose achieves better cache performance than the disk-optimized R-tree: a factor of 3.5-15.1 improvement for one-by-one insertions, 6.5-15.1 improvement for deletions, 1.3-1.9 improvement for range queries, and 2.7-9.7 improvement for k-nearest neighbor queries. All experimental results do not show notable declines of I/O performance.

  • Exploiting Versions for Transactional Cache Consistency

    Heum-Geun KANG  

     
    PAPER-Database

      Vol:
    E88-D No:6
      Page(s):
    1191-1198

    The efficiency of algorithms managing data caches has a major impact on the performance of systems that utilize client-side data caching. In these systems, two versions of data can be maintained without additional overhead by exploiting the replication of data in the server's buffer and clients' caches. In this paper, we present a new cache consistency algorithm employing versions: Two Versions-Callback Locking (2V-CBL). Our experimental results indicate that 2V-CBL provides good performance, and in particular outperforms a leading cache consistency algorithm, Asynchronous Avoidance-based Cache Consistency, when some clients run only read-only transactions.

  • Low-Power Network-Packet-Processing Architecture Using Process-Learning Cache for High-End Backbone Router

    Michitaka OKUNO  Shin-ichi ISHIDA  Hiroaki NISHI  

     
    PAPER-Digital

      Vol:
    E88-C No:4
      Page(s):
    536-543

    A novel cache-based packet-processing-engine (PPE) architecture that achieves low-power consumption and high packet-processing throughput by exploiting the nature of network traffic is proposed. This architecture consists of a processing-unit array and a bit-stream manipulation path called a burst stream path (BSP) that has a special cache mechanism called a process-learning cache (PLC). Network packets, which have the same information in their header, appear repeatedly over a short time. By exploiting that nature, the PLC memorizes the packet-processing method with all results (i. e. , table lookups), and applies it to other packets. The PLC enables most packets to skip the execution at the processing-unit array, which consumes high power. As a practical implementation of the cache-based PPE architecture, P-Gear was designed. In particular, P-Gear was compared with a conventional PPE in terms of silicon die size and power consumption. According to this comparison, in the case of current 0.13-µm CMOS process technology, P-Gear can achieve 100-Gbps (gigabit per second) packet-processing throughput with only 36.5% of the die size and 32.8% of the power consumption required by the conventional PPE. Configurations of both architectures for the 1- to 100-Gbps throughput range were also analyzed. In the throughput range of 10-Gbps or more, P-Gear can achieve the target throughput in a smaller die size than the conventional PPE. And for the whole throughput range, P-Gear can achieve a target throughput at lower power than the conventional PPE.

  • Quantitative Evaluation of State-Preserving Leakage Reduction Algorithm for L1 Data Caches

    Reiko KOMIYA  Koji INOUE  Vasily G. MOSHNYAGA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E88-A No:4
      Page(s):
    862-868

    As the transistor feature sizes and threshold voltages reduce, leakage energy consumption has become an inevitable issue for high-performance microprocessor designs. Since on-chip caches are major contributors of the leakage, a number of researchers have proposed efficient leakage reduction techniques. However, it is still not clear that 1) what kind of algorithm can be considered and 2) how much they have impact on energy and performance. To answer these questions, we explore run-time cache management algorithm, and evaluate the energy-performance efficiency for several alternatives.

  • Optimal Replication Algorithm for Scalable Streaming Media in Content Delivery Networks

    Zhou SU  Jiro KATTO  Yasuhiko YASUDA  

     
    PAPER-Internet Systems

      Vol:
    E87-D No:12
      Page(s):
    2723-2732

    CDN (Content Delivery Networks) improves end-user performance by replicating web contents on a group of geographically distributed servers. However, repeatedly keeping the entire replica of the original objects into many content servers consumes too much server resource. This problem becomes more serious for the large-sized objects such as streaming media, e.g. high quality video. In this paper, we therefore propose an efficient replication method for layered video streams in CDN, which can reduce user response delays and storage costs simultaneously. Based on an analytical formulation of the cooperative replication of layers and segments of each video stream, we derive a replication algorithm which solves next three problems quantitatively. (1) How many servers should be selected to replicate a given video stream? (2) For a single video stream, how many layers and segments should be stored in a given server? (3) After selecting a group of servers for each video stream, how do we allocate the replication priority (i.e. order) to each server? Simulation results verify that the proposed algorithm efficiently resolves the above problems and provides much better performance than conventional methods.

  • Density-Based Spam Detector

    Kenichi YOSHIDA  Fuminori ADACHI  Takashi WASHIO  Hiroshi MOTODA  Teruaki HOMMA  Akihiro NAKASHIMA  Hiromitsu FUJIKAWA  Katsuyuki YAMAZAKI  

     
    PAPER-Internet Systems

      Vol:
    E87-D No:12
      Page(s):
    2678-2688

    The volume of mass unsolicited electronic mail, often known as spam, has recently increased enormously and has become a serious threat not only to the Internet but also to society. This paper proposes a new spam detection method which uses document space density information. Although the proposed method requires extensive e-mail traffic to acquire the necessary information, it can achieve perfect detection (i.e., both recall and precision is 100%) under practical conditions. A direct-mapped cache method contributes to the handling of over 13,000 e-mail messages per second. Experimental results, which were conducted using over 50 million actual e-mail messages, are also reported in this paper.

  • Caching Policy and Cache Placement for Active Reliable Multicast

    Gang FENG  Chee Kheong SIEW  Kek Wee LOK  Kwan Lawrence YEUNG  

     
    PAPER-Network

      Vol:
    E87-B No:11
      Page(s):
    3230-3241

    Active Reliable Multicast (ARM) is a novel loss recovery scheme for large-scale reliable multicast that employs active routers to protect the sender and network bandwidth from unnecessary feedback and repair traffic. Active routers perform NACKs suppression, cache multicast data for local loss recovery, and use scoped retransmission to avoid exposure. Limited active resources at routers need to be optimized to achieve low loss recovery latency and/or high network throughput. In this paper, we study the cache placement strategies and caching policies for ARM. Several heuristics, namely uniform allocation, proportional allocation, max-min fair share and weighted allocation for cache allocation methods are proposed. To further improve the loss recovery performance, caching policies can be employed in conjunction with the cache allocation strategies. Several caching policies, namely complete caching, random caching and deterministic caching, are proposed. Extensive simulation experiments are conducted to evaluate and compare the performance of the proposed strategies and policies. Numerical results reveal that significant performance gains can be achieved when a proper cache placement strategy and a caching policy are used for a given available cache resource. Another interesting finding is that the contributions of the cache placement scheme and caching policy to the recovery latency performance are roughly independent. The obtained insights in this study will provide some design guidelines for optimal active resource allocation and caching polices for reliable multicast communications.

  • TLB Update-Hint: A Scalable TLB Consistency Algorithm for Cache-Coherent Non-uniform Memory Access Multiprocessors

    Byeonghag SEONG  Donggook KIM  Yangwoo ROH  Kyuho PARK  Daeyeon PARK  

     
    PAPER-Networking and System Architectures

      Vol:
    E87-D No:7
      Page(s):
    1682-1692

    Shared memory multiprocessors in which each processor has its own TLB must manage consistency among TLBs and a page table. As the large-scale CC-NUMA (cache-coherent non-uniform memory access) shared memory multiprocessors become popular, it is important for TLB consistency management algorithms to be highly scalable. In this paper, we propose a TLB update-hint algorithm as a scalable TLB consistency management solution for CC-NUMA multiprocessors. By using a lazy TLB invalidation approach, we reduced the number of unnecessary processor interruptions and idle-waiting time, and achieved a high level of scalability. Using a shared memory simulator, we evaluated the TLB update-hint algorithm. For performance comparison, we also simulated the TLB shootdown algorithm, one of the most popular TLB consistency algorithms. The simulations demonstrated that the TLB update-hint algorithm scales well in systems with a large number of processors. At 64 node systems, the TLB update-hint algorithm shows 4787% better performance than the TLB shootdown algorithm.

  • Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors for Applications with a Small Working Set

    Sung Woo CHUNG  Hyong-Shik KIM  Chu Shik JHON  

     
    PAPER-Networking and System Architectures

      Vol:
    E87-D No:7
      Page(s):
    1617-1624

    In CC-NUMA multiprocessor systems, it is important to reduce the remote memory access time. Based upon the fact that increasing the size of the LRU second-level (L2) cache more than a certain value does not reduce the cache miss rate significantly, in this paper, we propose two split L2 caches to utilize the surplus of the L2 cache. The split L2 caches are composed of a traditional LRU cache and another cache to reduce the remote memory access time. Both work together to reduce total L2 cache miss time by keeping remote (or long-distance) blocks as well as recently used blocks. For another cache, we propose two alternatives: an L2-RVC (Level 2 - Remote Victim Cache) and an L2-DAVC (Level 2 - Distance-Aware Victim Cache). The proposed split L2 caches reduce total execution time by up to 27%. It is also found that the proposed split L2 caches outperform the traditional single LRU cache of double size.

  • Enhancing ICP with P2P Technology: Cost, Availability, and Reconfiguration

    Ping-Jer YEH  Yu-Chen CHUANG  Shyan-Ming YUAN  

     
    PAPER-Networking and System Architectures

      Vol:
    E87-D No:7
      Page(s):
    1641-1648

    Traditional Web cache servers based on HTTP and ICP infrastructure tend to have higher hardware and management cost, have difficulty in availability, automatic and dynamic reconfiguration, and may have slow links to some users. We find that peer-to-peer technology can help solve these problems. The peer cache service (PCS) we proposed here leverages each peer's local cache, similar access patterns, fully distributed coordination, and fast communication channels to enhance response time, scale of cacheable objects, and availability. Moreover, incorporating goals and strategies such as making the protocol lightweight and mutually compatible with existing cache infrastructure, supporting mobile devices, undertaking dynamic three-level caching, and exchanging cache meta-information further improve the effectiveness and differentiate our work from other similar-at-first-glance P2P Web cache systems.

  • Dynamic Code Repositioning for Java

    Shinji TANAKA  Tetsuyasu YAMADA  Satoshi SHIRAISHI  

     
    PAPER-Software Support and Optimization Techniques

      Vol:
    E87-D No:7
      Page(s):
    1737-1742

    The sizes of recent Java-based server-side applications, like J2EE containers, have been increasing continuously. Past techniques for improving the performance of Java applications have targeted relatively small applications. Moreover, when the methods of these small target applications are invoked, they are not usually distributed over the entire memory space. As a result, these techniques cannot be applied efficiently to improve the performance of current large applications. We propose a dynamic code repositioning approach to improve the hit rates of instruction caches and translation look-aside buffers. Profiles of method invocations are collected when the application performs with its heaviest processor load, and the code is repositioned based on these profiles. We also discuss a method-splitting technique to significantly reduce the sizes of methods. Our evaluation of a prototype implementing these techniques indicated 5% improvement in the throughput of the application.

  • A Proposal of Effective Cooperative Caching System Based on Random Access Assumption

    Mitsuru ISHII  Shimmi HATTORI  

     
    LETTER-Network

      Vol:
    E87-B No:6
      Page(s):
    1741-1745

    In this letter, we propose an effective cooperative caching system under the assumption that each web object is accessed randomly. Under this assumption, the access frequency per unit time is given by Poisson distribution and the probability distribution of the web object in the future is derived. Based on this probability distribution, one can obtain the criterion to allocate the web objects with more access expected to the cache servers closer to clients. It is also shown that there is a tradeoff between the precision to allocate objects and the efficiency of caching.

  • ILP-Based Program Path Analysis for Bounding Worst-Case Inter-Task Cache Conflicts

    Hiroyuki TOMIYAMA  Nikil DUTT  

     
    LETTER-System Programs

      Vol:
    E87-D No:6
      Page(s):
    1582-1587

    The unpredictable behavior of cache memory makes it difficult to statically analyze the worst-case performance of real-time systems. This problem is further exacerbated in the case of preemptive multitask systems because of inter-task cache interference, called Cache-Related Preemption Delay (CRPD). This paper proposes an approach to analyzing the tight upper bound on CRPD which a task might impose on lower-priority tasks. Our method finds the program execution path which requires the maximum number of cache blocks using an integer linear programming technique. Experimental results show that our approach provides up to 69% tighter bounds on CRPD than a conservative approach.

  • A 100 MHz 7.84 mm2 31.7 msec 439 mW 512-Point 2-Dimensional FFT Single-Chip Processor

    Naoto MIYAMOTO  Leo KARNAN  Kazuyuki MARUO  Koji KOTANI  Tadahiro OHMI  

     
    PAPER

      Vol:
    E87-C No:4
      Page(s):
    502-509

    A single-chip 512-point FFT processor is presented. This processor is based on the cached-memory architecture (CMA) with the resource-saving multi-datapath radix-23 computation element. The 2-stage CMA, including a pair of single-port SRAMs, is also introduced to speedup the execution time of the 2-dimensional FFTs. Using the above techniques, we have designed an FFT processor core which integrates 552,000 transistors within an area of 2.82.8 mm2 with CMOS 0.35 µm triple-layer-metal process. This processor can execute a 512-point, 36-bit-complex fixed-point data format, 1-dimensonal FFT in 23.2 µsec and a 2-dimensional one in only 23.8 msec at 133 MHz operation. The power consumption of this processor is 439.6 mW at 3.3 V, 100 MHz operation.

  • A Novel Static Prediction Scheme for Filter Cache Structures

    Kugan VIVEKANANDARAJAH  Thambipillai SRIKANTHAN  Christopher T. CLARKE  Saurav BHATTACHARYYA  

     
    PAPER

      Vol:
    E87-C No:4
      Page(s):
    543-548

    Energy dissipation in cache memories is becoming a major design issue for embedded microprocessors. Predictive filter cache based instruction cache hierarchy has been shown to effectively reduce the energy-delay product. In this paper, a simplified pattern prediction algorithm is proposed for the filter cache hierarchy. The prediction scheme relies on the static nature of the hit or miss pattern of the instruction access streams. The static patterns are maintained in a small 32x1-bit wide Static Pattern Table (SPT). Our investigations show that the proposed prediction algorithm is superior to that based on Next Fetch Prediction Table (NFPT) for all the benchmarks simulated. With the proposed approach, energy delay product reduction of up to 6.79% was evident when compared with that using NFPT. Moreover, since the prediction scheme is based on the static assignment of patterns, it lends well for area and power efficient implementation than that employs dynamic pattern prediction although it is marginally inferior (i.e. 0.69%) in term of energy delay product.

  • Selective-Sets Resizable Cache Memory Design for High-Performance and Low-Power CPU Core

    Takashi KURAFUJI  Yasunobu NAKASE  Hidehiro TAKATA  Yukinaga IMAMURA  Rei AKIYAMA  Tadao YAMANAKA  Atsushi IWABU  Shutarou YASUDA  Toshitsugu MIWA  Yasuhiro NUNOMURA  Niichi ITOH  Tetsuya KAGEMOTO  Nobuharu YOSHIOKA  Takeshi SHIBAGAKI  Hiroyuki KONDO  Masayuki KOYAMA  Takahiko ARAKAWA  Shuhei IWADE  

     
    PAPER

      Vol:
    E87-C No:4
      Page(s):
    535-542

    We apply a selective-sets resizable cache and a complete hierarchy SRAM for the high-performance and low-power RISC CPU core. The selective-sets resizable cache can change the cache memory size by varying the number of cache sets. It reduces the leakage current by 23% with slight degradation of the worst case operating speed from 213 MHz to 210 MHz. The complete hierarchy SRAM enables the partial swing operation not only in the bit lines, but also in the global signal lines. It reduces the current consumption of the memory by 4.6%, and attains the high-speed access of 1.4 ns in the typical case.

  • Design and Evaluation of a High Speed Routing Lookup Architecture

    Jun ZHANG  JeoungChill SHIM  Hiroyuki KURINO  Mitsumasa KOYANAGI  

     
    PAPER-Implementation and Operation

      Vol:
    E87-B No:3
      Page(s):
    406-412

    The IP routing lookup problem is equivalent to finding the longest prefix of a packet's destination address in a routing table. It is a challenging problem to design a high performance IP routing lookup architecture, because of increasing traffic, higher link speed, frequent updates and increasing routing table size. At first, increasing traffic and higher link speed require that the IP routing can be executed at wire speed. Secondly, frequent routing table updates require that the insertion and deletion operations should be simple and low delay. At last, increasing routing table size hopes that less memory is used in order to reduce cost. Although many schemes to achieve fast lookup exist, less attention is paid on the latter two factors. This paper proposed a novel pipelined IP routing lookup architecture using selective binary search on hash table organized by prefix lengths. The evaluation results show that it can perform IP lookup operations at a maximum rate of one lookup per cycle. The hash operation ratio for one lookup can be reduced to about 1%, less than two hash operations are needed for one table update and only 512 kbytes SRAM is needed for a routing table with about 43000 prefixes. It proves to have higher performance than the existing schemes.

  • A Cache Replacement Policy for Transcoding Proxy Servers

    Kai-Hau YEUNG  Chun-Cheong WONG  Kin-Yeung WONG  Suk-Yu HUI  

     
    LETTER-Multimedia Systems

      Vol:
    E87-B No:1
      Page(s):
    209-211

    A cache replacement policy which takes the transcoding time into account in making replacement decisions, for the emerging transcoding proxy servers is proposed. Simulation results show the proposed policy outperforms the conventional LRU in both the cache hit rate and the average object transcoding time.

  • A Self-Adjusting Destage Algorithm with High-Low Water Mark in Cached RAID5

    Young Jin NAM  Chanik PARK  

     
    PAPER-Dependable Systems

      Vol:
    E86-D No:12
      Page(s):
    2527-2535

    The High-Low Water Mark destage (HLWM) algorithm is widely used to enable cached RAID5 to flush dirty data from its write cache to disks due to the simplicity of its operations. It starts and stops a destaging process based on the two thresholds that are configured at the initialization time with the best knowledge of its underlying storage performance capability and its workload pattern which includes traffic intensity, access patterns, etc. However, each time the current workload varies from the original, the thresholds need to be re-configured with the changed workload. This paper proposes an efficient destage algorithm which automatically re-configures its initial thresholds according to the changed traffic intensity and access patterns, called adaptive thresholding. The core of adaptive thresholding is to define the two thresholds as the multiplication of the referenced increasing and decreasing rates of the write cache occupancy level and the time required to fill and empty the write cache. We implement the proposed algorithm upon an actual RAID system and then verify the ability of the auto-reconfiguration with synthetic workloads having a different level of traffic intensity and access patterns. Performance evaluations under well-known traced workloads reveal that the proposed algorithm reduces disk IO traffic by about 12% with a 6% increase in the overwrite ratio compared with the HLWM algorithm.

121-140hit(201hit)