IEICE global.ieice.org Site

Author Search Result

[Author] Chu Shik JHON(12hit)

1-12hit

Write Avoidance Cache Coherence Protocol for Non-volatile Memory as Last-Level Cache in Chip-Multiprocessor
Ju Hee CHOI Jong Wook KWAK Chu Shik JHON

LETTER-Computer System

Vol:
E97-D No:8
Page(s):
2166-2169
Non-Volatile Memories (NVMs) are considered as promising memory technologies for Last-Level Cache (LLC) due to their low leakage and high density. However, NVMs have some drawbacks such as high dynamic energy in modifying NVM cells, long latency for write operation, and limited write endurance. A number of approaches have been proposed to overcome these drawbacks. But very little attention is paid to consider the cache coherency issue. In this letter, we suggest a new cache coherence protocol to reduce the write operations of the LLC. In our protocol, the block data of the LLC is updated only if the cache block is written-back from a private cache, which leads to avoiding useless write operations in the LLC. The simulation results show that our protocol provides 27.1% energy savings and 26.3% lifetime improvements in STT-RAM at maximum.
Throttling Capacity Sharing Using Life Time and Reuse Time Prediction in Private L2 Caches of Chip Multiprocessors
Young-Sik EOM Jong Wook KWAK Seong Tae JHANG Chu Shik JHON

LETTER-Computer System

Vol:
E95-D No:6
Page(s):
1676-1679
In Chip Multi-Processors (CMPs), private L2 caches have potential benefits in future CMPs, e.g. small access latency, performance isolation, tile-friendly architecture and simple low bandwidth on-chip interconnect. But the major weakness of private cache is the higher cache miss rate caused by small private cache capacity. To deal with this problem, private caches can share capacity through spilling replaced blocks to other private caches. However, indiscriminate spilling can make capacity problem worse and influence performance negatively. This letter proposes throttling capacity sharing (TCS) for effective capacity sharing in private L2 caches. TCS determines whether to spill a replaced block by predicting reuse possibility, based on life time and reuse time. In our performance evaluation, TCS improves weighted speedup by 48.79%, 6.37% and 5.44% compared to non-spilling, Cooperative Caching with best spill probability (CC) and Dynamic Spill-Receive (DSR), respectively.
Selective Write-Update: A Method to Relax Execution Constraints in a Critical Section
Jae Bum LEE Chu Shik JHON

PAPER-Computer Systems

Vol:
E81-D No:11
Page(s):
1186-1194
In a shared-memory multiprocessor, shared data are usually accessed in a critical section that is protected by a lock variable. Therefore, the order of accesses by multiple processors to the shared data corresponds to the order of acquiring the ownership of the lock variable. This paper presents a selective write-update protocol, where data modified in a critical section are stored in a write cache and, at a synchronization point, they are transferred only to the processor that will execute the critical section following the current processor. By using QOLB synchronization primitives, the next processor can be determined at the execution time. We prove that the selective write-update protocol ensures data coherency of parallel programs that comply with release consistency, and evaluate the performance of the protocol by analytical modeling and program-driven simulation. The simulation results show that our protocol can reduce the number of coherence misses in a critical section while avoiding the multicast of write-update requests on an interconnection network. In addition, we observe that synchronization latency can be decreased by reducing both the execution time of a critical section and the number of write-update requests. From the simulation results, it is shown that our protocol provides better performance than a write-invalidate protocol and a write-update protocol as the number of processors increases.
Torus Ring: Improving Interconnection Network Performance by Modifying Hierarchical Ring
Jong Wook KWAK Hyong Jin BAN Chu Shik JHON

LETTER-Computer Systems

Vol:
E88-D No:5
Page(s):
1067-1071
In this letter, we propose "Torus Ring", which is a modified version of 2-level hierarchical ring. The Torus Ring has the same complexity as the hierarchical rings, since the only difference is the way it connects the local rings. It has an advantage over the hierarchical ring when the destination of a packet is the adjacent local ring, especially to the backward direction. Although we assume that the destination of a network packet is uniformly distributed across the processing nodes, the average number of hops in Torus Ring is equal to that of the hierarchical ring. However, the performance gain of the Torus Ring is expected to increase, due to the spatial locality of the application programs in the real parallel programming environment. In the simulation results, latencies of the interconnection network are reduced by up to 19%, with moderate ring utilization ratios.
An Efficient Method of Eliminating Inclusion Overhead in Snoop-Based CC-NUMA Systems
Hyo-Joong SUH Seung Wha YOO Chu Shik JHON

PAPER-Computer Systems

Vol:
E83-D No:2
Page(s):
159-167
In a Cache Coherent Non-Uniform Memory Access (CC-NUMA) system, memory transactions can be classified into two types: inter-node transactions and intra-node transactions. Because the latency of inter-node transactions is usually hundreds times larger than that of intra-node transactions, it is important to reduce the latency of inter-node transactions. Even though the remote cache in the CC-NUMA systems improves the latency of inter-node transactions through caching the remote memory lines, the remote and processor caches of snoop-based CC-NUMA systems have to retain the multi-level cache inclusion property for the simplification of snooping. The inclusion property degrades the cache performance by following factors. First, all the remote memory lines in a processor cache should be preserved in the remote cache of the same node. Second, a line replacement at the remote cache replaces the same address line in the processor caches, which does not comply with the replacement policy of the processor caches. In this paper, we propose Access-list which renders the inclusion property unnecessary, and evaluate the performance of the proposed system by program-driven simulation. From the simulation results, it is shown that the miss rates of caches are reduced and the efficiency of the snoop filtering is similar to the system with the inclusion property. It turns out that the performance of the proposed system is improved up to 1.28 times.
An Efficient FPGA Technology Mapping Tightly Coupled with Logic Minimization
Kang YI Seong Yong OHM Chu Shik JHON

PAPER

Vol:
E80-A No:10
Page(s):
1807-1812
The FPGA logic synthesis consists of logic minimization step and technology mapping step. These two steps are usually performed separately to reduce the complexity of the problem. Conventional logic minimization methods try to minimize the number of literals of a given Boolean network, while FPGA technology mapping techniques attempt to minimize the number of basic blocks. However, minimizing the number of literals, which is target architecture-independent feature, does not always lead to minimization of basic block count, which is a FPGA architecture specific feature. Therefore, most of the existing technology mapping systems take into account reorganization of its input circuits to get better mapping results. Such a loosely coupled logic synthesis paradigm may cause difficulties in finding the optimal solution. In this paper, we propose a new logic synthesis approach where logic minimization and technology mapping steps are performed tightly coupled. Our system takes into account FPGA specific features in logic minimization step and thus our technology mapping step does not need to resynthesize the Boolean network. We formulate the technology mapping problem as a graph covering problem. Such formulation provides more global view to optimality and supports versatile cost functions. in addition, a fast and exact library management technique is devised for efficient FPGA cell matching which is one of the most frequently used operations in the FPGA logic synthesis.
Data Filter Cache with Partial Tag Matching for Low Power Embedded Processor
Ju Hee CHOI Jong Wook KWAK Seong Tae JHANG Chu Shik JHON

LETTER-Computer System

Vol:
E97-D No:4
Page(s):
972-975
Filter caches have been studied as an energy efficient solution. They achieve energy savings via selected access to L1 cache, but severely decrease system performance. Therefore, a filter cache system should adopt components that balance execution delay against energy savings. In this letter, we analyze the legacy filter cache system and propose Data Filter Cache with Partial Tag Cache (DFPC) as a new solution. The proposed DFPC scheme reduces energy consumption of L1 data cache and does not impair system performance at all. Simulation results show that DFPC provides the 46.36% energy savings without any performance loss.
The Impact of Branch Direction History Combined with Global Branch History in Branch Prediction
Jong Wook KWAK Ju-Hwan KIM Chu Shik JHON

LETTER-Computer Systems

Vol:
E88-D No:7
Page(s):
1754-1758
Most branch predictors use the PC information of the branch instruction and its dynamic Global Branch History (GBH). In this letter, we suggest a Branch Direction History (BDH) as the third component of the branch prediction and analyze its impact upon the prediction accuracy. Additionally, we propose a new branch predictor, direction-gshare predictor, which utilizes the BDH combined with the GBH. At first, we model a neural network with (PC, GBH, and BDH) and analyze their actual impact upon the branch prediction accuracy, and then we simulate our new predictor, the direction-gshare predictor. The simulation results show that the aliasing in Pattern History Table (PHT) is significantly reduced by the additional use of BDH information. The direction-gshare predictor outperforms bimodal predictor, two-level adaptive predictor and gshare predictor up to 15.32%, 5.41% and 5.74% respectively, without additional hardware costs.
Adopting the Drowsy Technique for Instruction Caches: A Soft Error Perspective
Soong Hyun SHIN Sung Woo CHUNG Eui-Young CHUNG Chu Shik JHON

PAPER-VLSI Design Technology and CAD

Vol:
E91-A No:7
Page(s):
1772-1779
As technology scales down, leakage energy accounts for a greater proportion of total energy. Applying the drowsy technique to a cache, is regarded as one of the most efficient techniques for reducing leakage energy. However, it increases the Soft Error Rate (SER), thus, many researchers doubt the reliability of the drowsy technique. In this paper, we show several reasons why the instruction cache can adopt the drowsy technique without reliability problems. First, an instruction cache always stores read-only data, leading to soft error recovery by re-fetching the instructions from lower level memory. Second, the effect of the re-fetching caused by soft errors on performance is negligible. Additionally, a considerable percentage of soft errors can occur without harming the performance. Lastly, unrecoverable soft errors can be controlled by the scrubbing method. The simulation results show that the drowsy instruction cache rarely increases the rate of unrecoverable errors and negligibly degrades the performance.
Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors for Applications with a Small Working Set
Sung Woo CHUNG Hyong-Shik KIM Chu Shik JHON

PAPER-Networking and System Architectures

Vol:
E87-D No:7
Page(s):
1617-1624
In CC-NUMA multiprocessor systems, it is important to reduce the remote memory access time. Based upon the fact that increasing the size of the LRU second-level (L2) cache more than a certain value does not reduce the cache miss rate significantly, in this paper, we propose two split L2 caches to utilize the surplus of the L2 cache. The split L2 caches are composed of a traditional LRU cache and another cache to reduce the remote memory access time. Both work together to reduce total L2 cache miss time by keeping remote (or long-distance) blocks as well as recently used blocks. For another cache, we propose two alternatives: an L2-RVC (Level 2 - Remote Victim Cache) and an L2-DAVC (Level 2 - Distance-Aware Victim Cache). The proposed split L2 caches reduce total execution time by up to 27%. It is also found that the proposed split L2 caches outperform the traditional single LRU cache of double size.
An Optimal Scheduling Approach Using Lower Bound in High-Level Synthesis
Seong Yong OHM Fadi J. KURDAHI Chu Shik JHON

PAPER-High-Level Synthesis

Vol:
E78-D No:3
Page(s):
231-236
This paper describes an optimal scheduling approach which finds the scheduling result of the minimum functional unit cost under the given timing constraint. In this method, a well-defined search space is constructed incrementally and traversed in a branch-and-bound manner. During the traversal, tighter lower bounds are estimated and utilized coupled with the upper bound on the optimal solution in pruning the search space effectively. This method is extended to support multi-cycling operations, operation chaining, pipelined functional units, and pipelined data paths. Experimental results on some benchmarks show the efficiency of the proposed approach.
Robust Delay Control for Audio Streaming over Wireless Link
Hyo Jin CHOI Jinhwan JEON Taehyoun KIM Hyo-Joong SUH Chu Shik JHON

LETTER-Networks

Vol:
E89-D No:8
Page(s):
2448-2451
The audio delay is becoming an important factor in audio streaming over short-range wireless network. In this study, we propose an efficient two-level delay control method, called frame sequence adaptation and audio sampling frequency compensation, for achieving stable audio delay with a small variation. To prove the effectiveness of our scheme, we implemented and evaluated the scheme on a Bluetooth network. Experimental results show that our scheme can control audio delay robustly and remove phase shift problem in multi-channel stereophonic audio broadcasting as well.

Author Search Result

[Author] Chu Shik JHON(12hit)

Write Avoidance Cache Coherence Protocol for Non-volatile Memory as Last-Level Cache in Chip-Multiprocessor

Throttling Capacity Sharing Using Life Time and Reuse Time Prediction in Private L2 Caches of Chip Multiprocessors

Selective Write-Update: A Method to Relax Execution Constraints in a Critical Section

Torus Ring: Improving Interconnection Network Performance by Modifying Hierarchical Ring

An Efficient Method of Eliminating Inclusion Overhead in Snoop-Based CC-NUMA Systems

An Efficient FPGA Technology Mapping Tightly Coupled with Logic Minimization

Data Filter Cache with Partial Tag Matching for Low Power Embedded Processor

The Impact of Branch Direction History Combined with Global Branch History in Branch Prediction

Adopting the Drowsy Technique for Instruction Caches: A Soft Error Perspective

Utilization of the On-Chip L2 Cache Area in CC-NUMA Multiprocessors for Applications with a Small Working Set

An Optimal Scheduling Approach Using Lower Bound in High-Level Synthesis

Robust Delay Control for Audio Streaming over Wireless Link

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles