IEICE global.ieice.org Site

Author Search Result

[Author] Chong-Min KYUNG(18hit)

1-18hit

A Clustering Based Linear Ordering Algorithm for Netlist Partitioning
Kwang-Su SEONG Chong-Min KYUNG

LETTER-VLSI Design Technology and CAD

Vol:
E79-A No:12
Page(s):
2185-2191
In this paper, we propose a clustering based linear ordering algorithm which consists of global ordering and local ordering. In the global ordering, the algorithm forms clusters from n given vertices and orders the clusters. In the local ordering, the elements in each cluster are linearly ordered. The linear order, thus produced, is used to obtain optimal κ-way partitioning based on scaled cost objective function. When the number of cluster is one, the proposed algorithm is exactly the same as MELO [2]. But the proposed algorithm has more global partitioning information than MELO by clustering. Experiment with 11 benchmark circuits for κ-way (2 κ 10) partitioning shows that the proposed algorithm yields an average of 10.6% improvement over MELO [2] for the κ-way scaled cost partitioning.
SAPICE: A Design Tool of CMOS Operational Amplifiers
Sang-Dae YU Chong-Min KYUNG

PAPER-VLSI Design Technology and CAD

Vol:
E80-A No:9
Page(s):
1667-1675
Based on a new search strategy using circuit simulation and simulated annealing with local search, a design tool is proposed to automate design or tuning process for CMOS operational amplifiers. A special-purpose circuit simulator and some heuristics are used to accomplish the design within reasonable time. For arbitrary circuit topology and specifications, the discrete optimization of cost function is performed by global and local search. Through the comparision of design results and the design of a low-power high-speed CMOS operational amplifier usable in 10-b 25-MHz pipelined A/D converters, it has been demonstrated that this tool can be used for designing high-performance operational amplifiers with less design knowledge and effort.
Hardware Accelerator for Outline Font Generation
Gyu-Cheol HWANG In-Cheol PARK Yun-Tae LEE Tae-Hyung LEE Jong-Hong BAE Chong-Min KYUNG

PAPER-VLSI Design Technology

Vol:
E74-A No:10
Page(s):
3078-3082
Translation of the scalable outline font data as represented by a set of control points of the cubic Bezier curve, etc. into the bitmap data for desk-top publishing (DTP) applications requires a significant amount of computation. In this paper, we propose a special purpose chip called KAFOG for the high-speed generation of bitmap font from the Hangul PostScript file for screen display as well as LBP (Laser Beam Printer) output. KAFOG chip was implemented in 1.5 µm CMOS gate array using 17 K gates. The computation throughput of the KAFOG chip is 250 K cubic Bezier curve segments (each curve segment is composed of four control points) per second at the clock frequency of 40 MHz.
An Efficient Routing Algorithm for Symmetrical FPGAs Using Reliable Cost Metrics
Nak-Woong EUM Inhag PARK Chong-Min KYUNG

PAPER-VLSI Design Technology and CAD

Vol:
E84-A No:3
Page(s):
829-838
This paper presents a new performance and routability-driven routing algorithm for symmetrical array-based field-programmable gate arrays (FPGAs). The contribution of our work is to overcome one of the most critical limitations of the previous routing algorithms: inaccurate estimations of routing density which were too general for symmetrical FPGA. To this end, we devised new routing density measures that are directly linked to the structure (switch block) of symmetrical FPGA, and utilize them consistently in global and detailed routings. With the use of the proposed accurate routing metrics, we developed a new routing algorithm called a reliable net decomposition-based routing which is very fast, and yet produces excellent routing results in terms of net/path delays and routability. An extensive experiment was carried out to show the effectiveness of our algorithm based on the proposed cost metrics. In summary, when compared to the best known results in the literature (TRACER-fpga_PR and SEGA), our algorithm has shown 31.9% shorter longest path delay and 23.0% shorter longest net delay even with about 9 times faster execution time.
Fast Precise Interrupt Handling without Associative Searching in Multiple Out-Of-Order Issue Processors
Sang-Joon NAM In-Cheol PARK Chong-Min KYUNG

PAPER-Computer Hardware and Design

Vol:
E82-D No:3
Page(s):
645-653
This paper presents a new approach to the precise interrupt handling problem in modern processors with multiple out-of-order issues. It is difficult to implement a precise interrupt scheme in the processors because later instructions may change the process states before their preceding instructions have completed. We propose a fast precise interrupt handling scheme which can recover the precise state in one cycle if an interrupt occurs. In addition, the scheme removes all the associative searching operations which are inevitable in the previous approaches. To deal with the renaming of destination registers, we present a new bank-based register file which is indexed by bank index tables containing the bank identifiers of renamed register entries. Simulation results based on the superscalar MIPS architecture show that the register file with 3 banks is a good trade-off between high performance and low complexity.
Improving Dictionary-Based Code Compression in VLIW Architectures
Sang-Joon NAM In-Cheol PARK Chong-Min KYUNG

PAPER

Vol:
E82-A No:11
Page(s):
2318-2324
Reducing code size is crucial in embedded systems as well as in high-performance systems to overcome the communication bottleneck between memory and CPU, especially with VLIW (Very Long Instruction Word) processors that require a high-bandwidth instruction prefetching. This paper presents a new approach for dictionary-based code compression in VLIW processor-based systems using isomorphism among instruction words. After we divide instruction words into two groups, one for opcode group and the other for operand group, the proposed compression algorithm is applied to each group for maximal code compression. Frequently-used instruction words are extracted from the original code to be mapped into two dictionaries, an opcode dictionary and an operand dictionary. According to the SPEC95 benchmarks, the proposed technique has achieved an average code compression ratio of 63%, 69%, and 71% in a 4-issue, 8-issue, and 12-issue VLIW architecture, respectively.
Path-Classified Trace Cache for Improving Hit Ratio in Wide-Issue Processors
Jin-Hyuk YANG In-Cheol PARK Chong-Min KYUNG

PAPER-Computer Hardware and Design

Vol:
E82-D No:10
Page(s):
1338-1343
In this paper, an instruction-cache scheme called Multi-Path Tracing is proposed to enhance the trace cache. Paths are classified to improve the trace cache hit ratio by reducing the path conflict and basic blocks are joined to reduce the hardware cost needed to implement the trace cache. Simulation results for various SPEC integer benchmarks show that the proposed scheme increases the hit ratio by more than 25% and the effective fetch size by 10%.
Synthesis of Application-Specific Coprocessor for Core-Based ASIC Design
Dae-Hyun LEE In-Cheol PARK Chong-Min KYUNG

PAPER-VLSI Design Technology and CAD

Vol:
E84-A No:2
Page(s):
604-613
This paper presents an efficient approach for a hardware/software partitioning problem: synthesis of an application-specific coprocessor which accelerates an embedded software running on a main processor. Given a set of data flow graphs (DFGs), most of previous hardware/software partitioning approaches have focused on mapping DFGs to hardware or software. Their common weaknesses are that 1) they ignore various implementation alternatives in realizing DFGs as hardware based on the assumption that only a single hardware implementation exists for a DFG, and that 2) they don't consider the effect of merging on hardware area when synthesizing a coprocessor by merging DFGs. To deal with the first issue, we formulate both the mapping of DFGs to hardware or software and the selection of the appropriate hardware implementation for each DFG as a single integer programming problem, and then apply an iterative algorithm based on the Kernighan and Lin's heuristic to solve the problem. To reduce the CPU time, we have devised data structures that quickly calculate costs of hardware implementations. To deal with the second issue, our method links DFGs with dummy nodes to produce a single large DFG, and then synthesizes a target coprocessor by globally scheduling the DFG and allocating its datapath. Experimental results demonstrate that our approach outperforms the previous approach based on genetic algorithm (GA) in both the coprocessor area and the CPU time.
SEWD: A Cache Architecture to Speed up the Misaligned Instruction Prefetch
Joon-Seo YIM In-Cheol PARK Chong-Min KYUNG

LETTER-Computer Hardware and Design

Vol:
E80-D No:7
Page(s):
742-745
In microprocessors, reducing the cache access delay and the number of pipeline stall is critical to improve the system performance. In this paper, we propose a Separated Word-line Decoding (SEWD) cache to overcome the pipeline stall caused by the misaligned multi-words data or instruction prefetches which are placed over two cache lines. SEWD cache makes it possible to perform misaligned prefetch as well as aligned prefetch in one clock cycle. This feature is invaluable because the branch target addresses are very often misaligned (Percentage of misalignment in the cache is 8 to 13% for 16-byte caches). 8Kbyte SEWD cache chip was implemented in 0.8µm DLM CMOS process. It consists of 489,000 transistors on a die size of 0.8530.827cm2.
CLASSIC: An O(n²)-Heuristic Algorithm for Microcode Bit Optimization Based on Incompleteness Relations
Young-doo CHOI In-Cheol PARK Chong-Min KYUNG

PAPER-VLSI Design Technology and CAD

Vol:
E83-A No:5
Page(s):
901-908
This paper presents a heuristic algorithm called CLASSIC for the minimization of the control memory width in microprogrammed processors or the instruction memory width of application-specific VLIW (Very Long Instruction Word) processors. CLASSIC results in nearly optimal solutions with the time complexity of O(n2), where n denotes the number of microoperations. In this paper, we also propose the so-called incompleteness relations which are exploited for the minimization of the control memory width. Experiments using various examples have shown that CLASSIC always achieves smaller microprogram widths compared to the earlier techniques based on the maximal compatibility class or the minimal AND/OR set. The results show that CLASSIC can reduce the control memory width by 34.2% on average compared with a heuristic compatibility class algorithm.
A Hierarchical Circuit Clustering Algorithm with Stable Performance
Seung-June KYOUNG Kwang-Su SEONG In-Cheol PARK Chong-Min KYUNG

LETTER-VLSI Design Technology and CAD

Vol:
E82-A No:9
Page(s):
1987-1993
Clustering is almost essential in improving the performance of iterative partitioning algorithms. In this paper, we present a clustering algorithm based on the following observation: if a group of cells is assigned to the same partition in numerous local optimum solutions, it is desirable to merge the group into a cluster. The proposed algorithm finds such a group of cells from randomly generated local optimum solutions and merges it into a cluster. We implemented a multilevel bipartitioning algorithm (MBP) based on the proposed clustering algorithm. For MCNC benchmark netlists, MBP improves the total average cut size by 9% and the total best cut size by 3-4%, compared with the previous state-of-the-art partitioners.
A New Single-Clock Flip-Flop for Half-Swing Clocking
Young-Su KWON In-Cheol PARK Chong-Min KYUNG

PAPER

Vol:
E82-A No:11
Page(s):
2521-2526
A new flip-flop configuration for half-swing clocking is proposed to save total clocking power. In the proposed scheme, only NMOS's are clocked with the half-swing clock in order to make it operate without level converters or any additional logics which were used in the earlier half-swing clocking schemes. Vcc is supplied to the random logic circuits and flip-flops while Vcc/2 is supplied to the clock network and some parts of the flip-flop to reduce the power consumed in the clock network. Compared to the conventional scheme, the proposed flip-flop configuration can save the clocking power by 40%.
Issues on the Interface Synthesis between Intellectual Properties Operating at Different Clock Frequencies
Bong-Il PARK Chong-Min KYUNG

PAPER-VLSI Design Technology and CAD

Vol:
E85-A No:8
Page(s):
1937-1945
In SoC (system-on-a-chip) design, interfacing among IP (Intellectual Property) blocks is one of the most important issues. Since most IP's are provided by different vendors, they generally have different interface schemes and different operating frequencies. In this paper, we propose a new interface synthesis method with two features: 1) generation of the interface between IP's with different operating frequencies, and 2) minimization of the hardware resource required for the interface. We have demonstrated the proposed algorithm through its application to an MP3 decoder design example, where the IIS (Inter-IC Sound)-to-PCI (Peripheral Component Interconnect) protocol converter was successfully implemented using the proposed method.
A Supplementary Scheme for Reducing Cache Access Time
Jong-Hong BAE Chong-Min KYUNG

LETTER-Computer Hardware and Design

Vol:
E79-D No:4
Page(s):
385-387
Among three factors mainly affecting the cache access time, i. e., hit access time, miss rate and miss penalty, previous approaches were focused on reducing the hit access time and miss rate. In this paper, we propose a scheme called MPC (Miss-Predicting Cache) which achives additional reduction of the average instruction cache access time through reducing the miss penalty. The MPC scheme which predicts cache miss and starts cache miss operations in advance, therefore, is supplementary to previous cache schemes targeted for reducing the miss rate and/or hit access time. Performance of the MPC scheme was evaluated using dinero, a trace-driven cache simulator, with the estimation of silicon area using 0.8 µm CMOS standard cell library.
Fast Image Generation Method for Animation
Jin-Han KIM Chong-Min KYUNG

PAPER-Combinational/Numerical/Graphic Algorithms

Vol:
E75-A No:6
Page(s):
691-700
A fast scan-line algorithm for a raster-scan graphics display is proposed based on an observation that a sequence of successive image frames in animation mostly consists of still objects with relatively few moving objects. In the proposed algorithm, successive images are generated using the background image composed of still objects only, and moving image composed only of moving objects. The color of each pixel in the successive images is then determined by one, which is nearer from eye, between the two candidate pixels, where one is from the background image and the other is from the moving image. The background image is generated once in the whole process, while the moving image is generated for each time frame using an interpolation of two images generated at the start and end time of the given time interval. For the purpose of fast shadow generation, we classify the shadows into three groups, i.e., still shadows generated by still objects on still objects, moving shadows generated by moving objects on still objects, and composite shadows generated by both still objects and moving objects on moving objects. These shadows can be generated very quickly by utilizing the frame coherence. According to the experimental results, a speed up factor of 3.2 to 12.8, depending on the percentage of the moving objects among all objects, was obtained using our algorithm, compared to the conventional scheme not utilizing the frame-to-frame image coherence.
Spread Omega Network for High Speed Packet Switching
H. C. LEE Chong-Min KYUNG

LETTER-Switching and Communication Processing

Vol:
E80-B No:1
Page(s):
192-195
A network with input and output buffer is proposed. It consists of several switching stages composed of 33 basic switching elements which are connected with perfect shuffle and horizontal connections. The proposed network reduces the required number of stages, and increases the fault tolerance due to its highly regular connection scheme. Its performance was evaluated with computer simulation under bursty traffic environment. For a 128128 switch with 11 switching stages, packet loss ratio of 10-6 was obtained when the input load is 0.8 and the burstiness is 10.
An Automatic Interface Insertion Scheme for In-System Verification of Algorithm Models in C
Chang-Jae PARK Ando KI In-Cheol PARK Chong-Min KYUNG

PAPER-High Level Synthesis

Vol:
E85-A No:12
Page(s):
2645-2654
This paper describes an automatic interface insertion scheme for in-system verification of algorithm models. To insert the interface, an algorithm model described in C is translated into another source code that includes the communication with hardware components in the target system to be validated with the algorithm model. The communication between the algorithm model and hardware components is achieved using transactors that perform transformation between access operations and bus cycle transactions. I/O terminal is introduced as an interface model to relate the transactions to access operations during the execution of the algorithm model, i.e., accesses to I/O terminals invoke bus cycle transactions in hardware and vice versa. An automatic interface insertion tool is developed using the source-to-source translation to identify the I/O terminals and insert interface function calls in the source code. The proposed automatic interface insertion scheme is validated by emulating several multimedia algorithms written in C on real target systems.
Power Minimization for Dual- and Triple-Supply Digital Circuits via Integer Linear Programming
Ki-Yong AHN Chong-Min KYUNG

PAPER-VLSI Design Technology and CAD

Vol:
E92-A No:9
Page(s):
2318-2325
This paper proposes an Integer Linear Programming (ILP)-based power minimization method by partitioning into regions, first, with three different VDD's(PM3V), and, secondly, with two different VDD's(PM2V). To reduce the solving time of triple-VDD case (PM3V), we also proposed a partitioned ILP method(p-PM3V). The proposed method provides 29% power saving on the average in the case of triple-VDD compared to the case of single VDD. Power reduction of PM3V compared to Clustered Voltage Scaling (CVS) was about 18%. Compared to the unpartitioned ILP formulation(PM3V), the partitioned ILP method(p-PM3V) reduced the total solution time by 46% at the cost of additional power consumption within 1.3%.

Author Search Result

[Author] Chong-Min KYUNG(18hit)

A Clustering Based Linear Ordering Algorithm for Netlist Partitioning

SAPICE: A Design Tool of CMOS Operational Amplifiers

Hardware Accelerator for Outline Font Generation

An Efficient Routing Algorithm for Symmetrical FPGAs Using Reliable Cost Metrics

Fast Precise Interrupt Handling without Associative Searching in Multiple Out-Of-Order Issue Processors

Improving Dictionary-Based Code Compression in VLIW Architectures

Path-Classified Trace Cache for Improving Hit Ratio in Wide-Issue Processors

Synthesis of Application-Specific Coprocessor for Core-Based ASIC Design

SEWD: A Cache Architecture to Speed up the Misaligned Instruction Prefetch

CLASSIC: An O(n²)-Heuristic Algorithm for Microcode Bit Optimization Based on Incompleteness Relations

A Hierarchical Circuit Clustering Algorithm with Stable Performance

A New Single-Clock Flip-Flop for Half-Swing Clocking

Issues on the Interface Synthesis between Intellectual Properties Operating at Different Clock Frequencies

A Supplementary Scheme for Reducing Cache Access Time

Fast Image Generation Method for Animation

Spread Omega Network for High Speed Packet Switching

An Automatic Interface Insertion Scheme for In-System Verification of Algorithm Models in C

Power Minimization for Dual- and Triple-Supply Digital Circuits via Integer Linear Programming

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles