The search functionality is under construction.

Author Search Result

[Author] Chong-Min KYUNG(18hit)

1-18hit
  • A Clustering Based Linear Ordering Algorithm for Netlist Partitioning

    Kwang-Su SEONG  Chong-Min KYUNG  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E79-A No:12
      Page(s):
    2185-2191

    In this paper, we propose a clustering based linear ordering algorithm which consists of global ordering and local ordering. In the global ordering, the algorithm forms clusters from n given vertices and orders the clusters. In the local ordering, the elements in each cluster are linearly ordered. The linear order, thus produced, is used to obtain optimal κ-way partitioning based on scaled cost objective function. When the number of cluster is one, the proposed algorithm is exactly the same as MELO [2]. But the proposed algorithm has more global partitioning information than MELO by clustering. Experiment with 11 benchmark circuits for κ-way (2 κ 10) partitioning shows that the proposed algorithm yields an average of 10.6% improvement over MELO [2] for the κ-way scaled cost partitioning.

  • SAPICE: A Design Tool of CMOS Operational Amplifiers

    Sang-Dae YU  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E80-A No:9
      Page(s):
    1667-1675

    Based on a new search strategy using circuit simulation and simulated annealing with local search, a design tool is proposed to automate design or tuning process for CMOS operational amplifiers. A special-purpose circuit simulator and some heuristics are used to accomplish the design within reasonable time. For arbitrary circuit topology and specifications, the discrete optimization of cost function is performed by global and local search. Through the comparision of design results and the design of a low-power high-speed CMOS operational amplifier usable in 10-b 25-MHz pipelined A/D converters, it has been demonstrated that this tool can be used for designing high-performance operational amplifiers with less design knowledge and effort.

  • Hardware Accelerator for Outline Font Generation

    Gyu-Cheol HWANG  In-Cheol PARK  Yun-Tae LEE  Tae-Hyung LEE  Jong-Hong BAE  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology

      Vol:
    E74-A No:10
      Page(s):
    3078-3082

    Translation of the scalable outline font data as represented by a set of control points of the cubic Bezier curve, etc. into the bitmap data for desk-top publishing (DTP) applications requires a significant amount of computation. In this paper, we propose a special purpose chip called KAFOG for the high-speed generation of bitmap font from the Hangul PostScript file for screen display as well as LBP (Laser Beam Printer) output. KAFOG chip was implemented in 1.5 µm CMOS gate array using 17 K gates. The computation throughput of the KAFOG chip is 250 K cubic Bezier curve segments (each curve segment is composed of four control points) per second at the clock frequency of 40 MHz.

  • An Efficient Routing Algorithm for Symmetrical FPGAs Using Reliable Cost Metrics

    Nak-Woong EUM  Inhag PARK  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E84-A No:3
      Page(s):
    829-838

    This paper presents a new performance and routability-driven routing algorithm for symmetrical array-based field-programmable gate arrays (FPGAs). The contribution of our work is to overcome one of the most critical limitations of the previous routing algorithms: inaccurate estimations of routing density which were too general for symmetrical FPGA. To this end, we devised new routing density measures that are directly linked to the structure (switch block) of symmetrical FPGA, and utilize them consistently in global and detailed routings. With the use of the proposed accurate routing metrics, we developed a new routing algorithm called a reliable net decomposition-based routing which is very fast, and yet produces excellent routing results in terms of net/path delays and routability. An extensive experiment was carried out to show the effectiveness of our algorithm based on the proposed cost metrics. In summary, when compared to the best known results in the literature (TRACER-fpga_PR and SEGA), our algorithm has shown 31.9% shorter longest path delay and 23.0% shorter longest net delay even with about 9 times faster execution time.

  • Fast Precise Interrupt Handling without Associative Searching in Multiple Out-Of-Order Issue Processors

    Sang-Joon NAM  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-Computer Hardware and Design

      Vol:
    E82-D No:3
      Page(s):
    645-653

    This paper presents a new approach to the precise interrupt handling problem in modern processors with multiple out-of-order issues. It is difficult to implement a precise interrupt scheme in the processors because later instructions may change the process states before their preceding instructions have completed. We propose a fast precise interrupt handling scheme which can recover the precise state in one cycle if an interrupt occurs. In addition, the scheme removes all the associative searching operations which are inevitable in the previous approaches. To deal with the renaming of destination registers, we present a new bank-based register file which is indexed by bank index tables containing the bank identifiers of renamed register entries. Simulation results based on the superscalar MIPS architecture show that the register file with 3 banks is a good trade-off between high performance and low complexity.

  • Improving Dictionary-Based Code Compression in VLIW Architectures

    Sang-Joon NAM  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2318-2324

    Reducing code size is crucial in embedded systems as well as in high-performance systems to overcome the communication bottleneck between memory and CPU, especially with VLIW (Very Long Instruction Word) processors that require a high-bandwidth instruction prefetching. This paper presents a new approach for dictionary-based code compression in VLIW processor-based systems using isomorphism among instruction words. After we divide instruction words into two groups, one for opcode group and the other for operand group, the proposed compression algorithm is applied to each group for maximal code compression. Frequently-used instruction words are extracted from the original code to be mapped into two dictionaries, an opcode dictionary and an operand dictionary. According to the SPEC95 benchmarks, the proposed technique has achieved an average code compression ratio of 63%, 69%, and 71% in a 4-issue, 8-issue, and 12-issue VLIW architecture, respectively.

  • Path-Classified Trace Cache for Improving Hit Ratio in Wide-Issue Processors

    Jin-Hyuk YANG  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-Computer Hardware and Design

      Vol:
    E82-D No:10
      Page(s):
    1338-1343

    In this paper, an instruction-cache scheme called Multi-Path Tracing is proposed to enhance the trace cache. Paths are classified to improve the trace cache hit ratio by reducing the path conflict and basic blocks are joined to reduce the hardware cost needed to implement the trace cache. Simulation results for various SPEC integer benchmarks show that the proposed scheme increases the hit ratio by more than 25% and the effective fetch size by 10%.

  • Synthesis of Application-Specific Coprocessor for Core-Based ASIC Design

    Dae-Hyun LEE  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E84-A No:2
      Page(s):
    604-613

    This paper presents an efficient approach for a hardware/software partitioning problem: synthesis of an application-specific coprocessor which accelerates an embedded software running on a main processor. Given a set of data flow graphs (DFGs), most of previous hardware/software partitioning approaches have focused on mapping DFGs to hardware or software. Their common weaknesses are that 1) they ignore various implementation alternatives in realizing DFGs as hardware based on the assumption that only a single hardware implementation exists for a DFG, and that 2) they don't consider the effect of merging on hardware area when synthesizing a coprocessor by merging DFGs. To deal with the first issue, we formulate both the mapping of DFGs to hardware or software and the selection of the appropriate hardware implementation for each DFG as a single integer programming problem, and then apply an iterative algorithm based on the Kernighan and Lin's heuristic to solve the problem. To reduce the CPU time, we have devised data structures that quickly calculate costs of hardware implementations. To deal with the second issue, our method links DFGs with dummy nodes to produce a single large DFG, and then synthesizes a target coprocessor by globally scheduling the DFG and allocating its datapath. Experimental results demonstrate that our approach outperforms the previous approach based on genetic algorithm (GA) in both the coprocessor area and the CPU time.

  • SEWD: A Cache Architecture to Speed up the Misaligned Instruction Prefetch

    Joon-Seo YIM  In-Cheol PARK  Chong-Min KYUNG  

     
    LETTER-Computer Hardware and Design

      Vol:
    E80-D No:7
      Page(s):
    742-745

    In microprocessors, reducing the cache access delay and the number of pipeline stall is critical to improve the system performance. In this paper, we propose a Separated Word-line Decoding (SEWD) cache to overcome the pipeline stall caused by the misaligned multi-words data or instruction prefetches which are placed over two cache lines. SEWD cache makes it possible to perform misaligned prefetch as well as aligned prefetch in one clock cycle. This feature is invaluable because the branch target addresses are very often misaligned (Percentage of misalignment in the cache is 8 to 13% for 16-byte caches). 8Kbyte SEWD cache chip was implemented in 0.8µm DLM CMOS process. It consists of 489,000 transistors on a die size of 0.8530.827cm2.

  • CLASSIC: An O(n2)-Heuristic Algorithm for Microcode Bit Optimization Based on Incompleteness Relations

    Young-doo CHOI  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E83-A No:5
      Page(s):
    901-908

    This paper presents a heuristic algorithm called CLASSIC for the minimization of the control memory width in microprogrammed processors or the instruction memory width of application-specific VLIW (Very Long Instruction Word) processors. CLASSIC results in nearly optimal solutions with the time complexity of O(n2), where n denotes the number of microoperations. In this paper, we also propose the so-called incompleteness relations which are exploited for the minimization of the control memory width. Experiments using various examples have shown that CLASSIC always achieves smaller microprogram widths compared to the earlier techniques based on the maximal compatibility class or the minimal AND/OR set. The results show that CLASSIC can reduce the control memory width by 34.2% on average compared with a heuristic compatibility class algorithm.

  • A Hierarchical Circuit Clustering Algorithm with Stable Performance

    Seung-June KYOUNG  Kwang-Su SEONG  In-Cheol PARK  Chong-Min KYUNG  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E82-A No:9
      Page(s):
    1987-1993

    Clustering is almost essential in improving the performance of iterative partitioning algorithms. In this paper, we present a clustering algorithm based on the following observation: if a group of cells is assigned to the same partition in numerous local optimum solutions, it is desirable to merge the group into a cluster. The proposed algorithm finds such a group of cells from randomly generated local optimum solutions and merges it into a cluster. We implemented a multilevel bipartitioning algorithm (MBP) based on the proposed clustering algorithm. For MCNC benchmark netlists, MBP improves the total average cut size by 9% and the total best cut size by 3-4%, compared with the previous state-of-the-art partitioners.

  • A New Single-Clock Flip-Flop for Half-Swing Clocking

    Young-Su KWON  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2521-2526

    A new flip-flop configuration for half-swing clocking is proposed to save total clocking power. In the proposed scheme, only NMOS's are clocked with the half-swing clock in order to make it operate without level converters or any additional logics which were used in the earlier half-swing clocking schemes. Vcc is supplied to the random logic circuits and flip-flops while Vcc/2 is supplied to the clock network and some parts of the flip-flop to reduce the power consumed in the clock network. Compared to the conventional scheme, the proposed flip-flop configuration can save the clocking power by 40%.

  • Issues on the Interface Synthesis between Intellectual Properties Operating at Different Clock Frequencies

    Bong-Il PARK  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E85-A No:8
      Page(s):
    1937-1945

    In SoC (system-on-a-chip) design, interfacing among IP (Intellectual Property) blocks is one of the most important issues. Since most IP's are provided by different vendors, they generally have different interface schemes and different operating frequencies. In this paper, we propose a new interface synthesis method with two features: 1) generation of the interface between IP's with different operating frequencies, and 2) minimization of the hardware resource required for the interface. We have demonstrated the proposed algorithm through its application to an MP3 decoder design example, where the IIS (Inter-IC Sound)-to-PCI (Peripheral Component Interconnect) protocol converter was successfully implemented using the proposed method.

  • A Supplementary Scheme for Reducing Cache Access Time

    Jong-Hong BAE  Chong-Min KYUNG  

     
    LETTER-Computer Hardware and Design

      Vol:
    E79-D No:4
      Page(s):
    385-387

    Among three factors mainly affecting the cache access time, i. e., hit access time, miss rate and miss penalty, previous approaches were focused on reducing the hit access time and miss rate. In this paper, we propose a scheme called MPC (Miss-Predicting Cache) which achives additional reduction of the average instruction cache access time through reducing the miss penalty. The MPC scheme which predicts cache miss and starts cache miss operations in advance, therefore, is supplementary to previous cache schemes targeted for reducing the miss rate and/or hit access time. Performance of the MPC scheme was evaluated using dinero, a trace-driven cache simulator, with the estimation of silicon area using 0.8 µm CMOS standard cell library.

  • Fast Image Generation Method for Animation

    Jin-Han KIM  Chong-Min KYUNG  

     
    PAPER-Combinational/Numerical/Graphic Algorithms

      Vol:
    E75-A No:6
      Page(s):
    691-700

    A fast scan-line algorithm for a raster-scan graphics display is proposed based on an observation that a sequence of successive image frames in animation mostly consists of still objects with relatively few moving objects. In the proposed algorithm, successive images are generated using the background image composed of still objects only, and moving image composed only of moving objects. The color of each pixel in the successive images is then determined by one, which is nearer from eye, between the two candidate pixels, where one is from the background image and the other is from the moving image. The background image is generated once in the whole process, while the moving image is generated for each time frame using an interpolation of two images generated at the start and end time of the given time interval. For the purpose of fast shadow generation, we classify the shadows into three groups, i.e., still shadows generated by still objects on still objects, moving shadows generated by moving objects on still objects, and composite shadows generated by both still objects and moving objects on moving objects. These shadows can be generated very quickly by utilizing the frame coherence. According to the experimental results, a speed up factor of 3.2 to 12.8, depending on the percentage of the moving objects among all objects, was obtained using our algorithm, compared to the conventional scheme not utilizing the frame-to-frame image coherence.

  • Spread Omega Network for High Speed Packet Switching

    H. C. LEE  Chong-Min KYUNG  

     
    LETTER-Switching and Communication Processing

      Vol:
    E80-B No:1
      Page(s):
    192-195

    A network with input and output buffer is proposed. It consists of several switching stages composed of 33 basic switching elements which are connected with perfect shuffle and horizontal connections. The proposed network reduces the required number of stages, and increases the fault tolerance due to its highly regular connection scheme. Its performance was evaluated with computer simulation under bursty traffic environment. For a 128128 switch with 11 switching stages, packet loss ratio of 10-6 was obtained when the input load is 0.8 and the burstiness is 10.

  • An Automatic Interface Insertion Scheme for In-System Verification of Algorithm Models in C

    Chang-Jae PARK  Ando KI  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-High Level Synthesis

      Vol:
    E85-A No:12
      Page(s):
    2645-2654

    This paper describes an automatic interface insertion scheme for in-system verification of algorithm models. To insert the interface, an algorithm model described in C is translated into another source code that includes the communication with hardware components in the target system to be validated with the algorithm model. The communication between the algorithm model and hardware components is achieved using transactors that perform transformation between access operations and bus cycle transactions. I/O terminal is introduced as an interface model to relate the transactions to access operations during the execution of the algorithm model, i.e., accesses to I/O terminals invoke bus cycle transactions in hardware and vice versa. An automatic interface insertion tool is developed using the source-to-source translation to identify the I/O terminals and insert interface function calls in the source code. The proposed automatic interface insertion scheme is validated by emulating several multimedia algorithms written in C on real target systems.

  • Power Minimization for Dual- and Triple-Supply Digital Circuits via Integer Linear Programming

    Ki-Yong AHN  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E92-A No:9
      Page(s):
    2318-2325

    This paper proposes an Integer Linear Programming (ILP)-based power minimization method by partitioning into regions, first, with three different VDD's(PM3V), and, secondly, with two different VDD's(PM2V). To reduce the solving time of triple-VDD case (PM3V), we also proposed a partitioned ILP method(p-PM3V). The proposed method provides 29% power saving on the average in the case of triple-VDD compared to the case of single VDD. Power reduction of PM3V compared to Clustered Voltage Scaling (CVS) was about 18%. Compared to the unpartitioned ILP formulation(PM3V), the partitioned ILP method(p-PM3V) reduced the total solution time by 46% at the cost of additional power consumption within 1.3%.