The search functionality is under construction.

Author Search Result

[Author] In-Cheol PARK(21hit)

1-20hit(21hit)

  • Fast Precise Interrupt Handling without Associative Searching in Multiple Out-Of-Order Issue Processors

    Sang-Joon NAM  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-Computer Hardware and Design

      Vol:
    E82-D No:3
      Page(s):
    645-653

    This paper presents a new approach to the precise interrupt handling problem in modern processors with multiple out-of-order issues. It is difficult to implement a precise interrupt scheme in the processors because later instructions may change the process states before their preceding instructions have completed. We propose a fast precise interrupt handling scheme which can recover the precise state in one cycle if an interrupt occurs. In addition, the scheme removes all the associative searching operations which are inevitable in the previous approaches. To deal with the renaming of destination registers, we present a new bank-based register file which is indexed by bank index tables containing the bank identifiers of renamed register entries. Simulation results based on the superscalar MIPS architecture show that the register file with 3 banks is a good trade-off between high performance and low complexity.

  • Improving Dictionary-Based Code Compression in VLIW Architectures

    Sang-Joon NAM  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2318-2324

    Reducing code size is crucial in embedded systems as well as in high-performance systems to overcome the communication bottleneck between memory and CPU, especially with VLIW (Very Long Instruction Word) processors that require a high-bandwidth instruction prefetching. This paper presents a new approach for dictionary-based code compression in VLIW processor-based systems using isomorphism among instruction words. After we divide instruction words into two groups, one for opcode group and the other for operand group, the proposed compression algorithm is applied to each group for maximal code compression. Frequently-used instruction words are extracted from the original code to be mapped into two dictionaries, an opcode dictionary and an operand dictionary. According to the SPEC95 benchmarks, the proposed technique has achieved an average code compression ratio of 63%, 69%, and 71% in a 4-issue, 8-issue, and 12-issue VLIW architecture, respectively.

  • Path-Classified Trace Cache for Improving Hit Ratio in Wide-Issue Processors

    Jin-Hyuk YANG  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-Computer Hardware and Design

      Vol:
    E82-D No:10
      Page(s):
    1338-1343

    In this paper, an instruction-cache scheme called Multi-Path Tracing is proposed to enhance the trace cache. Paths are classified to improve the trace cache hit ratio by reducing the path conflict and basic blocks are joined to reduce the hardware cost needed to implement the trace cache. Simulation results for various SPEC integer benchmarks show that the proposed scheme increases the hit ratio by more than 25% and the effective fetch size by 10%.

  • Synthesis of Application-Specific Coprocessor for Core-Based ASIC Design

    Dae-Hyun LEE  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E84-A No:2
      Page(s):
    604-613

    This paper presents an efficient approach for a hardware/software partitioning problem: synthesis of an application-specific coprocessor which accelerates an embedded software running on a main processor. Given a set of data flow graphs (DFGs), most of previous hardware/software partitioning approaches have focused on mapping DFGs to hardware or software. Their common weaknesses are that 1) they ignore various implementation alternatives in realizing DFGs as hardware based on the assumption that only a single hardware implementation exists for a DFG, and that 2) they don't consider the effect of merging on hardware area when synthesizing a coprocessor by merging DFGs. To deal with the first issue, we formulate both the mapping of DFGs to hardware or software and the selection of the appropriate hardware implementation for each DFG as a single integer programming problem, and then apply an iterative algorithm based on the Kernighan and Lin's heuristic to solve the problem. To reduce the CPU time, we have devised data structures that quickly calculate costs of hardware implementations. To deal with the second issue, our method links DFGs with dummy nodes to produce a single large DFG, and then synthesizes a target coprocessor by globally scheduling the DFG and allocating its datapath. Experimental results demonstrate that our approach outperforms the previous approach based on genetic algorithm (GA) in both the coprocessor area and the CPU time.

  • SEWD: A Cache Architecture to Speed up the Misaligned Instruction Prefetch

    Joon-Seo YIM  In-Cheol PARK  Chong-Min KYUNG  

     
    LETTER-Computer Hardware and Design

      Vol:
    E80-D No:7
      Page(s):
    742-745

    In microprocessors, reducing the cache access delay and the number of pipeline stall is critical to improve the system performance. In this paper, we propose a Separated Word-line Decoding (SEWD) cache to overcome the pipeline stall caused by the misaligned multi-words data or instruction prefetches which are placed over two cache lines. SEWD cache makes it possible to perform misaligned prefetch as well as aligned prefetch in one clock cycle. This feature is invaluable because the branch target addresses are very often misaligned (Percentage of misalignment in the cache is 8 to 13% for 16-byte caches). 8Kbyte SEWD cache chip was implemented in 0.8µm DLM CMOS process. It consists of 489,000 transistors on a die size of 0.8530.827cm2.

  • Low-Latency Low-Cost Architecture for Square and Cube Roots

    Jihyuck JO  In-Cheol PARK  

     
    PAPER-Digital Signal Processing

      Vol:
    E100-A No:9
      Page(s):
    1951-1955

    This paper presents a low-latency, low-cost architecture for computing square and cube roots in the fixed-point format. The proposed architecture is designed based on a non-iterative root calculation scheme to achieve fast computations. While previous non-iterative root calculators are restricted to a square-root operation due to the limitation of their mathematical property, the root computation is generalized in this paper to apply an approximation method to the non-iterative scheme. On top of that, a recurrent method is proposed to select parameters, which enables us to reduce the table size while keeping the maximum relative error value low. Consequently, the proposed root calculator can support both square and cube roots at the expense of small delay and low area overheads. This extension can be generalized to compute the nth roots, where n is a positive integer.

  • A Low-Complexity Stopping Criterion for Iterative Turbo Decoding

    Dong-Soo LEE  In-Cheol PARK  

     
    LETTER-Wireless Communication Technologies

      Vol:
    E88-B No:1
      Page(s):
    399-401

    This letter proposes an efficient and simple stopping criterion for turbo decoding, which is derived by observing the behavior of log-likelihood ratio (LLR) values. Based on the behavior, the proposed criterion counts the number of absolute LLR values less than a threshold and the number of hard decision 1's in order to complete the iterative decoding procedure. Simulation results show that the proposed approach achieves a reduced number of iterations while maintaining similar BER/FER performance to the previous criteria.

  • CLASSIC: An O(n2)-Heuristic Algorithm for Microcode Bit Optimization Based on Incompleteness Relations

    Young-doo CHOI  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E83-A No:5
      Page(s):
    901-908

    This paper presents a heuristic algorithm called CLASSIC for the minimization of the control memory width in microprogrammed processors or the instruction memory width of application-specific VLIW (Very Long Instruction Word) processors. CLASSIC results in nearly optimal solutions with the time complexity of O(n2), where n denotes the number of microoperations. In this paper, we also propose the so-called incompleteness relations which are exploited for the minimization of the control memory width. Experiments using various examples have shown that CLASSIC always achieves smaller microprogram widths compared to the earlier techniques based on the maximal compatibility class or the minimal AND/OR set. The results show that CLASSIC can reduce the control memory width by 34.2% on average compared with a heuristic compatibility class algorithm.

  • Multiplier-less and Table-less Linear Approximation for Square-Related Functions

    In-Cheol PARK  Tae-Hwan KIM  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E93-D No:11
      Page(s):
    2979-2988

    Square-related functions such as square, inverse square, square-root and inverse square-root operations are widely used in digital signal processing and digital communication algorithms, and their efficient realizations are commonly required to reduce the hardware complexity. In the implementation point of view, approximate realizations are often desired if they do not degrade performance significantly. In this paper, we propose new linear approximations for the square-related functions. The traditional linear approximations need multipliers to calculate slope offsets and tables to store initial offset values and slope values, whereas the proposed approximations exploit the inherent properties of square-related functions to linearly interpolate with only simple operations, such as shift, concatenation and addition, which are usually supported in modern VLSI systems. Regardless of the bit-width of the number system, more importantly, the maximum relative errors of the proposed approximations are bounded to 6.25% and 3.13% for square and square-root functions, respectively. For inverse square and inverse square-root functions, the maximum relative errors are bounded to 12.5% and 6.25% if the input operands are represented in 20 bits, respectively.

  • A Hierarchical Circuit Clustering Algorithm with Stable Performance

    Seung-June KYOUNG  Kwang-Su SEONG  In-Cheol PARK  Chong-Min KYUNG  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E82-A No:9
      Page(s):
    1987-1993

    Clustering is almost essential in improving the performance of iterative partitioning algorithms. In this paper, we present a clustering algorithm based on the following observation: if a group of cells is assigned to the same partition in numerous local optimum solutions, it is desirable to merge the group into a cluster. The proposed algorithm finds such a group of cells from randomly generated local optimum solutions and merges it into a cluster. We implemented a multilevel bipartitioning algorithm (MBP) based on the proposed clustering algorithm. For MCNC benchmark netlists, MBP improves the total average cut size by 9% and the total best cut size by 3-4%, compared with the previous state-of-the-art partitioners.

  • Area-Efficient QC-LDPC Decoder Architecture Based on Stride Scheduling and Memory Bank Division

    Bongjin KIM  In-Cheol PARK  

     
    PAPER-Fundamental Theories for Communications

      Vol:
    E96-B No:7
      Page(s):
    1772-1779

    In this paper, an area-efficient decoder architecture is proposed for the quasi-cyclic low-density parity check (QC-LDPC) codes specified in the IEEE 802.16e WiMAX standard. The decoder supports all the code rates and codeword lengths defined in the standard. In order to achieve low area and maximize hardware utilization, the decoder utilizes 4 decoding function units, which is the greatest common divisor of the expansion factors. In addition, the decoder adopts a novel scheduling scheme named stride scheduling, which stores the extrinsic messages in non-sequential order to replace the conventional complex flexible permutation network with simple small-sized cyclic shifters and also minimize the number of memory accesses. To further minimize the complexity, the number of extrinsic memory instances for 24 block columns is reduced to 5 banks by identifying independent sets. All the memory instances used in the decoder are single-port memories which cost less area and price compared to dual-port ones. Finally, the decoding function units have partially parallel structure to make the decoding throughput sufficiently over the requirement of the WiMAX standard. The proposed decoder is synthesized with 49 K equivalent gates and 54,144 bits of memory, and the implementation occupies 0.40 mm2 in a 65 nm CMOS technology.

  • Parallel Decoding of Context-Based Adaptive Binary Arithmetic Codes Based on Most Probable Symbol Prediction

    Chung-Hyo KIM  In-Cheol PARK  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E90-D No:2
      Page(s):
    609-612

    Context-based adaptive binary arithmetic coding (CABAC) is the major entropy-coding algorithm employed in H.264/AVC. Although the performance gain of H.264/AVC is mainly due to CABAC, it is difficult to achieve a fast decoder because the decoding algorithm is basically sequential and computationally intensive. In this letter, a prediction scheme is proposed that enhances overall decoding performance by decoding two binary symbols at a time. A CABAC decoder based on the proposed prediction scheme improves the decoding performance by 24% compared to conventional decoders.

  • A New Single-Clock Flip-Flop for Half-Swing Clocking

    Young-Su KWON  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2521-2526

    A new flip-flop configuration for half-swing clocking is proposed to save total clocking power. In the proposed scheme, only NMOS's are clocked with the half-swing clock in order to make it operate without level converters or any additional logics which were used in the earlier half-swing clocking schemes. Vcc is supplied to the random logic circuits and flip-flops while Vcc/2 is supplied to the clock network and some parts of the flip-flop to reduce the power consumed in the clock network. Compared to the conventional scheme, the proposed flip-flop configuration can save the clocking power by 40%.

  • Long-Point FFT Processing Based on Twiddle Factor Table Reduction

    Ji-Hoon KIM  In-Cheol PARK  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E90-A No:11
      Page(s):
    2526-2532

    In this paper, we present a new fast Fourier transform (FFT) algorithm to reduce the table size of twiddle factors required in pipelined FFT processing. The table size is large enough to occupy significant area and power consumption in long-point FFT processing. The proposed algorithm can reduce the table size to half, compared to the radix-22 algorithm, while retaining the simple structure. To verify the proposed algorithm, a 2048-point pipelined FFT processor is designed using a 0.18 µm CMOS process. By combining the proposed algorithm and the radix-22 algorithm, the table size is reduced to 34% and 51% compared to the radix-2 and radix-22 algorithms, respectively. The FFT processor occupies 1.28 mm2 and achieves a signal-to-quantization-noise ratio (SQNR) of more than 50 dB.

  • Energy-Scalable 4KB LDPC Decoding Architecture for NAND-Flash-Based Storage Systems

    Youngjoo LEE  Jaehwan JUNG  In-Cheol PARK  

     
    PAPER-Electronic Circuits

      Vol:
    E99-C No:2
      Page(s):
    293-301

    This paper presents a novel low-power decoder architecture for the (36420, 32778) binary LDPC code targeting energy-efficient NAND-flash-based mobile devices. The proposed energy-scalable decoding algorithm reduces the operating bit-width of decoding function units at the early-use stage where the channel condition is good enough to lower the precision of computation. Based on a flexible adder structure, the decoding energy of the proposed LDPC decoder can be reduced by freezing the unnecessary parts of hardware resources. A prototype 4KB LDPC decoder is designed in a 65nm CMOS technology, which achieves an average decoding throughput of 8.13Gb/s with 1.2M equivalent gates. The power consumption of the decoder ranges from 397mW to 563mW depending on operating conditions.

  • An Area-Efficient and Fully Synthesizable Bluetooth Baseband Module for Wireless Communication

    Ik-Jae CHUN  Bo-Gwan KIM  In-Cheol PARK  

     
    PAPER-Integrated Electronics

      Vol:
    E87-C No:1
      Page(s):
    94-100

    In this paper, we describe the implementation and the test results of a Bluetooth baseband module we have developed. For small chip size, we eliminate FIFOs for data buffering between hardware functional units and data buffers for bit streaming among channel coding blocks. Furthermore, we carefully consider hardware and software partitioning. We implement complex control tasks of the Bluetooth baseband layer protocols in software running on an embedded microcontroller. Hardware-efficient functions, such as low-level bitstream link control; host controller interfaces (HCIs), such as universal asynchronous receiver transmitter (UART) and universal serial bus (USB) interfaces; and audio CODEC are performed by dedicated hardware blocks. In addition, the bitstream data path block of the link controller constructing the baseband module has been designed by considering low power. The design of the baseband module is done using fully synthesizable Verilog HDL to enhance the portability between process technologies. A field programmable gate array (FPGA) implementation of the module was tested for functional verification and real time operation of file and bitstream transfer between PCs. The module was also fabricated in a 0.25 µm CMOS technology, the core size of which is only 2.792.80 mm2.

  • An Automatic Interface Insertion Scheme for In-System Verification of Algorithm Models in C

    Chang-Jae PARK  Ando KI  In-Cheol PARK  Chong-Min KYUNG  

     
    PAPER-High Level Synthesis

      Vol:
    E85-A No:12
      Page(s):
    2645-2654

    This paper describes an automatic interface insertion scheme for in-system verification of algorithm models. To insert the interface, an algorithm model described in C is translated into another source code that includes the communication with hardware components in the target system to be validated with the algorithm model. The communication between the algorithm model and hardware components is achieved using transactors that perform transformation between access operations and bus cycle transactions. I/O terminal is introduced as an interface model to relate the transactions to access operations during the execution of the algorithm model, i.e., accesses to I/O terminals invoke bus cycle transactions in hardware and vice versa. An automatic interface insertion tool is developed using the source-to-source translation to identify the I/O terminals and insert interface function calls in the source code. The proposed automatic interface insertion scheme is validated by emulating several multimedia algorithms written in C on real target systems.

  • Efficient Pruning for Infinity-Norm Sphere Decoding Based on Schnorr-Euchner Enumeration

    Tae-Hwan KIM  In-Cheol PARK  

     
    LETTER-Wireless Communication Technologies

      Vol:
    E94-B No:9
      Page(s):
    2677-2680

    An efficient pruning method is proposed for the infinity-norm sphere decoding based on Schnorr-Euchner enumeration in multiple-input multiple-output spatial multiplexing systems. The proposed method is based on the characteristics of the infinity norm, and utilizes the information of the layer at which the infinity-norm value is selected in order to decide unnecessary sub-trees that can be pruned without affecting error-rate performance. Compared to conventional pruning, the proposed pruning decreases the average number of tree-visits by up to 37.16% in 44 16-QAM systems and 33.75% in 66 64-QAM systems.

  • Hardware Accelerator for Outline Font Generation

    Gyu-Cheol HWANG  In-Cheol PARK  Yun-Tae LEE  Tae-Hyung LEE  Jong-Hong BAE  Chong-Min KYUNG  

     
    PAPER-VLSI Design Technology

      Vol:
    E74-A No:10
      Page(s):
    3078-3082

    Translation of the scalable outline font data as represented by a set of control points of the cubic Bezier curve, etc. into the bitmap data for desk-top publishing (DTP) applications requires a significant amount of computation. In this paper, we propose a special purpose chip called KAFOG for the high-speed generation of bitmap font from the Hangul PostScript file for screen display as well as LBP (Laser Beam Printer) output. KAFOG chip was implemented in 1.5 µm CMOS gate array using 17 K gates. The computation throughput of the KAFOG chip is 250 K cubic Bezier curve segments (each curve segment is composed of four control points) per second at the clock frequency of 40 MHz.

  • Low-Power Hybrid Turbo Decoding Based on Reverse Calculation

    Hye-Mi CHOI  Ji-Hoon KIM  In-Cheol PARK  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E89-A No:3
      Page(s):
    782-789

    As turbo decoding is a highly memory-intensive algorithm consuming large power, a major issue to be solved in practical implementation is to reduce power consumption. This paper presents an efficient reverse calculation method to lower the power consumption by reducing the number of memory accesses required in turbo decoding. The reverse calculation method is proposed for the Max-log-MAP algorithm, and it is combined with a scaling technique to achieve a new decoding algorithm, called hybrid log-MAP, that results in a similar BER performance to the log-MAP algorithm. For the W-CDMA standard, experimental results show that 80% of memory accesses are reduced through the proposed reverse calculation method. A hybrid log-MAP turbo decoder based on the proposed reverse calculation reduces power consumption and memory size by 34.4% and 39.2%, respectively.

1-20hit(21hit)