IEICE global.ieice.org Site

Author Search Result

[Author] Takahiro WATANABE(17hit)

1-17hit

Low Power Placement and Routing for the Coarse-Grained Power Gating FPGA Architecture
Ce LI Yiping DONG Takahiro WATANABE

PAPER-Physical Level Design

Vol:
E94-A No:12
Page(s):
2519-2527
Since the power consumption of FPGA is larger than that of ASIC under the condition to perform the same function using the same scaling, the application of FPGA is limited especially in portable electronic devices. In this paper, we propose a novel low-power FPGA architecture based on coarse-grained power gating to reduce power consumption. The new placement algorithm and routing resource graph for sleep regions is also presented. After enhancing the CAD framework, a detailed discussion is given under different region size supported by the new FPGA architecture. As a result, our proposed FPGA architecture combined with the new placement and routing algorithm can reduce 19.4% in the total power consumption compared with the traditional FPGA. By using our proposed method, FPGA is promising to be widely applied to portable devices.
High Performance Virtual Channel Based Fully Adaptive 3D NoC Routing for Congestion and Thermal Problem
Xin JIANG Xiangyang LEI Lian ZENG Takahiro WATANABE

PAPER-VLSI Design Technology and CAD

Vol:
E100-A No:11
Page(s):
2379-2391
Recent Network on Chip (NoC) design must take the thermal issue into consideration due to its great impact on the network performance and reliability, especially for 3D NoC. In this work, we design a virtual channel based fully adaptive routing algorithm for the runtime 3D NoC thermal-aware management. To improve the network throughput and latency, we use two virtual channels for each horizontal direction and design a routing function which can not only avoid deadlock and livelock, but also ensure high adaptivity and routability in the throttled network. For path selection, we design a strategy that takes priority to the distance, but also considers path diversity and traffic state. For throttling information collection, instead of transmitting the topology information of the whole network, we use a 12 bits register to reserve the router state for one hop away, which saves the hardware cost largely and decreases the network latency. In the experiments, we test our proposed routing algorithm in different states with different sizes, and the proposed algorithm shows better network latency and throughput with low power compared with traditional algorithms.
A Global Router for Analog Function Blocks Based on the Branch-and-Bound Algorithm
Tadanao TSUBOTA Masahiro KAWAKITA Takahiro WATANABE

PAPER-VLSI Design Technology and CAD

Vol:
E78-A No:3
Page(s):
345-352
The main aim of device-level global routing is to obtain high-performance detailed routing under various layout constraints. This paper deals with global routing for analog function blocks. For analog LSIs, especially for those operating at high frequency, various layout constraints are specified prior to routing. Those constrainsts must be completely satisfied to achieve the required circuit performance. However, they are sometimes too hard to be solved by any heuristic method even if a problem is small in size. Thus, we propose a method based on the branch-and-bound algorithm, which can generate all possible solutions to find the best one. Unfortunately, the method tends to take a large amount of processing time. In order to defeat the drawbacks by accelerating the process, constraints are classified into two groups: constraints on single nets and constraints between two nets. Therefore our method consists of two parts: in the first part only constraints on single nets are processed and in the second part only constraints between two nets are processed. The method is efficient because many possible routes that violate layout constraints are rejected immediately in each part. This makes it possible to construct a smaller search tree and to reduce processing time. Additionally this idea, all nets processed in the second phase are sorted in the proper order to reduce the number of edges in the search tree. This saves much processing time, too. Experimental results show that our method can find a good global route for hard layout constraints in practical processing time, and also show that it is superior to the well-known simulated annealing method both in quality of solutions and in processing time.
A Fast MER Enumeration Algorithm for Online Task Placement on Reconfigurable FPGAs
Tieyuan PAN Lian ZENG Yasuhiro TAKASHIMA Takahiro WATANABE

PAPER

Vol:
E99-A No:12
Page(s):
2412-2424
In this paper, we propose a fast Maximal Empty Rectangle (MER) enumeration algorithm for online task placement on reconfigurable Field-Programmable Gate Arrays (FPGAs). On the assumption that each task utilizes rectangle-shaped resources, the proposed algorithm can manage the free space on FPGAs by an MER list. When assigning or removing a task, a series of MERs are selected and cut into segments according to the task and its assignment location. By processing these segments, the MER list can be updated quickly with low memory consumption. Under the proof of the upper limit of the number of the MERs on the FPGA, we analyze both the time and space complexity of the proposed algorithm. The efficiency of the proposed algorithm is verified by experiments.
Analysis before Starting an Access: A New Power-Efficient Instruction Fetch Mechanism
Jiongyao YE Yingtao HU Hongfeng DING Takahiro WATANABE

PAPER-Computer System

Vol:
E94-D No:7
Page(s):
1398-1408
Power consumption has become an increasing concern in high performance microprocessor design. Especially, Instruction Cache (I-Cache) contributes a large portion of the total power consumption in a microprocessor, since it is a complex unit and is accessed very frequently. Several studies on low-power design have been presented for the power-efficient cache design. However, these techniques usually suffer from the restrictions in the traditional Instruction Fetch Unit (IFU) architectures where the fetch address needs to be sent to I-Cache once it is available. Therefore, work to reduce the power consumption is limited after the address generation and before starting an access. In this paper, we present a new power-aware IFU architecture, named Analysis Before Starting an Access (ABSA), which aims at maximizing the power efficiency of the low-power designs by eliminating the restrictions on those low-power designs of the traditional IFU. To achieve this goal, ABSA reorganizes the IFU pipeline and carefully assigns tasks for each stages so that sufficient time and information can be provided for the low-power techniques to maximize the power efficiency before starting an access. The proposed design is fully scalable and its cost is low. Compared to a conventional IFU design, simulation results show that ABSA saves about 30.3% fetch power consumption, on average. I-Cache employed by ABSA reduces both static and dynamic power consumptions about 85.63% and 66.92%, respectively. Meanwhile the performance degradation is only about 0.97%.
An Adaptive Various-Width Data Cache for Low Power Design
Jiongyao YE Yu WAN Takahiro WATANABE

PAPER-Computer System

Vol:
E94-D No:8
Page(s):
1539-1546
Modern microprocessors employ caches to bridge the great speed variance between a main memory and a central processing unit, but these caches consume a larger and larger proportion of the total power consumption. In fact, many values in a processor rarely need the full-bit dynamic range supported by a cache. The narrow-width value occupies a large portion of the cache access and storage. In view of these observations, this paper proposes an Adaptive Various-width Data Cache (AVDC) to reduce the power consumption in a cache, which exploits the popularity of narrow-width value stored in the cache. In AVDC, the data storage unit consists of three sub-arrays to store data of different widths. When high sub-arrays are not used, they are closed to save its dynamic and static power consumption through the modified high-bit SRAM cell. The main advantages of AVDC are: 1) Both the dynamic and static power consumption can be reduced. 2) Low power consumption is achieved by the modification of the data storage unit with less hardware modification. 3) We exploit the redundancy of narrow-width values instead of compressed values, thus cache access latency does not increase. Experimental results using SPEC 2000 benchmarks show that our proposed AVDC can reduce the power consumption, by 34.83% for dynamic power saving and by 42.87% for static power saving on average, compared with a cache without AVDC.
An Efficient Highly Adaptive and Deadlock-Free Routing Algorithm for 3D Network-on-Chip
Lian ZENG Tieyuan PAN Xin JIANG Takahiro WATANABE

PAPER

Vol:
E99-A No:7
Page(s):
1334-1344
As the semiconductor technology continues to develop, hundreds of cores will be deployed on a single die in the future Chip-Multiprocessors (CMPs) design. Three-Dimensional Network-on-Chips (3D NoCs) has become an attractive solution which can provide impressive high performance. An efficient and deadlock-free routing algorithm is a critical to achieve the high performance of network-on-chip. Traditional methods based on deterministic and turn model are deadlock-free, but they are unable to distribute the traffic loads over the network. In this paper, we propose an efficient, adaptive and deadlock-free algorithm (EAR) based on a novel routing selection strategy in 3D NoC, which can distribute the traffic loads not only in intra-layers but also in inter-layers according to congestion information and path diversity. Simulation results show that the proposed method achieves the significant performance improvement compared with others.
Region Oriented Routing FPGA Architecture for Dynamic Power Gating
Ce LI Yiping DONG Takahiro WATANABE

PAPER-Physical Level Design

Vol:
E95-A No:12
Page(s):
2199-2207
Dynamic power gating applicable to FPGA can reduce the power consumption effectively. In this paper, we propose a sophisticated routing architecture for a region oriented FPGA which supports dynamic power gating. This is the first routing solution of dynamic power gating for coarse-grained FPGA. This paper has 2 main contributions. First, it improves the routing resource graph and routing architecture to support special routing for a region oriented FPGA. Second, some routing channels are made wider to avoid congestion. Experimental result shows that 7.7% routing area can be reduced compared with the symmetric Wilton switch box in the region. Also, our proposed FPGA architecture with sophisticated P&R can reduce the power consumption of the system implemented in FPGA.
A Framework for Feature Extraction of Images by Energy Minimization
Satoshi NAKAGAWA Takahiro WATANABE Yuji KUNO

PAPER

Vol:
E77-D No:11
Page(s):
1213-1218
This paper describes a new feature extraction model (Active Model) which is extended from the active contour model (Snakes). Active Model can be applied to various energy minimizing models since it integrates most of the energy terms ever proposed into one model and also provides the new terms for multiple images such as motion and stereo images. The computational order of energy minimization process is estimated in comparison with a dynamic programming method and a greedy algorithm, and it is shown that the energy minimization process in Active Model is faster than the others. Some experimental results are also shown.
A Fine Grain Cooled Logic Architecture for Low-Power Processors
Hiroyuki MATSUBARA Takahiro WATANABE Tadao NAKAMURA

PAPER

Vol:
E84-A No:3
Page(s):
735-740
In this paper, we propose a fine grain Cooled Logic architecture for low-power oriented processors. Cooled Logic detects, in novel hardware method with dual-rail logic, functional blocks to be active, and stops clocks to each of the functional blocks in order to make it inactive at certain periods. To confirm the effectiveness of our approach, we design a 4-bit and a 16-bit event-driven array multipliers, and analyze their power consumption by the HSPICE simulator. As a result, it is shown that Cooled Logic has a tendency to reduce power consumptions in both the functional blocks and the clock drivers of the multipliers.
Region-Oriented Placement Algorithm for Coarse-Grained Power-Gating FPGA Architecture
Ce LI Yiping DONG Takahiro WATANABE

PAPER-Design Methodology

Vol:
E95-D No:2
Page(s):
314-323
An FPGA plays an essential role in industrial products due to its fast, stable and flexible features. But the power consumption of FPGAs used in portable devices is one of critical issues. Top-down hierarchical design method is commonly used in both ASIC and FPGA design. But, in the case where plural modules are integrated in an FPGA and some of them might be in sleep-mode, current FPGA architecture cannot be fully effective. In this paper, coarse-grained power gating FPGA architecture is proposed where a whole area of an FPGA is partitioned into several regions and power supply is controlled for each region, so that modules in sleep mode can be effectively power-off. We also propose a region oriented FPGA placement algorithm fitted to this user's hierarchical design based on VPR [1]. Simulation results show that this proposed method could reduce power consumption of FPGA by 38% on average by setting unused modules or regions in sleep mode.
Score Sequence Pair Problems of (r₁₁, r₁₂, r₂₂)-Tournaments--Determination of Realizability--
Masaya TAKAHASHI Takahiro WATANABE Takeshi YOSHIMURA

PAPER-Graph Algorithms

Vol:
E90-D No:2
Page(s):
440-448
Let G be any graph with property P (for example, general graph, directed graph, etc.) and S be nonnegative and non-decreasing integer sequence(s). The prescribed degree sequence problem is a problem to determine whether there is a graph G having S as the prescribed sequence(s) of degrees or outdegrees of the vertices. From 1950's, P has attracted wide attentions, and its many extensions have been considered. Let P be the property satisfying the following (1) and (2):(1) G is a directed graph with two disjoint vertex sets A and B. (2) There are r11 (r22, respectively) directed edges between every pair of vertices in A(B), and r12 directed edges between every pair of vertex in A and vertex in B. Then G is called an (r11, r12, r22)-tournament ("tournament", for short). The problem is called the score sequence pair problem of a "tournament" (realizable, for short). S is called a score sequence pair of a "tournament" if the answer of the problem is "yes." In this paper, we propose the characterizations of a score sequence pair of a "tournament" and an algorithm for determining in linear time whether a pair of two integer sequences is realizable or not.
A New Recovery Mechanism in Superscalar Microprocessors by Recovering Critical Misprediction
Jiongyao YE Yu WAN Takahiro WATANABE

PAPER-High-Level Synthesis and System-Level Design

Vol:
E94-A No:12
Page(s):
2639-2648
Current trends in modern out-of-order processors involve implementing deeper pipelines and a large instruction window to achieve high performance, which lead to the penalty of the branch misprediction recovery being a critical factor in overall processor performance. Multi path execution is proposed to reduce this penalty by executing both paths following a branch, simultaneously. However, there are some drawbacks in this mechanism, such as design complexity caused by processing both paths after a branch and performance degradation due to hardware resource competition between two paths. In this paper, we propose a new recovery mechanism, called Recovery Critical Misprediction (RCM), to reduce the penalty of branch misprediction recovery. The mechanism uses a small trace cache to save the decoded instructions from the alternative path following a branch. Then, during the subsequent predictions, the trace cache is accessed. If there is a hit, the processor forks the second path of this branch at the renamed stage so that the design complexity in the fetch stage and decode stage is alleviated. The most contribution of this paper is that our proposed mechanism employs critical path prediction to identify the branches that will be most harmful if mispredicted. Only the critical branch can save its alternative path into the trace cache, which not only increases the usefulness of a limited size of trace cache but also avoids the performance degradation caused by the forked non-critical branch. Experimental results employing SPECint 2000 benchmark show that a processor with our proposed RCM improves IPC value by 10.05% compared with a conventional processor.
Circuit Design Optimization Using Genetic Algorithm with Parameterized Uniform Crossover
Zhiguo BAO Takahiro WATANABE

PAPER-Nonlinear Problems

Vol:
E93-A No:1
Page(s):
281-290
Evolvable hardware (EHW) is a new research field about the use of Evolutionary Algorithms (EAs) to construct electronic systems. EHW refers in a narrow sense to use evolutionary mechanisms as the algorithmic drivers for system design, while in a general sense to the capability of the hardware system to develop and to improve itself. Genetic Algorithm (GA) is one of typical EAs. We propose optimal circuit design by using GA with parameterized uniform crossover (GApuc) and with fitness function composed of circuit complexity, power, and signal delay. Parameterized uniform crossover is much more likely to distribute its disruptive trials in an unbiased manner over larger portions of the space, then it has more exploratory power than one and two-point crossover, so we have more chances of finding better solutions. Its effectiveness is shown by experiments. From the results, we can see that the best elite fitness, the average value of fitness of the correct circuits and the number of the correct circuits of GApuc are better than that of GA with one-point crossover or two-point crossover. The best case of optimal circuits generated by GApuc is 10.18% and 6.08% better in evaluating value than that by GA with one-point crossover and two-point crossover, respectively.
An Online Task Placement Algorithm Based on MER Enumeration for Partially Reconfigurable Device
Tieyuan PAN Li ZHU Lian ZENG Takahiro WATANABE Yasuhiro TAKASHIMA

PAPER

Vol:
E99-A No:7
Page(s):
1345-1354
Recently, due to the development of design and manufacturing technologies for VLSI systems, an embedded system becomes more and more complex. Consequently, not only the performance of chips, but also the flexibility and dynamic adaptation of the implemented systems are required. To achieve these requirements, a partially reconfigurable device is promising. In this paper, we propose an efficient data structure to manage the reconfigurable units. And then, on the assumption that each task utilizes the rectangle shaped resources, a very simple MER enumeration algorithm based on this data structure is proposed. By utilizing the result of MER enumeration, the free space on the reconfigurable device can be used sufficiently. We analyze the complexity of the proposed algorithm and confirm its efficiency by experiments.
A Clocking Scheme for Lowering Peak-Current in Dynamic Logic Circuits
Hiroyuki MATSUBARA Takahiro WATANABE Tadao NAKAMURA

PAPER

Vol:
E83-C No:11
Page(s):
1733-1738
This paper deals with a new low-power clocking scheme for dynamic logic circuits to reduce power dissipation. Although conventional clocking schemes for dynamic logic circuits are mainly used for high-speed applications like domino circuits, their peak-current are very large due to the concentration of precharging and discharging in a short period. It is hard for these schemes to accomplish both reductions of power dissipation and high performance at the same time. In the field of power engineering, leveling power means decreasing peak-to-peak of power keeping its amount. So, we propose a sophisticated clocking scheme leveling power dissipation of processing elements that mainly reduces power dissipation of clock drivers. Our proposed clocking scheme uses an over-lapped clock with a fine-grain power control, and peak-current becomes lower and power dissipation in short period is leveled without penalty of speed performance. Our proposed scheme is applied to a 4-bit array multiplier, and reductions of power dissipation of both the multiplier and clock driver are measured by the HSPICE simulator based on 0.5 µm CMOS technology. It is shown that power dissipation of clock drivers, 4-bit array multiplier, and the total are reduced by about 13.2 percent, 2.6 percent and 7.0 percent, respectively. As a result, our clocking scheme is effective in reduction of power dissipations of clock drivers.
Analog Layout Compaction with a Clean-up Function
Masahiro KAWAKITA Takahiro WATANABE

PAPER

Vol:
E71-E No:12
Page(s):
1243-1252
It has been a main subject to reduce design time and cost not only in the field of digital LSI layout but also in the field of analog LSI, due to increasing LSI packing density and circuit complexity. Semicustom approaches are insufficient to design analog LSIs which require higher density chips and have many kinds of design specifications. As for custom approaches, a symbolic layout method is widely used, where an automatic compaction serves to shrink its chip size after placement and routing. However, most of analog LSIs are fabricated by bipolar process technology, which has many kinds of devices with various shaped patterns. And besides, there are many layout specifications, which are peculiar to analog LSIs and directly affect to circuit performance. So, it is necessary taking account of the layout specifications not only for placement and routing but also for compaction. This paper describes an approach for analog compaction. Given a layout pattern of placement and routing satisfying layout specifications, various techniques to take account of such specifications in a compaction method are discussed. This paper also proposes a clean-up function after compaction, which reduces detoured wire patterns and removes unnecessary vias. By the compaction with clean-up function, a final layout pattern becomes refined in quality.

Author Search Result

[Author] Takahiro WATANABE(17hit)

Low Power Placement and Routing for the Coarse-Grained Power Gating FPGA Architecture

High Performance Virtual Channel Based Fully Adaptive 3D NoC Routing for Congestion and Thermal Problem

A Global Router for Analog Function Blocks Based on the Branch-and-Bound Algorithm

A Fast MER Enumeration Algorithm for Online Task Placement on Reconfigurable FPGAs

Analysis before Starting an Access: A New Power-Efficient Instruction Fetch Mechanism

An Adaptive Various-Width Data Cache for Low Power Design

An Efficient Highly Adaptive and Deadlock-Free Routing Algorithm for 3D Network-on-Chip

Region Oriented Routing FPGA Architecture for Dynamic Power Gating

A Framework for Feature Extraction of Images by Energy Minimization

A Fine Grain Cooled Logic Architecture for Low-Power Processors

Region-Oriented Placement Algorithm for Coarse-Grained Power-Gating FPGA Architecture

Score Sequence Pair Problems of (r₁₁, r₁₂, r₂₂)-Tournaments--Determination of Realizability--

A New Recovery Mechanism in Superscalar Microprocessors by Recovering Critical Misprediction

Circuit Design Optimization Using Genetic Algorithm with Parameterized Uniform Crossover

An Online Task Placement Algorithm Based on MER Enumeration for Partially Reconfigurable Device

A Clocking Scheme for Lowering Peak-Current in Dynamic Logic Circuits

Analog Layout Compaction with a Clean-up Function

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles