The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Tadao NAKAMURA(14hit)

1-14hit
  • A Fine Grain Cooled Logic Architecture for Low-Power Processors

    Hiroyuki MATSUBARA  Takahiro WATANABE  Tadao NAKAMURA  

     
    PAPER

      Vol:
    E84-A No:3
      Page(s):
    735-740

    In this paper, we propose a fine grain Cooled Logic architecture for low-power oriented processors. Cooled Logic detects, in novel hardware method with dual-rail logic, functional blocks to be active, and stops clocks to each of the functional blocks in order to make it inactive at certain periods. To confirm the effectiveness of our approach, we design a 4-bit and a 16-bit event-driven array multipliers, and analyze their power consumption by the HSPICE simulator. As a result, it is shown that Cooled Logic has a tendency to reduce power consumptions in both the functional blocks and the clock drivers of the multipliers.

  • A Clocking Scheme for Lowering Peak-Current in Dynamic Logic Circuits

    Hiroyuki MATSUBARA  Takahiro WATANABE  Tadao NAKAMURA  

     
    PAPER

      Vol:
    E83-C No:11
      Page(s):
    1733-1738

    This paper deals with a new low-power clocking scheme for dynamic logic circuits to reduce power dissipation. Although conventional clocking schemes for dynamic logic circuits are mainly used for high-speed applications like domino circuits, their peak-current are very large due to the concentration of precharging and discharging in a short period. It is hard for these schemes to accomplish both reductions of power dissipation and high performance at the same time. In the field of power engineering, leveling power means decreasing peak-to-peak of power keeping its amount. So, we propose a sophisticated clocking scheme leveling power dissipation of processing elements that mainly reduces power dissipation of clock drivers. Our proposed clocking scheme uses an over-lapped clock with a fine-grain power control, and peak-current becomes lower and power dissipation in short period is leveled without penalty of speed performance. Our proposed scheme is applied to a 4-bit array multiplier, and reductions of power dissipation of both the multiplier and clock driver are measured by the HSPICE simulator based on 0.5 µm CMOS technology. It is shown that power dissipation of clock drivers, 4-bit array multiplier, and the total are reduced by about 13.2 percent, 2.6 percent and 7.0 percent, respectively. As a result, our clocking scheme is effective in reduction of power dissipations of clock drivers.

  • A Topology Preserving Neural Network for Nonstationary Distributions

    Taira NAKAJIMA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    LETTER-Bio-Cybernetics and Neurocomputing

      Vol:
    E82-D No:7
      Page(s):
    1131-1135

    We propose a learning algorithm for self-organizing neural networks to form a topology preserving map from an input manifold whose topology may dynamically change. Experimental results show that the network using the proposed algorithm can rapidly adjust itself to represent the topology of nonstationary input distributions.

  • The Object-Space Parallel Processing of the Multipass Rendering Method on the (Mπ)2 with a Distributed-Frame Buffer System

    Hitoshi YAMAUCHI  Takayuki MAEDA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Computer Architecture

      Vol:
    E80-D No:9
      Page(s):
    909-918

    The multipass rendering method based on the global illumination model can generate the most photo-realistic images. However, since the multipass rendering method is very time consuming, it is impractical in the industrial world. This paper discusses a massively parallel processing approach to fast image synthesis by the multipass rendering method. Especially, we focus on the performance evaluation of the view-dependent object-space parallel processing on the (Mπ)2 which has been proposed in our previous paper. We also propose two kinds of distributed frame buffer system named cached frame buffer and multistage-interconnected frame buffer. These frame buffer systems can solve the access conflict problem on the frame buffer. The simulation results show that the (Mπ)2 has a scalable performance. For example, the (Mπ)2 with more than 4000 processing elements can achieve an efficiency of over 50%. We also show that both of the proposed distributed frame buffer systems can relieve the overhead due to frame buffer access in the (Mπ)2 in the case that a large number of high-performance processing elements are adopted in the system.

  • An Active Learning Algorithm Based on Existing Training Data

    Hiroyuki TAKIZAWA  Taira NAKAJIMA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Biocybernetics, Neurocomputing

      Vol:
    E83-D No:1
      Page(s):
    90-99

    A multilayer perceptron is usually considered a passive learner that only receives given training data. However, if a multilayer perceptron actively gathers training data that resolve its uncertainty about a problem being learnt, sufficiently accurate classification is attained with fewer training data. Recently, such active learning has been receiving an increasing interest. In this paper, we propose a novel active learning strategy. The strategy attempts to produce only useful training data for multilayer perceptrons to achieve accurate classification, and avoids generating redundant training data. Furthermore, the strategy attempts to avoid generating temporarily useful training data that will become redundant in the future. As a result, the strategy can allow multilayer perceptrons to achieve accurate classification with fewer training data. To demonstrate the performance of the strategy in comparison with other active learning strategies, we also propose an empirical active learning algorithm as an implementation of the strategy, which does not require expensive computations. Experimental results show that the proposed algorithm improves the classification accuracy of a multilayer perceptron with fewer training data than that for a conventional random selection algorithm that constructs a training data set without explicit strategies. Moreover, the algorithm outperforms typical active learning algorithms in the experiments. Those results show that the algorithm can construct an appropriate training data set at lower computational cost, because training data generation is usually costly. Accordingly, the algorithm proves the effectiveness of the strategy through the experiments. We also discuss some drawbacks of the algorithm.

  • A Hierarchical Computation Scheme

    A. K. CHAKRAVARTY  Tadao NAKAMURA  Yoshiharu SHIGEI  

     
    PAPER-Computation Scheme

      Vol:
    E68-E No:7
      Page(s):
    484-491

    In recent years a lot of attention has been focused on writing error-free programs in an easily readable and understandable manner. It is also recognized that the so called von Neumann" or Imperative" languages may not be the right medium to work in this direction as these have a complex body lacking the solid foundation of computational mathematics. With the announcement of yet another such language, there is a further addition of various new language constructs thereby only helping to build up the confusion. On the other hand, functional languages are based upon a solid foundation and produce programs which are semantically very clear. These language, however, have not found favor with the computing community primarily because these are not history sensitive apart from being inefficient to run on presently available computers. An appearling alternative has been proposed by Backus in terms of an applicative language independent of the lambda calculus and possessing history sensitivity by means of a loose coupling between computation and the state (of the store). In this paper, we pick up his ideas and work up a computation scheme which introduces an amount of abstraction in the representation of variables. Specifically, we do not bind a variable to a particular value to the declared type, but rather we assign limits to the values of the variable. These limits are changeable and depend upon the available semantic information. It is observed that such a scheme can exploit the potentials of working at higher levels, notable among which in this particular scheme is, the possibility of considerable increase in the speed of computation.

  • Vector Quantization Codebook Design Using the Law-of-the-Jungle Algorithm

    Hiroyuki TAKIZAWA  Taira NAKAJIMA  Kentaro SANO  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Image Processing, Image Pattern Recognition

      Vol:
    E86-D No:6
      Page(s):
    1068-1077

    The equidistortion principle[1] has recently been proposed as a basic principle for design of an optimal vector quantization (VQ) codebook. The equidistortion principle adjusts all codebook vectors such that they have the same contribution to quantization error. This paper introduces a novel VQ codebook design algorithm based on the equidistortion principle. The proposed algorithm is a variant of the law-of-the-jungle algorithm (LOJ), which duplicates useful codebook vectors and removes useless vectors. Due to the LOJ mechanism, the proposed algorithm can establish the equidistortion condition without wasting learning steps. This is significantly effective in preventing performance degradation caused when initial states of codebook vectors are improper to find an optimal codebook. Therefore, even in the case of improper initialization, the proposed algorithm can achieve minimization of quantization error based on the equidistortion principle. Performance of the proposed algorithm is discussed through experimental results.

  • Kohonen Learning with a Mechanism, the Law of the Jungle, Capable of Dealing with Nonstationary Probability Distribution Functions

    Taira NAKAJIMA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Bio-Cybernetics and Neurocomputing

      Vol:
    E81-D No:6
      Page(s):
    584-591

    We present a mechanism, named the law of the jungle (LOJ), to improve the Kohonen learning. The LOJ is used to be an adaptive vector quantizer for approximating nonstationary probability distribution functions. In the LOJ mechanism, the probability that each node wins in a competition is dynamically estimated during the learning. By using the estimated win probability, "strong" nodes are increased through creating new nodes near the nodes, and "weak" nodes are decreased through deleting themselves. A pair of creation and deletion is treated as an atomic operation. Therefore, the nodes which cannot win the competition are transferred directly from the region where inputs almost never occur to the region where inputs often occur. This direct "jump" of weak nodes provides rapid convergence. Moreover, the LOJ requires neither time-decaying parameters nor a special periodic adaptation. From the above reasons, the LOJ is suitable for quick approximation of nonstationary probability distribution functions. In comparison with some other Kohonen learning networks through experiments, only the LOJ can follow nonstationary probability distributions except for under high-noise environments.

  • Identifying Program Loop Nesting Structures during Execution of Machine Code

    Yukinori SATO  Yasushi INOGUCHI  Tadao NAKAMURA  

     
    PAPER-Computer System

      Vol:
    E97-D No:9
      Page(s):
    2371-2385

    This paper presents a mechanism for detecting dynamic loop and procedure nesting during the actual program execution on-the-fly. This mechanism aims primarily at making better strategies for performance tuning or parallelization. Using a pre-compiled application executable machine code as an input, our mechanism statically generates simple but precise markers that indicate loop entries and loop exits, and dynamically monitors loop nesting that appears during the actual execution together with call context tree. To keep precise loop structures all the time, we monitor the indirect jumps that enter the loop regions and the setjmp/longjmp functions that cause irregular function call transfers. We also present a novel representation called Loop-Call Context Graph that can keep track of inter-procedural loop nests. We implement our mechanism and evaluate it using SPEC CPU2006 benchmark suite. The results confirm that our mechanism can successfully reveal the precise inter-procedural loop nest structures from all of SPEC CPU2006 benchmark executions without any particular compiler support. The results also show that it can reduce runtime loop detection overheads compared with the existing loop profiling method.

  • Power Estimation of Partitioned Register Files in a Clustered Architecture with Performance Evaluation

    Yukinori SATO  Ken-ichi SUZUKI  Tadao NAKAMURA  

     
    PAPER-VLSI Systems

      Vol:
    E90-D No:3
      Page(s):
    627-636

    High power consumption and slow access of enlarged and multiported register files make it difficult to design high performance superscalar processors. The clustered architecture, where the conventional monolithic register file is partitioned into several smaller register files, is expect to overcome the register file issues. In the clustered architecture, the more a monolithic register file is partitioned, the lower power and faster access register files can be realized. However, the partitioning causes losses of IPC (instructions per clock cycle) due to communication among register files. Therefore, degree of partitioning has a strong impact on the trade-off between power consumption and performance. In addition, the organization of partitioned register files also affects the trade-off. In this paper, we attempt to investigate appropriate degrees of partitioning and organizations of partitioned register files in a clustered architecture to assess the trade-off. From the results of execute-driven simulation, we find that the organization of register files and the degree of partitioning have a strong impact on the IPC, and the configuration with non-consistent register files can make use of the partitioned resources more effectively. From the results of register file access time and energy modeling, we find that the configurations with the highly partitioned non-consistent register file organization can receive benefit of the partitioning in terms of operating frequency and access energy of register files. Further, we examine relationship between IPS (instructions per second) and the product of IPC and operating frequency of register files. The results suggest that highly partitioned non-consistent configurations tends to gain more advantage in performance and power.

  • Load Balancing Based on Load Coherence between Continuous Images for an Object-Space Parallel Ray-Tracing System

    Hiroaki KOBAYASHI  Hideyuki KUBOTA  Susumu HORIGUCHI  Tadao NAKAMURA  

     
    PAPER-Computer Systems

      Vol:
    E76-D No:12
      Page(s):
    1490-1499

    The ray-tracing algorithm can synthesize very realistic images. However, the ray tracing is very time consuming. To solve this problem, a load balancing strategy using temporal coherence between images in an animation is presented for balancing computational loads among processing elements of a parallel processng system. Our parallel processing model is based on a space subdivision method for the ray-tracing algorithm. A subdivided object space is distributed among processing elements of the parallel system. To clarify the effectiveness of the load balancing strategy, we examine the system performance by computer simulation.

  • (Mπ)2: A Hierarchical Parallel Processing System for the Multipass Rendering Method

    Hiroaki KOBAYASHI  Hitoshi YAMAUCHI  Yuichiro TOH  Tadao NAKAMURA  

     
    PAPER-Architectures

      Vol:
    E79-D No:8
      Page(s):
    1055-1064

    This paper proposes a hierarchical parallel processing system for the multipass rendering method. The multipass rendering method based on the integration of radiosity and ray-tracing can synthesize photo-realistic images. However, the method is also computationally expensive. To accelerate the multipass rendering method, the system, called (Mπ)2, employs two kinds of parallel processing schemes. As a coarse-grain parallel processing, object-space parallel processing with multiple processing elements based on the object-space subdivision is adapted, and each processing element (PE) is equipped with multiple pipelined units for a fine-grain parallel processing. To balance load among the system, static load balancing at the PE level and dynamic load balancing at the pipelined unit level within the PE are introduced. Especially, we propose a novel static load allocation scheme, skewed-distributed allocation, which can effectively distribute a three-dimensional object space to one- or two-dimensional processor configuration of the (Mπ)2 system. Simulation experiments show that the two-dimensional (Mπ)2 systems with the skewed-distributed allocation outperform the three-dimensional systems with the non-skewed distributed allocation. Since lower dimensional systems can be built at a lower cost than higher dimensional systems, the skewed-distributed allocation will be meritorious. Besides, by the combination of static load balancing by the skewed-distributed allocation and the dynamic load balancing by dynamic ray allocation within each PE, the system performance can be further boosted. We also propose a cached frame buffer system to relieve access collision on a frame buffer.

  • Data-Parallel Volume Rendering with Adaptive Volume Subdivision

    Kentaro SANO  Hiroyuki KITAJIMA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Computer Graphics

      Vol:
    E83-D No:1
      Page(s):
    80-89

    A data-parallel processing approach is promising for real-time volume rendering because of the massive parallelism in volume rendering. In data-parallel volume rendering, local results processing elements(PEs) generate from allocated subvolumes are integrated to form a final image. Generally, the integration causes an overhead unavoidable in data-parallel volume rendering due to communications among PEs. This paper proposes a data-parallel shear-warp volume rendering algorithm combined with an adaptive volume subdivision method to reduce the communication overhead and improve processing efficiency. We implement the parallel algorithm on a message-passing multiprocessor system for performance evaluation. The experimental results show that the adaptive volume subdivision method can reduce the overhead and achieve higher efficiency compared with a conventional slab subdivision method.

  • Acceleration Techniques for the Network Inversion Algorithm

    Hiroyuki TAKIZAWA  Taira NAKAJIMA  Masaaki NISHI  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    LETTER-Bio-Cybernetics and Neurocomputing

      Vol:
    E82-D No:2
      Page(s):
    508-511

    We apply two acceleration techniques for the backpropagation algorithm to an iterative gradient descent algorithm called the network inversion algorithm. Experimental results show that these techniques are also quite effective to decrease the number of iterations required for the detection of input vectors on the classification boundary of a multilayer perceptron.