The search functionality is under construction.

Author Search Result

[Author] Hiroaki KOBAYASHI(21hit)

1-20hit(21hit)

  • A Machine Learning-Based Approach for Selecting SpMV Kernels and Matrix Storage Formats

    Hang CUI  Shoichi HIRASAWA  Hiroaki KOBAYASHI  Hiroyuki TAKIZAWA  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2018/06/13
      Vol:
    E101-D No:9
      Page(s):
    2307-2314

    Sparse Matrix-Vector multiplication (SpMV) is a computational kernel widely used in many applications. Because of the importance, many different implementations have been proposed to accelerate this computational kernel. The performance characteristics of those SpMV implementations are quite different, and it is basically difficult to select the implementation that has the best performance for a given sparse matrix without performance profiling. One existing approach to the SpMV best-code selection problem is by using manually-predefined features and a machine learning model for the selection. However, it is generally hard to manually define features that can perfectly express the characteristics of the original sparse matrix necessary for the code selection. Besides, some information loss would happen by using this approach. This paper hence presents an effective deep learning mechanism for SpMV code selection best suited for a given sparse matrix. Instead of using manually-predefined features of a sparse matrix, a feature image and a deep learning network are used to map each sparse matrix to the implementation, which is expected to have the best performance, in advance of the execution. The benefits of using the proposed mechanism are discussed by calculating the prediction accuracy and the performance. According to the evaluation, the proposed mechanism can select an optimal or suboptimal implementation for an unseen sparse matrix in the test data set in most cases. These results demonstrate that, by using deep learning, a whole sparse matrix can be used to do the best implementation prediction, and the prediction accuracy achieved by the proposed mechanism is higher than that of using predefined features.

  • Kohonen Learning with a Mechanism, the Law of the Jungle, Capable of Dealing with Nonstationary Probability Distribution Functions

    Taira NAKAJIMA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Bio-Cybernetics and Neurocomputing

      Vol:
    E81-D No:6
      Page(s):
    584-591

    We present a mechanism, named the law of the jungle (LOJ), to improve the Kohonen learning. The LOJ is used to be an adaptive vector quantizer for approximating nonstationary probability distribution functions. In the LOJ mechanism, the probability that each node wins in a competition is dynamically estimated during the learning. By using the estimated win probability, "strong" nodes are increased through creating new nodes near the nodes, and "weak" nodes are decreased through deleting themselves. A pair of creation and deletion is treated as an atomic operation. Therefore, the nodes which cannot win the competition are transferred directly from the region where inputs almost never occur to the region where inputs often occur. This direct "jump" of weak nodes provides rapid convergence. Moreover, the LOJ requires neither time-decaying parameters nor a special periodic adaptation. From the above reasons, the LOJ is suitable for quick approximation of nonstationary probability distribution functions. In comparison with some other Kohonen learning networks through experiments, only the LOJ can follow nonstationary probability distributions except for under high-noise environments.

  • Preparation and Evaluation of Aligned Naphthacene Thin Films Using Surface Plasmon Excitation

    Tohru SHIMAOKA  Hiroaki KOBAYASHI  Kazuki YAMASHITA  Yasuo OHDAIRA  Kazunari SHINBO  Keizo KATO  Futao KANEKO  

     
    LETTER-Evaluation of Organic Materials

      Vol:
    E89-C No:12
      Page(s):
    1758-1759

    Molecular aligned naphthacene thins films were fabricated using vacuum evaporation and the rubbing method. The attenuated total reflection (ATR) and emission light properties from surface plasmon (SP) excitation due to molecular luminescence were investigated for these films. The long axis of the rod-like molecule was estimated to align perpendicular to the rubbing direction. The ATR and emission light properties depended on the molecular orientation.

  • Load Balancing Based on Load Coherence between Continuous Images for an Object-Space Parallel Ray-Tracing System

    Hiroaki KOBAYASHI  Hideyuki KUBOTA  Susumu HORIGUCHI  Tadao NAKAMURA  

     
    PAPER-Computer Systems

      Vol:
    E76-D No:12
      Page(s):
    1490-1499

    The ray-tracing algorithm can synthesize very realistic images. However, the ray tracing is very time consuming. To solve this problem, a load balancing strategy using temporal coherence between images in an animation is presented for balancing computational loads among processing elements of a parallel processng system. Our parallel processing model is based on a space subdivision method for the ray-tracing algorithm. A subdivided object space is distributed among processing elements of the parallel system. To clarify the effectiveness of the load balancing strategy, we examine the system performance by computer simulation.

  • (Mπ)2: A Hierarchical Parallel Processing System for the Multipass Rendering Method

    Hiroaki KOBAYASHI  Hitoshi YAMAUCHI  Yuichiro TOH  Tadao NAKAMURA  

     
    PAPER-Architectures

      Vol:
    E79-D No:8
      Page(s):
    1055-1064

    This paper proposes a hierarchical parallel processing system for the multipass rendering method. The multipass rendering method based on the integration of radiosity and ray-tracing can synthesize photo-realistic images. However, the method is also computationally expensive. To accelerate the multipass rendering method, the system, called (Mπ)2, employs two kinds of parallel processing schemes. As a coarse-grain parallel processing, object-space parallel processing with multiple processing elements based on the object-space subdivision is adapted, and each processing element (PE) is equipped with multiple pipelined units for a fine-grain parallel processing. To balance load among the system, static load balancing at the PE level and dynamic load balancing at the pipelined unit level within the PE are introduced. Especially, we propose a novel static load allocation scheme, skewed-distributed allocation, which can effectively distribute a three-dimensional object space to one- or two-dimensional processor configuration of the (Mπ)2 system. Simulation experiments show that the two-dimensional (Mπ)2 systems with the skewed-distributed allocation outperform the three-dimensional systems with the non-skewed distributed allocation. Since lower dimensional systems can be built at a lower cost than higher dimensional systems, the skewed-distributed allocation will be meritorious. Besides, by the combination of static load balancing by the skewed-distributed allocation and the dynamic load balancing by dynamic ray allocation within each PE, the system performance can be further boosted. We also propose a cached frame buffer system to relieve access collision on a frame buffer.

  • Data-Parallel Volume Rendering with Adaptive Volume Subdivision

    Kentaro SANO  Hiroyuki KITAJIMA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Computer Graphics

      Vol:
    E83-D No:1
      Page(s):
    80-89

    A data-parallel processing approach is promising for real-time volume rendering because of the massive parallelism in volume rendering. In data-parallel volume rendering, local results processing elements(PEs) generate from allocated subvolumes are integrated to form a final image. Generally, the integration causes an overhead unavoidable in data-parallel volume rendering due to communications among PEs. This paper proposes a data-parallel shear-warp volume rendering algorithm combined with an adaptive volume subdivision method to reduce the communication overhead and improve processing efficiency. We implement the parallel algorithm on a message-passing multiprocessor system for performance evaluation. The experimental results show that the adaptive volume subdivision method can reduce the overhead and achieve higher efficiency compared with a conventional slab subdivision method.

  • Slot-Array Receiving Antennas Fed by Coplanar Waveguide for 700 GHz Submillimeter-Wave Radiation

    Hiroaki KOBAYASHI  Yasuhiko ABE  Yoshizumi YASUOKA  

     
    PAPER-Phased Arrays and Antennas

      Vol:
    E82-C No:7
      Page(s):
    1248-1252

    Thin-film slot-array receiving antennas fed by coplanar waveguide (CPW) were fabricated on fused quartz substrates, and the antenna properties were investigated at 700 GHz. It was confirmed that the transmission efficiency of CPW was 0.83/λm, and the rate of radiated power from a slot antenna was 0.5 at 700 GHz. The fabricated antennas worked as expected from the theory based on the transmission line model, and the two-dimensional 83 slot-array antenna fed by CPW increased the power gain by 11 dB over a single-slot antenna. The power gain of the antenna was 13 dBi and the aperture efficiency was 40% when the 700 GHz-submillimeter wave was irradiated through the substrate.

  • Acceleration Techniques for the Network Inversion Algorithm

    Hiroyuki TAKIZAWA  Taira NAKAJIMA  Masaaki NISHI  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    LETTER-Bio-Cybernetics and Neurocomputing

      Vol:
    E82-D No:2
      Page(s):
    508-511

    We apply two acceleration techniques for the backpropagation algorithm to an iterative gradient descent algorithm called the network inversion algorithm. Experimental results show that these techniques are also quite effective to decrease the number of iterations required for the detection of input vectors on the classification boundary of a multilayer perceptron.

  • A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts

    Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Computer System

      Vol:
    E96-D No:9
      Page(s):
    2047-2054

    Chip multiprocessors (CMPs) improve performance by simultaneously executing multiple threads using integrated multiple cores. However, since these cores commonly share one cache, inter-thread cache conflicts often limit the performance improvement by multi-threading. This paper focuses on two causes of inter-thread cache conflicts. In shared caches of CMPs, cached data fetched by one thread are frequently evicted by another thread. Such an eviction, called inter-thread kickout (ITKO), is one of the major causes of inter-thread cache conflicts. The other cause is capacity shortage that occurs when one cache is shared by threads demanding large cache capacities. If the total capacity demanded by the threads exceeds the actual cache capacity, the threads compete to use the limited cache capacity, resulting in capacity shortage. To address inter-thread cache conflicts, we must take into account both ITKOs and capacity shortage. Therefore, this paper proposes a capacity-aware thread scheduling method combined with cache partitioning. In the proposed method, inter-thread cache conflicts due to ITKOs and capacity shortage are decreased by cache partitioning and thread scheduling, respectively. The proposed scheduling method estimates the capacity demand of each thread with an estimation method used in the cache partitioning mechanism. Based on the estimation used for cache partitioning, the thread scheduler decides thread combinations sharing one cache so as to avoid capacity shortage. Evaluation results suggest that the proposed method can improve overall performance by up to 8.1%, and the performance of individual threads by up to 12%. The results also show that both cache partitioning and thread scheduling are indispensable to avoid both ITKOs and capacity shortage simultaneously. Accordingly, the proposed method can significantly reduce the inter-thread cache conflicts and hence improve performance.

  • A Fast Ray-Tracing Using Bounding Spheres and Frustum Rays for Dynamic Scene Rendering

    Ken-ichi SUZUKI  Yoshiyuki KAERIYAMA  Kazuhiko KOMATSU  Ryusuke EGAWA  Nobuyuki OHBA  Hiroaki KOBAYASHI  

     
    PAPER-Computer Graphics

      Vol:
    E93-D No:4
      Page(s):
    891-902

    Ray tracing is one of the most popular techniques for generating photo-realistic images. Extensive research and development work has made interactive static scene rendering realistic. This paper deals with interactive dynamic scene rendering in which not only the eye point but also the objects in the scene change their 3D locations every frame. In order to realize interactive dynamic scene rendering, RTRPS (Ray Tracing based on Ray Plane and Bounding Sphere), which utilizes the coherency in rays, objects, and grouped-rays, is introduced. RTRPS uses bounding spheres as the spatial data structure which utilizes the coherency in objects. By using bounding spheres, RTRPS can ignore the rotation of moving objects within a sphere, and shorten the update time between frames. RTRPS utilizes the coherency in rays by merging rays into a ray-plane, assuming that the secondary rays and shadow rays are shot through an aligned grid. Since a pair of ray-planes shares an original ray, the intersection for the ray can be completed using the coherency in the ray-planes. Because of the three kinds of coherency, RTRPS can significantly reduce the number of intersection tests for ray tracing. Further acceleration techniques for ray-plane-sphere and ray-triangle intersection are also presented. A parallel projection technique converts a 3D vector inner product operation into a 2D operation and reduces the number of floating point operations. Techniques based on frustum culling and binary-tree structured ray-planes optimize the order of intersection tests between ray-planes and a sphere, resulting in 50% to 90% reduction of intersection tests. Two ray-triangle intersection techniques are also introduced, which are effective when a large number of rays are packed into a ray-plane. Our performance evaluations indicate that RTRPS gives 13 to 392 times speed up in comparison with a ray tracing algorithm without organized rays and spheres. We found out that RTRPS also provides competitive performance even if only primary rays are used.

  • FLEXII: A Flexible Insertion Policy for Dynamic Cache Resizing Mechanisms

    Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER

      Vol:
    E98-C No:7
      Page(s):
    550-558

    As energy consumption of cache memories increases, an energy-efficient cache management mechanism is required. While a dynamic cache resizing mechanism is one promising approach to the energy reduction of microprocessors, one problem is that its effect is limited by the existence of dead-on-fill blocks, which are not used until their evictions from the cache memory. To solve this problem, this paper proposes a cache management policy named FLEXII, which can reduce the number of dead-on-fill blocks and help dynamic cache resizing mechanisms further reduce the energy consumption of the cache memories.

  • A Metadata Prefetching Mechanism for Hybrid Memory Architectures Open Access

    Shunsuke TSUKADA  Hikaru TAKAYASHIKI  Masayuki SATO  Kazuhiko KOMATSU  Hiroaki KOBAYASHI  

     
    PAPER

      Pubricized:
    2021/12/03
      Vol:
    E105-C No:6
      Page(s):
    232-243

    A hybrid memory architecture (HMA) that consists of some distinct memory devices is expected to achieve a good balance between high performance and large capacity. Unlike conventional memory architectures, the HMA needs the metadata for data management since the data are migrated between the memory devices during the execution of an application. The memory controller caches the metadata to avoid accessing the memory devices for the metadata reference. However, as the amount of the metadata increases in proportion to the size of the HMA, the memory controller needs to handle a large amount of metadata. As a result, the memory controller cannot cache all the metadata and increases the number of metadata references. This results in an increase in the access latency to reach the target data and degrades the performance. To solve this problem, this paper proposes a metadata prefetching mechanism for HMAs. The proposed mechanism loads the metadata needed in the near future by prefetching. Moreover, to increase the effect of the metadata prefetching, the proposed mechanism predicts the metadata used in the near future based on an address difference that is the difference between two consecutive access addresses. The evaluation results show that the proposed metadata prefetching mechanism can improve the instructions per cycle by up to 44% and 9% on average.

  • A Light-Weight Rollback Mechanism for Testing Kernel Variants in Auto-Tuning

    Shoichi HIRASAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Software

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2178-2186

    Automatic performance tuning of a practical application could be time-consuming and sometimes infeasible, because it often needs to evaluate the performances of a large number of code variants to find the best one. In this paper, hence, a light-weight rollback mechanism is proposed to evaluate each of code variants at a low cost. In the proposed mechanism, once one code variant of a target code block is executed, the execution state is rolled back to the previous state of not yet executing the block so as to repeatedly execute only the block to find the best code variant. It also has a feature of terminating a code variant whose execution time is longer than the shortest execution time so far. As a result, it can prevent executing the whole application many times and thus reduces the timing overhead of an auto-tuning process required for finding the best code variant.

  • Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems

    Muhammad ALFIAN AMRIZAL  Atsuya UNO  Yukinori SATO  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-High performance computing

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2749-2760

    Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.

  • MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications

    Ye GAO  Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Computer System

      Pubricized:
    2014/08/22
      Vol:
    E97-D No:11
      Page(s):
    2835-2843

    Vector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array.

  • A Topology Preserving Neural Network for Nonstationary Distributions

    Taira NAKAJIMA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    LETTER-Bio-Cybernetics and Neurocomputing

      Vol:
    E82-D No:7
      Page(s):
    1131-1135

    We propose a learning algorithm for self-organizing neural networks to form a topology preserving map from an input manifold whose topology may dynamically change. Experimental results show that the network using the proposed algorithm can rapidly adjust itself to represent the topology of nonstationary input distributions.

  • An Efficient Reference Image Sharing Method for the Image-Division Parallel Video Encoding Architecture

    Ken NAKAMURA  Yuya OMORI  Daisuke KOBAYASHI  Koyo NITTA  Kimikazu SANO  Masayuki SATO  Hiroe IWASAKI  Hiroaki KOBAYASHI  

     
    PAPER

      Pubricized:
    2022/11/29
      Vol:
    E106-C No:6
      Page(s):
    312-320

    This paper proposes an efficient reference image sharing method for the image-division parallel video encoding architecture. This method efficiently reduces the amount of data transfer by using pre-transfer with area prediction and on-demand transfer with a transfer management table. Experimental results show that the data transfer can be reduced to 19.8-35.3% of the conventional method on average without major degradation of coding performance. This makes it possible to reduce the required bandwidth of the inter-chip transfer interface by saving the amount of data transfer.

  • The Object-Space Parallel Processing of the Multipass Rendering Method on the (Mπ)2 with a Distributed-Frame Buffer System

    Hitoshi YAMAUCHI  Takayuki MAEDA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Computer Architecture

      Vol:
    E80-D No:9
      Page(s):
    909-918

    The multipass rendering method based on the global illumination model can generate the most photo-realistic images. However, since the multipass rendering method is very time consuming, it is impractical in the industrial world. This paper discusses a massively parallel processing approach to fast image synthesis by the multipass rendering method. Especially, we focus on the performance evaluation of the view-dependent object-space parallel processing on the (Mπ)2 which has been proposed in our previous paper. We also propose two kinds of distributed frame buffer system named cached frame buffer and multistage-interconnected frame buffer. These frame buffer systems can solve the access conflict problem on the frame buffer. The simulation results show that the (Mπ)2 has a scalable performance. For example, the (Mπ)2 with more than 4000 processing elements can achieve an efficiency of over 50%. We also show that both of the proposed distributed frame buffer systems can relieve the overhead due to frame buffer access in the (Mπ)2 in the case that a large number of high-performance processing elements are adopted in the system.

  • A Network Clustering Algorithm for Sybil-Attack Resisting

    Ling XU  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER

      Vol:
    E94-D No:12
      Page(s):
    2345-2352

    The social network model has been regarded as a promising mechanism to defend against Sybil attack. This model assumes that honest peers and Sybil peers are connected by only a small number of attack edges. Detection of the attack edges plays a key role in restraining the power of Sybil peers. In this paper, an attack-resisting, distributed algorithm, named Random walk and Social network model-based clustering (RSC), is proposed to detect the attack edges. In RSC, peers disseminate random walk packets to each other. For each edge, the number of times that the packets pass this edge reflects the betweenness of this edge. RSC observes that the betweennesses of attack edges are higher than those of the non-attack edges. In this way, the attack edges can be identified. To show the effectiveness of RSC, RSC is integrated into an existing social network model-based algorithm called SOHL. The results of simulations with real world social network datasets show that RSC remarkably improves the performance of SOHL.

  • An Active Learning Algorithm Based on Existing Training Data

    Hiroyuki TAKIZAWA  Taira NAKAJIMA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Biocybernetics, Neurocomputing

      Vol:
    E83-D No:1
      Page(s):
    90-99

    A multilayer perceptron is usually considered a passive learner that only receives given training data. However, if a multilayer perceptron actively gathers training data that resolve its uncertainty about a problem being learnt, sufficiently accurate classification is attained with fewer training data. Recently, such active learning has been receiving an increasing interest. In this paper, we propose a novel active learning strategy. The strategy attempts to produce only useful training data for multilayer perceptrons to achieve accurate classification, and avoids generating redundant training data. Furthermore, the strategy attempts to avoid generating temporarily useful training data that will become redundant in the future. As a result, the strategy can allow multilayer perceptrons to achieve accurate classification with fewer training data. To demonstrate the performance of the strategy in comparison with other active learning strategies, we also propose an empirical active learning algorithm as an implementation of the strategy, which does not require expensive computations. Experimental results show that the proposed algorithm improves the classification accuracy of a multilayer perceptron with fewer training data than that for a conventional random selection algorithm that constructs a training data set without explicit strategies. Moreover, the algorithm outperforms typical active learning algorithms in the experiments. Those results show that the algorithm can construct an appropriate training data set at lower computational cost, because training data generation is usually costly. Accordingly, the algorithm proves the effectiveness of the strategy through the experiments. We also discuss some drawbacks of the algorithm.

1-20hit(21hit)