IEICE global.ieice.org Site

Author Search Result

[Author] Hiroaki KOBAYASHI(21hit)

1-20hit(21hit)

A Machine Learning-Based Approach for Selecting SpMV Kernels and Matrix Storage Formats
Hang CUI Shoichi HIRASAWA Hiroaki KOBAYASHI Hiroyuki TAKIZAWA

PAPER-Artificial Intelligence, Data Mining

Pubricized:
2018/06/13
Vol:
E101-D No:9
Page(s):
2307-2314
Sparse Matrix-Vector multiplication (SpMV) is a computational kernel widely used in many applications. Because of the importance, many different implementations have been proposed to accelerate this computational kernel. The performance characteristics of those SpMV implementations are quite different, and it is basically difficult to select the implementation that has the best performance for a given sparse matrix without performance profiling. One existing approach to the SpMV best-code selection problem is by using manually-predefined features and a machine learning model for the selection. However, it is generally hard to manually define features that can perfectly express the characteristics of the original sparse matrix necessary for the code selection. Besides, some information loss would happen by using this approach. This paper hence presents an effective deep learning mechanism for SpMV code selection best suited for a given sparse matrix. Instead of using manually-predefined features of a sparse matrix, a feature image and a deep learning network are used to map each sparse matrix to the implementation, which is expected to have the best performance, in advance of the execution. The benefits of using the proposed mechanism are discussed by calculating the prediction accuracy and the performance. According to the evaluation, the proposed mechanism can select an optimal or suboptimal implementation for an unseen sparse matrix in the test data set in most cases. These results demonstrate that, by using deep learning, a whole sparse matrix can be used to do the best implementation prediction, and the prediction accuracy achieved by the proposed mechanism is higher than that of using predefined features.
Kohonen Learning with a Mechanism, the Law of the Jungle, Capable of Dealing with Nonstationary Probability Distribution Functions
Taira NAKAJIMA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI Tadao NAKAMURA

PAPER-Bio-Cybernetics and Neurocomputing

Vol:
E81-D No:6
Page(s):
584-591
We present a mechanism, named the law of the jungle (LOJ), to improve the Kohonen learning. The LOJ is used to be an adaptive vector quantizer for approximating nonstationary probability distribution functions. In the LOJ mechanism, the probability that each node wins in a competition is dynamically estimated during the learning. By using the estimated win probability, "strong" nodes are increased through creating new nodes near the nodes, and "weak" nodes are decreased through deleting themselves. A pair of creation and deletion is treated as an atomic operation. Therefore, the nodes which cannot win the competition are transferred directly from the region where inputs almost never occur to the region where inputs often occur. This direct "jump" of weak nodes provides rapid convergence. Moreover, the LOJ requires neither time-decaying parameters nor a special periodic adaptation. From the above reasons, the LOJ is suitable for quick approximation of nonstationary probability distribution functions. In comparison with some other Kohonen learning networks through experiments, only the LOJ can follow nonstationary probability distributions except for under high-noise environments.
Preparation and Evaluation of Aligned Naphthacene Thin Films Using Surface Plasmon Excitation
Tohru SHIMAOKA Hiroaki KOBAYASHI Kazuki YAMASHITA Yasuo OHDAIRA Kazunari SHINBO Keizo KATO Futao KANEKO

LETTER-Evaluation of Organic Materials

Vol:
E89-C No:12
Page(s):
1758-1759
Molecular aligned naphthacene thins films were fabricated using vacuum evaporation and the rubbing method. The attenuated total reflection (ATR) and emission light properties from surface plasmon (SP) excitation due to molecular luminescence were investigated for these films. The long axis of the rod-like molecule was estimated to align perpendicular to the rubbing direction. The ATR and emission light properties depended on the molecular orientation.
Load Balancing Based on Load Coherence between Continuous Images for an Object-Space Parallel Ray-Tracing System
Hiroaki KOBAYASHI Hideyuki KUBOTA Susumu HORIGUCHI Tadao NAKAMURA

PAPER-Computer Systems

Vol:
E76-D No:12
Page(s):
1490-1499
The ray-tracing algorithm can synthesize very realistic images. However, the ray tracing is very time consuming. To solve this problem, a load balancing strategy using temporal coherence between images in an animation is presented for balancing computational loads among processing elements of a parallel processng system. Our parallel processing model is based on a space subdivision method for the ray-tracing algorithm. A subdivided object space is distributed among processing elements of the parallel system. To clarify the effectiveness of the load balancing strategy, we examine the system performance by computer simulation.
(Mπ)²: A Hierarchical Parallel Processing System for the Multipass Rendering Method
Hiroaki KOBAYASHI Hitoshi YAMAUCHI Yuichiro TOH Tadao NAKAMURA

PAPER-Architectures

Vol:
E79-D No:8
Page(s):
1055-1064
This paper proposes a hierarchical parallel processing system for the multipass rendering method. The multipass rendering method based on the integration of radiosity and ray-tracing can synthesize photo-realistic images. However, the method is also computationally expensive. To accelerate the multipass rendering method, the system, called (Mπ)2, employs two kinds of parallel processing schemes. As a coarse-grain parallel processing, object-space parallel processing with multiple processing elements based on the object-space subdivision is adapted, and each processing element (PE) is equipped with multiple pipelined units for a fine-grain parallel processing. To balance load among the system, static load balancing at the PE level and dynamic load balancing at the pipelined unit level within the PE are introduced. Especially, we propose a novel static load allocation scheme, skewed-distributed allocation, which can effectively distribute a three-dimensional object space to one- or two-dimensional processor configuration of the (Mπ)2 system. Simulation experiments show that the two-dimensional (Mπ)2 systems with the skewed-distributed allocation outperform the three-dimensional systems with the non-skewed distributed allocation. Since lower dimensional systems can be built at a lower cost than higher dimensional systems, the skewed-distributed allocation will be meritorious. Besides, by the combination of static load balancing by the skewed-distributed allocation and the dynamic load balancing by dynamic ray allocation within each PE, the system performance can be further boosted. We also propose a cached frame buffer system to relieve access collision on a frame buffer.
Data-Parallel Volume Rendering with Adaptive Volume Subdivision
Kentaro SANO Hiroyuki KITAJIMA Hiroaki KOBAYASHI Tadao NAKAMURA

PAPER-Computer Graphics

Vol:
E83-D No:1
Page(s):
80-89
A data-parallel processing approach is promising for real-time volume rendering because of the massive parallelism in volume rendering. In data-parallel volume rendering, local results processing elements(PEs) generate from allocated subvolumes are integrated to form a final image. Generally, the integration causes an overhead unavoidable in data-parallel volume rendering due to communications among PEs. This paper proposes a data-parallel shear-warp volume rendering algorithm combined with an adaptive volume subdivision method to reduce the communication overhead and improve processing efficiency. We implement the parallel algorithm on a message-passing multiprocessor system for performance evaluation. The experimental results show that the adaptive volume subdivision method can reduce the overhead and achieve higher efficiency compared with a conventional slab subdivision method.
Slot-Array Receiving Antennas Fed by Coplanar Waveguide for 700 GHz Submillimeter-Wave Radiation
Hiroaki KOBAYASHI Yasuhiko ABE Yoshizumi YASUOKA

PAPER-Phased Arrays and Antennas

Vol:
E82-C No:7
Page(s):
1248-1252
Thin-film slot-array receiving antennas fed by coplanar waveguide (CPW) were fabricated on fused quartz substrates, and the antenna properties were investigated at 700 GHz. It was confirmed that the transmission efficiency of CPW was 0.83/λm, and the rate of radiated power from a slot antenna was 0.5 at 700 GHz. The fabricated antennas worked as expected from the theory based on the transmission line model, and the two-dimensional 83 slot-array antenna fed by CPW increased the power gain by 11 dB over a single-slot antenna. The power gain of the antenna was 13 dBi and the aperture efficiency was 40% when the 700 GHz-submillimeter wave was irradiated through the substrate.
Acceleration Techniques for the Network Inversion Algorithm
Hiroyuki TAKIZAWA Taira NAKAJIMA Masaaki NISHI Hiroaki KOBAYASHI Tadao NAKAMURA

LETTER-Bio-Cybernetics and Neurocomputing

Vol:
E82-D No:2
Page(s):
508-511
We apply two acceleration techniques for the backpropagation algorithm to an iterative gradient descent algorithm called the network inversion algorithm. Experimental results show that these techniques are also quite effective to decrease the number of iterations required for the detection of input vectors on the classification boundary of a multilayer perceptron.
A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts
Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI

PAPER-Computer System

Vol:
E96-D No:9
Page(s):
2047-2054
Chip multiprocessors (CMPs) improve performance by simultaneously executing multiple threads using integrated multiple cores. However, since these cores commonly share one cache, inter-thread cache conflicts often limit the performance improvement by multi-threading. This paper focuses on two causes of inter-thread cache conflicts. In shared caches of CMPs, cached data fetched by one thread are frequently evicted by another thread. Such an eviction, called inter-thread kickout (ITKO), is one of the major causes of inter-thread cache conflicts. The other cause is capacity shortage that occurs when one cache is shared by threads demanding large cache capacities. If the total capacity demanded by the threads exceeds the actual cache capacity, the threads compete to use the limited cache capacity, resulting in capacity shortage. To address inter-thread cache conflicts, we must take into account both ITKOs and capacity shortage. Therefore, this paper proposes a capacity-aware thread scheduling method combined with cache partitioning. In the proposed method, inter-thread cache conflicts due to ITKOs and capacity shortage are decreased by cache partitioning and thread scheduling, respectively. The proposed scheduling method estimates the capacity demand of each thread with an estimation method used in the cache partitioning mechanism. Based on the estimation used for cache partitioning, the thread scheduler decides thread combinations sharing one cache so as to avoid capacity shortage. Evaluation results suggest that the proposed method can improve overall performance by up to 8.1%, and the performance of individual threads by up to 12%. The results also show that both cache partitioning and thread scheduling are indispensable to avoid both ITKOs and capacity shortage simultaneously. Accordingly, the proposed method can significantly reduce the inter-thread cache conflicts and hence improve performance.
A Fast Ray-Tracing Using Bounding Spheres and Frustum Rays for Dynamic Scene Rendering
Ken-ichi SUZUKI Yoshiyuki KAERIYAMA Kazuhiko KOMATSU Ryusuke EGAWA Nobuyuki OHBA Hiroaki KOBAYASHI

PAPER-Computer Graphics

Vol:
E93-D No:4
Page(s):
891-902
Ray tracing is one of the most popular techniques for generating photo-realistic images. Extensive research and development work has made interactive static scene rendering realistic. This paper deals with interactive dynamic scene rendering in which not only the eye point but also the objects in the scene change their 3D locations every frame. In order to realize interactive dynamic scene rendering, RTRPS (Ray Tracing based on Ray Plane and Bounding Sphere), which utilizes the coherency in rays, objects, and grouped-rays, is introduced. RTRPS uses bounding spheres as the spatial data structure which utilizes the coherency in objects. By using bounding spheres, RTRPS can ignore the rotation of moving objects within a sphere, and shorten the update time between frames. RTRPS utilizes the coherency in rays by merging rays into a ray-plane, assuming that the secondary rays and shadow rays are shot through an aligned grid. Since a pair of ray-planes shares an original ray, the intersection for the ray can be completed using the coherency in the ray-planes. Because of the three kinds of coherency, RTRPS can significantly reduce the number of intersection tests for ray tracing. Further acceleration techniques for ray-plane-sphere and ray-triangle intersection are also presented. A parallel projection technique converts a 3D vector inner product operation into a 2D operation and reduces the number of floating point operations. Techniques based on frustum culling and binary-tree structured ray-planes optimize the order of intersection tests between ray-planes and a sphere, resulting in 50% to 90% reduction of intersection tests. Two ray-triangle intersection techniques are also introduced, which are effective when a large number of rays are packed into a ray-plane. Our performance evaluations indicate that RTRPS gives 13 to 392 times speed up in comparison with a ray tracing algorithm without organized rays and spheres. We found out that RTRPS also provides competitive performance even if only primary rays are used.
FLEXII: A Flexible Insertion Policy for Dynamic Cache Resizing Mechanisms
Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI

PAPER

Vol:
E98-C No:7
Page(s):
550-558
As energy consumption of cache memories increases, an energy-efficient cache management mechanism is required. While a dynamic cache resizing mechanism is one promising approach to the energy reduction of microprocessors, one problem is that its effect is limited by the existence of dead-on-fill blocks, which are not used until their evictions from the cache memory. To solve this problem, this paper proposes a cache management policy named FLEXII, which can reduce the number of dead-on-fill blocks and help dynamic cache resizing mechanisms further reduce the energy consumption of the cache memories.
A Metadata Prefetching Mechanism for Hybrid Memory Architectures Open Access
Shunsuke TSUKADA Hikaru TAKAYASHIKI Masayuki SATO Kazuhiko KOMATSU Hiroaki KOBAYASHI

PAPER

Pubricized:
2021/12/03
Vol:
E105-C No:6
Page(s):
232-243
A hybrid memory architecture (HMA) that consists of some distinct memory devices is expected to achieve a good balance between high performance and large capacity. Unlike conventional memory architectures, the HMA needs the metadata for data management since the data are migrated between the memory devices during the execution of an application. The memory controller caches the metadata to avoid accessing the memory devices for the metadata reference. However, as the amount of the metadata increases in proportion to the size of the HMA, the memory controller needs to handle a large amount of metadata. As a result, the memory controller cannot cache all the metadata and increases the number of metadata references. This results in an increase in the access latency to reach the target data and degrades the performance. To solve this problem, this paper proposes a metadata prefetching mechanism for HMAs. The proposed mechanism loads the metadata needed in the near future by prefetching. Moreover, to increase the effect of the metadata prefetching, the proposed mechanism predicts the metadata used in the near future based on an address difference that is the difference between two consecutive access addresses. The evaluation results show that the proposed metadata prefetching mechanism can improve the instructions per cycle by up to 44% and 9% on average.
A Light-Weight Rollback Mechanism for Testing Kernel Variants in Auto-Tuning
Shoichi HIRASAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI

PAPER-Software

Pubricized:
2015/09/15
Vol:
E98-D No:12
Page(s):
2178-2186
Automatic performance tuning of a practical application could be time-consuming and sometimes infeasible, because it often needs to evaluate the performances of a large number of code variants to find the best one. In this paper, hence, a light-weight rollback mechanism is proposed to evaluate each of code variants at a low cost. In the proposed mechanism, once one code variant of a target code block is executed, the execution state is rolled back to the previous state of not yet executing the block so as to repeatedly execute only the block to find the best code variant. It also has a feature of terminating a code variant whose execution time is longer than the shortest execution time so far. As a result, it can prevent executing the whole application many times and thus reduces the timing overhead of an auto-tuning process required for finding the best code variant.
Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems
Muhammad ALFIAN AMRIZAL Atsuya UNO Yukinori SATO Hiroyuki TAKIZAWA Hiroaki KOBAYASHI

PAPER-High performance computing

Pubricized:
2017/07/14
Vol:
E100-D No:12
Page(s):
2749-2760
Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.
MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications
Ye GAO Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI

PAPER-Computer System

Pubricized:
2014/08/22
Vol:
E97-D No:11
Page(s):
2835-2843
Vector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array.
A Topology Preserving Neural Network for Nonstationary Distributions
Taira NAKAJIMA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI Tadao NAKAMURA

LETTER-Bio-Cybernetics and Neurocomputing

Vol:
E82-D No:7
Page(s):
1131-1135
We propose a learning algorithm for self-organizing neural networks to form a topology preserving map from an input manifold whose topology may dynamically change. Experimental results show that the network using the proposed algorithm can rapidly adjust itself to represent the topology of nonstationary input distributions.
An Efficient Reference Image Sharing Method for the Image-Division Parallel Video Encoding Architecture
Ken NAKAMURA Yuya OMORI Daisuke KOBAYASHI Koyo NITTA Kimikazu SANO Masayuki SATO Hiroe IWASAKI Hiroaki KOBAYASHI

PAPER

Pubricized:
2022/11/29
Vol:
E106-C No:6
Page(s):
312-320
This paper proposes an efficient reference image sharing method for the image-division parallel video encoding architecture. This method efficiently reduces the amount of data transfer by using pre-transfer with area prediction and on-demand transfer with a transfer management table. Experimental results show that the data transfer can be reduced to 19.8-35.3% of the conventional method on average without major degradation of coding performance. This makes it possible to reduce the required bandwidth of the inter-chip transfer interface by saving the amount of data transfer.
The Object-Space Parallel Processing of the Multipass Rendering Method on the (Mπ)² with a Distributed-Frame Buffer System
Hitoshi YAMAUCHI Takayuki MAEDA Hiroaki KOBAYASHI Tadao NAKAMURA

PAPER-Computer Architecture

Vol:
E80-D No:9
Page(s):
909-918
The multipass rendering method based on the global illumination model can generate the most photo-realistic images. However, since the multipass rendering method is very time consuming, it is impractical in the industrial world. This paper discusses a massively parallel processing approach to fast image synthesis by the multipass rendering method. Especially, we focus on the performance evaluation of the view-dependent object-space parallel processing on the (Mπ)2 which has been proposed in our previous paper. We also propose two kinds of distributed frame buffer system named cached frame buffer and multistage-interconnected frame buffer. These frame buffer systems can solve the access conflict problem on the frame buffer. The simulation results show that the (Mπ)2 has a scalable performance. For example, the (Mπ)2 with more than 4000 processing elements can achieve an efficiency of over 50%. We also show that both of the proposed distributed frame buffer systems can relieve the overhead due to frame buffer access in the (Mπ)2 in the case that a large number of high-performance processing elements are adopted in the system.
A Network Clustering Algorithm for Sybil-Attack Resisting
Ling XU Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI

PAPER

Vol:
E94-D No:12
Page(s):
2345-2352
The social network model has been regarded as a promising mechanism to defend against Sybil attack. This model assumes that honest peers and Sybil peers are connected by only a small number of attack edges. Detection of the attack edges plays a key role in restraining the power of Sybil peers. In this paper, an attack-resisting, distributed algorithm, named Random walk and Social network model-based clustering (RSC), is proposed to detect the attack edges. In RSC, peers disseminate random walk packets to each other. For each edge, the number of times that the packets pass this edge reflects the betweenness of this edge. RSC observes that the betweennesses of attack edges are higher than those of the non-attack edges. In this way, the attack edges can be identified. To show the effectiveness of RSC, RSC is integrated into an existing social network model-based algorithm called SOHL. The results of simulations with real world social network datasets show that RSC remarkably improves the performance of SOHL.
An Active Learning Algorithm Based on Existing Training Data
Hiroyuki TAKIZAWA Taira NAKAJIMA Hiroaki KOBAYASHI Tadao NAKAMURA

PAPER-Biocybernetics, Neurocomputing

Vol:
E83-D No:1
Page(s):
90-99
A multilayer perceptron is usually considered a passive learner that only receives given training data. However, if a multilayer perceptron actively gathers training data that resolve its uncertainty about a problem being learnt, sufficiently accurate classification is attained with fewer training data. Recently, such active learning has been receiving an increasing interest. In this paper, we propose a novel active learning strategy. The strategy attempts to produce only useful training data for multilayer perceptrons to achieve accurate classification, and avoids generating redundant training data. Furthermore, the strategy attempts to avoid generating temporarily useful training data that will become redundant in the future. As a result, the strategy can allow multilayer perceptrons to achieve accurate classification with fewer training data. To demonstrate the performance of the strategy in comparison with other active learning strategies, we also propose an empirical active learning algorithm as an implementation of the strategy, which does not require expensive computations. Experimental results show that the proposed algorithm improves the classification accuracy of a multilayer perceptron with fewer training data than that for a conventional random selection algorithm that constructs a training data set without explicit strategies. Moreover, the algorithm outperforms typical active learning algorithms in the experiments. Those results show that the algorithm can construct an appropriate training data set at lower computational cost, because training data generation is usually costly. Accordingly, the algorithm proves the effectiveness of the strategy through the experiments. We also discuss some drawbacks of the algorithm.

1-20hit(21hit)

Author Search Result

[Author] Hiroaki KOBAYASHI(21hit)

A Machine Learning-Based Approach for Selecting SpMV Kernels and Matrix Storage Formats

Kohonen Learning with a Mechanism, the Law of the Jungle, Capable of Dealing with Nonstationary Probability Distribution Functions

Preparation and Evaluation of Aligned Naphthacene Thin Films Using Surface Plasmon Excitation

Load Balancing Based on Load Coherence between Continuous Images for an Object-Space Parallel Ray-Tracing System

(Mπ)²: A Hierarchical Parallel Processing System for the Multipass Rendering Method

Data-Parallel Volume Rendering with Adaptive Volume Subdivision

Slot-Array Receiving Antennas Fed by Coplanar Waveguide for 700 GHz Submillimeter-Wave Radiation

Acceleration Techniques for the Network Inversion Algorithm

A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts

A Fast Ray-Tracing Using Bounding Spheres and Frustum Rays for Dynamic Scene Rendering

FLEXII: A Flexible Insertion Policy for Dynamic Cache Resizing Mechanisms

A Metadata Prefetching Mechanism for Hybrid Memory Architectures Open Access

A Light-Weight Rollback Mechanism for Testing Kernel Variants in Auto-Tuning

Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems

MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications

A Topology Preserving Neural Network for Nonstationary Distributions

An Efficient Reference Image Sharing Method for the Image-Division Parallel Video Encoding Architecture

The Object-Space Parallel Processing of the Multipass Rendering Method on the (Mπ)² with a Distributed-Frame Buffer System

A Network Clustering Algorithm for Sybil-Attack Resisting

An Active Learning Algorithm Based on Existing Training Data

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles