Hiroaki AKUTSU Ko ARAI
Lanxi LIU Pengpeng YANG Suwen DU Sani M. ABDULLAHI
Xiaoguang TU Zhi HE Gui FU Jianhua LIU Mian ZHONG Chao ZHOU Xia LEI Juhang YIN Yi HUANG Yu WANG
Yingying LU Cheng LU Yuan ZONG Feng ZHOU Chuangao TANG
Jialong LI Takuto YAMAUCHI Takanori HIRANO Jinyu CAI Kenji TEI
Wei LEI Yue ZHANG Hanfeng XIE Zebin CHEN Zengping CHEN Weixing LI
David CLARINO Naoya ASADA Atsushi MATSUO Shigeru YAMASHITA
Takashi YOKOTA Kanemitsu OOTSU
Xiaokang Jin Benben Huang Hao Sheng Yao Wu
Tomoki MIYAMOTO
Ken WATANABE Katsuhide FUJITA
Masashi UNOKI Kai LI Anuwat CHAIWONGYEN Quoc-Huy NGUYEN Khalid ZAMAN
Takaharu TSUBOYAMA Ryota TAKAHASHI Motoi IWATA Koichi KISE
Chi ZHANG Li TAO Toshihiko YAMASAKI
Ann Jelyn TIEMPO Yong-Jin JEONG
Haruhisa KATO Yoshitaka KIDANI Kei KAWAMURA
Jiakun LI Jiajian LI Yanjun SHI Hui LIAN Haifan WU
Gyuyeong KIM
Hyun KWON Jun LEE
Fan LI Enze YANG Chao LI Shuoyan LIU Haodong WANG
Guangjin Ouyang Yong Guo Yu Lu Fang He
Yuyao LIU Qingyong LI Shi BAO Wen WANG
Cong PANG Ye NI Jia Ming CHENG Lin ZHOU Li ZHAO
Nikolay FEDOROV Yuta YAMASAKI Masateru TSUNODA Akito MONDEN Amjed TAHIR Kwabena Ebo BENNIN Koji TODA Keitaro NAKASAI
Yukasa MURAKAMI Yuta YAMASAKI Masateru TSUNODA Akito MONDEN Amjed TAHIR Kwabena Ebo BENNIN Koji TODA Keitaro NAKASAI
Kazuya KAKIZAKI Kazuto FUKUCHI Jun SAKUMA
Yitong WANG Htoo Htoo Sandi KYAW Kunihiro FUJIYOSHI Keiichi KANEKO
Waqas NAWAZ Muhammad UZAIR Kifayat ULLAH KHAN Iram FATIMA
Haeyoung Lee
Ji XI Pengxu JIANG Yue XIE Wei JIANG Hao DING
Weiwei JING Zhonghua LI
Sena LEE Chaeyoung KIM Hoorin PARK
Akira ITO Yoshiaki TAKAHASHI
Rindo NAKANISHI Yoshiaki TAKATA Hiroyuki SEKI
Chuzo IWAMOTO Ryo TAKAISHI
Chih-Ping Wang Duen-Ren Liu
Yuya TAKADA Rikuto MOCHIDA Miya NAKAJIMA Syun-suke KADOYA Daisuke SANO Tsuyoshi KATO
Yi Huo Yun Ge
Rikuto MOCHIDA Miya NAKAJIMA Haruki ONO Takahiro ANDO Tsuyoshi KATO
Koichi FUJII Tomomi MATSUI
Yaotong SONG Zhipeng LIU Zhiming ZHANG Jun TANG Zhenyu LEI Shangce GAO
Souhei TAKAGI Takuya KOJIMA Hideharu AMANO Morihiro KUGA Masahiro IIDA
Jun ZHOU Masaaki KONDO
Tetsuya MANABE Wataru UNUMA
Kazuyuki AMANO
Takumi SHIOTA Tonan KAMATA Ryuhei UEHARA
Hitoshi MURAKAMI Yutaro YAMAGUCHI
Jingjing Liu Chuanyang Liu Yiquan Wu Zuo Sun
Zhenglong YANG Weihao DENG Guozhong WANG Tao FAN Yixi LUO
Yoshiaki TAKATA Akira ONISHI Ryoma SENDA Hiroyuki SEKI
Dinesh DAULTANI Masayuki TANAKA Masatoshi OKUTOMI Kazuki ENDO
Kento KIMURA Tomohiro HARAMIISHI Kazuyuki AMANO Shin-ichi NAKANO
Ryotaro MITSUBOSHI Kohei HATANO Eiji TAKIMOTO
Genta INOUE Daiki OKONOGI Satoru JIMBO Thiem Van CHU Masato MOTOMURA Kazushi KAWAMURA
Hikaru USAMI Yusuke KAMEDA
Yinan YANG
Takumi INABA Takatsugu ONO Koji INOUE Satoshi KAWAKAMI
Fengshan ZHAO Qin LIU Takeshi IKENAGA
Naohito MATSUMOTO Kazuhiro KURITA Masashi KIYOMI
Tomohiro KOBAYASHI Tomomi MATSUI
Shin-ichi NAKANO
Ming PAN
Efficient parallel algorithms for several problems on proper circular arc graphs are presented in this paper. These problems include finding a maximum matching, partitioning into a minimum number of induced subgraphs each of which has a Hamiltonian cycle (path), partitioning into induced subgraphs each of which has a Hamiltonian cycle (path) with at least k vertices for a given k, and adding a minimum number of edges to make the graph contain a Hamiltonian cycle (path). It is shown here that the above problems can all be solved in logarithmic time with a linear number of EREW PRAM processors, or in constant time with a linear number of BSR processors. A more important part of this work is perhaps the extension of basic BSR to allow simultaneous multiple BROADCAST instructions.
2D (two-dimensional) convolution is a basic operation in image processing and requires intensive computation. Although the SIMD model is considered suitable for 2D convolution, previous 2D convolution algorithms on the SIMD model assume unbounded number of PEs (Processing Elements) available, which we call unbounded case. Unbounded case could not be satisfied on real computers. In this paper, time-optimal data-parallel 2D convolution is studied on mesh-connected SIMD computers with bounded number of PEs. Because the optimal computation complexity is not difficult to achieve, the main concern of this paper is how to achieve optimal communication complexity. Firstly the lower bound computation complexity is analyzed. Then the lower bound communication complexities are analyzed under two typical data-distribution strategies: block-mapping and cyclic-mapping. Based on the analysis result, an optimal algorithm is presented under the block-mapping. The algorithm achieves the lower bound complexity both in computation and in communication.
This paper presents a method for mechanically transforming a parallel algorithm on an original network so that the algorithm can work on a target network. It is assumed that the networks are of cube-type such as the shuffle-exchange network, omega network, and hypercube. Were those networks isomorphic to each other, the algorithm transformation is an easy task. The proposed transformation method is based on a novel graphembedding scheme <φ: δ, κ, π, ψ>. In addition to the dilating operation δ of the usual embedding scheme <φ: δ>, the novel scheme uses three primitive graph-transformation operations; κ (= δ-1) for contracting a path into a node, π for pipelining a graph, and ψ (= π-1) for folding a pipelined graph. By applying the primitive operations, the cube-type networks can be transformed so as to be isomorphic to each other. Relationships between the networks are represented by the composition of applied operations. With the isomorphic mapping φ, an algorithm in a node of the original network can be simulated in the corresponding node(s) of the target network. Thus the algorithm transformation is reduced to routine work.
Akihiro FUJIWARA Michiko INOUE Toshimitsu MASUZAWA Hideo FUJIWARA
The medial axis transform (MAT) is an image representation scheme. For a binary image, the MAT is defined as a set of upright maximal squares which consist of pixels of value l entirely. The MAT plays an important role in image understanding. This paper presents a parallel algorithm for computing the MAT of an n
Because the match phase in OPS5-type production systems requires most of the system's execution time and memory accesses, we proposed hash-based parallel production systems, CPPS (Clustered Parallel Production Systems), based on the RETE algorithm for distributed memory parallel computers, or multicomputers to reduce such a bottleneck. CPPS was effective in speeding up the match phase, but still left room for optimizations. In this paper, we introduce software cache techniques to memory nodes in the CPPS as one of the optimizations, and implement it on a multicomputer, nCUBE2. The benchmark results show that the CPPS with the software cache is about 2-fold faster than the original, and more than 7-fold faster than the simple hash method proposed by Acharya et al. for a large scale problem. The speed-up can be attributed to decreased communication costs.
Hiroaki KOBAYASHI Hitoshi YAMAUCHI Yuichiro TOH Tadao NAKAMURA
This paper proposes a hierarchical parallel processing system for the multipass rendering method. The multipass rendering method based on the integration of radiosity and ray-tracing can synthesize photo-realistic images. However, the method is also computationally expensive. To accelerate the multipass rendering method, the system, called (Mπ)2, employs two kinds of parallel processing schemes. As a coarse-grain parallel processing, object-space parallel processing with multiple processing elements based on the object-space subdivision is adapted, and each processing element (PE) is equipped with multiple pipelined units for a fine-grain parallel processing. To balance load among the system, static load balancing at the PE level and dynamic load balancing at the pipelined unit level within the PE are introduced. Especially, we propose a novel static load allocation scheme, skewed-distributed allocation, which can effectively distribute a three-dimensional object space to one- or two-dimensional processor configuration of the (Mπ)2 system. Simulation experiments show that the two-dimensional (Mπ)2 systems with the skewed-distributed allocation outperform the three-dimensional systems with the non-skewed distributed allocation. Since lower dimensional systems can be built at a lower cost than higher dimensional systems, the skewed-distributed allocation will be meritorious. Besides, by the combination of static load balancing by the skewed-distributed allocation and the dynamic load balancing by dynamic ray allocation within each PE, the system performance can be further boosted. We also propose a cached frame buffer system to relieve access collision on a frame buffer.
Yuetsu KODAMA Hirohumi SAKANE Mitsuhisa SATO Hayato YAMANA Shuichi SAKAI Yoshinori YAMAGUCHI
Communication latency is central to multiprocessor design. This study presents the design principles of the EM-X distributed-memory multiprocessor towards tolerating communication latency. The EM-X overlaps computation with communication for latency tolerance by multithreading. In particular, we present two types of hardware support for remote memory access: (1) priority-based packet scheduling for thread invocation, and (2) direct remote memory access. The priority-based scheduling policy extends a FIFO ordered thread invocation policy to adopt to different computational needs. The direct remote memory access is designed to overlap remote memory operations with thread execution. The 80-processor prototype of EM-X is developed and is operational since December 1995. We execute several programs on the machine and evaluate how the EM-X effectively overlaps computation with communication toward tolerating communication latency for high performance parallel computing.
There is a growing demand for high reliability beyond what current RAID can provide and there are various levels of user demand for data reliability. An efficient data placement scheme called RM2 has been proposed in [10], which makes a disk array system resistant to double disk failures. In this paper, we consider how to choose an optimal striping unit for RM2 particularly when no workload information is available except read/write ratio. For experimental purposes, we develop a disk array simulator incorporating RM2 as one of the data placement schemes including other schemes of RAID levels. In the case of disk read operations, it is shown that RM2 has an optimal striping unit of 4/3T for large requests and 8/3T for small requests, where T represents the size of a single track. We have also shown that, if any disk write operations are involved, an optimal striping unit becomes 1/3T for large requests and 8/3T for small requests.
Kazuhiko MOGI Masaru KITSUREGAWA
RAID5 disk arrays provide high performance and high reliability for reasonable cost. However RAID5 suffers a performance penalty during block updates. In this paper, we propose a method to improve the small write performance of RAID5 disk arrays, named Virtual Striping. Instead of updating each block independently, this method buffers a number of updates, generates a new stripe composed of the newly updated blocks, then writes the full stripe back to disk. In order to make free space for write operations, new garbage collection strategy is employed, where the linkage of blocks in a parity stripe is changed in Virtual Striping. The LFS (log-structured file system) based storage management scheme also writes new block onto large free area, which uses copying garbage collection. In this paper, we compare the performance of both methods through simulation. Although the write cost of Virtual Striping is more than that of the LFS based method, Virtual Striping has better performance than the LFS based method. This is due to the high efficiency of garbage collection in Virtual Striping.
Kazuhiko IWASAKI Akinori FURUTA
A mesh spiral network (MSnet) and a mesh random (MRnet) are proposed. The MSnet consists of the 2-D torus and bypass links that keep the degree at six. The MRnet consists of the 2-D torus and random bypass links that keep the degree at six. The diameter and the average distance are calculated by using a computer program. The cost of the MSnet is slightly higher than that of the de Bruijn graph, and is about the same as the Star graph. The cost of the MRnet is better than that of the de Bruijn graph. The MSnet is proven to be maximally fault-tolerant. The upper bound of the MRnet size is also discussed.
Takashi YOKOTA Hiroshi MATSUOKA Kazuaki OKAMOTO Hideo HIRONO Shuichi SAKAI
This paper discusses a massively parallel interconnection scheme for multithreaded architecture and introduces a new class of direct interconnection networks called the hierarchical Multidimensional Directed Cycles Ensemble (hMDCE). Its suitability for massively parallel systems is discussed. The network is evolved from the Multidimensional Directed Cycles Ensemble (MDCE) network, where each node is substituted by lower-level sub-networks. The new network addresses some serious problems caused by the increasing scale of parallel systems, such as longer latency, limited throughput and high implementation cost. This paper first introduces the MDCE network and then presents and examines in detail the hierarchical MDCE network. Bisection bandwidth of hMDCE is considerably reduced from its ancestor MDCE and the network performs significantly higher throughput and lower latency under some practical implementation constraints. The gate count and delay time of the compiled circuit for the routing function are insignificant. These results reveal that the hMDCE network is an important candidate for massively parallel systems interconnection.
The mesh-connected computers with hyperbus broadcasting are an extension of the mesh-connected computers with multiple broadcasting. Instead of using local buses, we use global buses to connect processors. Such a strategy efficiently reduces the time complexity of the semigroup problem from O(N) to O(log N). Also, the matrix multiplication and the transitive closure problems are solved in O(log N) and O(log2 N) time, respectively. Then, based on these operations, several interesting problems such as the connected recognition problem, the articulation problem, the dominator problem, the bridge problem, the sorting problem, the minimum spanning tree problem and the bipartite graph recognition problem can be solved in the order of polylogarithmic time.
Atsushi MURATA Taisuke BOKU Hideharu AMANO
The recent advance of semiconductor technologies enable to produce a medium size of crossbar with reasonable cost. By making the best use of the high bandwidth of such crossbars, indirect networks including the base-m n-cube and HyperCross have been proposed and researched. In these networks, a node is connected other nodes through crossbars in multiple dimensions. Although these networks are practically used in commercial machines, almost no discussion on a class of networks including them has been done. In this paper, a network class called Multi-Dimensional X'bar (MDX) which includes the above two networks is defined. Several new networks in this class are proposed, and relationship between these networks and direct networks/multistage interconnection networks is discussed. Finally, routing methods for these new networks are proposed and the average distance is evaluated. Through the discussion and evaluation, the MDX supports higher bandwidth than the corresponding multistage interconnection network with smaller hardware than the corresponding direct network.
Michael JURCZYK Thomas SCHWEDERSKI
Nonuniform traffic can degrade the overall performance of multistage interconnection networks substantially. In this paper, this performance degradation is traced back to blocking effects that are not present under uniform traffic patterns within a network. This blocking phenomenon is not mentioned in the literature and is termed higher order Head-of-Line-blocking (HOLk-blocking) in this paper. Methods to determine the HOL-blocking order of multistage networks in order to classify the networks are presented. The performance of networks under hot-spot traffic as a function of their HOL-blocking characteristics is studied by simulation. It is shown that network bandwidth and packet delay improve under nonuniform traffics with increasing HOL-blocking order of a network.
Govindan RAVINDRAN Michael STUMM
This paper presents the results of a simulation study of blocking and non-blocking switching for hierarchical ring networks. The switching techniques include wormhole, virtual cut-through, and slotted ring. We conclude that slotted ring network performs better than the more popular wormhole and virtual cut-through networks. We also show that the size of the node buffers is an important parameter and that choosing them too large can hurt performance in some cases. Slotted rings have the advantage that the choice of buffer size is easier in that larger than necessary buffers do not hurt performance and hence a single choice of buffer size performs well for all system configurations. In contrast, the optimal buffer size for virtual cut-through and wormhole switching nodes varies depending on the system configuration and the level in the hierarchy in which the switching node lies.
Noritaka SHIGEI Hiromi MIYAJIMA Takayuki ISHIZAKA Sadayuki MURASHIMA
To enhance fabrication yield for processor arrays, many reconfiguration schemes for replacing faulty processing elements (PE's) with spare PE's have been proposed. An array grid model based on single-tracks is one of such models. For this model, some algorithms for reconfiguring processor arrays have been proposed. However, an algorithm which can reconfigure the array, whenever the array is reconfigurable, has not been proposed yet. This paper presents two types of methods for reconfiguration of processor arrays. Both the types use indirect replacements for reconfiguring arrays. For an indirect replacement of a faulty non-spare PE, one has a fixed direction, the other has at most four directions among which one is chosen. For the former, we consider the several distribution of spare PE's, and computer simulations show a tendency in the term of difference in the distributions. The latter algorithms consist of two phases. In the first phase, rows and columns of spare PE's are decided in accordance with a rule. Several rules for deciding spare PE's are considered in this paper. In the second phase, faulty non-spare PE's are replaced with healthy spare PE's. By simulations the performance of the algorithms are evaluated and a tendency is shown in the terms of difference in disposition of spare PE's.
Toshinori YAMADA Koji YAMAMOTO Shuichi UENO
Motivated by the design of fault-tolerant multiprocessor interconnection networks, this paper considers the following problem: Given a positive integer t and a graph H, construct a graph G from H by adding a minimum number Δ(t, H) of edges such that even after deleting any t edges from G the remaining graph contains H as a subgraph. We estimate Δ(t, H) for the hypercube and torus, which are well-known as important interconnection networks for multiprocessor systems. If we denote the hypercube and the square torus on N vertices by QN and DN respectively, we show, among others, that Δ(t, QN) = O(tN log(log N/t + log 2e)) for any t and N (t
In this paper, we study the following node-to-node and node-to-set routing problems in r-dimensional torus Trn with r
Tadayoshi HORITA Itsuo TAKANAMI
Various reconfiguration schemes against faults of mesh-connected processor arrays have been proposed. As one of them, the mesh-connected processor arrays model based on single-track switches was proposed in [1]. The model has an advantage of its inherent simplicity of the routing hardware. Furthermore, the 2
Fabrizio LOMBARDI Nohpill PARK Susumu HORIGUCHI
This paper proposes new algorithms for diagnosing (detection, identification and location) baseline multistage interconnection networks (MIN) as one of the basic units in a massively parallel system. This is accomplished in the presence of single and multiple faults under a new fault model. This model referred to as the geometric fault model, considers defective crossing connections which are located between adjacent stages, internally to the MIN (therefore, a fault corresponds to a physical bridge fault between two connections). It is shown that this type of fault affects the correct geometry of the network, thus requiring a different testing approach than previous methods. Initially, an algorithm which detects the presence of bridge faults (both in the single and multiple fault cases), is presented. For a single bridge fault, the proposed algorithm locates the fault except in an unique pathological case under which it is logically impossible to differentiate between two equivalent locations of the fault (however, the switching element affected by this fault is uniquely located). The proposed algorithm requires log2 N test vectors to diagnose the MIN as fault free (where N is the number of input lines to the MIN). For fully diagnosing a single bridge fault, this algorithm requires at most 2
Akira FUNAHASHI Toshihiro HANAWA Hideharu AMANO
Multistage Interconnection Networks (MIN) with multiple outlets are networks which can support higher bandwidth than those of nonblocking networks by passing multiple packets to the same destination. Fault recovery mechanisms are proposed for two of such networks (TBSF/PBSF) with the best use of their inherent fault tolerant capability. With these mechanisms, on-the-fly fault recovery is possible for multiple faults on switching elements. For the link fault, the networks are reconfigured after fault diagnosis, and the network is available with some performance degradation. The bandwidth degradation under multiple faults on link/element is analyzed with both theoretical models and simulation. Through the analysis, F-PBSF shows high fault tolerance under high traffic load and low reliability by using 3 or more banyan networks.
Tomoo INOUE Hironori MAEDA Hideo FUJIWARA
The order of faults which are targeted for test-pattern generation affects both of the processing time for test generation and the number of generated test-patterns. This order is referred to as a test generation schedule. In this paper, we consider the effect of scheduling in test generation. We formulate the test generation scheduling problem which minimizes the cost of testing. We propose schedulings based on test-pattern generation time, dominating probability and dominated probability, and analyze the effect of these schedulings. In the analysis, we show that the total test-pattern generation time and the total number of test-patterns can be reduced by the scheduling according to the descending order of dominating probability prior to the ascending order of test-pattern generation. This is confirmed by the experiments using ISCAS'85 benchmark circuits. Further, in the experiments, we consider eight schedulings, and show that the scheduling according to the ascending order of dominated probability is the most effective of them.
Hani C. YEHIA Kazuya TAKEDA Fumitada ITAKURA
The objective of this paper is to find a parametric representation for the vocal-tract log-area function that is directly and simply related to basic acoustic characteristics of the human vocal-tract. The importance of this representation is associated with the solution of the articulatory-to-acoustic inverse problem, where a simple mapping from the articulatory space onto the acoustic space can be very useful. The method is as follows: Firstly, given a corpus of log-area functions, a parametric model is derived following a factor analysis technique. After that, the articulatory space, defined by the parametric model, is filled with approximately uniformly distributed points, and the corresponding first three formant frequencies are calculated. These formants define an acoustic space onto which the articulatory space maps. In the next step, an independent component analysis technique is used to determine acoustic and articulatory coordinate systems whose components are as independent as possible. Finally, using singular value decomposition, acoustic and articulatory coordinate systems are rotated so that each of the first three components of the articulatory space has major influence on one, and only one, component of the acoustic space. An example showing how the proposed model can be applied to the solution of the articulatory-to-acoustic inverse problem is given at the end of the paper.
Masahiro AGU Kazuo YAMANAKA Hiroki TAKAHASHI
Stable
Kazushi MURAKOSHI Tadashi KURATA
We develop a simulation environment for designing and examining a neural network model at the network level. The aim of our research is to enable researchers investigating neural network connective models to save time by being equipped with a graphical user interface and database of the network models. This environment consists of three parts: (1) the kernel of the simulation system, (2) NNDBMS (Neural Networks DataBase Management System), and (3) a system for displaying simulation results in various ways.