Susumu HORIGUCHI Takeo NAKADA Yoshiharu SHIGEI
A parallel sorting is actually executed on the multiprocessor system to estimate the real performance. The sorting algorithm consists of merging and exchanging data between the nearest neighbors. The system performance of parallel sorting is measured experimentally. It is proved that the parallel sorting algorithm is really effective for a multiprocessor system with the limited number of processors.
A parallel FFT (Fast Fourier Transform) is actually executed on a clustered multiprocessor system with 32 processors to study real performance of three communication schemes, common memory, linear array and ring array. The speed-up of parallel FFT to a serial computer really depends on the number of data to be transformed and the number of processors. It is also proved that the performance really depends on the communication scheme of multiprocessor system.
Hong SHEN Yi PAN John SUM Susumu HORIGUCHI
This paper provides an overview on efficient algorithms for multicasting in optical networks supported by Wavelength Division Multiplexing (WDM) with limited wavelength conversion. We classify the multicast problems according to off-line and on-line in both reliable and unreliable networks. In each problem class, we present efficient algorithms for multicast and multiple multicast and show their performance. We also present efficient schemes for dynamic multicast group membership updating. We conclude the paper by showing possible extension of the presented algorithms for QoS provision.
Vijay K. JAIN Tadasse GHIRMAI Susumu HORIGUCHI
Advanced scientific and engineering problems require massively parallel computing. Critical to the designand ultimately the performanceof such computing systems is the interconnection network binding the computing elements, just as is the cardiovascular network to the human body. This paper develops a new interconnection network, "Tori connected mESHes (TESH)," consisting of k-ary n-cube connection of supernodes that comprise meshes of lower level nodes. Its key features are the following: it is hierarchical, thus allowing exploitation of computation locality as well as easy expansion (up to a million processors), and it appears to be well suited for 3-D VLSI implementation, for it requires far fewer number of vertical wires than almost all known multi-computer networks. Presented in the paper are the architecture of the new network, node addressing and message routing, 3-D VLSI/ULSI considerations, and application of the network to massively parallel computing. Specifically, we discuss the mapping on to the network of stack filtering, a hardware oriented technique for order statistic image filtering.
Susumu HORIGUCHI Yoshiharu SHIGEI
A new parallel sorting algorithm is proposed, assuming a linearly connected multiprocessor system with a limited number of processors. The algorithm based on merging and exchanging data between adjacent processors. The upper bound is analytically investigated, and the throughput is proportional to the number of processors. The time complexity is O(N) for N data with O(log N) processors. Average number of iterations for a multiprocessor with two processing elements is also analytically investigated. Execution time for real situation is numerically simulated on a conventional computer.
Minh LE NGUYEN Masaru FUKUSHI Susumu HORIGUCHI
This paper describes a new probabilistic sentence reduction method using maximum entropy model. In contrast to previous methods, the proposed method has the ability to produce multiple best results for a given sentence, which is useful in text summarization applications. Experimental results show that the proposed method improves on earlier methods in both accuracy and computation time.
M.M. Hafizur RAHMAN Yasushi INOGUCHI Susumu HORIGUCHI
Three-dimensional (3D) wafer stacked implementation (WSI) has been proposed as a promising technology for massively parallel computers. A hierarchical 3D-torus (H3DT) network, which is a 3D-torus network of multiple basic modules in which the basic modules are 3D-mesh networks, has been proposed for efficient 3D-WSI. However, the restricted use of physical links between basic modules in the higher level networks reduces the dynamic communication performance of this network. A torus network has better dynamic communication performance than a mesh network. Therefore, we have modified the H3DT network by replacing the 3D-mesh modules by 3D-tori, calling it a Modified H3DT (MH3DT) network. This paper addresses the architectural details of the MH3DT network and explores aspects such as degree, diameter, cost, average distance, arc connectivity, bisection width, and wiring complexity. We also present a deadlock-free routing algorithm for the MH3DT network using two virtual channels and evaluate the network's dynamic communication performance under the uniform traffic pattern, using the proposed routing algorithm. It is shown that the MH3DT network possesses several attractive features including small diameter, small cost, small average distance, better bisection width, and better dynamic communication performance.
Xiaohong JIANG Hong SHEN Md. Mamun-ur-Rashid KHANDKER Susumu HORIGUCHI
Crosstalk in optical switch is an intrinsic drawback of optical networks, and avoiding crosstalk is important for making optical network work properly. Horizontal expansion and vertical stacking are two basic techniques for creating nonblocking multistage interconnection networks (MINs). Rearrangeable (nonblocking) optical MINs are feasible since they have lower complexity than their strictly nonblocking counterparts. In this paper, we study the crosstalk-free permutations in rearrangeable optical MINs built on a combination of horizontal expansion and vertical stacking of banyan networks, and provide a scheme for realizing crosstalk-free permutations in this kind of optical MINs. The basic idea of this scheme is to first decompose a permutation into multiple partial permutations by using Euler Split technique, then route and realize each of these partial permutations crosstalk-free in one plane (stacked copy) of a MIN based on both the Euler Split technique and self-routing property of a banyan network. The tradeoff between the overall time complexity and hardware cost of this class of MINs is also explored in this paper.
Susumu HORIGUCHI Takayuki ISHIZONE Yasuto MUSHIAKE
Effects of the element radiation pattern on characteristics of the spherical array antenna composed of turnstile elements is discussed. It is found by numerical calculations that scanning characteristics of the directivity and the axial ratio do not so much depend on element radiation patterns.
Xuan-Hieu PHAN Le-Minh NGUYEN Yasushi INOGUCHI Susumu HORIGUCHI
Conditional random fields (CRFs) have been successfully applied to various applications of predicting and labeling structured data, such as natural language tagging & parsing, image segmentation & object recognition, and protein secondary structure prediction. The key advantages of CRFs are the ability to encode a variety of overlapping, non-independent features from empirical data as well as the capability of reaching the global normalization and optimization. However, estimating parameters for CRFs is very time-consuming due to an intensive forward-backward computation needed to estimate the likelihood function and its gradient during training. This paper presents a high-performance training of CRFs on massively parallel processing systems that allows us to handle huge datasets with hundreds of thousand data sequences and millions of features. We performed the experiments on an important natural language processing task (text chunking) on large-scale corpora and achieved significant results in terms of both the reduction of computational time and the improvement of prediction accuracy.
Yusuke FUKUSHIMA Xiaohong JIANG Susumu HORIGUCHI
The multi-log2N network architecture is attractive for constructing optical switches, and the related routing algorithms are critical for the operation and efficiency of such switches. Although several routing algorithms have been proposed for multi-log2N networks, a full performance comparison among them has not been published up to now. Thus, we rectify this omission by providing such a comparison in terms of blocking probability, time complexity, hardware cost and load balancing capability. Notice that the load balance is important for reducing the peak power requirement of a switch, so we also propose in this paper a new routing algorithm for optical multi-log2N networks to achieve a better load balance.
Ayong YE Jianfeng MA Xiaohong JIANG Susumu HORIGUCHI
Secure sensor localization is a prerequisite for many sensor networks to retrieve trustworthy data. However, most of existing node positioning systems were studied in trust environment and are therefore vulnerable to malicious attacks. In this work, we develop a robust node positioning mechanism(ROPM) to protect localization techniques from position attacks. Instead of introducing countermeasures for every possible internal or external attack, our approach aims at making node positioning system attack-tolerant by removing malicious beacons. We defeat internal attackers and external attackers by applying different strategies, which not only achieves robustness to attacks but also dramatically reduces the computation overhead. Finally, we provide detailed theoretical analysis and simulations to evaluate the proposed technique.
M.M. Hafizur RAHMAN Susumu HORIGUCHI
Interconnection networks usually suffer from Little's Law: low cost implies low performance and high performance is obtained high cost. However, hierarchical interconnection networks provide high performance at low cost by exploring the locality that exists in communication patterns of massively parallel computers. In this paper, we propose a new hierarchical interconnection network, called Hierarchical Torus Network (HTN). This network reduces the number of vertical links in 3D stacked implementation while maintaining good network features. This paper addresses the architectural details of the HTN, and explores aspects such as the network diameter, average distance, bisection width, peak number of vertical links, and VLSI layout area of the HTN as well as for several commonly used networks for parallel computers. It is shown that the HTN possesses several attractive features including small diameter, small average distance, small number of wires, a particularly small number of vertical links, and economic layout area.
Yusuke FUKUSHIMA Xiaohong JIANG Achille PATTAVINA Susumu HORIGUCHI
Arrayed waveguide grating (AWG) is a promising technology for constructing high-speed large-capacity WDM switches, because it can switch fast, is scalable to large size and consumes little power. To take the full advantage of high-speed AWG, the routing control of a massive AWG-based switch should be as simple as possible. In this paper, we focus on the self-routing design of AWG-based switches with O(1) constant routing complexity and propose a novel construction of self-routing AWG switches that can guarantee the attractive nonblocking property for both the wavelength-to-wavelength and wavelength-to-fiber request models. We also fully analyze the proposed design in terms of its blocking property, hardware cost and crosstalk performance and compare it against traditional designs. It is expected that the proposed construction will be useful for the design and all-optical implementation of future ultra high-speed optical packet/burst switches.
Hiroaki KOBAYASHI Hideyuki KUBOTA Susumu HORIGUCHI Tadao NAKAMURA
The ray-tracing algorithm can synthesize very realistic images. However, the ray tracing is very time consuming. To solve this problem, a load balancing strategy using temporal coherence between images in an animation is presented for balancing computational loads among processing elements of a parallel processng system. Our parallel processing model is based on a space subdivision method for the ray-tracing algorithm. A subdivided object space is distributed among processing elements of the parallel system. To clarify the effectiveness of the load balancing strategy, we examine the system performance by computer simulation.
Xiaohong JIANG Md. Mamun-ur-Rashid KHANDKER Hong SHEN Susumu HORIGUCHI
Vertical stacking is a novel technique for building switching networks, and packing multiple compatible connections together is an effective strategy to reduce network hardware cost. In this paper, we study the crosstalk-free permutation capability of an optical switching network built on the vertical stacking of optical banyan networks to which packing strategy is applied. We first look into the nonblocking condition of this optical switching network. We then study the crosstalk-free permutation in this network by decomposing a permutation evenly into multiple crosstalk-free partial permutations (CFPPs) and realizing each CFPP in a stacked plane of the network such that a crosstalk-free permutation can be performed in a single pass. We present a rigorous proof of CFPP decomposability of a permutation and also a complete algorithm for CFPP decomposition. The possibility of a tradeoff between the number of passes and the number of planes required for realizing a crosstalk-free permutation in this network is also explored in this paper.
Norman SCAIFE Ryoko HAYASHI Susumu HORIGUCHI
We have constructed a parallelizing compiler for Standard ML (SML) based upon algorithmic skeletons. We present an implementation of a Parallel Molecular Dynamics (PMD) simulation in order to compare our functional approach with a traditional imperative approach. Although we present performance data, the principal benefits from our approach are in the modularity of the code and the ease of programming. Extant FORTRAN90 code for an O(N 2) algorithm is translated, firstly into imperative SML and then into purely functional SML which is then parallelized. The ease of programming and the performance of the FORTRAN90 and SML code are compared. Modest parallel performance is obtained from the parallel SML but with a much slower sequential execution time compared to the FORTRAN90. We then improve the implementation with a ring topology implementation which gives much closer performance to the FORTRAN90 implementation.
Xu ZHANG Xiaohong JIANG Susumu HORIGUCHI
The evolution of VLSI chips towards larger die size, smaller feature size and faster clock speed makes the clock distribution an increasingly important issue. In this paper, we propose a new clock distribution network (CDN), namely Variant X-Tree, based on the idea of X-Architecture proposed recently for efficient wiring within VLSI chips. The Variant X-Tree CDN keeps the nice properties of equal-clock-path and symmetric structure of the typical H-Tree CDN, but results in both a lower maximal clock delay and a lower clock skew than its H-Tree counterpart, as verified by an extensive simulation study that incorporates simultaneously the effects of process variations and on-chip inductance. We also propose a closed-form statistical models for evaluating the skew and delay of the Variant X-Tree CDN. The comparison between the theoretical results and the simulation results indicates that the proposed statistical models can be used to efficiently and rapidly evaluate the performance of the variant X-Tree CDNs.
Fabrizio LOMBARDI Nohpill PARK Susumu HORIGUCHI
This paper proposes new algorithms for diagnosing (detection, identification and location) baseline multistage interconnection networks (MIN) as one of the basic units in a massively parallel system. This is accomplished in the presence of single and multiple faults under a new fault model. This model referred to as the geometric fault model, considers defective crossing connections which are located between adjacent stages, internally to the MIN (therefore, a fault corresponds to a physical bridge fault between two connections). It is shown that this type of fault affects the correct geometry of the network, thus requiring a different testing approach than previous methods. Initially, an algorithm which detects the presence of bridge faults (both in the single and multiple fault cases), is presented. For a single bridge fault, the proposed algorithm locates the fault except in an unique pathological case under which it is logically impossible to differentiate between two equivalent locations of the fault (however, the switching element affected by this fault is uniquely located). The proposed algorithm requires log2 N test vectors to diagnose the MIN as fault free (where N is the number of input lines to the MIN). For fully diagnosing a single bridge fault, this algorithm requires at most 2 log2 N tests and terminates when multiple bridge faults are detected. Subsequently, an algorithm which locates all bridge faults is given. The number of required test vectors is O(N). Fault location of each bridge fault is accomplished in terms of the two lines in the bridge and the numbers of the stages between which it occurs. Illustrative examples are given.