Ryuta KAWANO Ryota YASUDO Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
Recently proposed irregular networks can reduce the latency for both on-chip and off-chip systems with a large number of computing nodes and thus can improve the performance of parallel applications. However, these networks usually suffer from deadlocks in routing packets when using a naive minimal path routing algorithm. To solve this problem, we focus attention on a lately proposed theory that generalizes the turn model to maintain the network performance with deadlock-freedom. The theorems remain a challenge of applying themselves to arbitrary topologies including fully irregular networks. In this paper, we advance the theorems to completely general ones. Moreover, we provide a feasible implementation of a deadlock-free routing method based on our advanced theorem. Experimental results show that the routing method based on our proposed theorem can improve the network throughput by up to 138 % compared to a conventional deterministic minimal routing method. Moreover, when utilized as the escape path in Duato's protocol, it can improve the throughput by up to 26.3 % compared with the conventional up*/down* routing.
Taku YAMAZAKI Ryo YAMAMOTO Genki HOSOKAWA Tadahide KUNITACHI Yoshiaki TANAKA
In wireless multi-hop networks such as ad hoc networks and sensor networks, backoff-based opportunistic routing protocols, which make a forwarding decision based on backoff time, have been proposed. In the protocols, each potential forwarder calculates the backoff time based on the product of a weight and global scaling factor. The weight prioritizes potential forwarders and is calculated based on hop counts to the destination of a sender and receiver. The global scaling factor is a predetermined value to map the weight to the actual backoff time. However, there are three common issues derived from the global scaling factor. First, it is necessary to share the predetermined global scaling factor with a centralized manner among all terminals properly for the backoff time calculation. Second, it is almost impossible to change the global scaling factor during the networks are being used. Third, it is difficult to set the global scaling factor to an appropriate value since the value differs among each local surrounding of forwarders. To address the aforementioned issues, this paper proposes a novel decentralized local scaling factor control without relying on a predetermined global scaling factor. The proposed method consists of the following three mechanisms: (1) sender-centric local scaling factor setting mechanism in a decentralized manner instead of the global scaling factor, (2) adaptive scaling factor control mechanism which adapts the local scaling factor to each local surrounding of forwarders, and (3) mitigation mechanism for excessive local scaling factor increases for the local scaling factor convergence. Finally, this paper evaluates the backoff-based opportunistic routing protocol with and without the proposed method using computer simulations.
Yi GUO Heming SUN Ping LEI Shinji KIMURA
Approximate computing has emerged as a promising approach for error-tolerant applications to improve hardware performance at the cost of some loss of accuracy. Multiplication is a key arithmetic operation in these applications. In this paper, we propose a low-cost approximate multiplier design by employing new probability-driven inexact compressors. This compressor design is introduced to reduce the height of partial product matrix into two rows, based on the probability distribution of the sum result of partial products. To compensate the accuracy loss of the multiplier, a grouped error recovery scheme is proposed and achieves different levels of accuracy. In terms of mean relative error distance (MRED), the accuracy losses of the proposed multipliers are from 1.07% to 7.86%. Compared with the Wallace multiplier using 40nm process, the most accurate variant of the proposed multipliers can reduce power by 59.75% and area by 42.47%. The critical path delay reduction is larger than 12.78%. The proposed multiplier design has a better accuracy-performance trade-off than other designs with comparable accuracy. In addition, the efficiency of the proposed multiplier design is assessed in an image processing application.
Marcus WALLDEN Stefano MARKIDIS Masao OKITA Fumihiko INO
We propose a novel compositing pipeline and a dynamic load balancing technique for volume rendering which utilizes a two-layered group structure to achieve effective and scalable load balancing. The technique enables each process to render data from non-contiguous regions of the volume with minimal impact on the total render time. We demonstrate the effectiveness of the proposed technique by performing a set of experiments on a modern GPU cluster. The experiments show that using the technique results in up to a 35.7% lower worst-case memory usage as compared to a dynamic k-d tree load balancing technique, whilst simultaneously achieving similar or higher render performance. The proposed technique was also able to lower the amount of transferred data during the load balancing stage by up to 72.2%. The technique has the potential to be used in many scenarios where other dynamic load balancing techniques have proved to be inadequate, such as during large-scale visualization.
Toyohiko SAEKI Takayuki NOZAKI
This paper constructs non-binary codes correcting a single b-burst of insertions or deletions with large cardinalities. This paper also provides insertion and deletion correcting algorithms of the constructed codes and evaluates a lower bound of the cardinalities of the constructed codes. Moreover, we evaluate a non-asymptotic upper bound on the cardinalities of arbitrary codes which correct a single b-burst of insertions or deletions.
This paper constructs packet-oriented erasure correcting codes and their systematic forms for the distributed storage systems. The proposed codes are encoded by exclusive OR and bit-level shift operation. By the shift operation, the encoded packets are slightly longer than the source packets. This paper evaluates the extra length of the encoded packets, called overhead, and shows that the proposed codes have smaller overheads than the zigzag decodable codes, which are existing codes using bit-level shift operation and exclusive OR.
Kazuyoshi SHOGEN Thong PHAM VIET
Two frequency sharing criteria for BSS (Broadcasting-Satellite Service) are enacted in Sect.1 of Annex 1 to Appendix 30 to Radio Regulations. These two criteria are pfd (power flux-density) and EPM (Equivalent Protection Margin) values. In this paper, the two criteria are compared and studied from the view point of applicability to the sharing cases between BSS and BSS. In particular, it is shown that in some cases, the EPM criterion contributes to alleviate the problem of “sensitive satellite network”, i.e., one that has relatively low transmission power and is very weak against interference and blocks the new satellite to enter. Disclaimer The views and positions expressed by the authors are strictly personal and do not constitute, nor can be interpreted as, the position of the International Telecommunication Union on the topics addressed in this paper.
Hongjie XU Jun SHIOMI Tohru ISHIHARA Hidetoshi ONODERA
This paper focuses on power-area trade-off axis to memory systems. Compared with the power-performance-area trade-off application on the traditional high performance cache, this paper focuses on the edge processing environment which is becoming more and more important in the Internet of Things (IoT) era. A new power-oriented trade-off is proposed for on-chip cache architecture. As a case study, this paper exploits a good energy efficiency of Standard-Cell Memory (SCM) operating in a near-threshold voltage region and a good area efficiency of Static Random Access Memory (SRAM). A hybrid 2-level on-chip cache structure is first introduced as a replacement of 6T-SRAM cache as L0 cache to save the energy consumption. This paper proposes a method for finding the best capacity combination for SCM and SRAM, which minimizes the energy consumption of the hybrid cache under a specific cache area constraint. The simulation result using a 65-nm process technology shows that up to 80% energy consumption is reduced without increasing the die area by replacing the conventional SRAM instruction cache with the hybrid 2-level cache. The result shows that energy consumption can be reduced if the area constraint for the proposed hybrid cache system is less than the area which is equivalent to a 8kB SRAM. If the target operating frequency is less than 100MHz, energy reduction can be achieved, which implies that the proposed cache system is suitable for low-power systems where a moderate processing speed is required.
Songlin DU Yuhao XU Tingting HU Takeshi IKENAGA
High frame rate and ultra-low delay matching system plays an important role in various human-machine interactive applications, which demands better performance in matching deformable and out-of-plane rotating objects. Although many algorithms have been proposed for deformation tracking and matching, few of them are suitable for hardware implementation due to complicated operations and large time consumption. This paper proposes a hardware-oriented template update and recovery method for high frame rate and ultra-low delay deformation matching system. In the proposed method, the new template is generated in real time by partially updating the template descriptor and adding new keypoints simultaneously with the matching process in pixels (proposal #1), which avoids the large inter-frame delay. The size and shape of region of interest (ROI) are made flexible and the Hamming threshold used for brute-force matching is adjusted according to pixel position and the flexible ROI (proposal #2), which solves the problem of template drift. The template is recovered by the previous one with a relative center-shifting vector when it is judged as lost via region-wise difference check (proposal #3). Evaluation results indicate that the proposed method successfully achieves the real-time processing of 784fps at the resolution of 640×480 on field-programmable gate array (FPGA), with a delay of 0.808ms/frame, as well as achieves satisfactory deformation matching results in comparison with other general methods.
Ryohei BANNO Jingyu SUN Susumu TAKEUCHI Kazuyuki SHUDO
MQTT is one of the promising protocols for various data exchange in IoT environments. Typically, those environments have a characteristic called “edge-heavy”, which means that things at the network edge generate a massive volume of data with high locality. For handling such edge-heavy data, an architecture of placing multiple MQTT brokers at the network edges and making them cooperate with each other is quite effective. It can provide higher throughput and lower latency, as well as reducing consumption of cloud resources. However, under this kind of architecture, heterogeneity could be a vital issue. Namely, an appropriate product of MQTT broker could vary according to the different environment of each network edge, even though different products are hard to cooperate due to the MQTT specification providing no interoperability between brokers. In this paper, we propose Interworking Layer of Distributed MQTT brokers (ILDM), which enables arbitrary kinds of MQTT brokers to cooperate with each other. ILDM, designed as a generic mechanism independent of any specific cooperation algorithm, provides APIs to facilitate development of a variety of algorithms. By using the APIs, we also present two basic cooperation algorithms. To evaluate the usefulness of ILDM, we introduce a benchmark system which can be used for both a single broker and multiple brokers. Experimental results show that the throughput of five brokers running together by ILDM is improved 4.3 times at maximum than that of a single broker.
Miao TANG Juxiang WANG Minjia SHI Jing LIANG
Linear complexity and the k-error linear complexity of periodic sequences are the important security indices of stream cipher systems. This paper focuses on the distribution of p-error linear complexity of p-ary sequences with period pn. For p-ary sequences of period pn with linear complexity pn-p+1, n≥1, we present all possible values of the p-error linear complexity, and derive the exact formulas to count the number of the sequences with any given p-error linear complexity.
This paper addresses the problem of developing an efficient fault-tolerant routing method for 2D mesh Network-on-Chips (NoCs) to realize dependable and high performance many core systems. Existing fault-tolerant routing methods have two critical problems of high communication latency and low node utilization. Unlike almost all existing methods where packets always detour faulty nodes, we propose a novel and unique approach that packets can pass through faulty nodes. For this approach, we enhance the common NoC architecture by adding switches and links around each node and propose a fault-tolerant routing method with no virtual channels based on the well-known simple XY routing method. Simulation results show that the proposed method reduces average communication latency by about 97.1% compared with the existing method, without sacrificing fault-free nodes.
Krittin INTHARAWIJITR Katsuyoshi IIDA Hiroyuki KOGA Katsunori YAMAOKA
Most of latency-sensitive mobile applications depend on computational resources provided by a cloud computing service. The problem of relying on cloud computing is that, sometimes, the physical locations of cloud servers are distant from mobile users and the communication latency is long. As a result, the concept of distributed cloud service, called mobile edge computing (MEC), is being introduced in the 5G network. However, MEC can reduce only the communication latency. The computing latency in MEC must also be considered to satisfy the required total latency of services. In this research, we study the impact of both latencies in MEC architecture with regard to latency-sensitive services. We also consider a centralized model, in which we use a controller to manage flows between users and mobile edge resources to analyze MEC in a practical architecture. Simulations show that the interval and controller latency trigger some blocking and error in the system. However, the permissive system which relaxes latency constraints and chooses an edge server by the lowest total latency can improve the system performance impressively.
Richard Hsin-Hsyong YANG Chia-Kun LEE Shiunn-Jang CHERN
Continuous phase modulation (CPM) is a very attractive digital modulation scheme, with constant envelope feature and high efficiency in meeting the power and bandwidth requirements. CPM signals with pairs of input sequences that differ in an infinite number of positions and map into pairs of transmitted signals with finite Euclidean distance (ED) are called catastrophic. In the CPM scheme, data sequences that have the catastrophic property are called the catastrophic sequences; they are periodic difference data patterns. The catastrophic sequences are usually with shorter length of the merger. The corresponding minimum normalized squared ED (MNSED) is smaller and below the distance bound. Two important CPM schemes, viz., LREC and LRC schemes, are known to be catastrophic for most cases; they have poor overall power and bandwidth performance. In the literatures, it has been shown that the probability of generating such catastrophic sequences are negligible, therefore, the asymptotic error performance (AEP) of those well-known catastrophic CPM schemes evaluated with the corresponding MNSED, over AWGN channels, might be too negative or pessimistic. To deal with this problem in AWGN channel, this paper presents a new split-merged MNSED and provide criteria to explore which conventional catastrophic CPM scheme could increase the length of mergers with split-merged non-periodic events, effectively. For comparison, we investigate the exact power and bandwidth performance for LREC and LRC CPM for the same bandwidth occupancy. Computer simulation results verify that the AEP evaluating with the split-merged MNSED could achieve up to 3dB gain over the conventional approach.
Yuan ZHOU Yuichi GOTO Jingde CHENG
Many kinds of questionnaires, testing, and voting are performed in some completely electronic ways to do questions and answers on the Internet as Web applications, i.e. e-questionnaire systems, e-testing systems, and e-voting systems. Because there is no unified communication tool among the stakeholders of e-questionnaire, e-testing, and e-voting systems, until now, all the e-questionnaire, e-testing, and e-voting systems are designed, developed, used, and maintained in various ad hoc ways. As a result, the stakeholders are difficult to communicate to implement the systems, because there is neither an exhaustive requirement list to have a grasp of the overall e-questionnaire, e-testing, and e-voting systems nor a standardized terminology for these systems to avoid ambiguity. A general-purpose specification language to provide a unified description way for specifying various e-questionnaire, e-testing, and e-voting systems can solve the problems such that the stakeholders can refer to and use the complete requirements and standardized terminology for better communications, and can easily and unambiguously specify all the requirements of systems and services of e-questionnaire, e-testing, and e-voting, even can implement the systems. In this paper, we propose the first specification language, named “QSL,” with a standardized, consistent, and exhaustive list of requirements for specifying various e-questionnaire, e-testing, and e-voting systems such that the specifications can be used as the precondition of automatically generating e-questionnaire, e-testing, and e-voting systems. The paper presents our design addressing that QSL can specify all the requirements of various e-questionnaire, e-testing, and e-voting systems in a structured way, evaluates its effectiveness, performs real applications using QSL in case of e-questionnaire, e-testing, and e-voting systems, and shows various QSL applications for providing convenient QSL services to stakeholders.
Since first introduced in 2008 with the 1.0 specification, OpenCL has steadily evolved over the decade to increase its support for heterogeneous parallel systems. In this paper, we accelerate stochastic simulation of biochemical reaction networks on modern GPUs (graphics processing units) by means of the OpenCL programming language. In implementing the OpenCL version of the stochastic simulation algorithm, we carefully apply its data-parallel execution model to optimize the performance provided by the underlying hardware parallelism of the modern GPUs. To evaluate our OpenCL implementation of the stochastic simulation algorithm, we perform a comparative analysis in terms of the performance using the CPU-based cluster implementation and the NVidia CUDA implementation. In addition to the initial report on the performance of OpenCL on GPUs, we also discuss applicability and programmability of OpenCL in the context of GPU-based scientific computing.
In this letter, we investigate the separating redundancy of binary linear codes. Using analytical techniques, we provide a general lower bound on the first separating redundancy of binary linear codes and show the bound is tight for a particular family of binary linear codes, i.e., cycle codes. In other words, the first separating redundancy of cycle codes can be determined. We also derive a deterministic and constructive upper bound on the second separating redundancy of cycle codes, which is shown to be better than the general deterministic and constructive upper bounds for the codes.
To the best of our knowledge, there are a few researches on air-handwriting character-level writer identification only employing acceleration and angular velocity data. In this paper, we propose a deep learning approach to writer identification only using inertial sensor data of air-handwriting. In particular, we separate different representations of degree of freedom (DoF) of air-handwriting to extract local dependency and interrelationship in different CNNs separately. Experiments on a public dataset achieve an average good performance without any extra hand-designed feature extractions.
We design a new oblivious routing algorithm for two-dimensional mesh-based Networks-on-Chip (NoCs) called LEF (Long Edge First) which offers high throughput with low design complexity. LEF's basic idea comes from conventional wisdom in choosing the appropriate dimension-order routing (DOR) algorithm for supercomputers with asymmetric mesh or torus interconnects: routing longest dimensions first provides better performance than other strategies. In LEF, we combine the XY DOR and the YX DOR. When routing a packet, which DOR algorithm is chosen depends on the relative position between the source node and the destination node. Decisions of selecting the appropriate DOR algorithm are not fixed to the network shape but instead made on a per-packet basis. We also propose an efficient deadlock avoidance method for LEF in which the use of virtual channels is more flexible than in the conventional method. We evaluate LEF against O1TURN, another effective oblivious routing algorithm, and a minimal adaptive routing algorithm based on the odd-even turn model. The evaluation results show that LEF is particularly effective when the communication is within an asymmetric mesh. In a 16×8 NoC, LEF even outperforms the adaptive routing algorithm in some cases and delivers from around 4% up to around 64.5% higher throughput than O1TURN. Our results also show that the proposed deadlock avoidance method helps to improve LEF's performance significantly and can be used to improve O1TURN's performance. We also examine LEF in large-scale NoCs with thousands of nodes. Our results show that, as the NoC size increases, the performance of the routing algorithms becomes more strongly influenced by the resource allocation policy in the network and the effect is different for each algorithm. This is evident in that results of middle-scale NoCs with around 100 nodes cannot be applied directly to large-scale NoCs.
Masafumi NAGASAKA Masaaki KOJIMA Takuma TORII Hiromitsu UTSUMI Koji YAMANAKA Shintaro SHINJO Mitsuhiro SHIMOZAWA Hisashi SUJIKAI
Satellite broadcasting of 4K/8K ultra-high definition television (UHDTV) was launched in Japan in December 2018. Because this system uses the amplitude and phase shift keying (APSK) modulation scheme, there is a need to improve the non-linear characteristics of the satellite transponders. To meet this requirement, we have been developing a 120-W-class Ku-band solid state power amplifier (SSPA) as a replacement for the currently used traveling wave tube amplifier (TWTA). In this study, we developed a gallium-nitride (GaN) SSPA and linearizer (LNZ). The SSPA achieved an output power of 120W while maintaining a power added efficiency (PAE) of 31%. We evaluated the transmission performance of 16APSK in this SSPA channel in comparison with that in the TWTA channel.