Tsung-Han LIN Yuki KINEBUCHI Tatsuo NAKAJIMA
In this paper, we propose a virtualization architecture for a multi-core embedded system to provide more system reliability and security while maintaining performance and without introducing additional special hardware supports or implementing a complex protection mechanism in the virtualization layer. Embedded systems, especially consumer electronics, have often used virtualization. Virtualization is not a new technique, as there are various uses for both GPOS (General Purpose Operating System) and RTOS (Real Time Operating System). The surge of the multi-core platforms in embedded systems also helps consolidate the virtualization system for better performance and lower power consumption. Embedded virtualization design usually uses two approaches. The first is to use the traditional VMM, but it is too complicated for use in the embedded environment without additional special hardware support. The other approach uses the microkernel, which imposes a modular design. The guest systems, however, would suffer from considerable modifications in this approach, as the microkernel allows guest systems to run in the user space. For some RTOSes and their applications originally running in the kernel space, this second approach is more difficult to use because those codes use many privileged instructions. To achieve better reliability and keep the virtualization layer design lightweight, this work uses a common hardware component adopted in multi-core embedded processors. In most embedded platforms, vendors provide additional on-chip local memory for each physical core, and these local memory areas are only private to their cores. By taking advantage of this memory architecture, we can mitigate the above-mentioned problems at once. We choose to re-map the virtualization layer's program on the local memory, called SPUMONE, which runs all guest systems in the kernel space. Doing so, it can provide additional reliability and security for the entire system because the SPUMONE design in a multi-core platform has each instance installed on a separate processor core. This design differs from traditional virtualization layer design, and the content of each SPUMONE is inaccessible to the others. We also achieve this goal without adding overhead to the overall performance.
Kyong Hoon KIM Guy Martin TCHAMGOUE Yong-Kee JUN Wan Yeon LEE
In large-scale collaborative computing, users and resource providers organize various Virtual Organizations (VOs) to share resources and services. A VO organizes other sub-VOs for the purpose of achieving the VO goal, which forms hierarchical VO environments. VO participants agree upon a certain policies, such as resource sharing amount or user accesses. In this letter, we provide an optimal resource sharing mechanism in hierarchical VO environments under resource sharing agreements. The proposed algorithm enhances resource utilization and reduces mean response time of each user.
Bei HUANG Kaidi YOU Yun CHEN Zhiyi YU Xiaoyang ZENG
Reed-Solomon (RS) codes are widely used in digital communication and storage systems. Unlike usual VLSI approaches, this paper presents a high throughput fully programmable Reed-Solomon decoder on a multi-core processor. The multi-core processor platform is a 2-Dimension mesh array of Single Instruction Multiple Data (SIMD) cores, and it is well suited for digital communication applications. By fully extracting the parallelizable operations of the RS decoding process, we propose multiple optimization techniques to improve system throughput, including: task level parallelism on different cores, data level parallelism on each SIMD core, minimizing memory access, and route length minimized task mapping techniques. For RS(255, 239, 8), experimental results show that our 12-core implementation achieve a throughput of 4.35 Gbps, which is much better than several other published implementations. From the results, it is predictable that the throughput is linear with the number of cores by our approach.
Dajiang LIU Shouyi YIN Chongyong YIN Leibo LIU Shaojun WEI
Reconfigurable computing system is a class of parallel architecture with the ability of computing in hardware to increase performance, while remaining much of flexibility of a software solution. This architecture is particularly suitable for running regular and compute-intensive tasks, nevertheless, most compute-intensive tasks spend most of their running time in nested loops. Polyhedron model is a powerful tool to give a reasonable transformation on such nested loops. In this paper, a number of issues are addressed towards the goal of optimization of affine loop nests for reconfigurable cell array (RCA), such as approach to make the most use of processing elements (PE) while minimizing the communication volume by loop transformation in polyhedron model, determination of tilling form by the intra-statement dependence analysis and determination of tilling size by the tilling form and the RCA size. Experimental results on a number of kernels demonstrate the effectiveness of the mapping optimization approaches developed. Compared with DFG-based optimization approach, the execution performances of 1-d jacobi and matrix multiplication are improved by 28% and 48.47%. Lastly, the run-time complexity is acceptable for the practical cases.
Multicore processor architectures have become ubiquitous in today's computing platforms, especially in parallel computing installations, with their power and cost advantages. While the technology trend continues towards having hundreds of cores on a chip in the foreseeable future, an urgent question posed to system designers as well as application users is whether applications can receive sufficient support on today's operating systems for them to scale to many cores. To this end, people need to understand the strengths and weaknesses on their support on scalability and to identify major bottlenecks limiting the scalability, if any. As open-source operating systems are of particular interests in the research and industry communities, in this paper we choose three operating systems (Linux, Solaris and FreeBSD) to systematically evaluate and compare their scalability by using a set of highly-focused microbenchmarks for broad and detailed understanding their scalability on an AMD 32-core system. We use system profiling tools and analyze kernel source codes to find out the root cause of each observed scalability bottleneck. Our results reveal that there is no single operating system among the three standing out on all system aspects, though some system(s) can prevail on some of the system aspects. For example, Linux outperforms Solaris and FreeBSD significantly for file-descriptor- and process-intensive operations. For applications with intensive sockets creation and deletion operations, Solaris leads FreeBSD, which scales better than Linux. With the help of performance tools and source code instrumentation and analysis, we find that synchronization primitives protecting shared data structures in the kernels are the major bottleneck limiting system scalability.
Yasumichi TAKAI Masanori HASHIMOTO Takao ONOYE
This paper investigates power gating implementations that mitigate power supply noise. We focus on the body connection of power-gated circuits, and examine the amount of power supply noise induced by power-on rush current and the contribution of a power-gated circuit as a decoupling capacitance during the sleep mode. To figure out the best implementation, we designed and fabricated a test chip in 65 nm process. Experimental results with measurement and simulation reveal that the power-gated circuit with body-tied structure in triple-well is the best implementation from the following three points; power supply noise due to rush current, the contribution of decoupling capacitance during the sleep mode and the leakage reduction thanks to power gating.
Mamoru OHARA Takashi YAMAGUCHI
In numerical simulations using massively parallel computers like GPGPU (General-Purpose computing on Graphics Processing Units), we often need to transfer computational results from external devices such as GPUs to the main memory or secondary storage of the host machine. Since size of the computation results is sometimes unacceptably large to hold them, it is desired that the data is compressed and stored. In addition, considering overheads for transferring data between the devices and host memories, it is preferable that the data is compressed in a part of parallel computation performed on the devices. Traditional compression methods for floating-point numbers do not always show good parallelism. In this paper, we propose a new compression method for massively-parallel simulations running on GPUs, in which we combine a few successive floating-point numbers and interleave them to improve compression efficiency. We also present numerical examples of compression ratio and throughput obtained from experimental implementations of the proposed method runnig on CPUs and GPUs.
Arne KUTZNER Pok-Son KIM Won-Kwang PARK
We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2l into 2i subproblems of size 2l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.
Geographic routing uses the geographical location information provided by nodes to make routing decisions. However, the nodes can not obtain accurate location information due to the effect of measurement error. A new routing strategy using maximum expected distance and angle (MEDA) algorithm is proposed to improve the performance and promote the successive transmission rate. We firstly introduce the expected distance and angle, and then we employ the principal component analysis to construct the object function for selecting the next hop node. We compare the proposed algorithm with maximum expectation within transmission range (MER) and greedy routing scheme (GRS) algorithms. Simulation results show that the proposed MEDA algorithm outperforms the MER and GRS algorithms with higher successive transmission rate.
Shunsuke ONO Takamichi MIYATA Isao YAMADA Katsunori YAMAOKA
Solving image recovery problems requires the use of some efficient regularizations based on a priori information with respect to the unknown original image. Naturally, we can assume that an image is modeled as the sum of smooth, edge, and texture components. To obtain a high quality recovered image, appropriate regularizations for each individual component are required. In this paper, we propose a novel image recovery technique which performs decomposition and recovery simultaneously. We formulate image recovery as a nonsmooth convex optimization problem and design an iterative scheme based on the alternating direction method of multipliers (ADMM) for approximating its global minimizer efficiently. Experimental results reveal that the proposed image recovery technique outperforms a state-of-the-art method.
Liqiang ZHANG Chao LI Haoliang SUN Changwen ZHENG Pin LV
Due to the complicated composition of cloud and its disordered transformation, the rendering of cloud does not perfectly meet actual prospect by current methods. Based on physical characteristics of cloud, a physical cellular automata model of Dynamic cloud is designed according to intrinsic factor of cloud, which describes the rules of hydro-movement, deposition and accumulation and diffusion. Then a parallel computing architecture is designed to compute the large-scale data set required by the rendering of dynamical cloud, and a GPU-based ray-casting algorithm is implemented to render the cloud volume data. The experiment shows that cloud rendering method based on physical cellular automata model is very efficient and able to adequately exhibit the detail of cloud.
Ce LI Yiping DONG Takahiro WATANABE
Dynamic power gating applicable to FPGA can reduce the power consumption effectively. In this paper, we propose a sophisticated routing architecture for a region oriented FPGA which supports dynamic power gating. This is the first routing solution of dynamic power gating for coarse-grained FPGA. This paper has 2 main contributions. First, it improves the routing resource graph and routing architecture to support special routing for a region oriented FPGA. Second, some routing channels are made wider to avoid congestion. Experimental result shows that 7.7% routing area can be reduced compared with the symmetric Wilton switch box in the region. Also, our proposed FPGA architecture with sophisticated P&R can reduce the power consumption of the system implemented in FPGA.
This paper presents an algorithmic approach to acquiring the influencing relationships among users by discovering implicit influencing group structure from smartphone usage. The method assumes that a time series of users' application downloads and activations can be represented by individual inter-personal influence factors. To achieve better predictive performance and also to avoid over-fitting, a latent feature model is employed. The method tries to extract the latent structures by monitoring cross validating predictive performances on approximated influence matrices with reduced ranks, which are generated based on an initial influence matrix obtained from a training set. The method adopts Nonnegative Matrix Factorization (NMF) to reduce the influence matrix dimension and thus to extract the latent features. To validate and demonstrate its ability, about 160 university students voluntarily participated in a mobile application usage monitoring experiment. An empirical study on real collected data reveals that the influencing structure consisted of six influencing groups with two types of mutual influence, i.e. intra-group influence and inter-group influence. The results also highlight the importance of sparseness control on NMF for discovering latent influencing groups. The obtained influencing structure provides better predictive performance than state-of-the-art collaborative filtering methods as well as conventional methods such as user-based collaborative filtering techniques and simple popularity.
Tatsumi TAKAGI Hiroshi HASEGAWA Ken-ichi SATO Yoshiaki SONE Akira HIRANO Masahiko JINNO
We propose optical path routing and frequency slot assignment algorithms that can make the best use of elastic optical paths and the capabilities of distance adaptive modulation. Due to the computational difficulty of the assignment problem, we develop algorithms for 1+1 dedicated/1:1 shared protected ring networks and unprotected mesh networks to that fully utilize the characteristics of the topologies. Numerical experiments elucidate that the introduction of path elasticity and distance adaptive modulation significantly reduce the occupied bandwidth.
Jean Marc Kouakou ATTOUNGBLE Kazunori OKADA
In this paper, we present Greedy Routing for Maximum Lifetime (GRMax) [1],[2] which can use the limited energy available to nodes in a Wireless Sensor Network (WSN) in order to delay the dropping of packets, thus extend the network lifetime. We define network lifetime as the time period until a source node starts to drop packets because it has no more paths to the destination [3]. We introduce the new concept of Network Connectivity Aiming (NCA) node. The primary goal of NCA nodes is to maintain network connectivity and avoid network partition. To evaluate GRMax, we compare its performance with Geographic and Energy Aware Routing (GEAR) [4], which is an energy efficient geographic routing protocol and Greedy Perimeter Stateless Routing (GPSR) [5], which is a milestone among geographic routing protocol. We evaluate and compare the performance of GPSR, GEAR, and GRMax using OPNET Modeler version 15. The results show that GRMax performs better than GEAR and GPSR with respect to the number of successfully delivered packets and the time period before the nodes begin to drop packets. Moreover, with GRMax, there are fewer dead nodes in the system and less energy is required to deliver packets to destination node (sink).
A novel cooperative spectrum sensing scheme suitable for wireless cognitive radio system with imperfect reporting channels is proposed. In the proposed scheme, binary local decision bits are transmitted to the fusion center and combined to form a soft-valued decision statistics in the fusion center. To form a decision statistics, a majority-decision-aided weighting rule is proposed. The proposed scheme provides a reliable sensing capability even with poor reporting channels.
Tomoaki TAKEUCHI Hiroyuki HAMAZUMI Kazuhiko SHIBUYA
As many digital terrestrial broadcasting stations have been installed and are now broadcasting, the problem of poor reception has become serious even though the receiving powers are high. Although we had developed a interference canceller for broadcast-wave relay stations, an adaptive array is desirable to be more robust against low-D/U multipath environment as a receiver for the service area. In this paper, we propose a weighting coefficient optimization algorithm for post-FFT adaptive array using the reciprocals of weighting coefficients. Numerical examples show the effectiveness of the proposed method.
In this letter, we propose a new 4-dimensional constellation-rotation (CR) modulation method that achieves diversity gain of 4 in Rayleigh fading channels. The proposed scheme consists of two consecutive CR operations for QAM symbols unlike the conventional 2-dimensional CR method based on only one CR operation. Computer simulation results show that the new method exhibits much better performance than the conventional one in terms of code rate and channel erasure ratio.
Sang-Gun LEE Hong-Seok CHOI Chang-Wook HAN Seok-Jong LEE Yoon-Heung TAK Byung-Chul AHN
A numerical model of multi-layered organic light emitting diode (OLED) is presented in this paper. The current density-voltage (J-V) model for OLED was performed by using the injection-limited current and bulk-limited current. The mobility equation was based on the field dependent model, so called “Poole-Frenkel mobility model.” The accuracy of this simulation was represented by comparing to the experimental results with a variable of EML thickness of multi-layered OLED device. There are two hetero-junction models which should be dealt with in the simulation. The Langevin recombination rate of electron and hole is also calculated through the device simulation.
Lijuan ZHENG Yingxin HU Zhen HAN Fei MA
Previous inter-domain fast authentication schemes only realize the authentication of user identity. We propose a trusted inter-domain fast authentication scheme based on the split mechanism network. The proposed scheme can realize proof of identity and integrity verification of the platform as well as proof of the user identity. In our scheme, when the mobile terminal moves to a new domain, the visited domain directly authenticates the mobile terminal using the ticket issued by the home domain rather than authenticating it through its home domain. We demonstrate that the proposed scheme is highly effective and more secure than contemporary inter-domain fast authentication schemes.