Reliability is an important figure of merit of the system and it must be satisfied in safety-critical applications. This paper considers parallel applications on heterogeneous embedded systems and proposes a two-phase algorithm framework to minimize energy consumption for satisfying applications’ reliability requirement. The first phase is for initial assignment and the second phase is for either satisfying the reliability requirement or improving energy efficiency. Specifically, when the application’s reliability requirement cannot be achieved via the initial assignment, an algorithm for enhancing the reliability of tasks is designed to satisfy the application’s reliability requirement. Considering that the reliability of initial assignment may exceed the application’s reliability requirement, an algorithm for reducing the execution frequency of tasks is designed to improve energy efficiency. The proposed algorithms are compared with existing algorithms by using real parallel applications. Experimental results demonstrate that the proposed algorithms consume less energy while satisfying the application’s reliability requirements.
Lukas NAKAMURA Hiromitsu AWANO
We propose “Temporal Ensemble SSDLite,” a new method for video object detection that boosts accuracy while maintaining detection speed and energy consumption. Object detection for video is becoming increasingly important as a core part of applications in robotics, autonomous driving and many other promising fields. Many of these applications require high accuracy and speed to be viable, but are used in compute and energy restricted environments. Therefore, new methods that increase the overall performance of video object detection i.e., accuracy and speed have to be developed. To increase accuracy we use ensemble, the machine learning method of combining predictions of multiple different models. The drawback of ensemble is the increased computational cost which is proportional to the number of models used. We overcome this deficit by deploying our ensemble temporally, meaning we inference with only a single model at each frame, cycling through our ensemble of models at each frame. Then, we combine the predictions for the last N frames where N is the number of models in our ensemble through non-max-suppression. This is possible because close frames in a video are extremely similar due to temporal correlation. As a result, we increase accuracy through the ensemble while only inferencing a single model at each frame and therefore keeping the detection speed. To evaluate the proposal, we measure the accuracy, detection speed and energy consumption on the Google Edge TPU, a machine learning inference accelerator, with the Imagenet VID dataset. Our results demonstrate an accuracy boost of up to 4.9% while maintaining real-time detection speed and an energy consumption of 181mJ per image.
This letter presents a novel technique to achieve a fast inference of the binarized convolutional neural networks (BCNN). The proposed technique modifies the structure of the constituent blocks of the BCNN model so that the input elements for the max-pooling operation are binary. In this structure, if any of the input elements is +1, the result of the pooling can be produced immediately; the proposed technique eliminates such computations that are involved to obtain the remaining input elements, so as to reduce the inference time effectively. The proposed technique reduces the inference time by up to 34.11%, while maintaining the classification accuracy.
Koichi MITSUNARI Jaehoon YU Takao ONOYE Masanori HASHIMOTO
Visual object detection on embedded systems involves a multi-objective optimization problem in the presence of trade-offs between power consumption, processing performance, and detection accuracy. For a new Pareto solution with high processing performance and low power consumption, this paper proposes a hardware architecture for decision tree ensemble using multiple channels of features. For efficient detection, the proposed architecture utilizes the dimensionality of feature channels in addition to parallelism in image space and adopts task scheduling to attain random memory access without conflict. Evaluation results show that an FPGA implementation of the proposed architecture with an aggregated channel features pedestrian detector can process 229 million samples per second at 100MHz operation frequency while it requires a relatively small amount of resources. Consequently, the proposed architecture achieves 350fps processing performance for 1080P Full HD images and outperforms conventional object detection hardware architectures developed for embedded systems.
Naoki FUJIEDA Kiyohiro SATO Ryodai IWAMOTO Shuichi ICHIKAWA
Instruction set randomization (ISR) is a cost-effective obfuscation technique that modifies or enhances the relationship between instructions and machine languages. An Instruction Register File (IRF), a list of frequently used instructions, can be used for ISR by providing the way of indirect access to them. This study examines the IRF that integrates a positional register, which was proposed as a supplementary unit of the IRF, for the sake of tamper resistance. According to our evaluation, with a new design for the contents of the positional register, the measure of tamper resistance was increased by 8.2% at a maximum, which corresponds to a 32.2% increase in the size of the IRF. The number of logic elements increased by the addition of the positional register was 3.5% of its baseline processor.
Yining XU Ittetsu TANIGUCHI Hiroyuki TOMIYAMA
Task mapping is one of the most important design processes in embedded manycore systems. This paper proposes a static task mapping technique for manycore real-time systems. The technique minimizes the number of cores while satisfying deadline constraints of individual tasks.
Yining XU Yang LIU Junya KAIDA Ittetsu TANIGUCHI Hiroyuki TOMIYAMA
This paper proposes a static application mapping technique, based on integer linear programming, for non-hierarchical manycore embedded systems. Unlike previous work which was designed for hierarchical manycore SoCs, this work allows more flexible application mapping to achieve higher performance. The experimental results show the effectiveness of this work.
Hideki TAKASE Gang ZENG Lovic GAUTHIER Hirotaka KAWASHIMA Noritoshi ATSUMI Tomohiro TATEMATSU Yoshitake KOBAYASHI Takenori KOSHIRO Tohru ISHIHARA Hiroyuki TOMIYAMA Hiroaki TAKADA
This paper presents a framework for reducing the energy consumption of embedded real-time systems. We implemented the presented framework as both an optimization toolchain and an energy-aware real-time operating system. The framework consists of the integration of multiple techniques to optimize the energy consumption. The main idea behind our approach is to utilize trade-offs between the energy consumption and the performance of different processor configurations during task checkpoints, and to maintain memory allocation during task context switches. In our framework, a target application is statically analyzed at both intra-task and inter-task levels. Based on these analyzed results, runtime optimization is performed in response to the behavior of the application. A case study shows that our toolchain and real-time operating systems have achieved energy reduction while satisfying the real-time performance. The toolchain has also been successfully applied to a practical application.
Ittetsu TANIGUCHI Junya KAIDA Takuji HIEDA Yuko HARA-AZUMI Hiroyuki TOMIYAMA
This paper studies mapping techniques of multiple applications on embedded many-core SoCs. The mapping techniques proposed in this paper are static which means the mapping is decided at design time. The mapping techniques take into account both inter-application and intra-application parallelism in order to fully utilize the potential parallelism of the many-core architecture. Additionally, the proposed static mapping supports dynamic application switching, which means the applications mapped onto the same cores are switched to each other at runtime. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. Experimental results show the effectiveness of the proposed techniques.
Chungsoo LIM Soojeong LEE Jae-Hun CHOI Joon-Hyuk CHANG
In this letter, we propose a simple but effective technique that improves statistical model-based voice activity detection (VAD) by both reducing computational complexity and increasing detection accuracy. The improvements are made by applying Taylor series approximations to the exponential and logarithmic functions in the VAD algorithm based on an in-depth analysis of the algorithm. Experiments performed on a smartphone as well as on a desktop computer with various background noises confirm the effectiveness of the proposed technique.
Junya KAIDA Yuko HARA-AZUMI Takuji HIEDA Ittetsu TANIGUCHI Hiroyuki TOMIYAMA Koji INOUE
This paper studies the static mapping of multiple applications on embedded many-core SoCs. The mapping techniques proposed in this paper take into account both inter-application and intra-application parallelism in order to fully utilize the potential parallelism of the many-core architecture. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. Experiments show the effectiveness of the proposed techniques.
Memory accesses are a major cause of energy consumption for embedded systems. This paper presents the implementation of a fully software technique which places stack and static data into a scratch-pad memory (SPM) in order to reduce the energy consumed by the processor while accessing them. Since an SPM is usually too small to include all these data, some of them must be left into the external main memory (MM). Therefore, further energy reduction is achieved by moving some stack data between both memories at run time. The technique employs integer linear programming in order to find at compile time the optimal placement of static data and management of the stack and implements it by inserting stack operations inside the code. Experimental results show that with an SPM of only 1 KB, our technique is able to exploit it for reducing the energy consumption related to the static and stack data accesses by more than 90% for several applications and on an average by 57% compared to the case where these data are fully placed into the main memory.
This paper presents a verification and analysis tool set for embedded systems. Recently, the development scale of embedded systems has been increasing since they are used for mobile systems, automobile platforms, and various consumer systems with rich functionality. This has increased the amount of time and cost needed to develop them. Consequently, it is very important to develop tools to reduce development time and cost. This paper describes a tool set consisting of three tools to enhance the efficiency of embedded system design. The first tool is an integrated tool platform. The second is a remote debugging system. The third is a clock-accurate verification system based on a field-programmable gate array (FPGA) for custom embedded systems. This tool set promises to significantly reduce the time and cost needed to develop embedded systems.
Cheng-Min LIN Shyi-Shiou WU Tse-Yi CHEN
Universal Plug and Play (UPnP) allows devices automatic discovery and control of services available in those devices connected to a Transmission Control Protocol/ Internet Protocol (TCP/IP) network. Although many products are designed using UPnP, little attention has been given to UPnP related to modeling and performance analysis. This paper uses a framework of Generalized Stochastic Petri Net (GSPN) to model and analyze the behavior of UPnP systems. The framework includes modeling UPnP, reachability decomposition, GSPN analysis, and reward assignment. Then, the Platform Independent Petri net Editor 2 (PIPE2) tool is used to model and evaluate the controllers in terms of power consumption, system utilization and network throughput. Through quantitative analysis, the steady states in the operation and notification stage dominate the system performance, and the control point is better than the device in power consumption but the device outperforms the control point in evaluating utilization. The framework and numerical results are useful to improve the quality of services provided in UPnP devices.
Won-young CHUNG Ha-young JEONG Won Woo RO Yong-surk LEE
In this paper, we propose a novel low-cost Message Passing Interface (MPI) unit between processor nodes, which supports message passing in multiprocessor systems using distributed memory architecture. Our MPI unit operates in the standard mode – using the buffered mode for small amounts of data transaction and the synchronous mode for large amounts of data transaction. This results in increased performance by reducing the control message transmission time for the small amount of data. We verified the performance with a simulator designed based on SystemC. Additionally, we designed the MPI unit using VerilogHDL, and we synthesized it with a synopsys design compiler. The proposed standard mode MPI unit shows a high performance even though the size of the MPI unit occupies less than 1% of the whole chip. Thus, with respect to low-cost design and scalability, this MPI hardware unit is useful to increase overall performance of the embedded Multiprocessor System on a Chip (MPSoC).
This paper proposes an energy efficient processor which can be used as a design alternative for the dynamic voltage scaling (DVS) processors in embedded system design. The processor consists of multiple PE (processing element) cores and a selective set-associative cache memory. The PE-cores have the same instruction set architecture but differ in their clock speeds and energy consumptions. Only a single PE-core is activated at a time and the other PE-cores are deactivated using clock gating and signal gating techniques. The major advantage over the DVS processors is a small overhead for changing its performance. The gate-level simulation demonstrates that our processor can change its performance within 1.5 microsecond and dissipates about 10 nano-joule while conventional DVS processors need hundreds of microseconds and dissipate a few micro-joule for the performance transition. This makes it possible to apply our multi-performance processor to many real-time systems and to perform finer grained and more sophisticated dynamic voltage control.
Interrupt service routines are a key technology for embedded systems. In this paper, we introduce the standard approach for using Generalized Stochastic Petri Nets (GSPNs) as a high-level model for generating CTMC Continuous-Time Markov Chains (CTMCs) and then use Markov Reward Models (MRMs) to compute the performance for embedded systems. This framework is employed to analyze two embedded controllers with low cost and high performance, ARM7 and Cortex-M3. Cortex-M3 is designed with a tail-chaining mechanism to improve the performance of ARM7 when a nested interrupt occurs on an embedded controller. The Platform Independent Petri net Editor 2 (PIPE2) tool is used to model and evaluate the controllers in terms of power consumption and interrupt overhead performance. Using numerical results, in spite of the power consumption or interrupt overhead, Cortex-M3 performs better than ARM7.
Hae Young LEE Seung-Min PARK Tae Ho CHO
This paper presents an approach to implementing simulation models for SAM fuzzy controllers without the use of external components. The approach represents a fuzzy controller as a composition of simple simulation models which involve only basic operations.
Bing-Fei WU Hao-Yu HUANG Yen-Lin CHEN Hsin-Yuan PENG Jia-Hsiung HUANG
This study presents several optimization approaches for the MPEG-2/4 Audio Advanced Coding (AAC) Low Complexity (LC) encoding and decoding processes. Considering the power consumption and the peripherals required for consumer electronics, this study adopts the TI OMAP5912 platform for portable devices. An important optimization issue for implementing AAC codec on embedded and mobile devices is to reduce computational complexity and memory consumption. Due to power saving issues, most embedded and mobile systems can only provide very limited computational power and memory resources for the coding process. As a result, modifying and simplifying only one or two blocks is insufficient for optimizing the AAC encoder and enabling it to work well on embedded systems. It is therefore necessary to enhance the computational efficiency of other important modules in the encoding algorithm. This study focuses on optimizing the Temporal Noise Shaping (TNS), Mid/Side (M/S) Stereo, Modified Discrete Cosine Transform (MDCT) and Inverse Quantization (IQ) modules in the encoder and decoder. Furthermore, we also propose an efficient memory reduction approach that provides a satisfactory balance between the reduction of memory usage and the expansion of the encoded files. In the proposed design, both the AAC encoder and decoder are built with fixed-point arithmetic operations and implemented on a DSP processor combined with an ARM-core for peripheral controlling. Experimental results demonstrate that the proposed AAC codec is computationally effective, has low memory consumption, and is suitable for low-cost embedded and mobile applications.
Hiroki SUGANO Hiroyuki OCHI Yukihiro NAKAMURA Ryusuke MIYAMOTO
Recently, many researchers tackle accurate object recognition algorithms and many algorithms are proposed. However, these algorithms have some problems caused by variety of real environments such as a direction change of the object or its shading change. The new tracking algorithm, Cascade Particle Filter, is proposed to fill such demands in real environments by constructing the object model while tracking the objects. We have been investigating to implement accurate object recognition on embedded systems in real-time. In order to apply the Cascade Particle Filter to embedded applications such as surveillance, automotives, and robotics, a hardware accelerator is indispensable because of limitations in power consumption. In this paper we propose a hardware implementation of the Discrete AdaBoost algorithm that is the most computationally intensive part of the Cascade Particle Filter. To implement the proposed hardware, we use PICO Express, a high level synthesis tool provided by Synfora, for rapid prototyping. Implementation result shows that the synthesized hardware has 1,132,038 transistors and the die area is 2,195 µm 1,985 µm under a 0.180 µm library. The simulation result shows that total processing time is about 8.2 milliseconds at 65 MHz operation frequency.