Junqi ZHANG Lina NI Chen XIE Ying TAN Zheng TANG
This paper presents an adaptive magnification transformation based particle swarm optimizer (AMT-PSO) that provides an adaptive search strategy for each particle along the search process. Magnification transformation is a simple but very powerful mechanism, which is inspired by using a convex lens to see things much clearer. The essence of this transformation is to set a magnifier around an area we are interested in, so that we could inspect the area of interest more carefully and precisely. An evolutionary factor, which utilizes the information of population distribution in particle swarm, is used as an index to adaptively tune the magnification scale factor for each particle in each dimension. Furthermore, a perturbation-based elitist learning strategy is utilized to help the swarm's best particle to escape the local optimum and explore the potential better space. The AMT-PSO is evaluated on 15 unimodal and multimodal benchmark functions. The effects of the adaptive magnification transformation mechanism and the elitist learning strategy in AMT-PSO are studied. Results show that the adaptive magnification transformation mechanism provides the main contribution to the proposed AMT-PSO in terms of convergence speed and solution accuracy on four categories of benchmark test functions.
Yuki ANDO Seiya SHIBATA Shinya HONDA Hiroyuki TOMIYAMA Hiroaki TAKADA
We present a hardware sharing method for design space exploration of multi-processor embedded systems. In our prior work, we had developed a system-level design tool named SystemBuilder which automatically synthesizes target implementation of a system from a functional description. In this work, we have extended SystemBuilder so that it can automatically synthesize an area-efficient implementation which shares a hardware module among different applications. With SystemBuilder, designers only need to enable an option in order to share a hardware module. The designers, therefore, can easily explore a design space including hardware sharing in short time. A case study shows the effectiveness of the hardware sharing on design space exploration.
Junqi ZHANG Ying TAN Lina NI Chen XIE Zheng TANG
Particle swarm optimizer (PSO) is a stochastic global optimization technique based on a social interaction metaphor. Because of the complexity, dynamics and randomness involved in PSO, it is hard to theoretically analyze the mechanism on which PSO depends. Statistical results have shown that the probability distribution of PSO is a truncated triangle, with uniform probability across the middle that decreases on the sides. The "truncated triangle" is also called the "Maya pyramid" by Kennedy. However, very little is known regarding the sampling distribution of PSO in itself. In this paper, we theoretically analyze the "Maya pyramid" without any assumption and derive its computational formula, which is actually a hybrid uniform distribution that looks like a trapezoid and conforms with the statistical results. Based on the derived density function of the hybrid uniform distribution, the search strategy of PSO is defined and quantified to characterize the mechanism of the search strategy in PSO. In order to show the significance of these definitions based on the derived hybrid uniform distribution, the comparison between the defined search strategies of the classical linear decreasing weight based PSO and the canonical constricted PSO suggested by Clerc is illustrated and elaborated.
Nobuaki TOJO Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI
Recently, two-level cache, L1 cache and L2 cache, is commonly used in a processor. Particularly in an embedded system whereby a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. An optimal two-level cache configuration can be obtained which minimizes overall memory access time or memory energy consumption by varying the three cache parameters: the number of sets, a line size, and an associativity, for L1 cache and L2 cache. In this paper, we first extend the L1 cache simulation algorithm so that we can explore two-level cache configuration. Second, we propose two-level cache design space exploration algorithms: CRCB-T1 and CRCB-T2, each of which is based on applying Cache Inclusion Property to two-level cache configuration. Each of the proposed algorithms realizes exact cache simulation but decreases the number of cache hit/miss judgments by a factor of several thousands. Experimental results show that, by using our approach, the number of cache hit/miss judgments required to optimize a cache configurations is reduced to 1/50-1/5500 compared to the exhaustive approach. As a result, our proposed approach totally runs an average of 1398.25 times faster compared to the exhaustive approach. Our proposed cache simulation approach achieves the world fastest two-level cache design space exploration.
Takuji HIEDA Hiroaki TANAKA Keishi SAKANUSHI Yoshinori TAKEUCHI Masaharu IMAI
Partial forwarding is a design method to place forwarding paths on a part of processor pipeline. Hardware cost of processor can be reduced without performance loss by partial forwarding. However, compiler with the instruction scheduler which considers partial forwarding structure of the target processor is required since conventional scheduling algorithm cannot make the most of partial forwarding structure. In this paper, we propose a heuristic instruction scheduling method for processors with partial forwarding structure. The proposed algorithm uses available distance to schedule instructions which are suitable for the target partial forwarding processor. Experimental results show that the proposed method generates near-optimal solutions in practical time and some of the optimized codes for partial forwarding processor run in the shortest time among the target processors. It also shows that the proposed method is superior to hazard detection unit.
Liang-Bi CHEN Chi-Tsai YEH Hung-Yu CHEN Ing-Jer HUANG
3D graphics application is widely used in consumer electronics which is an inevitable tendency in the future. In general, the higher abstraction level is used to model a complex system like 3D graphics SoC. However, the concerned issue is that how to use efficient methods to traverse design space hierarchically, reduce simulation time, and refine the performance fast. This paper demonstrates a system-level design space exploration model for a tile-based 3D graphics SoC refinement. This model uses UML tools which can assist designers to traverse the whole system and reduces simulation time dramatically by adopting SystemC. As a result, the system performance is improved 198% at geometry function and 69% at rendering function, respectively.
Farhad MEHDIPOUR Hamid NOORI Koji INOUE Kazuaki MURAKAMI
Multitude parameters in the design process of a reconfigurable instruction-set processor (RISP) may lead to a large design space and remarkable complexity. Quantitative design approach uses the data collected from applications to satisfy design constraints and optimize the design goals while considering the applications' characteristics; however it highly depends on designer observations and analyses. Exploring design space can be considered as an effective technique to find a proper balance among various design parameters. Indeed, this approach would be computationally expensive when the performance evaluation of the design points is accomplished based on the synthesis-and-simulation technique. A combined analytical and simulation-based model (CAnSO**) is proposed and validated for performance evaluation of a typical RISP. The proposed model consists of an analytical core that incorporates statistics collected from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. CAnSO has clear speed advantages and therefore it can be used for easing a cumbersome design space exploration of a reconfigurable RISP processor and quick performance evaluation of slightly modified architectures.
Shuichi MIYAZAKI Naoyuki MORIMOTO Yasuo OKABE
The purpose of the online graph exploration problem is to visit all the nodes of a given graph and come back to the starting node with the minimum total traverse cost. However, unlike the classical Traveling Salesperson Problem, information of the graph is given online. When an online algorithm (called a searcher) visits a node v, then it learns information on nodes and edges adjacent to v. The searcher must decide which node to visit next depending on partial and incomplete information of the graph that it has gained in its searching process. The goodness of the algorithm is evaluated by the competitive analysis. If input graphs to be explored are restricted to trees, the depth-first search always returns an optimal tour. However, if graphs have cycles, the problem is non-trivial. In this paper we consider two simple cases. First, we treat the problem on simple cycles. Recently, Asahiro et al. proved that there is a 1.5-competitive online algorithm, while no online algorithm can be (1.25-ε)-competitive for any positive constant ε. In this paper, we give an optimal online algorithm for this problem; namely, we give a (
Nobuaki TOJO Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI
In an embedded system where a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. We can have an optimal cache configuration which minimizes overall memory access time by varying the three cache parameters: the number of sets, a line size, and an associativity. In this paper, we first propose two cache simulation algorithms: CRCB1 and CRCB2, based on Cache Inclusion Property. They realize exact cache simulation but decrease the number of cache hit/miss judgments dramatically. We further propose three more cache design space exploration algorithms: CRMF1, CRMF2, and CRMF3, based on our experimental observations. They can find an almost optimal cache configuration from the viewpoint of access time. By using our approach, the number of cache hit/miss judgments required for optimizing cache configurations is reduced to 1/10-1/50 compared to conventional approaches. As a result, our proposed approach totally runs an average of 3.2 times faster and a maximum of 5.3 times faster compared to the fastest approach proposed so far. Our proposed cache simulation approach achieves the world fastest cache design space exploration when optimizing total memory access time.
Ittetsu TANIGUCHI Praveen RAGHAVAN Murali JAYAPALA Francky CATTHOOR Yoshinori TAKEUCHI Masaharu IMAI
Low energy and high performance embedded processor is crucial in the future nomadic embedded systems design. Improvement of memory accesses, especially improvement of spatial and temporal locality is well known technique to reduce energy and increase performance. However, after transformations that improve locality, address calculation often becomes a bottleneck. In this paper, we propose novel AGU (Address Generation Unit) exploration and mapping technique based on a reconfigurable AGU model. Experimental results show that the proposed techniques help exploring AGU architectures effectively and designers can get trade-offs of real life applications for about 10 hours.
Jinhyun CHO Doowon LEE Sangyong YOON Sanggyu PARK Soo-Ik CHAE
In this paper, we present a high-performance VC-1 main-profile decoder for high-definition (HD) video applications, which can decode HD 720p video streams with 30 fps at 80 MHz. We implemented the decoder with a one-poly eight-metal 0.13 µm CMOS process, which contains about 261,900 logic gates and on-chip memories of 13.9 KB SRAM and 13.1 KB ROM and occupies an area of about 5.1 mm2. In designing the VC-1 decoder, we used a template-based SoC design flow, with which we performed the design space exploration of the decoder by trying various configurations of communication channels. Moreover, we also describe architectures of the computation blocks optimized to satisfy the requirements of VC-1 HD applications.
Wouter CAARLS Pieter JONKER Henk CORPORAAL
Developing embedded parallel image processing applications is usually a very hardware-dependent process, often using the single instruction multiple data (SIMD) paradigm, and requiring deep knowledge of the processors used. Furthermore, the application is tailored to a specific hardware platform, and if the chosen hardware does not meet the requirements, it must be rewritten for a new platform. We have proposed the use of design space exploration [9] to find the most suitable hardware platform for a certain application. This requires a hardware-independent program, and we use algorithmic skeletons [5] to achieve this, while exploiting the data parallelism inherent to low-level image processing. However, since different operations run best on different kinds of processors, we need to exploit task parallelism as well. This paper describes how we exploit task parallelism using an asynchronous remote procedure call (RPC) system, optimized for low-memory and sparsely connected systems such as smart cameras. It uses a futures [16]-like model to present a normal imperative C-interface to the user in which the skeleton calls are implicitly parallelized and pipelined. Simulation provides the task dependency graph and performance numbers for the mapping, which can be done at run time to facilitate data dependent branching. The result is an easy to program, platform independent framework which shields the user from the parallel implementation and mapping of his application, while efficiently utilizing on-chip memory and interconnect bandwidth.
Modern digital systems design requires us to explore a large and complex design space to find a best configuration which satisfies design requirements. Such exploration requires a sound representation of design space from which design candidates are efficiently generated, each of which then is evaluated. This paper proposes a plan-generation-evaluation framework which supports a complete process of such design space exploration. The plan phase constitutes a design space of all possible design alternatives by means of a formally defined representation scheme of attributed AND-OR graph. The generation phase generates a set of candidates by algorithmic pruning of the design space in an attributed AND-OR graph with respect to design requirements as well as architectural constraints. Finally, the evaluation phase measures performance of design candidates in a pruned graph to select a best one. A complete process of cache design is exemplified to show the effectiveness of the proposed framework.
Hajime OTA Tatsuoki NAGAISHI Eiichi ARAI
The Time Domain Electromagnetic Method (TDEM) survey is one of the several geophysical exploration methods. In the conventional TDEM survey, an induction coil is used as the magnetometer. However, the measurement depth is limited to about 500 m. Using high Tc SQUIDs, there are expectations of large bandwidth and high sensitivity for the TDEM. We developed the high Tc SQUID TDEM system. We have reduced the system noise by developing a 20 mm20 mm step-edge type direct coupled SQUID and a low noise direct readout flux locked loop (FLL) circuit. We have also improved the slew rate, optimizing the parameter of the FLL circuit. Consequently, the system noise of less than 0.2 pT/Hz1/2 at 1 kHz was achieved in the earth's magnetic field. The slew rate was 7.3 mT/sec. We conducted field trials and confirmed that the TDEM using high Tc SQUIDs obtains information of deeper region with high precision compared with the TDEM using induction coils.
This paper proposes an efficient method for design space exploration of the global optimum configuration for parameterized ASIPs. The method not only guarantees the optimum configuration, but also provides robust speedup for a wide range of processor architectures such as SoC, ASIC as well as ASIP. The optimization procedure within this method takes a two-steps approach. Firstly, design parameters are partitioned into clusters of inter-dependent parameters using parameter dependency information. Secondly, parameters are optimized for each cluster, the results of which are merged for global optimum. In such optimization, inferior configurations are extensively pruned with a detailed optimality mapping between dependent parameters. Experimental results with mediabench applications show an optimization speedup of 4.1 times faster than the previous work on average, which is significant improvement for practical use.
Ing-Jer HUANG Li-Rong WANG Yu-Min WANG Tai-An LU
This paper presents a case study of synthesis of the industrial embedded microcontroller HT48100 and analysis of performance, cost and software compatibility for its implementation alternatives, using the hardware/software co-design system for microcontrollers/microprocessors PIPER-II. The synthesis tool accepts as input the instruction set architecture (behavioral) specification, and produces as outputs the pipelined RTL designs with their simulators, and the reordering constraints which guide the compiler backend to optimize the code for the synthesized designs. A compiler backend is provided to optimize the application software according to the reordering constraints. The study shows that the co-design approach was able to help the original design team to analyze the architectural properties, identify inefficient architecture features, and explore possible architectural improvements and their impacts in both hardware and software. Feasible future upgrades for the microcontroller family have been identified by the study.
Shinsuke KOBAYASHI Kentaro MITA Yoshinori TAKEUCHI Masaharu IMAI
This paper proposes a compiler generation method for PEAS-III (Practical Environment for ASIP development), which is a configurable processor development environment for application domain specific embedded systems. Using the PEAS-III system, not only the HDL description of a target processor but also its target compiler can be generated. Therefore, execution cycles and dynamic power consumption can be rapidly evaluated. Two processors and their derivatives were designed using the PEAS-III system in the experiment. Experimental results show that the trade-offs among area, performance and power consumption of processors were analyzed in about twelve hours and the optimal processor was selected under the design constraints by using generated compilers and processors.
Peng-Cheng KAO Chih-Kuang HSIEH Ching-Feng SU Allen C.-H. WU
In this paper, we present an RTL design-space exploration method for high-level applications. We formulate the RTL design-space exploration into a performance-driven module selection problem. We devise a dynamic-programming algorithm to solve the problem. We present an exploration flow by integrating commercial synthesis and layout tools with our proposed method. Experimental results have demonstrated that generating AT-curve for all modules is the most time consuming task in the design-space exploration process. Using the proposed 3-point AT projection approach, our method can achieve on an average of 80% speed-up in run time and 90% accuracy in design estimation.
Gang ZHAO Shoji TATSUMI Ruoying SUN
Reinforcement Learning (RL) is an efficient method for solving Markov Decision Processes (MDPs) without a priori knowledge about an environment, and can be classified into the exploitation oriented method and the exploration oriented method. Q-learning is a representative RL and is classified as an exploration oriented method. It is guaranteed to obtain an optimal policy, however, Q-learning needs numerous trials to learn it because there is not action-selecting mechanism in Q-learning. For accelerating the learning rate of the Q-learning and realizing exploitation and exploration at a learning process, the Q-ee learning system has been proposed, which uses pre-action-selector, action-selector and back propagation of Q values to improve the performance of Q-learning. But the Q-ee learning is merely suitable for deterministic MDPs, and its convergent guarantee to derive an optimal policy has not been proved. In this paper, based on discussing different exploration methods, replacing the pre-action-selector in the Q-ee learning, we introduce a method that can be used to implement an active exploration to an environment, the Active Exploration Planning (AEP), into the learning system, which we call the Q-ae learning. With this replacement, the Q-ae learning not only maintains advantages of the Q-ee learning but also is adapted to a stochastic environment. Moreover, under deterministic MDPs, this paper presents the convergent condition and its proof for an agent to obtain the optimal policy by the method of the Q-ae learning. Further, by discussions and experiments, it is shown that by adjusting the relation between the learning factor and the discounted rate, the exploration process to an environment can be controlled on a stochastic environment. And, experimental results about the exploration rate to an environment and the correct rate of learned policies also illustrate the efficiency of the Q-ae learning on the stochastic environment.
Tetsuya YOSHIDA Koichi HORI Shinichi NAKASUKA
This paper proposes a new method to improve cooperation in concurrent systems within the framework of Multi-Agent Systems (MAS) by utilizing reinforcement learning. When subsystems work independently and concurrently, achieving appropriate cooperation among them is important to improve the effectiveness of the overall system. Treating subsystems as agents makes it easy to explicitly deal with the interactions among them since they can be modeled naturally as communication among agents with intended information. In our approach agents try to learn the appropriate balance between exploration and exploitation via reward, which is important in distributed and concurrent problem solving in general. By focusing on how to give reward in reinforcement learning, not the learning equation, two kinds of reward are defined in the context of cooperation between agents, in contrast to reinforcement learning within the framework of single agent. In our approach reward for insistence by individual agent contributes to facilitating exploration and reward for concession to other agents contributes to facilitating exploitation. Our cooperation method was examined through experiments on the design of micro satellites and the result showed that it was effective to some extent to facilitate cooperation among agents by letting agents themselves learn the appropriate balance between insistence and concession. The result also suggested the possibility of utilizing the relative magnitude of these rewards as a new control parameter in MAS to control the overall behavior of MAS.