Tomoki NAKAMURA Naoki HAYASHI Masahiro INUIGUCHI
In this paper, we consider distributed decision-making over directed time-varying multi-agent systems. We consider an adversarial bandit problem in which a group of agents chooses an option from among multiple arms to maximize the total reward. In the proposed method, each agent cooperatively searches for the optimal arm with the highest reward by a consensus-based distributed Exp3 policy. To this end, each agent exchanges the estimation of the reward of each arm and the weight for exploitation with the nearby agents on the network. To unify the explored information of arms, each agent mixes the estimation and the weight of the nearby agents with their own values by a consensus dynamics. Then, each agent updates the probability distribution of arms by combining the Hedge algorithm and the uniform search. We show that the sublinearity of a pseudo-regret can be achieved by appropriately setting the parameters of the distributed Exp3 policy.
Keita TERASHIMA Koichi KOBAYASHI Yuh YAMASHITA
In a multi-agent system, it is important to consider a design method of cooperative actions in order to achieve a common goal. In this paper, we propose two novel multi-agent reinforcement learning methods, where the control specification is described by linear temporal logic formulas, which represent a common goal. First, we propose a simple solution method, which is directly extended from the single-agent case. In this method, there are some technical issues caused by the increase in the number of agents. Next, to overcome these technical issues, we propose a new method in which an aggregator is introduced. Finally, these two methods are compared by numerical simulations, with a surveillance problem as an example.
With the high development of computation requirements in Internet of Things, resource-limited edge servers usually require to cooperate to perform the tasks. Most related studies usually assume a static cooperation approach which might not suit the dynamic environment of edge computing. In this paper, we consider a dynamic cooperation approach by guiding edge servers to form coalitions dynamically. It raises two issues: 1) how to guide them to optimally form coalitions and 2) how to cope with the dynamic feature where server statuses dynamically change as the tasks are performed. The coalitional Markov decision process (CMDP) model proposed in our previous work can handle these issues well. However, its basic solution, coalitional Q-learning, cannot handle the large scale problem when the task number is large in edge computing. Our response is to propose a novel algorithm called deep coalitional Q-learning (DCQL) to solve it. To sum up, we first formulate the dynamic cooperation problem of edge servers as a CMDP: each edge server is regarded as an agent and the dynamic process is modeled as a MDP where the agents observe the current state to formulate several coalitions. Each coalition takes an action to impact the environment which correspondingly transfers to the next state to repeat the above process. Then, we propose DCQL which includes a deep neural network and so can well cope with large scale problem. DCQL can guide the edge servers to form coalitions dynamically with the target of optimizing some goal. Furthermore, we run experiments to verify our proposed algorithm's effectiveness in different settings.
Distributed edge cloud computing is an important computation infrastructure for Internet of Things (IoT) and its task offloading problem has attracted much attention recently. Most existing work on task offloading in distributed edge cloud computing usually assumes that each self-interested user owns one edge server and chooses whether to execute its tasks locally or to offload the tasks to cloud servers. The goal of each edge server is to maximize its own interest like low delay cost, which corresponds to a non-cooperative setting. However, with the strong development of smart IoT communities such as smart hospital and smart factory, all edge and cloud servers can belong to one organization like a technology company. This corresponds to a cooperative setting where the goal of the organization is to maximize the team interest in the overall edge cloud computing system. In this paper, we consider a new problem called cooperative task offloading where all edge servers try to cooperate to make the entire edge cloud computing system achieve good performance such as low delay cost and low energy cost. However, this problem is hard to solve due to two issues: 1) each edge server status dynamically changes and task arrival is uncertain; 2) each edge server can observe only its own status, which makes it hard to optimize team interest as global information is unavailable. For solving these issues, we formulate the problem as a decentralized partially observable Markov decision process (Dec-POMDP) which can well handle the dynamic features under partial observations. Then, we apply a multi-agent reinforcement learning algorithm called value decomposition network (VDN) and propose a VDN-based task offloading algorithm (VDN-TO) to solve the problem. Specifically, the motivation is that we use a team value function to evaluate the team interest, which is then divided into individual value functions for each edge server. Then, each edge server updates its individual value function in the direction that can maximize the team interest. Finally, we choose a part of a real dataset to evaluate our algorithm and the results show the effectiveness of our algorithm in a comparison with some other existing methods.
Takanori HARA Masahiro SASABE Shoji KASAHARA
Traffic congestion in road networks has been studied as the congestion game in game theory. In the existing work, the road usage by each agent was assumed to be static during the whole time horizon of the agent's travel, as in the classical congestion game. This assumption, however, should be reconsidered because each agent sequentially uses roads composing the route. In this paper, we propose a multi-agent distributed route selection scheme based on a gradient descent method considering the time-dependency among agents' road usage for vehicular networks. The proposed scheme first estimates the time-dependent flow on each road by considering the agents' probabilistic occupation under the first-in-first-out (FIFO) policy. Then, it calculates the optimal route choice probability of each route candidate using the gradient descent method and the estimated time-dependent flow. Each agent finally selects one route according to the optimal route choice probabilities. We first prove that the proposed scheme can exponentially converge to the steady-state at the convergence rate inversely proportional to the product of the number of agents and that of individual route candidates. Through simulations under a grid-like network and a real road network, we show that the proposed scheme can improve the actual travel time by 5.1% and 2.5% compared with the conventional static-flow based approach, respectively. In addition, we demonstrate that the proposed scheme is robust against incomplete information sharing among agents, which would be caused by its low penetration ratio or limited transmission range of wireless communications.
This paper proposes a switched pinning control method with a multi-rating mechanism for vehicle platoons. The platoons are expressed as multi-agent systems consisting of mass-damper systems in which pinning agents receive target velocities from external devices (ex. intelligent traffic signals). We construct model predictive control (MPC) algorithm that switches pinning agents via mixed-integer quadratic programmings (MIQP) problems. The optimization rate is determined according to the convergence rate to the target velocities and the inter-vehicular distances. This multi-rating mechanism can reduce the computational load caused by iterative calculation. Numerical results demonstrate that our method has a reduction effect on the string instability by selecting the pinning agents to minimize errors of the inter-vehicular distances to the target distances.
Fanying ZHENG Fu GU Yangjian JI Jianfeng GUO Xinjian GU Jin ZHANG
In the context of Web 2.0, the interaction between users and resources is more and more frequent in the process of resource sharing and consumption. However, the current research on resource pricing mainly focuses on the attributes of the resource itself, and does not weigh the interests of the resource sharing participants. In order to deal with these problems, the pricing mechanism of resource-user interaction evaluation based on multi-agent game theory is established in this paper. Moreover, the user similarity, the evaluation bias based on link analysis and punishment of academic group cheating are also included in the model. Based on the data of 181 scholars and 509 articles from the Wanfang database, this paper conducts 5483 pricing experiments for 13 months, and the results show that this model is more effective than other pricing models - the pricing accuracy of resource resources is 94.2%, and the accuracy of user value evaluation is 96.4%. Besides, this model can intuitively show the relationship within users and within resources. The case study also exhibits that the user's knowledge level is not positively correlated with his or her authority. Discovering and punishing academic group cheating is conducive to objectively evaluating researchers and resources. The pricing mechanism of scientific and technological resources and the users proposed in this paper is the premise of fair trade of scientific and technological resources.
Makoto YAMASHITA Naoki HAYASHI Takeshi HATANAKA Shigemasa TAKAI
This paper investigates a constrained distributed online optimization problem over strongly connected communication networks, where a local cost function of each agent varies in time due to environmental factors. We propose a distributed online projected subgradient method over unbalanced directed networks. The performance of the proposed method is evaluated by a regret which is defined by the error between the cumulative cost over time and the cost of the optimal strategy in hindsight. We show that a logarithmic regret bound can be achieved for strongly convex cost functions. We also demonstrate the validity of the proposed method through a numerical example on distributed estimation over a diffusion field.
Makoto YAMASHITA Naoki HAYASHI Shigemasa TAKAI
This paper considers a distributed subgradient method for online optimization with event-triggered communication over multi-agent networks. At each step, each agent obtains a time-varying private convex cost function. To cooperatively minimize the global cost function, these agents need to communicate each other. The communication with neighbor agents is conducted by the event-triggered method that can reduce the number of communications. We demonstrate that the proposed online algorithm achieves a sublinear regret bound in a dynamic environment with slow dynamics.
Ryohei KAWATA Katsuhide FUJITA
Multi-time negotiation which repeats negotiations many times under the same conditions is an important class of automated negotiation. We propose a meta-strategy that selects an agent's individual negotiation strategy for multi-time negotiation. Because the performance of the negotiating agents depends on situational parameters, such as the negotiation domains and the opponents, a suitable and effective individual strategy should be selected according to the negotiation situation. However, most existing agents negotiate based on only one negotiation policy: one bidding strategy, one acceptance strategy, and one opponent modeling method. Although the existing agents effectively negotiate in most situations, they do not work well in particular situations and their utilities are decreased. The proposed meta-strategy provides an effective negotiation strategy for the situation at the beginning of the negotiation. We model the meta-strategy as a multi-armed bandit problem that regards an individual negotiation strategy as a slot machine and utility of the agent as a reward. We implement the meta-strategy as the negotiating agents that use existing effective agents as the individual strategies. The experimental results demonstrate the effectiveness of our meta-strategy under various negotiation conditions. Additionally, the results indicate that the individual utilities of negotiating agents are influenced by the opponents' strategies, the profiles of the opponent and its own profiles.
Yuta HOSOKAWA Katsuhide FUJITA
In recent years, agreement technologies have garnered interest among agents in the field of multi-agent systems. Automated negotiation is one of the agreement technologies, in which agents negotiate with each other to make an agreement so that they can solve conflicts between their preferences. Although most agents keep their own preferences private, it is necessary to estimate the opponent's preferences to obtain a better agreement. Therefore, opponent modeling is one of the most important elements in automated negotiating strategy. A frequency model is widely used for opponent modeling because of its robustness against various types of strategy while being easy to implement. However, existing frequency models do not consider the opponent's proposal speed and the transition of offers. This study proposes a novel frequency model that considers the opponent's behavior using two main elements: the offer ratio and the weighting function. The offer ratio stabilizes the model against changes in the opponent's offering speed, whereas the weighting function takes the opponent's concession into account. The two experiments conducted herein show that our proposed model is more accurate than other frequency models. Additionally, we find that the agent with the proposed model performs with a significantly higher utility value in negotiations.
Takuma WAKASA Yoshiki NAGATANI Kenji SAWADA Seiichi SHIN
This paper considers a velocity control problem for merging and splitting maneuvers of vehicle platoons. In this paper, an external device sends velocity commands to some vehicles in the platoon, and the others adjust their velocities autonomously. The former is pinning control, and the latter is consensus control in multi-agent control. We propose a switched pinning control algorithm. Our algorithm consists of three sub-methods. The first is an optimal switching method of pinning agents based on an MLD (Mixed Logical Dynamical) system model and MPC (Model Predictive Control). The second is a representation method for dynamical platoon formation with merging and splitting maneuver. The platoon formation follows the positional relation between vehicles or the formation demand from the external device. The third is a switching reduction method by setting a cost function that penalizes the switching of the pinning agents in the steady-state. Our proposed algorithm enables us to improve the consensus speed. Moreover, our algorithm can regroup the platoons to the arbitrary platoons and control the velocities of the multiple vehicle platoons to each target value.
Naoki HAYASHI Kazuyuki ISHIKAWA Shigemasa TAKAI
In this paper, we propose a distributed subgradient-based method over quantized and event-triggered communication networks for constrained convex optimization. In the proposed method, each agent sends the quantized state to the neighbor agents only at its trigger times through the dynamic encoding and decoding scheme. After the quantized and event-triggered information exchanges, each agent locally updates its state by a consensus-based subgradient algorithm. We show a sufficient condition for convergence under summability conditions of a diminishing step-size.
Shun ANDOH Koichi KOBAYASHI Yuh YAMASHITA
Pinning control of multi-agent systems is a method that the external control input is added to some agents (pinning nodes), e.g., leaders. By the external control input, consensus to a certain target value and faster consensus are achieved. In this paper, we propose a new method of self-triggered predictive pinning control for the consensus problem. Self-triggered control is a method that both the control input and the next update time are calculated. Using self-triggered control, it is expected that the communication cost can be reduced. First, a new finite-time optimal control problem used in self-triggered control is formulated, and its solution method is derived. Next, as an on-line algorithm, two methods, i.e., the multi-hop communication-based method and the observer-based method are proposed. Finally, numerical examples are presented.
This paper studies the problem of real-time routing in a multi-autonomous robot enhanced network at uncertain and vulnerable tactical edge. Recent network protocols, such as opportunistic mobile network routing protocols, engaged social network in communication network that can increase the interoperability by using social mobility and opportunistic carry and forward routing algorithms. However, in practical harsh environment such as a battlefield, the uncertainty of social mobility and complexity of vulnerable environment due to unpredictable physical and cyber-attacks from enemy, would seriously affect the effectiveness and practicality of these emerging network protocols. This paper presents a GT-SaRE-MANET (Game Theoretic Situation-aware Robot Enhanced Mobile Ad-hoc Network) routing protocol that adopt the online reinforcement learning technique to supervise the mobility of multi-robots as well as handle the uncertainty and potential physical and cyber attack at tactical edge. Firstly, a set of game theoretic mission oriented metrics has been introduced to describe the interrelation among network quality, multi-robot mobility as well as potential attacking activities. Then, a distributed multi-agent game theoretic reinforcement learning algorithm has been developed. It will not only optimize GT-SaRE-MANET routing protocol and the mobility of multi-robots online, but also effectively avoid the physical and/or cyber-attacks from enemy by using the game theoretic mission oriented metrics. The effectiveness of proposed design has been demonstrated through computer aided simulations and hardware experiments.
It is known that policy gradient algorithm can not guarantee the convergence to a Nash equilibrium in mixed policies when it is applied in matrix games. To overcome this problem, we propose a novel multi-agent reinforcement learning (MARL) algorithm called a policy gradient lagging anchor (PGLA) algorithm. And we prove that the agents' policies can converge to a Nash equilibrium in mixed policies by using the PGLA algorithm in two-player two-action matrix games. By simulation, we confirm the convergence and also show that the PGLA algorithm has a better convergence than the LR-I lagging anchor algorithm.
Yuichi KAJIYAMA Naoki HAYASHI Shigemasa TAKAI
This paper proposes a consensus-based subgradient method under a common constraint set with switching undirected graphs. In the proposed method, each agent has a state and an auxiliary variable as the estimates of an optimal solution and accumulated information of past gradients of neighbor agents. We show that the states of all agents asymptotically converge to one of the optimal solutions of the convex optimization problem. The simulation results show that the proposed consensus-based algorithm with accumulated subgradient information achieves faster convergence than the standard subgradient algorithm.
Naoki HAYASHI Masaaki NAGAHARA
This paper proposes a novel distributed proximal minimization algorithm for constrained optimization problems over fixed strongly connected networks. At each iteration, each agent updates its own state by evaluating a proximal operator of its objective function under a constraint set and compensating the unbalancing due to unidirectional communications. We show that the states of all agents asymptotically converge to one of the optimal solutions. Numerical results are shown to confirm the validity of the proposed method.
In this paper, based on the policy of model predictive control, a new method of predictive pinning control is proposed for the consensus problem of multi-agent systems. Pinning control is a method that the external control input is added to some agents (pinning nodes), e.g., leaders. By the external control input, consensus to a certain target value (not the average of the initial states) and faster consensus are achieved. In the proposed method, the external control input is calculated by the controller node connected to only pinning nodes. Since the states of all agents are required in calculation of the external control input, communication delays must be considered. The proposed algorithm includes not only calculation of the external control input but also delay compensation. The effectiveness of the proposed method is presented by a numerical example.
The output feedback consensus problem of nonlinear multi-agent systems under a directed network with a time varying communication delay is studied. In order to deal with this problem, the dynamic output feedback controller with an additional low gain parameter that compensates for the effect of nonlinearity and a communication delay is proposed. Also, it is shown that under some assumptions, the proposed controller can always solve the output feedback consensus problem even in the presence of an arbitrarily large communication delay.