The search functionality is under construction.

Keyword Search Result

[Keyword] MDP(18hit)

1-18hit
  • A POMDP-Based Approach to Assortment Optimization Problem for Vending Machine Open Access

    Gaku NEMOTO  Kunihiko HIRAISHI  

     
    PAPER-Mathematical Systems Science

      Pubricized:
    2023/09/05
      Vol:
    E107-A No:6
      Page(s):
    909-918

    Assortment optimization is one of main problems for retailers, and has been widely studied. In this paper, we focus on vending machines, which have many characteristic issues to be considered. We first formulate an assortment optimization problem for vending machines, next propose a model that represents consumer’s decision making, and then show a solution method based on partially observable Markov decision process (POMDP). The problem includes incomplete state observation, stochastic consumer behavior and policy decisions that maximize future expected rewards. Using computer simulation, we observe that sales increases compared to that by heuristic methods under the same condition. Moreover, the sales approaches the theoretical upper bound.

  • Deep Coalitional Q-Learning for Dynamic Coalition Formation in Edge Computing

    Shiyao DING  Donghui LIN  

     
    PAPER

      Pubricized:
    2021/12/14
      Vol:
    E105-D No:5
      Page(s):
    864-872

    With the high development of computation requirements in Internet of Things, resource-limited edge servers usually require to cooperate to perform the tasks. Most related studies usually assume a static cooperation approach which might not suit the dynamic environment of edge computing. In this paper, we consider a dynamic cooperation approach by guiding edge servers to form coalitions dynamically. It raises two issues: 1) how to guide them to optimally form coalitions and 2) how to cope with the dynamic feature where server statuses dynamically change as the tasks are performed. The coalitional Markov decision process (CMDP) model proposed in our previous work can handle these issues well. However, its basic solution, coalitional Q-learning, cannot handle the large scale problem when the task number is large in edge computing. Our response is to propose a novel algorithm called deep coalitional Q-learning (DCQL) to solve it. To sum up, we first formulate the dynamic cooperation problem of edge servers as a CMDP: each edge server is regarded as an agent and the dynamic process is modeled as a MDP where the agents observe the current state to formulate several coalitions. Each coalition takes an action to impact the environment which correspondingly transfers to the next state to repeat the above process. Then, we propose DCQL which includes a deep neural network and so can well cope with large scale problem. DCQL can guide the edge servers to form coalitions dynamically with the target of optimizing some goal. Furthermore, we run experiments to verify our proposed algorithm's effectiveness in different settings.

  • Improved Metric Function for AlphaSeq Algorithm to Design Ideal Complementary Codes for Multi-Carrier CDMA Systems

    Shucong TIAN  Meng YANG  Jianpeng WANG  Rui WANG  Avik R. ADHIKARY  

     
    LETTER-Communication Theory and Signals

      Pubricized:
    2021/11/15
      Vol:
    E105-A No:5
      Page(s):
    901-905

    AlphaSeq is a new paradigm to design sequencess with desired properties based on deep reinforcement learning (DRL). In this work, we propose a new metric function and a new reward function, to design an improved version of AlphaSeq. We show analytically and also through numerical simulations that the proposed algorithm can discover sequence sets with preferable properties faster than that of the previous algorithm.

  • Improving the Efficiency of a Reaction Attack on the QC-MDPC McEliece

    Thales BANDIERA PAIVA  Routo TERADA  

     
    PAPER-Cryptography and Information Security

      Vol:
    E101-A No:10
      Page(s):
    1676-1686

    The QC-MDPC McEliece scheme was considered one of the most promising public key encryption schemes for efficient post-quantum secure encryption. As a variant of the McEliece scheme, it is based on the syndrome decoding problem, which is a hard problem from Coding Theory. Its key sizes are competitive with the ones of the widely used RSA cryptosystem, and it came with an apparently strong security reduction. For three years, the scheme has not suffered major threats, until the end of 2016, at the Asiacrypt, when Guo, Johansson, and Stankovski presented a reaction attack on the QC-MDPC that exploits one aspect that was not considered in the security reduction: the probability of a decoding failure to occur is lower when the secret key and the error used for encryption share certain properties. Recording the decoding failures, the attacker obtains information about the secret key and then use the information gathered to reconstruct the key. Guo et al. presented an algorithm for key reconstruction for which we can point two weaknesses. The first one is that it cannot deal with partial information about the secret key, resulting in the attacker having to send a large number of decoding challenges. The second one is that it does not scale well for higher security levels. To improve the attack, we propose a key reconstruction algorithm that runs faster than Guo's et al. algorithm, even using around 20% less interactions with the secret key holder than used by their algorithm, considering parameters suggested for 80 bits of security. It also has a lower asymptotic complexity which makes it scale much better for higher security parameters. The algorithm can be parallelized straightforwardly, which is not the case for the one by Guo et al.

  • Energy-Aware Download Method in LTE Based Smartphone

    Jie REN  Ling GAO  Hai WANG  QuanLi GAO  ZheWen ZHANG  

     
    PAPER-Information Network

      Pubricized:
    2016/11/18
      Vol:
    E100-D No:2
      Page(s):
    304-312

    Mobile traffic is experiencing tremendous growth, and this growing wave is no doubt increasing the use of radio component of mobile devices, resulting in shorter battery lifetime. In this paper, we present an Energy-Aware Download Method (EDM) based on the Markov Decision Process (MDP) to optimize the data download energy for mobile applications. Unlike the previous download schemes in literature that focus on the energy efficiency by simply delaying the download requests, which often leads to a poor user experience, our MDP model learns off-line from a set of training download workloads for different user patterns. The model is then integrated into the mobile application to deal the download request at runtime, taking into account the current battery level, LTE reference signal receiving power (RSRP), reference signal signal to noise radio (RSSNR) and task size as input of the decision process, and maximizes the reward which refers to the expected battery life and user experience. We evaluate how the EDM can be used in the context of a real file downloading application over the LTE network. We obtain, on average, 20.3%, 15% and 45% improvement respectively for energy consumption, latency, and performance of energy-delay trade off, when compared to the Android default download policy (Minimum Delay).

  • Energy-Efficient and Throughput Maximization Scheme for Sensor-Aided Cognitive Radio Networks

    Hiep VU-VAN  Insoo KOO  

     
    PAPER

      Vol:
    E98-B No:10
      Page(s):
    1996-2003

    A cognitive radio user (CU) can get assistance from sensor nodes (SN) to perform spectrum sensing. However, the SNs are often powered by a finite-capacity battery, which can maintain operations of the SNs over a short time. Therefore, energy-efficiency of the SNs becomes a crucial problem. In this paper, an SN is considered to be a device with an energy harvester that can harvest energy from a non-radio frequency (non-RF) energy resource while performing other actions concurrently. In any one time slot, in order to maintain the required sensing accuracy of the CR network and to conserve energy in the SNs, only a small number of SNs are required to sense the primary user (PU) signal, and other SNs are kept silent to save energy. For this, an algorithm to divide all SNs into groups that can satisfy the required sensing accuracy of the network, is proposed. In a time slot, each SN group can be assigned one of two actions: stay silent, or be active to perform sensing. The problem of determining the optimal action for all SN groups to maximize throughput of the CR network is formulated as a framework of a partially observable Markov decision process (POMDP), in which the effect of the current time slot's action on the throughput of future time slots is considered. The solution to the problem, that is the decision mode of the SN groups (i.e., active or silent), depends on the residual energy and belief of absence probability of the PU signal. The simulation results show that the proposed scheme can improve energy efficiency of CR networks compared with other conventional schemes.

  • Dynamic Linear Bellman Combination of Optimal Policies for Solving New Tasks

    Takamitsu MATSUBARA  Takaya ASAKURA  Kenji SUGIMOTO  

     
    LETTER-Systems and Control

      Vol:
    E98-A No:10
      Page(s):
    2187-2190

    In this letter, we propose Dynamic Linear Bellman Combination that allows us to composite optimal policies of Kullback-Leibler control for similar tasks with different terminal and non-terminal costs. Simulation results demonstrate the effectiveness of our method.

  • Optimal Channel-Sensing Scheme for Cognitive Radio Systems Based on Fuzzy Q-Learning

    Fereidoun H. PANAHI  Tomoaki OHTSUKI  

     
    PAPER

      Vol:
    E97-B No:2
      Page(s):
    283-294

    In a cognitive radio (CR) network, the channel sensing scheme used to detect the existence of a primary user (PU) directly affects the performances of both CR and PU. However, in practical systems, the CR is prone to sensing errors due to the inefficiency of the sensing scheme. This may yield primary user interference and low system performance. In this paper, we present a learning-based scheme for channel sensing in CR networks. Specifically, we formulate the channel sensing problem as a partially observable Markov decision process (POMDP), where the most likely channel state is derived by a learning process called Fuzzy Q-Learning (FQL). The optimal policy is derived by solving the problem. Simulation results show the effectiveness and efficiency of our proposed scheme.

  • A POMDP Based Distributed Adaptive Opportunistic Spectrum Access Strategy for Cognitive Ad Hoc Networks

    Yichen WANG  Pinyi REN  Zhou SU  

     
    LETTER

      Vol:
    E94-B No:6
      Page(s):
    1621-1624

    In this letter, we propose a Partially Observable Markov Decision Process (POMDP) based Distributed Adaptive Opportunistic Spectrum Access (DA-OSA) Strategy for Cognitive Ad Hoc Networks (CAHNs). In each slot, the source and destination choose a set of channels to sense and then decide the transmission channels based on the sensing results. In order to maximize the throughput for each link, we use the theories of sequential decision and optimal stopping to determine the optimal sensing channel set. Moreover, we also establish the myopic policy and exploit the monotonicity of the reward function that we use, which can be used to reduce the complexity of the sequential decision.

  • Photomask Data Prioritization Based on VLSI Design Intent and Its Utilization for Mask Manufacturing

    Kokoro KATO  Masakazu ENDO  Tadao INOUE  Shigetoshi NAKATAKE  Masaki YAMABE  Sunao ISHIHARA  

     
    PAPER-Device and Circuit Modeling and Analysis

      Vol:
    E93-A No:12
      Page(s):
    2424-2432

    The increase in the time required for data processing, mask drawing, and inspection of photomask, has led to substantial increase in mask manufacturing cost. This has become one of the major challenges in the semiconductor industry. We have developed a data flow process for mask manufacturing in which we refer to design intent information in order to reduce TAT of mask manufacturing processes. We convert design level information "Design Intent (DI)" into priority information of mask manufacturing data known as "Mask Data Rank (MDR)" so that we can identify and sort out the importance of mask patterns from the view point of the design side. As a result, we can reduce mask writing time and mask inspection time. Our objective is to build efficient data flow conversion system from DI to MDR. In this paper we introduce the idea of MDR and the software system that we built for DI extraction. Then we show the experimental results with actual chip data. Lastly we will discuss related issues and their solutions.

  • Policy Gradient Based Semi-Markov Decision Problems: Approximation and Estimation Errors

    Ngo Anh VIEN  SeungGwan LEE  TaeChoong CHUNG  

     
    PAPER

      Vol:
    E93-D No:2
      Page(s):
    271-279

    In and we have presented a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov Decision Process (SMDP). We approximated the gradient of the average reward. Then, a simulation-based algorithm was proposed to estimate the approximate gradient of the average reward (called GSMDP), using only a single sample path of the underlying Markov chain. GSMDP was proved to converge with probability 1. In this paper, we give bounds on the approximation and estimation errors for GSMDP algorithm. The approximation error of that approximation is the size of the difference between the true gradient and the approximate gradient. The estimation error, the size of the difference between the output of the algorithm and its asymptotic output, arises because the algorithm sees only a finite data sequence.

  • Constructing a Multilayered Boundary to Defend against Intrusive Anomalies

    Zonghua ZHANG  Hong SHEN  

     
    PAPER-Application Information Security

      Vol:
    E90-D No:2
      Page(s):
    490-499

    We propose a model for constructing a multilayered boundary in an information system to defend against intrusive anomalies by correlating a number of parametric anomaly detectors. The model formulation is based on two observations. First, anomaly detectors differ in their detection coverage or blind spots. Second, operating environments of the anomaly detectors reveal different information about system anomalies. The correlation among observation-specific anomaly detectors is first formulated as a Partially Observable Markov Decision Process, and then a policy-gradient reinforcement learning algorithm is developed for an optimal cooperation search, with the practical objectives being broader overall detection coverage and fewer false alerts. A host-based experimental scenario is developed to illustrate the principle of the model and to demonstrate its performance.

  • CHQ: A Multi-Agent Reinforcement Learning Scheme for Partially Observable Markov Decision Processes

    Hiroshi OSADA  Satoshi FUJITA  

     
    PAPER-Artificial Intelligence and Cognitive Science

      Vol:
    E88-D No:5
      Page(s):
    1004-1011

    In this paper, we propose a new reinforcement learning scheme called CHQ that could efficiently acquire appropriate policies under partially observable Markov decision processes (POMDP) involving probabilistic state transitions, that frequently occurs in multi-agent systems in which each agent independently takes a probabilistic action based on a partial observation of the underlying environment. A key idea of CHQ is to extend the HQ-learning proposed by Wiering et al. in such a way that it could learn the activation order of the MDP subtasks as well as an appropriate policy under each MDP subtask. The goodness of the proposed scheme is experimentally evaluated. The result of experiments implies that it can acquire a deterministic policy with a sufficiently high success rate, even if the given task is POMDP with probabilistic state transitions.

  • Sliding Multiple Phase Differential Detection of Trellis-Coded MDPSK-OFDM

    Chong Il KIM  Zhengyuan XU  Han Jong KIM  

     
    PAPER

      Vol:
    E86-B No:5
      Page(s):
    1591-1600

    In this paper, the Viterbi decoder containing new branch metrics of the squared Euclidean distance with multiple order phase differences is introduced in order to improve the bit error rate (BER) in the differential detection of the trellis-coded MDPSK-OFDM. The proposed Viterbi decoder is conceptually same as the sliding multiple phase differential detection method that uses the branch metric with multiple phase differences. Also, we describe the Viterbi algorithm in order to use this branch metrics. Our study shows that such a Viterbi decoder improves BER performance without sacrificing bandwidth and power efficiency. Also, the proposed algorithm can be used in the single carrier modulation.

  • Labeling Q-Learning in POMDP Environments

    Haeyeon LEE  Hiroyuki KAMAYA  Kenichi ABE  

     
    PAPER-Biocybernetics, Neurocomputing

      Vol:
    E85-D No:9
      Page(s):
    1425-1432

    This paper presents a new Reinforcement Learning (RL) method, called "Labeling Q-learning (LQ-learning)," to solve the partially obervable Markov Decision Process (POMDP) problems. Recently, hierarchical RL methods are widely studied. However, they have the drawback that the learning time and memory are exhausted only for keeping the hierarchical structure, though they wouldn't be necessary. On the other hand, our LQ-learning has no hierarchical structure, but adopts a new type of internal memory mechanism. Namely, in the LQ-learning, the agent percepts the current state by pair of observation and its label, and then, the agent can distinguish states, which look as same, but obviously different, more exactly. So to speak, at each step t, we define a new type of perception of its environment õt=(ot,θt), where ot is conventional observation, and θt is the label attached to the observation ot. Then the classical RL-algorithm is used as if the pair (ot,θt) serves as a Markov state. This labeling is carried out by a Boolean variable, called "CHANGE," and a hash-like or mod function, called Labeling Function (LF). In order to demonstrate the efficiency of LQ-learning, we will apply it to "maze problems" in Grid-Worlds, used in many literatures as POMDP simulated environments. By using the LQ-learning, we can solve the maze problems without initial knowledge of environments.

  • New Results on Selective Diversity Using Order Statistics for Reception of M-Ary DPSK and PSK Signals on Nakagami Fading Channels

    Changhwan KIM  Seyoung CHOI  Youngyearl HAN  

     
    LETTER-Image

      Vol:
    E84-A No:4
      Page(s):
    1085-1089

    The performances of M-ary DPSK (MDPSK) and PSK (MPSK) systems using L-branch selection combining (SC) diversity reception in frequency-nonselective slow Nakagami fading channels are derived theoretically. For integer values of the Nakagami fading parameter m, the general formula for evaluating symbol error rate (SER) of MDPSK signals in the independent branch diversity system comprises numerical analyses with the integral-form expressions. An exact closed-form SER performance of MPSK signals under the effect of SC diversity via numerical integration is presented.

  • Performance Analysis of an ATM Multiplexer with a Resume Level Loaded with Homogeneous Bursty Sources

    Kwang-Chul LEE  Byung-Cheol SHIN  

     
    PAPER-Communication Systems and Transmission Equipment

      Vol:
    E81-B No:11
      Page(s):
    2147-2156

    This paper investigates an ATM multiplexer with a resume level, which uses a selective cell discarding strategy as a priority control, and a Markov-modulated deterministic process (MMDP) as the burst input traffic. Assuming that a system is loaded with a superposition of several independent and homogeneous On-Off bursty sources with two priority classes, we obtain the cell loss probability of each priority class of an ATM multiplexer with a resume level. The performance analysis derived here includes as special cases one without priority and one with a threshold level. From the numerical results, we compare the cell loss probability, the mean queue length, the mean queuing delay, the level crossing rate, and the queue length distribution at the embedded points for the case of a threshold level with those for the case of a resume level. By selecting an appropriate resume level, we can reduce the sensitive state change around the threshold level.

  • Performance of an ATM Multiplexer with Selective Cell Discarding for On-Off Bursty Traffics

    Sang Won MIN  Hae CHUNG  Chong Kwan UN  

     
    PAPER-Communication Networks and Service

      Vol:
    E78-B No:9
      Page(s):
    1253-1261

    We study an asynchronous transfer mode (ATM) multiplexer with selective cell discarding (SCD), and propose as a new traffic parameter the ratio of high and low priority cell streams when a cell stream multiplexed is classified by the cell loss priority (CLP) field in a cell header. By having loss priority control, it is possible to increase the multiplexing gain. For performance analysis we assume that an on-of T bursty traffic is described with several traffic parameters and the proposed priority ratio, and is approximately modeled by a Markov-modulated deterministic process (MMDP). Assuming that several independent and homogeneous on-off bursty traffics with priority discrimination are multiplexed, we present an analytical procedure for the cell loss probability of each priority level in statistical cell multiplexing with loss priority control, and use the performance results for connection admission control (CAC). Also, we consider the effect of the proposed priority ratio. Although loss priority control increases the statistical multiplexing gain, it is not appropriate for the on-off bursty traffic to change the value of the high-priority ratio in order to obtain a larger multiplexing gain, since the admissible load is determined by the loss probability of low priority traffic for most cases and the values of the ratio in a certain range slightly affect it.