The search functionality is under construction.

The search functionality is under construction.

In and we have presented a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov Decision Process (SMDP). We approximated the gradient of the average reward. Then, a simulation-based algorithm was proposed to estimate the approximate gradient of the average reward (called GSMDP), using only a single sample path of the underlying Markov chain. GSMDP was proved to converge with probability 1. In this paper, we give bounds on the approximation and estimation errors for GSMDP algorithm. The approximation error of that approximation is the size of the difference between the true gradient and the approximate gradient. The estimation error, the size of the difference between the output of the algorithm and its asymptotic output, arises because the algorithm sees only a finite data sequence.

- Publication
- IEICE TRANSACTIONS on Information Vol.E93-D No.2 pp.271-279

- Publication Date
- 2010/02/01

- Publicized

- Online ISSN
- 1745-1361

- DOI
- 10.1587/transinf.E93.D.271

- Type of Manuscript
- Special Section PAPER (Special Section on Foundations of Computer Science)

- Category

The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.

Copy

Ngo Anh VIEN, SeungGwan LEE, TaeChoong CHUNG, "Policy Gradient Based Semi-Markov Decision Problems: Approximation and Estimation Errors" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 2, pp. 271-279, February 2010, doi: 10.1587/transinf.E93.D.271.

Abstract: In and we have presented a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov Decision Process (SMDP). We approximated the gradient of the average reward. Then, a simulation-based algorithm was proposed to estimate the approximate gradient of the average reward (called GSMDP), using only a single sample path of the underlying Markov chain. GSMDP was proved to converge with probability 1. In this paper, we give bounds on the approximation and estimation errors for GSMDP algorithm. The approximation error of that approximation is the size of the difference between the true gradient and the approximate gradient. The estimation error, the size of the difference between the output of the algorithm and its asymptotic output, arises because the algorithm sees only a finite data sequence.

URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.271/_p

Copy

@ARTICLE{e93-d_2_271,

author={Ngo Anh VIEN, SeungGwan LEE, TaeChoong CHUNG, },

journal={IEICE TRANSACTIONS on Information},

title={Policy Gradient Based Semi-Markov Decision Problems: Approximation and Estimation Errors},

year={2010},

volume={E93-D},

number={2},

pages={271-279},

abstract={In and we have presented a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov Decision Process (SMDP). We approximated the gradient of the average reward. Then, a simulation-based algorithm was proposed to estimate the approximate gradient of the average reward (called GSMDP), using only a single sample path of the underlying Markov chain. GSMDP was proved to converge with probability 1. In this paper, we give bounds on the approximation and estimation errors for GSMDP algorithm. The approximation error of that approximation is the size of the difference between the true gradient and the approximate gradient. The estimation error, the size of the difference between the output of the algorithm and its asymptotic output, arises because the algorithm sees only a finite data sequence.},

keywords={},

doi={10.1587/transinf.E93.D.271},

ISSN={1745-1361},

month={February},}

Copy

TY - JOUR

TI - Policy Gradient Based Semi-Markov Decision Problems: Approximation and Estimation Errors

T2 - IEICE TRANSACTIONS on Information

SP - 271

EP - 279

AU - Ngo Anh VIEN

AU - SeungGwan LEE

AU - TaeChoong CHUNG

PY - 2010

DO - 10.1587/transinf.E93.D.271

JO - IEICE TRANSACTIONS on Information

SN - 1745-1361

VL - E93-D

IS - 2

JA - IEICE TRANSACTIONS on Information

Y1 - February 2010

AB - In and we have presented a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov Decision Process (SMDP). We approximated the gradient of the average reward. Then, a simulation-based algorithm was proposed to estimate the approximate gradient of the average reward (called GSMDP), using only a single sample path of the underlying Markov chain. GSMDP was proved to converge with probability 1. In this paper, we give bounds on the approximation and estimation errors for GSMDP algorithm. The approximation error of that approximation is the size of the difference between the true gradient and the approximate gradient. The estimation error, the size of the difference between the output of the algorithm and its asymptotic output, arises because the algorithm sees only a finite data sequence.

ER -