Two-time-scale online actor-critic paradigm driven by POMDP
Document Type
Conference Proceeding
Date of Original Version
6-9-2010
Abstract
In this paper, we analyze a class of actor-critic algorithms under partially observable Markov decision process (POMDP) environment. Specifically, in this work we focus on the two-time-scale framework in which the critic uses a temporal difference with neural network (NN) as nonlinear function approximator, and the actor is updated using greedy algorithm with the stochastic gradient approach. Instead of the common construction of hidden state estimator, we develop the idea originated from Singh, Jaakkola and Jordan (1994) into an online action-dependent actor-critic paradigm. This framework explores the ability of the adaptive dynamic programming (ADP) approach in POMDP environment without implementing extra architectures such as state estimators. Both the theoretical analysis and simulation studies validate that the framework performs effectively under certain assumptions given in this paper. ©2010 IEEE.
Publication Title, e.g., Journal
2010 International Conference on Networking Sensing and Control Icnsc 2010
Citation/Publisher Attribution
Liu, Bo, Haibo He, and Daniel W. Repperger. "Two-time-scale online actor-critic paradigm driven by POMDP." 2010 International Conference on Networking Sensing and Control Icnsc 2010 (2010). doi: 10.1109/ICNSC.2010.546149.