Online learning control based on projected gradient temporal difference and advanced heuristic dynamic programming
Document Type
Conference Proceeding
Date of Original Version
1-1-2014
Abstract
We present a novel online learning control algorithm (OLCPA) which comprises projected gradient temporal difference for action-value function (PGTDAVF) and advanced heuristic dynamic programming with one step delay (AHD-POSD). PGTDAVF can guarantee the convergence of temporal difference(TD)-based policy learning with smooth action-value function approximators, such as neural networks. Meanwhile, AHDPOSD is a specially designed framework for embedding PGTDAVF in to conduct online learning control. It not only coincides with the intention of temporal difference but also enables PGTDAVF to be effective under nonidentical policy environment, which results in more practicality. In this way, the proposed algorithms achieve the stability and practicability simultaneously. Finally, simulation of online learning control on a cart pole benchmark demonstrates practical control capability and efficiency of the presented method.
Publication Title, e.g., Journal
Proceedings of the International Joint Conference on Neural Networks
Citation/Publisher Attribution
Fu, Jian, Sujuan Wei, Haibo He, and Shengyong Wang. "Online learning control based on projected gradient temporal difference and advanced heuristic dynamic programming." Proceedings of the International Joint Conference on Neural Networks (2014): 3649-3656. doi: 10.1109/IJCNN.2014.6889756.