Online learning control based on projected gradient temporal difference and advanced heuristic dynamic programming
Date of Original Version
We present a novel online learning control algorithm (OLCPA) which comprises projected gradient temporal difference for action-value function (PGTDAVF) and advanced heuristic dynamic programming with one step delay (AHD-POSD). PGTDAVF can guarantee the convergence of temporal difference(TD)-based policy learning with smooth action-value function approximators, such as neural networks. Meanwhile, AHDPOSD is a specially designed framework for embedding PGTDAVF in to conduct online learning control. It not only coincides with the intention of temporal difference but also enables PGTDAVF to be effective under nonidentical policy environment, which results in more practicality. In this way, the proposed algorithms achieve the stability and practicability simultaneously. Finally, simulation of online learning control on a cart pole benchmark demonstrates practical control capability and efficiency of the presented method.
Proceedings of the International Joint Conference on Neural Networks
Fu, Jian, Sujuan Wei, Haibo He, and Shengyong Wang. "Online learning control based on projected gradient temporal difference and advanced heuristic dynamic programming." Proceedings of the International Joint Conference on Neural Networks , (2014): 3649-3656. doi:10.1109/IJCNN.2014.6889756.