Experimental studies on data-driven heuristic dynamic programming for POMDP

Document Type


Date of Original Version



Adaptive dynamic programming (ADP) has been a popular approach to seek the optimal control strategy in Markov decision process (MDP). Generally, this type of approach requires a complete set of system information/states to achieve the online optimal decision-making. However, the full system information/states are not usually available in practical situations. In many cases, the measured input/output data can only represent part of the system information and the system internal states are not available. In this chapter, we investigate a data-driven heuristic dynamic programming (HDP) architecture to tackle the partially observed Markov decision process (POMDP). In specific, we include a state estimator neural network to recover the full system information for the action network, so that the optimal control policy can still be achieved under the partially observed environment. We randomly initialize the weights in the state estimator network, and conduct online learning for the entire process. Both discrete-time and continuous-time system functions are tested. Simulation results and system trajectories justify the control performance of our proposed approach.

Publication Title, e.g., Journal

Frontiers of Intelligent Control and Information Processing