Chasing Hallucinated Value: A Pitfall of Dyna Style Algorithms with Imperfect Environment Models
Date
Author
Institution
Degree Level
Degree
Department
Specialization
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
In Dyna style algorithms, reinforcement learning (RL) agents use a model of the environment to generate simulated experience. By updating on this simulated experience, Dyna style algorithms allow agents to potentially learn control policies in fewer environment interactions than agents that use model-free RL algorithms. Dyna, therefore, is an attractive approach to developing sample efficient RL agents. In many RL problems, however, it is seldom possible to learn a perfectly accurate model of environment dynamics. This thesis explores what happens when Dyna is coupled with an imperfect environment model.
We present the Hallucinated Value Hypothesis. We hypothesise that Dyna style algorithms coupled with imperfect environment models may fail to learn control policies if they update Q-values of observed states towards values of simulated states. We argue this occurs because the imperfect model may erroneously generate fictitious states that do not correspond to real, reachable states of the environment. These fictitious states may have arbitrary Q-values, and temporal difference updates toward them may lead to the propagation of this misleading values through the value function. Consequently, agents may end up incorrectly chasing hallucinated value.
We present three Dyna style algorithms that may update real state values toward simulated state values and one which is designed not to. We evaluate these algorithms on Bordered Gridworld --- a simple setting designed to carefully test the hypothesis. Furthermore, we study whether the hypothesis holds in a range of standard RL benchmarks: Cartpole, Catcher, and Puddleworld.
Experimental evidence supports the Hallucinated Value Hypothesis. The algorithms which update real state values toward simulated state values struggle to improve their control performance. On the other hand, n-step predecessor Dyna, our algorithm which does not perform such updates, seems to be robust to model error on the tested domains. Furthermore, it enjoys speed-ups in learning over its competitors.
