Chasing Hallucinated Value: A Pitfall of Dyna Style Algorithms with Imperfect Environment Models

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Specialization

Statistical Machine Learning

Supervisor / Co-Supervisor and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

In Dyna style algorithms, reinforcement learning (RL) agents use a model of the environment to generate simulated experience. By updating on this simulated experience, Dyna style algorithms allow agents to potentially learn control policies in fewer environment interactions than agents that use model-free RL algorithms. Dyna, therefore, is an attractive approach to developing sample efficient RL agents. In many RL problems, however, it is seldom possible to learn a perfectly accurate model of environment dynamics. This thesis explores what happens when Dyna is coupled with an imperfect environment model.

We present the Hallucinated Value Hypothesis. We hypothesise that Dyna style algorithms coupled with imperfect environment models may fail to learn control policies if they update Q-values of observed states towards values of simulated states. We argue this occurs because the imperfect model may erroneously generate fictitious states that do not correspond to real, reachable states of the environment. These fictitious states may have arbitrary Q-values, and temporal difference updates toward them may lead to the propagation of this misleading values through the value function. Consequently, agents may end up incorrectly chasing hallucinated value.

We present three Dyna style algorithms that may update real state values toward simulated state values and one which is designed not to. We evaluate these algorithms on Bordered Gridworld --- a simple setting designed to carefully test the hypothesis. Furthermore, we study whether the hypothesis holds in a range of standard RL benchmarks: Cartpole, Catcher, and Puddleworld.

Experimental evidence supports the Hallucinated Value Hypothesis. The algorithms which update real state values toward simulated state values struggle to improve their control performance. On the other hand, n-step predecessor Dyna, our algorithm which does not perform such updates, seems to be robust to model error on the tested domains. Furthermore, it enjoys speed-ups in learning over its competitors.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

en

Location

Time Period

Source