Sample-Efficient Control with Directed Exploration in Discounted MDPs Under Linear Function Approximation
Date
Author
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
An important goal of online reinforcement learning algorithms is efficient data collection to learn near-optimal behaviour, that is, optimizing the exploration-exploitation trade-off to reduce the sample-complexity of learning. To improve sample-complexity of learning it is essential that the agent directs its exploratory behaviour towards either visiting unvisited parts of the environment, or reducing uncertainty it may have with respect to the visited parts. In addition to such directed exploration, sample-complexity of learning can be improved by using a representation space that is amenable to online reinforcement learning. This thesis presents several algorithms that focus on these avenues for improving the sample-complexity of online reinforcement learning, specifically in the setting of discounted MDPs under linear function approximation. A key challenge to direct effective online exploration is the learning of reliable uncertainty estimates. We address this by deriving high-probability confidence-bounds for value uncertainty estimation. We use these derived confidence-bounds to design two algorithms that direct effective online exploration; they differ mainly in their approach to directing ex- ploration for visiting unknown regions of the environment. In the first algorithm we propose a heuristic to do so, whereas the second algorithm uses a more principled strategy based on optimistic initialization. The second algorithm is also a planning-compatible algorithm that can be parallelized, scaling sample-efficiency benefits with the compute resources afforded to the algorithm. To improve sample-efficiency by utilizing representations that are amenable to online reinforcement learning, the thesis proposes a simple strategy for learning such representations offline. The representation learning algorithm encodes a property we call locality. Locality reduces interference in learning targets used by online reinforcement learning algorithms, consequently improving its sample-efficiency. The thesis shows that these learned representations also aid effective online exploration. Overall, this thesis proposes algorithms for improving sample-efficiency of online reinforcement learning, motivates their utility, and evaluates their benefits empirically.
