A Unified View of Multi-step Temporal Difference Learning

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Supervisor / Co-Supervisor and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

Temporal-difference (TD) learning is an important approach for predictive knowledge representation and sequential decision making. Within TD learning exists multi-step methods which unify one-step TD learning and Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. Multi-step TD methods allows a practitioner to address a bias-variance trade-off between reliance on current estimates, which could be poor, and incorporating longer sequences of sampled information, which could have large variance. In this dissertation, we investigate an extension of multi-step TD learning aimed at reducing the variance in the estimates, and provide a unified view of the space of multi-step TD algorithms. In Monte Carlo methods, information about the error of a known quantity is sometimes incorporated in an attempt to reduce the error in the estimation of an unknown quantity. This is known as the method of control variates, and has not been extensively explored in TD learning. We show that control variates can be formulated in multi-step TD learning, and demonstrate their improvement in terms of learning speed and accuracy. We then show how the inclusion of control variates provides a deeper understanding of how n-step TD methods relate to TD(λ) algorithms. We then look at a previously proposed method to unify the space of $n$-step TD algorithms, the n-step Q(σ) algorithm. We provide empirical results and analyze properties of this algorithm, suggest an improvement based on insight from the control variates, and derive the TD(λ) version of the algorithm. This generalization can recover existing multi-step TD algorithms as special cases, providing an alternative, unified view of them. Lastly, we bring attention to the discount rate in TD learning. The discount rate is typically used to specify the horizon of interest in sequential decision making problems, but we introduce an alternate view of the parameter with insight from digital signal processing. By allowing the discount rate to take on complex numbers within the complex unit circle, we can extend the types of knowledge learnable by a TD agent into the frequency domain. This allows for online and incremental estimation of the extent at which particular frequencies exist in a signal, with the standard discounting framework corresponding to the zero frequency case.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

en

Location

Time Period

Source