Unifying n-Step Temporal-Difference Action-Value Methods

Loading...
Thumbnail Image

Institution

http://id.loc.gov/authorities/names/n79058482

Degree Level

Master's

Degree

Master of Science

Department

Department of Computing Science

Specialization

Statistical Machine Learning

Supervisor / Co-Supervisor and Their Department(s)

Citation for Previous Publication

Link to Related Item

Abstract

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, the TD(λ) algorithm elegantly unifies temporal difference (TD) methods with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. The same type of unification is achievable with n-step algorithms, a simpler version of multi-step TD methods where updates consist of a single backup of length n instead of a geometric average of several backups of different lengths.In this work, we present a new n-step algorithm named Q(σ) that unifies two of the existing n-step algorithms for estimating action-value functions — Sarsa and Tree Backup. The fundamental difference between Sarsa and Tree Backup is that the former samples a single action at every step of the backup, whereas the latter takes an expectation over all the possible actions. We introduce a new parameter, σ ∈ [0, 1], that allows the degree of sampling performed by the algorithm at each step to be continuously varied. This creates a new family of algorithms that span a continuum between Sarsa (full sampling, σ = 1) and Tree Backup (pure expectation, σ = 0). Our results show that our algorithm can perform better when using intermediate values of σ instead of any of the extremes. Moreover, if we decay σ over time from one to zero, we obtain an algorithm that outperforms other variants of Q(σ) with a fixed σ over a variety of tasks.This work has three main contributions. First, we introduce our new algorithm, n-step Q(σ) and provide empirical evaluations of the algorithm in the tabular case. Second, we extend n-step Q(σ) to the linear function approximation case and demonstrate its performance in the environment mountain cliff. Third, we combined n-step Q(σ) with the DQN architecture and tested the performance of our new architecture — named the Q(σ) network — in the mountain car environment.Throughout our empirical evaluations, we found that the parameter σ often serves as a trade-off between initial and final performance. Moreover, we found that the decaying σ algorithm performed better than algorithms with fixed values of σ in terms of initial and final performance. We also found that in some domains n-step Q(σ) with an intermediate value of σ performed better than either of the extreme values corresponding to n-step Tree Backup and Sarsa. Our results represent a compelling argument for using n-step Q(σ) over n-step Sarsa or Tree Backup. n-step Q(σ) offers a flexible framework that can be adapted to the specifics of the learning task in order to improve performance.

Item Type

http://purl.org/coar/resource_type/c_46ec

Alternative

License

Other License Text / Link

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

en

Location

Time Period

Source