Unifying n-Step Temporal-Difference Action-Value Methods
Date
Institution
Degree Level
Degree
Department
Specialization
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, the TD(λ) algorithm elegantly unifies temporal difference (TD) methods with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. The same type of unification is achievable with n-step algorithms, a simpler version of multi-step TD methods where updates consist of a single backup of length n instead of a geometric average of several backups of different lengths.In this work, we present a new n-step algorithm named Q(σ) that unifies two of the existing n-step algorithms for estimating action-value functions — Sarsa and Tree Backup. The fundamental difference between Sarsa and Tree Backup is that the former samples a single action at every step of the backup, whereas the latter takes an expectation over all the possible actions. We introduce a new parameter, σ ∈ [0, 1], that allows the degree of sampling performed by the algorithm at each step to be continuously varied. This creates a new family of algorithms that span a continuum between Sarsa (full sampling, σ = 1) and Tree Backup (pure expectation, σ = 0). Our results show that our algorithm can perform better when using intermediate values of σ instead of any of the extremes. Moreover, if we decay σ over time from one to zero, we obtain an algorithm that outperforms other variants of Q(σ) with a fixed σ over a variety of tasks.This work has three main contributions. First, we introduce our new algorithm, n-step Q(σ) and provide empirical evaluations of the algorithm in the tabular case. Second, we extend n-step Q(σ) to the linear function approximation case and demonstrate its performance in the environment mountain cliff. Third, we combined n-step Q(σ) with the DQN architecture and tested the performance of our new architecture — named the Q(σ) network — in the mountain car environment.Throughout our empirical evaluations, we found that the parameter σ often serves as a trade-off between initial and final performance. Moreover, we found that the decaying σ algorithm performed better than algorithms with fixed values of σ in terms of initial and final performance. We also found that in some domains n-step Q(σ) with an intermediate value of σ performed better than either of the extreme values corresponding to n-step Tree Backup and Sarsa. Our results represent a compelling argument for using n-step Q(σ) over n-step Sarsa or Tree Backup. n-step Q(σ) offers a flexible framework that can be adapted to the specifics of the learning task in order to improve performance.
