Consistent Emphatic Temporal-Difference Learning
Date
Author
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
Off-policy policy evaluation has been a critical and challenging problem in reinforcement learning, and Temporal-Difference (TD) learning is one of the most important approaches for addressing it. There has been significant interest in searching for off-policy TD algorithms which find the same solution that would have been obtained in the on-policy regime. An important property of these algorithms is that their expected update has the same fixed point as that of On-policy TD(λ), which we call consistency. Notably, Full IS TD(λ) is the only existing consistent off-policy TD method under general linear function approximation but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of ETD(λ), which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new consistent algorithm called Average Emphatic TD (AETD(λ)) with a transient bias, which strikes a balance between bias and variance. Further, we unify AETD(λ) with existing algorithms and obtain a new family of consistent algorithms called Consistent Emphatic TD (CETD(λ, β, ν)), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through theoretical analysis and experiments on a didactic example, we settle the consistency of CETD(λ, β, ν) and demonstrate this theoretical advantage empirically. Moreover, we show that CETD(λ, β, ν) converges faster to the lowest error in a complex task with a high variance.
