The Silver Challenge - Lecture 8

Okay, so I realize this post is coming on Friday, but I did watch Lecture 8 on Thursday. Fortunately I wrote the rules for this challenge myself and so as the official challenge arbiter I say it’s okay to post a day late.

Anyway, I wanted to briefly discuss the Dyna-2 algorithm (which I think Silver developed himself in this 2008 paper), because I was confused about the difference between Temporal Difference (TD) learning and search [1]. Silver explained that the Dyna-2 algorithm records two sets of feature weights (for two sets of function approximators, such as neural networks) that encode both the long-term and short-term memory of the agent [1]. The long-term memory is derived from real experience, and is obtained using TD learning [1]. In contrast, the short-term memory comes from simulated experience, and is updated using TD search [1]. I could not see any significant difference between the TD learning and TD searching algorithms, which both use essentially the same step to update the action-value function [1,2]:

Eqn 1
Equation 1

I think the difference is where the agent is getting its information from [1]. For example, in TD search, the agent is using bootstrapping and performing updates using Sarsa, as shown in Equation 1, just like in TD learning [1]. The difference is that TD search is searching over every possible action in the sub-MDP that represents possible actions taken after the current time step [1]. This search over a tree graph representing the sub-MDP is shown in Figure 1 [1]. In contrast, TD learning can only be done over the agent’s actual experience in the real world, so it does not explore all the actions, it only pursues the action that has the highest Q-value at the current time step [1].

Fig 1
Figure 1 - Source: [3]


[1] Silver, D. “RL Course by David Silver - Lecture 8: Integrating Learning and Planning.” YouTube. 13 May 2015. Visited 13 Aug 2020.

[2] Silver, D. “Lecture 5: Model-Free Control.” Visited 14 Aug 2020.

[3] Silver, D. “Lecture 8: Integrating Learning and Planning.” Visited 14 Aug 2020.

Written on August 13, 2020