The Silver Challenge - Lecture 3

Welcome back to Day 3 of the Silver Challenge! Today we learned about dynamic programming. I was really glad to encounter this topic again because although I have heard about dynamic programming in other controls courses, I have never felt as though I had a good grasp of what it was. In fact, this lecture solidified my understanding of several key concepts, which I will review briefly here, including: dynamic programming, the Principle of Optimality, and the different approaches to applying dynamic programming to reinforcement learning. Let’s dig in.

Dynamic Programming

Briefly, dynamic programming refers to methods that seek to optimize a policy for a problem that has some sequential or temporal element to it [1]. The term dynamic refers to the temporal nature of the problem, and the term programming is meant in the mathematical sense, where it refers to a policy, not a coded program that computer scientists would write [1]. In general we need the problem that we are solving with dynamic programming to have two key properties:

We must be able to break the problem into smaller problems [1]. In other words, the Principle of Optimality must apply [1]. (More on that in a moment.)
The subproblems must overlap, that is we should be able to solve this problem using a recursive approach [1].

Dynamic programming can be used in many areas of research, not just reinforcement learning [1].

The Principle of Optimality

I have also heard about the Principle of Optimality before, but I really liked Silver’s definition and so I wanted to record it here. According to Silver, a policy achieves the optimal value from a given state, s if and only if, for any successor state, s’, that is reachable from s, the policy will achieve the optimal value from s’ onwards [1].

I believe the Principle of Optimality is what enables us to solve a dynamic programming problem backwards from the final state. That is, as long as we know all the states that we have seen are optimal, then if our current chosen state is also optimal, then we are satisfying the Principle of Optimality. And if we have satisfied the Principle of Optimality, then we know we are solving the RL problem with this approach.

Summary of Prediction and Control Approaches

Finally, Silver presented a nice summary of the prediction and control approaches that use dynamic programming in a table, which I wanted to recreate here [1]:

Fig 1
Figure 1 - Adapted from [1]

References:

[1] Silver, D. “RL Course by David Silver - Lecture 3: Planning by Dynamic Programming.” YouTube. 13 May 2015. https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=3 Visited 08 Aug 2020.

Written on August 8, 2020