Jekyll2020-11-09T04:08:22+00:00http://sassafras13.github.io/feed.xmlEmma BenjaminsonMechanical Engineering Graduate StudentConvex Optimization and MPC2020-11-08T00:00:00+00:002020-11-08T00:00:00+00:00http://sassafras13.github.io/ConvOpt<p>We have been talking a lot lately about implementing model predictive control (MPC) in discrete time. In this final post on the subject (for now), we are going to look at how we can write MPC as a convex optimization problem. We are going to use fast solvers in our codebase to solve MPC as an optimization problem, and so this post is intended to understand how to structure the MPC problem so it is easy to implement in code. We are going to start by discussing what convex optimization is (at a high level) and then work through the math of casting MPC as a convex optimization problem.</p>
<h2 id="what-is-convex-optimization">What is Convex Optimization?</h2>
<p>Convex optimization is such a deep topic that there are entire graduate level courses on the subject, so I am going to make some generalizations here to avoid going into too much detail. Let’s start by considering what a <strong>convex function</strong> is: it is a function where you can draw a line segment between any two points and the line will always be above the curve, as shown in Figure 1 [1]. A convex function can only have one minimum, which is a key property that we exploit when we optimize these functions [1].</p>
<p><img src="/images/2020-11-08-ConvexOpt-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [2]</p>
<p>Convex optimization focuses on finding ways to minimize convex functions, and many of the solutions are polynomial time algorithms (this is good, algorithms of this order tend to be fast whereas in general finding function optima mathematically can be NP-hard) [3]. The standard form of a convex optimization problem looks like this [3]:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn1.png" alt="Eqn 1" title="Equation 1" /> <br />
Equation 1</p>
<p>There may be zero, one or many solutions to this problem [3]. We can use tools like fmincon in MATLAB to perform convex optimization. In the next section, we will talk about how we formulate MPC as a convex optimization problem as given in Equation 1.</p>
<h2 id="writing-mpc-as-convex-optimization">Writing MPC as Convex Optimization</h2>
<p>When we <a href="https://sassafras13.github.io/ControlTheoryBasics/">first encountered MPC</a>, we learned that we needed to optimize some cost function, and that we had to constrain our optimization to account for the dynamics of our system. The problem was that our system dynamics were written in continuous time and it was impossible to write them in the form given for equality constraints shown in Equation 1. Now that we have learned about the discrete time formulation of system dynamics, however, it is possible for us to cast MPC as a convex optimization problem. Let’s start by writing out the basic form of the linear MPC problem in discrete time [4]:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn2.png" alt="Eqn 2" title="Equation 2" /> <br />
Equation 2</p>
<p>The additional terms in the cost function (using matrices M and Q-tilde) add costs for the interaction between the state and the control input (in the case of the M matrix term) and for the terminal cost (in the case of the Q-tilde term). The additional constraints listed below the cost function allow us to define specific conditions we may need to consider for our particular system, such as the system dynamics. Please note that although we are describing the system dynamics in terms of A and B, these matrices are really phi and gamma, the discrete time equivalents of the continuous time matrices A and B.</p>
<p>We need a condensed form of Equation 2 to optimize, one that does not depend on the state at a particular time interval <em>j</em> [4]. Let’s see how we can aggregate the variables in the cost function and constraints into a different representation that only relies on the initial state [4].</p>
<p>We start by writing the variables in the basic form (Equation 2) as vectors or matrices containing values for every time point we are considering [4]. Recall that in MPC we usually optimize over some horizon, so we will need as many entries as we have time points in our horizon. This looks like [4]:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn3.png" alt="Eqn 3" title="Equation 3" /> <br />
Equation 3</p>
<p>This new representation will allow us to rewrite the minimization objective as [4]:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn4.png" alt="Eqn 4" title="Equation 4" /> <br />
Equation 4</p>
<p>Notice that we can write the matrices A-bar and B-bar as follows [4]:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn5.png" alt="Eqn 5" title="Equation 5" /> <br />
Equation 5</p>
<p>Why does this work? This works because, as we compute the values of successive states, we find that they only depend on the control inputs for each time point and the initial value of the state. Let’s see that in action over a few time steps:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn6.png" alt="Eqn 6" title="Equation 6" /> <br />
Equation 6</p>
<p>Do you see how the pattern of the coefficients of the terms match what we have in Equation 5 for the A-bar and B-bar matrices? And do you see how, at every time step, we only need the initial state to compute our current state? This pattern now allows us to rewrite the dynamics as [4]:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn7.png" alt="Eqn 7" title="Equation 7" /> <br />
Equation 7</p>
<p>Now we can substitute the system dynamics into the minimization objective and the constraints in order to get the following final expression [4]:</p>
<p><img src="/images/2020-11-08-ConvexOpt-eqn8.png" alt="Eqn 8" title="Equation 8" /> <br />
Equation 8</p>
<p>And there we have it, we have now successfully cast the MPC problem as a convex optimization problem in discrete time.</p>
<h4 id="references">References:</h4>
<p>[1] “Convex function.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Convex_function">https://en.wikipedia.org/wiki/Convex_function</a> Visited 8 Nov 2020.</p>
<p>[2] By Eli Osherovich - Own work, CC BY-SA 3.0, <a href="https://commons.wikimedia.org/w/index.php?curid=10764763">https://commons.wikimedia.org/w/index.php?curid=10764763</a> Visited 8 Nov 2020.</p>
<p>[3] “Convex optimization.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Convex_optimization">https://en.wikipedia.org/wiki/Convex_optimization</a> Visited 8 Nov 2020.</p>
<p>[4] Wright, S. J. “Efficient Convex Optimization for Linear MPC” in Handbook of Model Predictive Control, pp. 287-303. 2018. <a href="https://link.springer.com/chapter/10.1007/978-3-319-77489-3_13">https://link.springer.com/chapter/10.1007/978-3-319-77489-3_13</a> Visited 8 Nov 2020.</p>We have been talking a lot lately about implementing model predictive control (MPC) in discrete time. In this final post on the subject (for now), we are going to look at how we can write MPC as a convex optimization problem. We are going to use fast solvers in our codebase to solve MPC as an optimization problem, and so this post is intended to understand how to structure the MPC problem so it is easy to implement in code. We are going to start by discussing what convex optimization is (at a high level) and then work through the math of casting MPC as a convex optimization problem.Writing System Dynamics in Discrete Time2020-11-08T00:00:00+00:002020-11-08T00:00:00+00:00http://sassafras13.github.io/DiscTimeSS<p>In my <a href="https://sassafras13.github.io/DiscreteTime/">last post</a>, I introduced the concept of discrete time for digital systems. We discussed difference equations and the z-transform. In this post, we are going to use these ideas to write the standard state space expressions for a system’s dynamics in <strong>discrete time</strong>. That is, we are going to find the matrices phi, gamma and H for the discrete system dynamics as written below [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn1.png" alt="Eqn 1" title="Equation 1" /> <br />
Equation 1</p>
<p>This post is going to be a lot of math so I am going to split it up into parts and walk us through all of it. Let’s go!</p>
<h2 id="solving-for-x-in-the-dynamics">Solving for x in the Dynamics</h2>
<p>Since our objective is to write the system dynamics in discrete time, it makes sense to start with the known system dynamics in continuous time. We are trying to solve for x in the following system of equations in continuous time [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn2.png" alt="Eqn 2" title="Equation 2" /> <br />
Equation 2</p>
<p>(The w term contains any disturbances we might see in the system.) We can solve for x in two parts by considering two possible cases: (1) we compute x given initial conditions and assuming that there is no external input, and (2) we also compute x if there is an external input given [1]. Let’s look at each case in turn.</p>
<h3 id="1---find-x-from-initial-conditions-u--0">1 - Find x from Initial Conditions, u = 0</h3>
<p>Here we begin by assuming that the solution to the dynamics when u = 0 takes the form of some homogeneous equation like so [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn3.png" alt="Eqn 3" title="Equation 3" /> <br />
Equation 3</p>
<p>We are also going to assume that the homogeneous equation is smooth enough to be approximated with a series expansion that looks like this [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn4.png" alt="Eqn 4" title="Equation 4" /> <br />
Equation 4</p>
<p>If I calculate the derivative of Equation 4, I can substitute it into Equation 3 on the left hand side. This gives me the following [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn5.png" alt="Eqn 5" title="Equation 5" /> <br />
Equation 5</p>
<p>Okay, the next step is a little confusing to me, and if someone can explain why the math works this way, I would really appreciate hearing about it. The authors in [1] state that we differentiate Equation 5 again and end up with this [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn6.png" alt="Eqn 6" title="Equation 6" /> <br />
Equation 6</p>
<p>Again, I am not sure how we get here, but it turns out that the sequence inside the brackets is a series approximation for a matrix exponential. In other words [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn7.png" alt="Eqn 7" title="Equation 7" /> <br />
Equation 7</p>
<p>Equation 7 is a unique solution to the original dynamics in Equation 2, and it has some interesting properties that we can exploit later when we are solving for x in the presence of an external input. Let’s see how that works.</p>
<h3 id="special-properties-of-the-homogeneous-solution-for-no-external-input">Special Properties of the Homogeneous Solution for No External Input</h3>
<p>Let’s consider what the value of x, as expressed in Equation 7, looks like when <em>t</em> takes on two different values. We find that [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn8.png" alt="Eqn 8" title="Equation 8" /> <br />
Equation 8</p>
<p>Since t0 is arbitrary, I can rewrite x at time t2 as [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn9.png" alt="Eqn 9" title="Equation 9" /> <br />
Equation 9</p>
<p>Notice that I can substitute the expression for x at t1 into the expression for x at t2 [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn10.png" alt="Eqn 10" title="Equation 10" /> <br />
Equation 10</p>
<p>Since the solution for x at t2 is unique, the two expressions, Equations 9 and 10, must be the same [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn11.png" alt="Eqn 11" title="Equation 11" /> <br />
Equation 11</p>
<p>Finally, if I were to set t2 = t0, then I would find that [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn12.png" alt="Eqn 12" title="Equation 12" /> <br />
Equation 12</p>
<p>This means that we can find the inverse of exp(Ft) by changing the sign of <em>t</em> [1]. This might seem random, but it is going to come in useful in the next section, where we solve for x in the presence of an external input.</p>
<h3 id="2---find-x-when-u-is-nonzero">2 - Find x When u is Nonzero</h3>
<p>In this section, we are going to find the second half of the solution to x, assuming that there is a nonzero external input. We will use a technique called <strong>variation of parameters</strong> to find the solution [1]. We begin by assuming that the solution to x is in the form [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn13.png" alt="Eqn 13" title="Equation 13" /> <br />
Equation 13</p>
<p>Where v(t) is a vector of variable parameters [1]. We can substitute Equation 13 into Equation 2 to get the following [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn14.png" alt="Eqn 14" title="Equation 14" /> <br />
Equation 14</p>
<p>We are assuming that there is no control input before time t0, so we can integrate the derivative of v from t0 to time <em>t</em> [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn15.png" alt="Eqn 15" title="Equation 15" /> <br />
Equation 15</p>
<p>I’m using tau as a dummy variable for time here since I’ve already used <em>t</em> in the integral definition and I don’t want to overload notation. Now I can substitute this integral into my solution for x in Equation 13 [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn16.png" alt="Eqn 16" title="Equation 16" /> <br />
Equation 16</p>
<p>And now I’m going to use the interesting property that we laid out in the previous section. I can rewrite the exponents in Equation 16 as one term as follows [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn17.png" alt="Eqn 17" title="Equation 17" /> <br />
Equation 17</p>
<p>Now this gives me a new form for my solution [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn18.png" alt="Eqn 18" title="Equation 18" /> <br />
Equation 18</p>
<p>I now have my second half to my solution for x. In the next section, we are going to pull this all together and find expressions for the matrices phi, gamma and H from Equation 1.</p>
<h3 id="pulling-it-all-together">Pulling It All Together</h3>
<p>My total solution (neglecting disturbances) can now be written as [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn19.png" alt="Eqn 19" title="Equation 19" /> <br />
Equation 19</p>
<p>Remember, the whole purpose of this blog post is to write the dynamics in discrete time. Here is where we are going to make the switch from continuous to discrete time. We are going to rewrite Equation 19 as a difference equation by letting t = kT + T and t0 = kT. Then we can say [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn20.png" alt="Eqn 20" title="Equation 20" /> <br />
Equation 20</p>
<p>Notice that Equation 20 is not dependent on the type of hold we use to sample the output [1]. The control input, u, is specified in terms of the continuous time history over the sample interval, whatever that may be, but it is independent of the type of hold that spans the sample interval [1]. If we assume we are using a zero-order hold (ZOH), with no delay, then we can rewrite u as [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn21.png" alt="Eqn 21" title="Equation 21" /> <br />
Equation 21</p>
<p>Now if we want to solve the integral for the case where we are using a ZOH with no delay, we can use a change of variables from tau to eta [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn22.png" alt="Eqn 22" title="Equation 22" /> <br />
Equation 22</p>
<p>Notice that we were able to pull out the <em>G u(kT)</em> term because it is a constant over the time interval.</p>
<p>We are almost there. If I define phi and gamma carefully, I can now cast Equation 22 in the form we were originally looking for (Equation 1) [1]:</p>
<p><img src="/images/2020-11-08-DiscTimeSS-eqn23.png" alt="Eqn 23" title="Equation 23" /> <br />
Equation 23</p>
<p>And we’re done! We have successfully derived the mathematics that we use to rewrite dynamics in continuous time into dynamics in discrete time.</p>
<p>Unfortunately (I hate to say this after all of our hard work), in practice we never need to actually do this math ourselves. If you are working with controls libraries like MATLAB’s Controls Toolbox, then you can use functions like c2d() to do this conversion automatically. But I will say that it is always good to understand what is happening under the hood, and now we have a very good idea of what must be done to switch from continuous to discrete time.</p>
<h4 id="references">References:</h4>
<p>[1] Franklin, G., Powell, J., Workman, M. “Digital Control of Dynamic Systems, 3rd Ed.” Ellis-Kagle Press, Half Moon Bay, CA, 1998.</p>In my last post, I introduced the concept of discrete time for digital systems. We discussed difference equations and the z-transform. In this post, we are going to use these ideas to write the standard state space expressions for a system’s dynamics in discrete time. That is, we are going to find the matrices phi, gamma and H for the discrete system dynamics as written below [1]:Review of “Scene Graph Generation by Iterative Message Passing” by Xu et al.2020-11-08T00:00:00+00:002020-11-08T00:00:00+00:00http://sassafras13.github.io/SceneGraphs<p>Recently I have been interested in constructing <strong>scene graphs</strong> directly from raw image data for a research project. Scene graphs are a representation of the objects in an image, as well as their relationships [1]. Typically, a scene graph represents the objects as nodes and joins these objects together using edges and nodes that represent specific types of relationships [1]. It can be useful to construct graphs from images so that we can use the graph information as part of a larger AI solution that operates on graphs. In the post we are going to focus on a particular method for deriving scene graphs from images as proposed by Prof. Fei-Fei Li’s group at Stanford in “Scene Graph Generation by Iterative Message Passing” by Xu et al. [1].</p>
<h2 id="overview">Overview</h2>
<p><img src="/images/2020-11-08-SceneGraphs-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [1]</p>
<p>In this paper, our objective is to take as input a single image, and to return as output a scene graph that completely describes the objects in the image and their relationships [1]. This concept is illustrated in Figure 1. The proposed solution uses several concepts from artificial intelligence, computer vision and Bayesian probability. Xu et al. use a standard <strong>recursive neural network (RNN)</strong> to infer the structure of the graph, and it uses message passing to improve its predictions of the scene graph over multiple iterations [1]. They also use a <strong>joint inference model</strong> to use the image context to improve predictions as well [1]. An overview of the model architecture is shown in Figure 2. The entire solution is an end-to-end model*1 which returns a scene graph with object categories, the bounding boxes of the objects found in the image, and the relationships between pairs of objects [1].</p>
<p><img src="/images/2020-11-08-SceneGraphs-fig2.png" alt="Fig 2" title="Figure 2" /> <br />
Figure 2 - Source: [1]</p>
<h2 id="introducing-bipartite-graphs-grus-and-message-passing">Introducing Bipartite Graphs, GRUs and Message Passing</h2>
<p>The scene graph that is built during this process can be described as <strong>bipartite</strong>, which means that the nodes in the graph can be divided into two distinct sets, and every edge joins a node from each set [3]. In this context, the scene graph is bipartite because it contains two types of nodes: nodes that represent objects (horse, mountain, face) and nodes that represent relationships (of, riding, wearing) [1].</p>
<p>Xu et al. explain that their approach to performing <strong>dense graph inference</strong> is inspired by work by Zheng et al. [4] which uses RNNs to complete the inference [1]. Let me briefly try to explain what I mean by dense graph inference: first, we are assuming that most of these scene graphs are going to be dense - that is, that there will be many connections, or relationships, between objects in the graph. Secondly, we are trying to infer what the structure and contents of the scene graph should be using probability - we will go into more detail on this point in a subsequent section.</p>
<p>The point that Xu et al. are trying to make here is that while the previous work by Zheng et al. also used RNNs, the prior work used custom RNN layers, while this work by Xu et al. chose to use a generic RNN layer, called a <strong>gated recurrent unit (GRU)</strong>, instead [1]. The advantage, they argue, is that Xu et al.’s architecture is more flexible and general-purpose than the earlier work by Zheng et al. [1].</p>
<p>Let’s briefly review what a RNN and a GRU are. A recurrent neural network is a type of neural network that is typically used to process sequential data [5]. There are different types of RNNs, and one type in particular is the GRU, developed by Cho et al. in 2014 [5]. The GRU is a unit that takes in input at a given time step, and combines it with knowledge derived from prior time steps (called the hidden state) and updates that hidden state at the current time point [5]. A GIF prepared by Raimi Karim is shown below, and he recommends reading <a href="https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21">this post</a> by Michael Phi to learn more about RNNs if you’re interested.</p>
<p><img src="/images/2020-11-08-SceneGraphs-fig3.png" alt="Fig 3" title="Figure 3" /> <br />
Figure 3 - Source: [5]</p>
<p>The tricky part to understand here is how the GRUs are used to build the scene graph and perform message passing over the graph. Xu et al. explain that each node and edge in the scene graph store their “internal states” inside a corresponding GRU unit [1]. So there is one GRU unit per node and per edge in the scene graph. We also note that all the node GRUs share the same set of weights, and similarly all the edge GRUs also share the same set of weights (but a different set from the node GRUs) [1]. Information is shared by passing messages between GRUs (more on that later) [1].</p>
<p>The bipartite graph structure is going to come into play in this message-passing process, too. Since, by definition, our bipartite scene graph will only have edges connecting object nodes to relationship nodes (and vice versa), we know that there is no message passing inside the set of object nodes or the set of relationship nodes [1]. This means that we can think of the subset of object nodes as the <strong>dual graph</strong>*2 of the subset of relationship nodes [1].</p>
<p>In a slightly confusing use of vocabulary, Xu et al. also present the message passing problem as a “primal-dual graph” structure, where the primal graph represents the pathways for messages to move from <strong>edge GRUs</strong> to <strong>node GRUs</strong>, and the dual graph represents the pathways for message passing from <strong>node GRUs</strong> to <strong>edge GRUs</strong> [1]. I’m confused because I think the “primal-dual” terminology is often used to mean other things than what is happening here, but maybe I just don’t have enough of a graph theory background to really grasp this. Either way, this concept is illustrated in more detail in Figure 5.</p>
<p>Now that we have introduced the basic architecture used by Xu et al., let’s dive deeper into the mathematics of the graph inference problem.</p>
<h2 id="the-graph-inference-problem">The Graph Inference Problem</h2>
<p>In this section we are going to see how we take the input image, apply bounding boxes, and generate the object class and relationship type for each object and pair of objects, respectively. To obtain preliminary bounding boxes, the authors use the Region Proposal Network to get a set of proposed bounding boxes as a starting point for the inference model [1].</p>
<p>The authors take in the proposed bounding boxes and infer object classes for each identified object. They also find offsets for the proposed bounding boxes to refine the size and location of each box. Lastly, they identify relationships between each pair of objects. If we define the set of object classes as <em>C</em> and the set of relationship types as <em>R</em>, then we can write [1]:</p>
<p><img src="/images/2020-11-08-SceneGraphs-eqn1.png" alt="Eqn 1" title="Equation 1" /> <br />
Equation 1</p>
<p>Ultimately, our goal is to find an optimal set of variables, x*, which maximizes this probability [1]:</p>
<p><img src="/images/2020-11-08-SceneGraphs-eqn2.png" alt="Eqn 2" title="Equation 2" /> <br />
Equation 2</p>
<p>In the next section, we will see how we can approximate this inference using GRUs to perform iterative message passing [1].</p>
<h2 id="approximating-graph-inference-using-grus-and-message-passing">Approximating Graph Inference using GRUs and Message Passing</h2>
<p>Xu et al. use an approach called “mean field” to approximate the graph inference we described in Equation 2 above [1]. I am not sure what “mean field” is, but we can discuss the mechanics of this approach regardless. First, we define the probability of each variable, x, as [1]:</p>
<p><img src="/images/2020-11-08-SceneGraphs-eqn3.png" alt="Eqn 3" title="Equation 3" /> <br />
Equation 3</p>
<p>The nodes and edges each have some hidden states that are computed by the corresponding GRUs [1]. Together all of these computations form an expression for the “mean field distribution” which can be written as [1]:</p>
<p><img src="/images/2020-11-08-SceneGraphs-eqn4.png" alt="Eqn 4" title="Equation 4" /> <br />
Equation 4</p>
<p>Please note that the term <strong>visual feature</strong> refers to the proposed bounding box in the case of individual nodes, and it refers to the union box that fits over the bounding boxes for the nodes i and j in the case of individual edges [1].</p>
<p>Earlier we discussed the concept of a primal-dual graph structure, where the primal graph describes how messages flow from edges to nodes, and the dual graph describes the flow of information from nodes to edges. Xu et al. point out that by identifying this unique structure, we can reduce the number of computations that we have to perform as compared to processing all connections that are present on a generic dense graph [1].</p>
<p><img src="/images/2020-11-08-SceneGraphs-fig5.png" alt="Fig 5" title="Figure 5" /> <br />
Figure 5 - Source: [1]</p>
<p>Since every node GRU could be receiving inputs over multiple edges (see Figure 5), we need to find a way to <strong>pool</strong> or <strong>aggregate</strong> all the incoming messages [1]. Xu et al. use a weighted function so that they can learn weights for each individual incoming message - this helps the model learn which information is more valuable in computing the graph inference [1]. We can write expressions for the messages that are pooled for the nodes and edges as follows [1]:</p>
<p><img src="/images/2020-11-08-SceneGraphs-eqn5.png" alt="Eqn 5" title="Equation 5" /> <br />
Equation 5</p>
<p>This concludes the details of the theory behind scene graph generation as presented in Xu et al. Please do consider reading the paper in its entirety - there is much more information available than I covered here.</p>
<h4 id="footnotes">Footnotes:</h4>
<p>*1 The expression “end-to-end” just means that the model takes in some input and returns the thing we are looking for, without any post processing [2]. In other words, it is the complete solution [2].</p>
<p>*2 A dual graph is a complementary graph that places a node at every face of the graph it is complementing [6]. Consider Figure 4 as an illustration of this concept.</p>
<p><img src="/images/2020-11-08-SceneGraphs-fig4.png" alt="Fig 4" title="Figure 4" /> <br />
Figure 4 - Source: [6]</p>
<h4 id="references">References:</h4>
<p>[1] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene Graph Generation by Iterative Message Passing.” <a href="https://arxiv.org/abs/1701.02426">https://arxiv.org/abs/1701.02426</a> Visited 8 Nov 2020.</p>
<p>[2] “What does end to end mean in deep learning methods?” Quora. <a href="https://www.quora.com/What-does-end-to-end-mean-in-deep-learning-methods">https://www.quora.com/What-does-end-to-end-mean-in-deep-learning-methods</a> Visited 8 Nov 2020.</p>
<p>[3] “Bipartite graphs.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Bipartite_graph">https://en.wikipedia.org/wiki/Bipartite_graph</a> Visited 8 Nov 2020.</p>
<p>[4] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015. <a href="https://arxiv.org/abs/1502.03240">https://arxiv.org/abs/1502.03240</a> Visited 8 Nov 2020.</p>
<p>[5] Karim, R. “Animated RNN, LSTM and GRU.” 14 Dec 2018. Towards Data Science on Medium. <a href="https://towardsdatascience.com/animated-rnn-lstm-and-gru-ef124d06cf45">https://towardsdatascience.com/animated-rnn-lstm-and-gru-ef124d06cf45</a> Visited 8 Nov 2020.</p>
<p>[6] “Dual graph.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Dual_graph">https://en.wikipedia.org/wiki/Dual_graph</a> Visited 8 Nov 2020.</p>Recently I have been interested in constructing scene graphs directly from raw image data for a research project. Scene graphs are a representation of the objects in an image, as well as their relationships [1]. Typically, a scene graph represents the objects as nodes and joins these objects together using edges and nodes that represent specific types of relationships [1]. It can be useful to construct graphs from images so that we can use the graph information as part of a larger AI solution that operates on graphs. In the post we are going to focus on a particular method for deriving scene graphs from images as proposed by Prof. Fei-Fei Li’s group at Stanford in “Scene Graph Generation by Iterative Message Passing” by Xu et al. [1].Welcome to Discrete Time2020-11-01T00:00:00+00:002020-11-01T00:00:00+00:00http://sassafras13.github.io/DiscreteTime<p>Our lab is working on implementing model predictive control (MPC) for an ongoing project, and so I have been learning the finer points of MPC in order to build this implementation. Since MPC is usually implemented with a computer, it is helpful to understand something about discrete time and how it affects our control algorithms. In this post, I am going to review some of the basics to help inform our understanding of MPC. I will start by explaining why we care about discrete time, the fundamentals of sampling in discrete time, and then introduce difference equations and (briefly) the z-transform. This is going to set up a discussion of how we can write a full dynamics model in discrete time in a follow-on post.</p>
<h2 id="what-is-discrete-time-control-and-why-do-i-need-it-for-mpc">What is discrete time control and why do I need it for MPC?</h2>
<p>Usually when we start learning about control theory, we think in terms of continuous time. Continuous time assumes that we can compute the value of a variable at an infinitesimally small period of time [1]. By contrast, discrete time assumes that we are sampling from a variable at regular instances in time, and that between each sample time, the value of a variable remains constant. Consider Figure 1, where continuous time is represented by the smoothly varying gray line, and discrete time is represented by the samples taken at the points indicated by the red arrows.</p>
<p><img src="/images/2020-11-01-DiscreteTime-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [2]</p>
<p>When we study control theory in continuous time, we use analytical expressions to describe our models that we can evaluate at any point in time. We can compute derivatives and integrals analytically, as well. We even learn to implement basic controllers like PID in continuous time, because it is possible to build PID controllers using only passive electronic components (resistors, capacitors, etc). Since these PID controllers are driven only by electric current (and not a computer chip), they will also operate in continuous time and the assumption of continuous time still holds.</p>
<p>But what happens when we need to implement MPC control? Can we continue to assume that we are operating in continuous time? No, we cannot, because MPC control requires a digital controller that can perform convex optimization. This means that we have to rewrite the mathematics behind MPC to accommodate for the fact that we are working in digital time, not continuous time. We have to take into account the fact that the computer is going to be sampling some signal from the plant at regular, discrete intervals, not continuously, and sending out control commands at regular intervals, too.</p>
<p>This difference is also shown in Figures 2 (continuous time control) and 3 (discrete time control). You can see that in continuous time, our error, output, control and reference signals are all in terms of continuous time, <em>t</em>, but in discrete time they are in terms of specific samples, <em>k</em>.</p>
<p><img src="/images/2020-11-01-DiscreteTime-fig2.png" alt="Fig 2" title="Figure 2" /> <br />
Figure 2 - Source: [3]</p>
<p><img src="/images/2020-11-01-DiscreteTime-fig3.png" alt="Fig 3" title="Figure 3" /> <br />
Figure 3 - Source: [3]</p>
<h2 id="basics-of-how-controllers-sample-in-discrete-time">Basics of How Controllers Sample in Discrete Time</h2>
<p>As you can see in Figure 3, we are using Analog-to-Digital (A/D) and Digital-to-Analog (D/A) converters to transition between discrete (digital) and continuous (analog) time [4]. Let’s think about how these converters might work in practice, starting with the A/D converter. This converter is operating on some physical signal, like a voltage taken from a temperature sensor. The A/D converter changes the voltage signal to some binary number that represents the voltage value. Depending on the resolution of the converter, the binary number will have a certain number of bits, often 10-12 bits. Let’s say the converter is a 10-bit converter. That means that it can encode 2^10 = 1024 discrete values. This is also like saying that we have a resolution of about 0.1%, because we can represent 1/1000th of a unit, which is also 0.1% [4].</p>
<p>The A/D converter, therefore, is converting the voltage signal to a binary number at regular intervals. More specifically, the converter has some sampling period, T, and we can say that the sample signal is in terms of these sampling periods, i.e. y(kT), where k is the integer number of samples we have taken so far. We can also shorten this to just write y(k), as shown in Figure 3, to indicate that we are working in discrete time now [4].</p>
<p>Similarly, the D/A converter is changing the discrete values output by the MPC controller into continuous time signals. Typically, this is done using something called a zero-order hold (more on this later), which just means that at regular intervals, we send a new command, u(kT), to the plant, and hold that value until the next sample time, when we send a different command. The zero-order hold just means that we hold the value of u(kT) constant over the sample period [4].</p>
<p>Now that we’ve briefly seen how we can change between discrete and continuous time, let’s discuss how a computer operates in discrete/digital time using difference equations.</p>
<h2 id="difference-equations">Difference Equations</h2>
<p>As we have <a href="https://sassafras13.github.io/ControlTheoryBasics/">seen before</a>, our state space model for a dynamic system gives the model in terms of the derivative of the state variables, <em>x</em>. We need a way to write this derivative in terms of basic operations that a computer can perform, like addition, subtraction and multiplication, given a discrete set of samples. Once we have found a way to compute derivatives using only sampled, discrete points from a continuous-time plant, we will be one step closer to understanding how we must rewrite our continuous dynamics model into discrete time [4].</p>
<p>Recall that the definition of a derivative is [4]:</p>
<p><img src="/images/2020-11-01-DiscreteTime-eqn1.png" alt="Eqn 1" title="Equation 1" /> <br />
Equation 1</p>
<p>We can use Euler’s method to calculation an approximation of the derivative like so [4]:</p>
<p><img src="/images/2020-11-01-DiscreteTime-eqn2.png" alt="Eqn 2" title="Equation 2" /> <br />
Equation 2</p>
<p>Note that Equation 2 uses the forward difference method to compute the derivative, but we can also use the backward or central difference methods (central difference is generally recommended in practice, but the idea is the same in this discussion) [4].</p>
<p>Notice that Equation 2 is written solely in terms of values that the digital controller has access to: specifically, sampled points in time and the sampling period, T. This means that any time we encounter a derivative within the MPC implementation, we can now use Equation 2 to compute it. And if our system’s bandwidth is small, and our sampling period is sufficiently fast, we can assume that our approximation errors will be very small [4]. Let’s see how Equation 2 could be used to rewrite the dynamics model in discrete time [4]:</p>
<p><img src="/images/2020-11-01-DiscreteTime-eqn3.png" alt="Eqn 3" title="Equation 3" /> <br />
Equation 3</p>
<p>In general, sampling introduces a small delay in our system of value T/2. The delay can be approximated by a first-order Pade approximation as [4]:</p>
<p><img src="/images/2020-11-01-DiscreteTime-eqn4.png" alt="Eqn 4" title="Equation 4" /> <br />
Equation 4</p>
<p>Note that the sampling delay reduces the system phase. More intuitively, we can say that delays reduce system damping and system stability. In the next section, we will talk about how we can start to introduce the sampling delay into our dynamics models using the z-transform [4].</p>
<h2 id="the-z-transform">The Z-Transform</h2>
<p>In this last section, we are going to introduce the z-transform. So far we have seen that in continuous time, we consider signals as a function of continuous time, <em>t</em>, such as y(t). Similarly, we have seen that we can consider discrete time-based signals as functions of <em>k</em>, such as y(k). We know that in continuous time, we can convert from the time domain (t) to the frequency domain (s) using the Laplace transform. Sometimes we want to analyze a system in the frequency domain because it can be a more intuitive way to think about system stability and controller design. Similarly, the z-transform is a way of converting from <em>k</em> to <em>z</em>, where <em>z</em> is, conceptually, equivalent to <em>s</em> [4].</p>
<p>Recall that the Laplace transform can be written as [5]:</p>
<p><img src="/images/2020-11-01-DiscreteTime-eqn5.png" alt="Eqn 5" title="Equation 5" /> <br />
Equation 5</p>
<p>Similarly, we can write the z-transform as [4]:</p>
<p><img src="/images/2020-11-01-DiscreteTime-eqn6.png" alt="Eqn 6" title="Equation 6" /> <br />
Equation 6</p>
<p>In this case, <em>e</em> represents the error signal into a system, and the bounds on <em>z</em> describe the values of <em>z</em> for which the sum converges [4].</p>
<p>Using this z-transform, we can now convert from the discrete time domain, to the frequency domain in terms of <em>z</em>. This allows us to write transfer functions in terms of <em>z</em>, taking into account the phase shifts caused by sampling delays in discrete time [4].</p>
<p>Now that we have gotten a brief introduction to some of the key concepts in discrete (or digital) time, we will next explore how we can rewrite a dynamics model from continuous time to discrete time. We will then take this model and look at how we can use it in writing the convex optimization formulation for MPC control.</p>
<h4 id="references">References:</h4>
<p>[1] “Discrete time and continuous time.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Discrete_time_and_continuous_time">https://en.wikipedia.org/wiki/Discrete_time_and_continuous_time</a> Visited 01 Nov 2020.</p>
<p>[2] By No machine-readable author provided. Rbj assumed (based on copyright claims). - No machine-readable source provided. Own work assumed (based on copyright claims)., Public Domain, https://commons.wikimedia.org/w/index.php?curid=870308</p>
<p>[3] “Introduction: Digital Controller Design.” Control Tutorials for MATLAB & Simulink. <a href="https://ctms.engin.umich.edu/CTMS/index.php?example=Introduction&section=ControlDigital">https://ctms.engin.umich.edu/CTMS/index.php?example=Introduction&section=ControlDigital</a> Visited 01 Nov 2020.</p>
<p>[4] Franklin, G., Powell, J., Workman, M. “Digital Control of Dynamic Systems, 3rd Ed.” Ellis-Kagle Press, Half Moon Bay, CA, 1998.</p>
<p>[5] “Laplace transform.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Laplace_transform">https://en.wikipedia.org/wiki/Laplace_transform</a> Visited 01 Nov 2020.</p>Our lab is working on implementing model predictive control (MPC) for an ongoing project, and so I have been learning the finer points of MPC in order to build this implementation. Since MPC is usually implemented with a computer, it is helpful to understand something about discrete time and how it affects our control algorithms. In this post, I am going to review some of the basics to help inform our understanding of MPC. I will start by explaining why we care about discrete time, the fundamentals of sampling in discrete time, and then introduce difference equations and (briefly) the z-transform. This is going to set up a discussion of how we can write a full dynamics model in discrete time in a follow-on post.Docker Best Practices2020-10-20T00:00:00+00:002020-10-20T00:00:00+00:00http://sassafras13.github.io/DockerBestPractices<p>In a <a href="https://sassafras13.github.io/Docker/">previous post</a>, I introduced the basic ideas behind Docker containers and using them on shared workstations. In this post, I want to build on that discussion by talking about some of the best practices recommended by experts when working
with Docker containers. This is going to be a grab bag of topics, so feel free to scan through the headings to find information that is useful to you.</p>
<h2 id="general-advice-from-docker">General Advice from Docker</h2>
<p>The good people at Docker have some best practices that they recommend. First and foremost, they recommend that you do everything you can to keep your images as small as possible [1]. A good way to do this right off the bat is to be careful in what you choose for your base image - for example, don’t choose the official image of all of ubuntu 18.04 if you just need a base image of Python3 [1].</p>
<p>It is also possible to use multistage builds that draw from multiple bases if needed [1]. I think this discussion can get more complex so I recommend looking at Docker’s documentation if this sounds useful to you.</p>
<p>Docker engineers also recommend minimizing the number of separate RUN commands that you write in your Dockerfile [1]. I believe the reason for this is that every new RUN command adds a <strong>layer</strong> to your image, which makes the images heavier and less efficient [1]. If you combine RUN commands by using && and other bash tricks, then you can reduce the number of layers without loss in functionality [1].</p>
<p>If you have several Docker images that share some layers in common, consider turning the common layers into a base image that you draw from repeatedly [1]. This also makes your child images lighter and it makes your code modular [1].</p>
<h2 id="specifying-requirements">Specifying Requirements</h2>
<p>Part of the build process for a Docker container involves installing all of the necessary packages for your application. It is best practice to be very specific about which versions you want to install so that as packages get updated, you continue to install the correct versions that are compatible with your application [2]. For example, you don’t want to just install the latest version of Python3 - you should really specify that you want Python 3.6.9 if that’s what you have been using.</p>
<p>A cool trick for building a well-specified list of requirements is to use <a href="https://github.com/jazzband/pip-tools">pip-compile</a>. You can build a general list of the packages you need in a file named “requirements.in” and send this to pip-compile to build a list of all dependent packages and their current versions. The command is [2]:</p>
<p>pip-compile requirements.in > requirements.txt</p>
<h2 id="writing-good-dockerfiles">Writing Good Dockerfiles</h2>
<p>After building the same Docker image a couple of times, you may have noticed that the process of installing all the packages can really slow down the process. Moreover, it doesn’t really make sense to have to re-install the packages every time you rebuild an image if you haven’t changed the dependencies. Docker implemented <strong>caching</strong> to help speed up the build process and save packages from a previous build if they haven’t been changed [3].</p>
<p>Let’s look at how a Dockerfile is processed to understand how caching works. As we mentioned above, every instruction in a Dockerfile is another layer that gets added to the base image. So as the daemon works through the layers of the Docker image, it applies the following rules [3]:</p>
<p><strong>1.</strong> If the text of the command has not changed, then it will use the cached version. <br />
<strong>2.</strong> If the files in a COPY command have not changed, then it will use the cached version. <br />
<strong>3.</strong> If the cache cannot be used for a given layer, then <strong>none</strong> of the subsequent layers will be loaded from the cache.</p>
<p>We can be strategic in how we structure our Dockerfile so that we only introduce new files when they are absolutely necessary for the Dockerfile to continue [3]. Another way to think about it is to introduce files and commands that are likely to change near the <strong>end</strong> of the Dockerfile so that we can draw on the cache for as long as possible. Let’s consider the Dockerfile shown below.</p>
<p><img src="/images/2020-10-20-DockerBestPractices-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [3]</p>
<p>This is not written according to best practice because we are going to copy all the files over very early in our Dockerfile. So if I change something in one of my files, the build rules will force me to stop using the cache after this layer and I will be forced to repeat the installation process for all my dependencies, even if none of my requirements have changed [3]. A better approach is shown in Figure 2.</p>
<p><img src="/images/2020-10-20-DockerBestPractices-fig2.png" alt="Fig 2" title="Figure 2" /> <br />
Figure 2 - Source: [3]</p>
<p>Here we first copy <em>just</em> the requirements.txt file that tells us what packages we need to install. Once those dependencies have been installed, we then copy the rest of the necessary files over. This is a much better approach because now, if I make changes to “server.py” and rebuild the image, I can use the cache to speed through the build process until I need to recopy “server.py”. If I have a lot of dependencies, this can save me a lot of time [3].</p>
<h2 id="data-storage-with-bind-mounts-and-volumes">Data Storage with Bind-Mounts and Volumes</h2>
<p>Docker has several ways of storing persistent*1 data that can be used by containers during runtime [4]. The preferred tool is to use a <strong>volume</strong> - Docker argues that this approach is easy to back up, migrate and use safely between multiple containers [4]. The volume itself is stored on the host workstation in the area dedicated to running Docker infrastructure, and can be easily mounted to any container [4]. The volume is managed by Docker [4]. I have to be honest, though, and admit that, although I have figured out how to create a volume, I have not figured out how to populate it with data (such as a training dataset) so I will save that for a future blog post when I figure it out.</p>
<p>The alternative to using volumes, although not preferred, is to use a <strong>bind-mount</strong> [4]. The difference between a bind-mount and a volume is that a bind-mount simply points the container to a directory located elsewhere on the host computer [4]. This is an easier setup for me to understand, but I believe it is not preferred by Docker because the directory that gets mounted to the Docker container is not managed by Docker [5]. Also, you need to reference the bind-mount using an absolute path which is dependent on the specific file structure of the host computer - if you want to be able to transfer your image between workstations, you will need to edit any references to the bind-mount accordingly [5].</p>
<p>One other thing to note about using either bind-mounts or volumes is that there are two flags you can use to tell Docker to use one of these storage devices [5]. The older flag that has been in use longer is “-v”, but a newer flag “–mount” is supposed to be easier to read and configure [5]. I have been using “–mount” and I have to agree that I prefer it because you can modify it with readable key-value pairs like “type=bind-mount” [5].</p>
<h4 id="footnotes">Footnotes:</h4>
<p>*1 Persistent means that the data remains on the host computer even after the container has stopped and has been removed from the hard disk.</p>
<h4 id="references">References:</h4>
<p>[1] “Docker development best practices.” Docker. <a href="https://docs.docker.com/develop/dev-best-practices/">https://docs.docker.com/develop/dev-best-practices/</a> Visited 20 Oct 2020.</p>
<p>[2] Turner-Trauring, I. “Broken by default: why you should avoid most Dockerfile examples.” Python Speed. 27 Mar 2020. <a href="https://pythonspeed.com/articles/dockerizing-python-is-hard/">https://pythonspeed.com/articles/dockerizing-python-is-hard/</a> Visited 20 Oct 2020.</p>
<p>[3] Turner-Trauring, I. “Faster or slower: the basics of Docker build caching.” Python Speed. 6 Aug 2020. <a href="https://pythonspeed.com/articles/docker-caching-model/">https://pythonspeed.com/articles/docker-caching-model/</a> Visited 20 Oct 2020.</p>
<p>[4] “Use volumes.” Docker. <a href="https://docs.docker.com/storage/volumes/">https://docs.docker.com/storage/volumes/</a> Visited 21 Oct 2020.</p>
<p>[5] “Use bind mounts.” Docker. <a href="https://docs.docker.com/storage/bind-mounts/">https://docs.docker.com/storage/bind-mounts/</a> Visited 21 Oct 2020.</p>In a previous post, I introduced the basic ideas behind Docker containers and using them on shared workstations. In this post, I want to build on that discussion by talking about some of the best practices recommended by experts when working with Docker containers. This is going to be a grab bag of topics, so feel free to scan through the headings to find information that is useful to you.Docker Basics2020-10-19T00:00:00+00:002020-10-19T00:00:00+00:00http://sassafras13.github.io/Docker<p>In this post we are going to introduce a useful tool called Docker containers. I came across these recently when I decided to finally move my deep learning model off Google Colab and onto a computer equipped with GPUs.*1 The <a href="http://biorobotics.ri.cmu.edu/index.php">Biorobotics lab</a> uses Docker containers so that every lab member can run their own code inside a confined space on a shared workstation without affecting anybody else’s work. In this post I’m going to explain the motivation for using Docker containers, present the basic underlying principles for how they are set up, and provide some useful commands for using them. I may write a follow-up post focused on best practices, as well.</p>
<h2 id="what-are-docker-containers-and-why-use-them">What are Docker Containers and Why Use Them?</h2>
<p>I have heard cool computer scientists call Docker containers “sandboxes,” because they give you a place to play around with code without fear of affecting anything outside the container [1]. They are similar to virtual environments in Python or virtual machines in that they are a dedicated ecosystem inside your computer with their own operating system, packages and scripts installed [1]. You can set up your Docker container to run exactly the right version of Python for your application, for example, so that your application will always run no matter what version of Python is running on the workstation hosting your Docker container [1].</p>
<p>The other beautiful thing about Docker containers is that they don’t affect code located elsewhere on the host computer. This is great in a research lab where the computer itself is a shared resource, and I may need to use different versions of Python or Tensorflow than other people in the lab. In this scenario, I can set up my Docker container to match my application requirements without messing up anybody else’s setup [1]. This is especially comforting if you are, like me, new to running deep learning models and you’re a little afraid that you’ll destroy other people’s codebases. It’s hard to make friends with people after irrevocably destroying their thesis project.</p>
<p>To get a little more technical, we can say that Docker containers are an open-source tool that allows you to package your application inside a “standardized unit,” along with all of the application’s dependencies [1]. To the best of my knowledge, Docker containers are primarily run in Linux; I’m not sure if they can be used on other operating systems [1]. The difference between a Docker container and a virtual machine is that the Docker container is designed to be more lightweight [1]. I don’t understand exactly how the Docker engineers were able to do this, but they were able to design containers that use the host computer’s compute resources very efficiently [1]. This is good for you because it means you get a larger share of the host computer’s resources for your application, and you are spending less of it on the infrastructure of your sandbox [1].</p>
<h2 id="underlying-principles-of-docker-containers">Underlying Principles of Docker Containers</h2>
<p>In this section I want to introduce some of the terminology for talking about Docker containers. (I always find that understanding technical things gets a lot easier once you have mastered the vocabulary.) We start with the <strong>Docker image</strong>, which is the “blueprint” of the Docker container [1]. That is, the Docker image is the source code which you can download and compile to run a Docker container [1]. You can store Docker images in your Docker repository so that you can put them on any computer [1]. As you might imagine, this is especially useful when you have a lab with multiple workstations available for use and you want to be able to switch easily between using different machines based on their availability.</p>
<p>Now that we have downloaded a Docker image, we create a <strong>Docker container</strong> from that image (i.e. an instance of the image on your host machine). There is a <strong>Docker daemon</strong> running in the background on the host machine which is always managing the building, running and distribution of all the Docker containers on that machine [1]. The daemon is working with the host machine’s operating system to allocate resources [1]. The user (you) is interacting with the daemon through the <strong>Docker client</strong>, which is just a command line tool that lets you manage your containers [1]. (Apparently there is also a GUI available if you prefer to use that [1].)</p>
<p>Let’s talk a little more about images for a minute. There are two kinds of images: <strong>base</strong> and <strong>child</strong> images [1]. Base images do not have parent images - they usually contain fundamental codebases like operating systems or Python 3.6.9 [1]. Child images, in contrast, build on top of base images, adding more functionality for your specific application [1]. For example, I am currently building a child image that draws on the Python 3.6.9 base image - this ensures I have all the functionality of this version of Python which I can use in my particular deep learning model. You could also say that my image is a <strong>user image</strong> because I wrote it myself [1]. Conversely, <strong>official images</strong> are ones that are maintained and supported by engineers at Docker, and they usually consist of images for operating systems or coding languages [1].</p>
<h2 id="workflow">Workflow</h2>
<p>So now that we have some familiarity with the Docker container paradigm, let’s walk through how we might use one in practice. Srivastav provides some excellent in-depth examples in his tutorial [1], and I will only give a brief overview here. In short, the steps to creating and using a Docker image are as follows:</p>
<ol>
<li>Create an image by preparing your application code and writing a Dockerfile.</li>
<li>Build the Docker image.</li>
<li>Run the Docker container based on that image.</li>
<li>Commit the image to Docker Hub.</li>
</ol>
<h3 id="create-an-image">Create an Image</h3>
<p>Let’s say that I want to set up a Docker image for a particular deep learning model that I have written. I already have a Github repository with the deep learning model codebase. The first thing I need to do is create a <strong>Dockerfile</strong> that defines what commands the daemon will call while building my Docker image [1]. The Dockerfile is similar to a bash script in Linux which automates the image creation process by telling the daemon what base image to use, what packages to install and what scripts to run [1]. In my next post I’m going to go into more detail about some best practices for writing Dockerfiles, because there are some subtleties to writing them well. But a simple Dockerfile is shown below [1].</p>
<p><img src="/images/2020-10-19-Docker-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [1]</p>
<p>The FROM command indicates what the base image is for this Docker image - in this example, the base image is Python 3. We choose a directory to locate the application inside the image using WORKDIR. Then we COPY all the files and install the application’s dependencies using RUN. This particular example creates a simple website so we need to EXPOSE a port for the webpage. Finally, we run the primary script for the application with the CMD command [1].</p>
<h3 id="build-an-image">Build an Image</h3>
<p>Once you have written the Dockerfile, you are ready to build the Docker image. We use this command [1]:</p>
<p>docker build -t username/image-name .</p>
<p>Don’t forget to put the full stop at the end. The “-t” flag just adds a tag to the Docker image with your username/image-name. Notice that it is best practice to name all the images that you create following this format of using your Docker username and then the image name after a forward slash [1]. This build command will run through all the commands in the Dockerfile, and may take a while the first time you run this command because you may need to install a lot of packages in support of your application. (There are ways to cache your packages when your dependencies don’t change much from build to build - we’ll talk about that in another post.)</p>
<h3 id="run-the-container">Run the Container</h3>
<p>Once the Docker daemon has successfully built the image, you can run it using [1]:</p>
<p>docker run username/image-name</p>
<p>This will open your Docker container and run through the code specified in the Dockerfile. If you use the Terminal command as listed here, the Docker container will close again (but not delete itself) when it’s done. There are additional flags you can add to open a container and run commands inside the container, for example:</p>
<p>docker run -it container-name sh</p>
<p>The “-it” flag provides an interactive TeleTYpe (tty) interface to the container, which just looks like inputting commands to the Terminal. You can also use:</p>
<p>docker run –rm –interactive –tty container-name</p>
<h3 id="commit-to-docker-hub">Commit to Docker Hub</h3>
<p>Similar to Github, Docker hosts a repository service called the Docker Hub that allows you to publish your images [1]. It’s free to create an account and host your images on the Docker Hub. Use this command to log in [1]:</p>
<p>docker login</p>
<p>And this command to publish your image [1]:</p>
<p>docker push username/image-name</p>
<p>When you want to pull your image on a new workstation, use this command [1]:</p>
<p>docker pull username/image-name</p>
<p>There you go! Now you have a way to build your own images and store them in a repository so that you can access them from any workstation. In the next section we’ll go over a couple more useful commands for working with Docker containers.</p>
<h2 id="other-useful-commands">Other Useful Commands</h2>
<p>You can see all of the Docker containers that you have run using:</p>
<p>docker ps -a</p>
<p>This command allows you to see the containers, even if you have stopped running them. Note that after a container finishes running, they are not completely removed from your hard disk and can pile up and consume disk space if you don’t delete them [1]. To remove containers that you don’t need any more, you can either remove the specific container with [1]:</p>
<p>docker rm container-ID</p>
<p>Or you can use the prune function [1]:</p>
<p>docker container prune</p>
<p>Remember to use these commands occasionally, especially when using a shared workstation.</p>
<p>You can also see all of your images using [1]:</p>
<p>docker images</p>
<p>So that’s it for this post. I hope it was a useful introduction to Docker containers. I hope to write a follow-on post soon with some more best practices for building Dockerfiles and other activities.</p>
<h2 id="footnotes">Footnotes:</h2>
<p>*1 Don’t get me wrong, I love Google Colab and how easy it is to use, but the conventional wisdom is that once you get your model working, you need to move it to a workstation with a dedicated GPU. The problem with staying on Google Colab as you train your model for many epochs is that you can be kicked off the server’s GPUs at any time, which can seriously interrupt your run or cause it to fail completely. This was happening to me as I was preparing a short paper submission and it made me very nervous. So now I’m investing the time in learning how to use Docker containers so it doesn’t happen again!</p>
<h2 id="references">References</h2>
<p>[1] Srivastav, P. “Docker for Beginners.” <a href="https://docker-curriculum.com/">https://docker-curriculum.com/</a> Visited 19 Oct 2020.</p>In this post we are going to introduce a useful tool called Docker containers. I came across these recently when I decided to finally move my deep learning model off Google Colab and onto a computer equipped with GPUs.*1 The Biorobotics lab uses Docker containers so that every lab member can run their own code inside a confined space on a shared workstation without affecting anybody else’s work. In this post I’m going to explain the motivation for using Docker containers, present the basic underlying principles for how they are set up, and provide some useful commands for using them. I may write a follow-up post focused on best practices, as well.Control Theory Grab Bag and MPC2020-08-28T00:00:00+00:002020-08-28T00:00:00+00:00http://sassafras13.github.io/ControlTheoryBasics<p>Recently I have been looking into various topics on control and motion planning, and I wanted to just write a quick post to define some terms. I am also providing a somewhat detailed explanation of one control technique in particular, called <strong>model predictive control</strong> (MPC).</p>
<h2 id="definitions">Definitions</h2>
<p>First, some definitions. <strong>Visual servoing</strong> seems to be a somewhat generic term that describes any form of closed loop control that uses a camera to collect information about your system’s state [1]. There are two primary categories of visual servoing: (1) directly controlling the system’s degrees of freedom using information pulled from the camera feed and (2) geometrically interpreting the information from the camera feed before using it in a control loop [1]. This second method usually requires some knowledge about the shape of the system that is being observed with the camera so that we can draw conclusions about the system’s pose and orientation [1].</p>
<p>The second term that I hear all the time but have not got a great intuition for is <strong>trajectory optimization</strong>. Trajectory optimization refers to a broad class of methods that can be used to design a system’s trajectory which maximizes (or minimizes) some measure of performance, given some constraints [2]. Usually we perform trajectory optimization in an open-loop environment - that is, we assume that we will only be able to send commands to our system, but we will not necessarily be able to track the system’s state and respond to disturbances during the execution of the trajectory [2]. One simple example of this might be shooting a cannon ball (the control input is the angle of the cannon, which we want to optimize to hit our target).</p>
<p>We often use trajectory optimization when it is not necessary or practical to find a closed-loop solution to controlling our system [2]. We can also use trajectory optimization techniques when we only need to compute the first control step for an infinite-horizon problem (i.e. where we are trying to control the system forever) [2]. This last situation actually lends itself well to using model predictive control, which we will describe more below [2]. Some other methods of trajectory optimization include single shooting, multiple shooting, direct collocation, and differential dynamic programming [2]. I should probably write about all of these methods at some point, too.</p>
<p>Finally, I wanted to quickly explain <strong>linear-quadratic regulator</strong> (LQR) control. I have studied this technique multiple times and even implemented it myself on real hardware, but for some reason it always feels like a difficult concept to remember. LQR control is a way of finding a static gain, <em>K</em>, that can be used to perform feedback control according to <em>u</em> = -<em>K x</em> [3]. The matrix, <em>K</em>, is found by minimizing some cost function, <em>J</em>, which can be written as follows [3]:</p>
<p><img src="/images/2020-08-28-ControlTheoryBasics-eqn1.png" alt="Eqn 1" title="Equation 1" /> <br />
Equation 1</p>
<p>Where <em>U</em> contains the control inputs at every time step, <em>Q</em> is the state cost, and <em>R</em> is the control cost [4]. In Equation 1, we are computing the cost over some horizon which is defined as <em>N</em> time steps [4]. The first term in the cost function computes the cost of the error between our desired and actual states at a given time step [4]. The second term computes the cost of applying a control input at a particular time [4]. Finally, the last term computes the cost of the error between our desired and actual <em>final</em> states [4]. The engineer must choose values for <em>Q</em> and <em>R</em> in order to optimize the controller for the particular problem at hand - this is often considered to be easier and more intuitive than other forms of controller design [5].</p>
<p>How do we use the cost function in Equation 1 to find our controller, <em>K</em>? If we assume that we can write the system dynamics as a linear system as follows [3]:</p>
<p><img src="/images/2020-08-28-ControlTheoryBasics-eqn2.png" alt="Eqn 2" title="Equation 2" /> <br />
Equation 2</p>
<p>Then we can solve the associated Riccati equation for this system [3]. We use the solution, <em>S</em>, to write an expression for <em>K</em> [3].</p>
<p>Where did this <strong>Riccati equation</strong> come from? In general, we can define a Riccati equation as any first order ordinary differential equation that is quadratic in terms of the unknown function [6]. More specifically, for the particular case where we are designing a controller for our system’s steady state, we use the <strong>algebraic Riccati equation</strong> [6].</p>
<p>LQR is a very popular method of designing controllers and is often one of the first techniques you learn in a course on modern control theory. In the next section I’ll talk about MPC, which is another highly popular control algorithm, especially in recent years with the development of more powerful computers.</p>
<h2 id="model-predictive-control">Model Predictive Control</h2>
<p>Okay, now let’s dive into model predictive control (MPC) because it is a really cool tool that is getting a lot of air time in recent years in applications like drone control and autonomous vehicle control. MPC actually originated in the 1980s for use in chemical plants, because the processes occurring in the plants were slow enough that 1980s-era computers could perform the calculations needed to implement MPC [7].</p>
<p>Model predictive control does basically what the name advertises: it uses a model of the plant to make predictions about the future, and find the most optimal control input for the current time based on those predictions [7]. MPC assumes that you already know what your desired trajectory, or setpoint, is - in other words, we should already have done trajectory optimization for our application, and we are using MPC to make sure our system follows this desired trajectory. MPC is a computationally expensive approach - especially as the number of states gets larger and the control horizon is longer*1 - but as mentioned above, we can manage this to some extent with the more powerful computing resources we have today.</p>
<p>While PID control is still the dominant control strategy in industry, MPC is a better choice for MIMO (multi-input, multi-output) systems that have many inputs [7]. This is because PID controllers must be applied to each control input individually, and it can be challenging to tune the gains in each controller to optimize the entire system, because one control input could affect multiple output states [7]. MPC sidesteps this problem by optimizing the entire system at once [7].</p>
<p>Okay, let’s lay out the structure of the standard MPC algorithm, and then we will discuss how to choose some of the key parameters of the algorithm.</p>
<h3 id="mpc-algorithm">MPC Algorithm</h3>
<p>A brief block diagram of the MPC algorithm is shown in Figure 1. Here we are controlling some plant with our MPC controller, which contains two components: the plant model and the optimizer [8]. Our goal is to make the plant follow the reference, which could be a setpoint value or an optimized trajectory [8]. Let’s say we are designing an MPC controller for a car, and our goal is to make the car follow a trajectory which stays in the middle of a lane on the highway.</p>
<p><img src="/images/2020-08-28-ControlTheoryBasics-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [8]</p>
<p>Our MPC controller works as follows: at the current time step, we use the model of the car to predict the car’s trajectory some number of time steps, <em>p</em>, into the future, assuming we give the car some control input, <em>u</em> [8]. In general, the controller will make this forward prediction for a number of possible control inputs [8]. The optimizer inside the MPC controller is using the predictions to find the sequence of control inputs that minimize some cost function [8]. The cost function is somewhat similar to Equation 1, and it is given below [9]:</p>
<p><img src="/images/2020-08-28-ControlTheoryBasics-eqn3.png" alt="Eqn 3" title="Equation 3" /> <br />
Equation 3</p>
<p>The first term in Equation 3 computes the cost of deviating from the desired trajectory, and the second term is trying to minimize the control inputs so that we don’t abruptly apply large control inputs (this is generally a bad thing to do) [8]. The weights, <em>w</em>, are used to convey to the optimizer how important it is to minimize each cost [8]. For example, we may want to put more weight on minimizing the control inputs because our car is a passenger vehicle and we want our passengers to be comfortable. We can also apply <strong>constraints</strong> to the system’s control inputs and state: for example, we might say that you cannot turn the steering wheel more than 90 degrees, and the car cannot be allowed to move outside of its lane [8].</p>
<p>Once we have found the optimal sequence of control inputs, we apply the first control input in that sequence to the car [8]. Then we throw away all the rest of the control inputs, and move forward one time step [8]. On the next time step, we take a measurement of our system *2, and re-do the entire optimization process [8]. In this way, if there is some disturbance to our system that we did not predict, MPC is able to take that into account and generate a new sequence of control inputs [8]. Because we are constantly re-computing the control sequence at every new time step, MPC is also called receding horizon control [8].</p>
<p>Now that we have a basic idea of how MPC works, let’s talk about how we choose some of these parameters.</p>
<h3 id="parameter-selection">Parameter Selection</h3>
<p>As we mentioned earlier, choosing the parameters for the MPC algorithm is important for ensuring that the computational complexity does not blow up and make solving the algorithm intractable [10]. But these parameters also affect the performance of the algorithm and how effectively it can control your application.</p>
<p>Let’s start by considering the <strong>sample time</strong> of the algorithm. If the sample time is too slow, then of course the controller will not be able to respond to disturbances quickly enough to avoid hitting one of the constraints [10]. For instance, you can imagine that if the controller for the autonomous vehicle is very slow, it will not be able to correct for a pedestrian in the road. Conversely, if the controller is too fast, then it might not be possible for it to finish a set of computations before the end of the time step [10]. MATLAB recommends choosing a time step that allows the controller to fit 10 to 20 steps into the open loop rise time of the system [10].</p>
<p><img src="/images/2020-08-28-ControlTheoryBasics-fig2.png" alt="Fig 2" title="Figure 2" /> <br />
Figure 2 - Source: [10]</p>
<p>Let’s talk about the horizons of the algorithm. First we have the <strong>prediction horizon</strong>, which determines how many time steps into the future we need to compute a predicted trajectory using our model of the system [10]. If our prediction horizon is too short, then we fail to find a control policy that is truly optimal for our situation [10]. For example, imagine we are planning the autonomous vehicle’s speed as it approaches a bend in the road [10]. If our prediction horizon is not long enough to include the bend in the road, we may keep traveling at a high speed until the bend in the road is just ahead of us [10]. At that point, we will have to slam on the brakes which will be very uncomfortable for the passengers in the car [10].</p>
<p><img src="/images/2020-08-28-ControlTheoryBasics-fig3.png" alt="Fig 3" title="Figure 3" /> <br />
Figure 3 - Source: [10]</p>
<p>But so then why not just make the prediction horizon very long? Remember that we are still trying to avoid the Curse of Dimensionality - if we make the prediction horizon very long, then we are consuming a lot of computation resources and time to compute our next control signal [10]. Furthermore, remember that we’re going to throw away everything after the first time step and repeat this process, so it’s wasteful to have more computational steps than we truly need [10]. MATLAB recommends sizing the prediction horizon so that you can fit 20 to 30 time steps into the open loop transient system response [10].</p>
<p>The other horizon in the MPC algorithm is the <strong>control horizon</strong> [10]. This determines how many time steps into the future we need to compute our control sequence [10]. That is, when we are optimizing our control policy, the optimizer has to choose <em>n</em> control inputs, where <em>n</em> is the control horizon [10]. Again the Curse of Dimensionality comes into play here, because we can think of each control input as a separate variable that must be optimized [10]. Therefore, we don’t want to choose a horizon that is too long, because that will make the algorithm intractable. Similarly, if the control horizon is too short, we may not be looking far enough into the future to find the truly optimal control sequence [10]. In general, assuming that we are trying to reach some steady state setpoint, the first couple of control inputs are very important to our success, while control inputs that are in the more distant future will have less importance [10]. For the control horizon, MATLAB recommends choosing a control horizon that is 10 - 20% as long as the prediction horizon, and contains at least 2 to 3 time steps [10].</p>
<p>Finally, we can apply constraints to the MPC algorithm. We’ve touched on these already, but I just wanted to introduce the idea of <strong>hard</strong> and <strong>soft</strong> constraints [10]. Hard constraints cannot, under any circumstances, be broken by the algorithm [10]. On the other hand, soft constraints can be broken as needed, as long as the algorithm continues to try to minimize these states as much as possible [10]. Constraints can be applied to both the control inputs and the system’s states [10]. Let me illustrate the idea with another example. Let’s say that our autonomous car is fully loaded and has to climb a steep hill [10]. We apply a hard constraint to the control input - we cannot exceed some maximum input to the gas pedal - and we also apply a hard constraint to the car’s velocity - we cannot drop below the minimum allowable speed on the highway [10]. These hard constraints on both the control inputs and the state could produce an intractable situation where it is impossible to find a solution to the problem, because we cannot give the car enough gas to keep it moving at 45 mph [10].</p>
<p>The solution to this problem of hitting an intractable situation, is to never apply hard constraints to <em>both</em> the control inputs and the states [10]. If we had instead made the constraint on the car’s velocity soft, then we <em>could</em> find a solution to this problem, where the car would be allowed to slow down while it was climbing the hill, and then we would return to the desired speed after we returned to level ground [10].</p>
<h4 id="footnotes">Footnotes:</h4>
<p>*1 The Curse of Dimensionality strikes again!</p>
<p>*2 Note that if we are lucky, we can directly measure the system state, but if that is difficult to do directly, we can also use state estimators like <a href="https://sassafras13.github.io/Filters/">Kalman filters</a> [8].</p>
<h4 id="references">References:</h4>
<p>[1] “Visual servoing.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Visual_servoing">https://en.wikipedia.org/wiki/Visual_servoing</a> Visited 26 Aug 2020.</p>
<p>[2] “Trajectory optimization.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Trajectory_optimization">https://en.wikipedia.org/wiki/Trajectory_optimization</a> Visited 28 Aug 2020.</p>
<p>[3] “lqr.” MathWorks. <a href="https://www.mathworks.com/help/control/ref/lqr.html">https://www.mathworks.com/help/control/ref/lqr.html</a> Visited 28 Aug 2020.</p>
<p>[4] “Lecture 1: Linear quadratic regulator: Discrete-time finite horizon.” EE363 Stanford University. Winter 2008-09. <a href="https://stanford.edu/class/ee363/lectures/dlqr.pdf">https://stanford.edu/class/ee363/lectures/dlqr.pdf</a> Visited 28 Aug 2020.</p>
<p>[5] “Linear-quadratic regulator.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator">https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator</a> Visited 28 Aug 2020.</p>
<p>[6] “Riccati equation.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Riccati_equation">https://en.wikipedia.org/wiki/Riccati_equation</a> Visited 28 Aug 2020.</p>
<p>[7] “Understanding Model Predictive Control, Part 1: Why Use MPC?” MATLAB. YouTube. 15 May 2018. <a href="https://www.youtube.com/watch?v=8U0xiOkDcmw">https://www.youtube.com/watch?v=8U0xiOkDcmw</a> Visited 27 Aug 2020.</p>
<p>[8] “Understanding Model Predictive Control, Part 2: What is MPC?” MATLAB. YouTube. 30 May 2018. <a href="https://www.youtube.com/watch?v=cEWnixjNdzs">https://www.youtube.com/watch?v=cEWnixjNdzs</a> Visited 28 Aug 2020.</p>
<p>[9] “Model predictive control.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Model_predictive_control">https://en.wikipedia.org/wiki/Model_predictive_control</a> Visited 28 Aug 2020.</p>
<p>[10] “Understanding Model Predictive Control, Part 3: MPC Design Parameters.” MATLAB. YouTube. 19 Jun 2018. <a href="https://www.youtube.com/watch?v=dAPRamI6k7Q">https://www.youtube.com/watch?v=dAPRamI6k7Q</a> Visited 28 Aug 2020.</p>Recently I have been looking into various topics on control and motion planning, and I wanted to just write a quick post to define some terms. I am also providing a somewhat detailed explanation of one control technique in particular, called model predictive control (MPC).The Silver Challenge - Lecture 102020-08-15T00:00:00+00:002020-08-15T00:00:00+00:00http://sassafras13.github.io/Silver10<p>Woohoo! I have completed the Silver Challenge! Today we wrapped up the course with a review of different RL algorithms used to play classic games like backgammon, Scrabble, chess and Go [1]. I’ll briefly explain some of the intuition behind the TreeStrap algorithm, presenting the state of the art (at least in 2015), and then I wanted to close by highlighting some of the techniques Silver used in his teaching that I thought were really effective [1].</p>
<h2 id="the-treestrap-algorithm">The TreeStrap Algorithm</h2>
<p>The TreeStrap algorithm is a way of combining search and learning strategies in an information-rich method [1]. TreeStrap a search process to compute the minimax values over a search tree at all nodes on the tree [1]. These minimax values are then used to update the value function at all points on the tree, not just at the root node [1]. This idea is presented in Figure 1, where we can see that the minimax values are propagated backwards through the tree towards the root [1]. TreeStrap is different from TD root or TD leaf because it propagates the new minimax values to all the nodes in the tree, not just to the root [1].</p>
<p><img src="/images/2020-08-15-Silver10-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [2]</p>
<p>Silver explains that, even though he believed that the TreeStrap algorithm was a very effective method of playing games using a linear value function, he actually found with <a href="https://www.youtube.com/watch?v=WXuK6gekU1Y">his work on AlphaGo</a> that neural networks could also be extremely effective [1]. He uses this experience as an example that we as researchers should constantly be challenging our assumptions and testing them further [1].</p>
<h2 id="the-state-of-the-art-in-2015">The State of the Art (in 2015)</h2>
<p>This lecture closed with a slide that showed the state of the art algorithms for playing classic games (not Atari or Starcraft), as of 2015. I think more neural networks have been applied since then (most notably with AlphaGo), but it was still interesting to see how well RL could perform in playing some classic games without using neural networks as function approximators [1]. In fact, Silver argues in this lecture that there are applications where neural networks (at least in 2015) were not the best function approximator for some of the more deterministic games in this repertoire [1].</p>
<p><img src="/images/2020-08-15-Silver10-fig2.png" alt="Fig 2" title="Figure 2" /> <br />
Figure 2 - Source: [2]</p>
<h2 id="teaching-strategies">Teaching Strategies</h2>
<p>Finally, I just want to close by listing a couple of Silver’s teaching strategies that I found quite effective and would like to use myself in the future:</p>
<ul>
<li>
<p><strong>Tell a Story</strong> - Silver always added narration that linked the ideas in each lecture together. He told us the outline of the story at the beginning of each lecture, often placing the current lecture in the context of others. Then at each transition in the lecture, he would link ideas together, which helped his audience to track the flow of the concepts.</p>
</li>
<li>
<p><strong>Use Repetition</strong> - Silver was a master of structuring ideas so that we could see repeated forms in the structure and link different concepts together. For example, when he taught Monte Carlo and TD planning methods, he set them both up as different forms of computing the update for the policy function parameters, which helped us to see how these methods were the same and different. The repetitive structure really helped me to remember ideas from previous lectures and to understand the mechanism of how different methods worked.</p>
</li>
<li>
<p><strong>Explain the Mechanisms Behind the Math</strong> - Silver would always narrate the way an equation worked. That is, he would highlight in words what different parts of any equation were doing - sometimes they were pulling the update in a certain direction, sometimes they were measuring the similarities of two distributions or comparing the weights of two properties. But Silver always gave his students an intuitive way of understanding how the equations worked, which I think is part of the secret sauce of being a great researcher.</p>
</li>
<li>
<p><strong>Use Clear Language and Examples</strong> - Silver always had 1-3 great examples to make his points concrete in every lecture, and they made all the difference. He was good at only using the minimum necessary technical vocabulary to convey his points, and then he focused on giving students an intuitive, conceptual understanding through concrete examples.</p>
</li>
<li>
<p><strong>Repeat Students’ Questions</strong> - Silver was excellent at listening to his students’ questions and very quickly finding ways to respond to them. One thing in particular that he did was to repeat the question to the class - I love it when professors do this because it helps bring everyone into the discussion, and prevents us from zoning out or losing the thread because we couldn’t hear the original question.</p>
</li>
<li>
<p><strong>Don’t Add Superfluous Information</strong> - Silver’s slides were very clean, and every piece of information on them was explained and important to the discussion. I really appreciated this because extra information can really make it unnecessarily harder to understand the core concepts.</p>
</li>
</ul>
<h4 id="references">References:</h4>
<p>[1] Silver, D. “RL Course by David Silver - Lecture 10: Classic Games.” YouTube. 29 Jul 2015. <a href="https://www.youtube.com/watch?v=kZ_AUmFcZtk&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=10">https://www.youtube.com/watch?v=kZ_AUmFcZtk&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=10</a> Visited 15 Aug 2020.</p>
<p>[2] Silver, D. “Lecture 10: Classic Games.” <a href="https://www.davidsilver.uk/wp-content/uploads/2020/03/games.pdf">https://www.davidsilver.uk/wp-content/uploads/2020/03/games.pdf</a> Visited 15 Aug 2020.</p>Woohoo! I have completed the Silver Challenge! Today we wrapped up the course with a review of different RL algorithms used to play classic games like backgammon, Scrabble, chess and Go [1]. I’ll briefly explain some of the intuition behind the TreeStrap algorithm, presenting the state of the art (at least in 2015), and then I wanted to close by highlighting some of the techniques Silver used in his teaching that I thought were really effective [1].The Silver Challenge - Lecture 92020-08-14T00:00:00+00:002020-08-14T00:00:00+00:00http://sassafras13.github.io/Silver9<p>Today’s lecture discussed more intelligent methods of trading off exploration and exploitation than epsilon-greedy, which is the only method we have seen so far [1]. Silver reviewed three approaches to making this trade-off, and he explained that for all of them, there is a lower bound on how well they can perform [1]. I’m going to briefly explain that lower bound and then introduce the idea behind one of these three approaches which I thought was particularly interesting.</p>
<h2 id="lai-and-robbins-theorem">Lai and Robbins Theorem</h2>
<p>This theorem states that the asymptotic total regret (i.e. total missed opportunity for an exploration-exploitation trade-off algorithm) is at least logarithmic in the number of steps [1]:</p>
<p><img src="/images/2020-08-14-Silver9-eqn1.png" alt="Eqn 1" title="Equation 1" /> <br />
Equation 1</p>
<p>Here the log(t) term represents the logarithmic limit over the number of states [1]. The numerator in the fraction represents the <strong>gap</strong> between different actions that we can take [1]. A larger gap means that two actions have very different means, and so if we pick the worse-performing action, we incur a large regret [1]. The denominator, the KL-divergence, is a measure of the similarity of the distributions [1]. If the KL-divergence is small, it means that the distribution of the rewards over the two actions looks very similar, and this can increase the amount of regret we accrue [1]. Difficult problems have small values of the KL-divergence because it means that we have a very difficult problem where the two actions appear to have very similar distributions [1]. Difficult problems also have large gaps which indicate that the real means between the two actions are very different, and so choosing the wrong action can incur a large penalty [1].</p>
<h2 id="optimism-in-the-face-of-uncertainty">Optimism in the Face of Uncertainty</h2>
<p>One of the three approaches to balancing the exploration/exploitation dilemma was called choosing optimism in the face of uncertainty [1]. I thought it was an interesting idea that we had not yet seen, because instead of assuming that unknown states have zero reward until proven otherwise, with optimism in the face of uncertainty, we actually assume that we have maximum reward until proven otherwise [1]. The result of this assumption is that we are motivated to visit the states we know the least about (because they were initialized with the highest reward) and we progressively decrement those states rewards until they reflect their true rewards [1]. This method drives us to frequently choose the states that are the most unknown if there is a possibility that they could yield a higher reward than any of our known states [1].</p>
<h4 id="references">References:</h4>
<p>[1] Silver, D. “RL Course by David Silver - Lecture 9: Exploration and Exploitation.” YouTube. 13 May 2015. <a href="https://www.youtube.com/watch?v=sGuiWX07sKw&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=9">https://www.youtube.com/watch?v=sGuiWX07sKw&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=9</a> Visited 14 Aug 2020.</p>Today’s lecture discussed more intelligent methods of trading off exploration and exploitation than epsilon-greedy, which is the only method we have seen so far [1]. Silver reviewed three approaches to making this trade-off, and he explained that for all of them, there is a lower bound on how well they can perform [1]. I’m going to briefly explain that lower bound and then introduce the idea behind one of these three approaches which I thought was particularly interesting.The Gumbel-Softmax Distribution2020-08-13T00:00:00+00:002020-08-13T00:00:00+00:00http://sassafras13.github.io/GumbelSoftmax<p>I have been meaning to write this post about the <strong>Gumbel-softmax distribution</strong> for several months, but I put it on a back burner after I had dug myself into a hole of deep confusion and couldn’t get out. After some encouragement from my advisor, I decided to pick it up again, and this time I think I was able to figure things out.*1 So in this post, we are going to learn how the Gumbel-softmax distribution can be used to incorporate categorical distributions into algorithms that use neural networks and still allow for optimization via backpropagation [1, 2].</p>
<p>First we will discuss why it is difficult to work with categorical distributions, and then we will build up the Gumbel-softmax distribution from the <a href="https://sassafras13.github.io/ReparamTrick/">Reparameterization Trick</a> and the Gumbel-Max trick.</p>
<h2 id="why-is-this-hard">Why Is This Hard?</h2>
<p>The problem that we will address in this post is how to work with discrete data generated from a categorical distribution [1-4]. A <strong>categorical distribution</strong> is a probability distribution made up of discrete categories [5]. For example, let’s draw inspiration from the <a href="https://sassafras13.github.io/GraphGANS/">MolGAN work</a> and think about generating graphs that represent molecules. We can limit ourselves to generating graphs that only contain carbon (C), oxygen (O) and fluorine (F) atoms. In this situation, we will have a categorical distribution of the probabilities of choosing each atom type for each node in our graph. This idea is shown in Figure 1.</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1</p>
<p>Now let’s say that I have a neural network that is going to output samples, <em>z</em>, pulled from this categorical distribution of atoms. These samples, <em>z</em>, will represent the atoms in my generated molecule. During the forward pass, the nodes in the final layer of my neural network return these samples. But when I need to optimize my neural network via backpropagation, I will not be able to compute the gradient across these nodes [1-4]. This is because these nodes represent a stochastic process defined by a discrete distribution, and it is impossible to compute a smooth gradient for either a <strong>stochastic</strong> or <strong>discrete</strong> process [1-4]. So I need to find another way to generate graphs that will allow me to perform backpropagation.</p>
<h2 id="the-reparameterization-trick">The Reparameterization Trick</h2>
<p>The first thing I am going to do is apply the Reparameterization Trick. We have talked about it before but I’m going to briefly repeat it here. Let’s imagine for a moment that I have a <strong>continuous</strong> distribution of atoms that I can draw samples from, instead of a categorical distribution. (I know that doesn’t really make sense physically, but bear with me.) Imagining my distribution of atoms as continuous solves one of my problems, because now I don’t have to take the gradient of a discrete process. But how do I deal with the stochasticity of the sampling process?</p>
<p>We use the Reparameterization Trick to recast the stochastic sampling process as a linear combination of a deterministic and a stochastic element [1-4]. In other words, instead of saying that the nodes in the final layer of my neural network are directly sampling from the continuous distribution (which is a completely stochastic process), I can say that the nodes in the final layer are a linear combination of two nodes in the previous layer [1-4]. Specifically, I can say that the samples, <em>z</em>, come from a sum of the mean of the distribution plus some stochastic noise [1-4]. Why is this better? Because now I can directly compute the gradient of the mean and the variance of the continuous distribution, and I can bypass the stochastic node completely [1-4]. I illustrate this in Figure 2 and also give some equations below to help demonstrate what I’m talking about [1-4]:</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-eqn1.png" alt="Eqn 1" title="Equation 1" /> <br />
Equation 1</p>
<p>Note that the noise we are adding is white noise centered about the zero-mean, so it does not add bias to our samples.</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-fig2.png" alt="Fig 2" title="Figure 2" /> <br />
Figure 2 - Inspired by [1, 3]</p>
<p>Okay, so now I have a way of dividing a sampling process into separate deterministic and stochastic components. However, I did ask you to temporarily imagine that our distribution of atoms was continuous, and of course that’s not actually true. So how can we draw samples from a discrete distribution instead of a continuous one? That’s where the Gumbel-Max Trick comes in.</p>
<h2 id="the-gumbel-max-trick">The Gumbel-Max Trick</h2>
<p>The Gumbel-Max Trick was introduced a couple years prior to the Gumbel-softmax distribution, also by DeepMind researchers [6]. The value of the Gumbel-Max Trick is that it allows for sampling from a categorical distribution during the forward pass through a neural network [1-4, 6]. Let’s see how it works by following Figure 3.</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-fig3.png" alt="Fig 3" title="Figure 3" /> <br />
Figure 3 - Source: [2]</p>
<p>First, the Gumbel-Max Trick uses the approach from the Reparameterization Trick to separate out the deterministic and stochastic parts of the sampling process [1-4,6]. We do this by computing the log probabilities of all the classes in the distribution (deterministic) and adding them to some noise (stochastic) [1-4,6]. In this case, we use noise generated from the Gumbel distribution, which I will discuss more in a minute [1-4,6]. This step is similar to that used in the Reparametrization Trick above where we add the normally distributed noise to the mean.</p>
<p>Once we have combined the deterministic and stochastic parts of the sampling process, we use the argmax function to find the class that has the maximum value for each sample [1-4,6]. The class (or sample, <em>z</em>) is encoded as a one-hot vector for use by the rest of the neural network [1-4,6].</p>
<p>So now we can see that the Gumbel-Max Trick is very similar to the Reparameterization Trick, but we are adding the argmax function to it, and using noise sampled from a different kind of distribution. In fact, why are we using the Gumbel distribution to generate the noise here? An exact mathematical explanation escapes me, but I know that the Gumbel distribution is typically used to model the distribution of the maximums for a number of samples pulled from other distributions [7]. For example, if you wanted to predict which month in 2020 the Monongahela River will flood, the Gumbel distribution could be used to model monthly river level data over the past 10 years and thus extrapolate which month in 2020 will have the highest water levels [7]. Since the Gumbel distribution is used to model the distribution of maximums, it makes sense to me that Maddison et al. explained the selection of the Gumbel distribution by stating that it “is stable under max operations” [2].</p>
<p>One final note on this point - there is also a <a href="https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/">mathematical proof</a> available that proves the Gumbel-Max Trick is equivalent to computing the softmax over a set of stochastically sampled points.</p>
<h2 id="bringing-it-all-together">Bringing It All Together</h2>
<p>We now have almost all of the pieces of the puzzle assembled. We have a way to separate the stochastic from the deterministic in the sampling process, and we have provided a means for sampling from a categorical distribution, as opposed to a continuous distribution. However, the argmax function in the Gumbel-Max Trick is <em>not</em> differentiable [1-4, 6]. So can we replace the argmax function with something that <em>is</em> differentiable?</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-fig4.png" alt="Fig 4" title="Figure 4" /> <br />
Figure 4 - Source: [2]</p>
<p>In fact you can, and both [1] and [2] proposed using the softmax function as a replacement for the argmax function, as demonstrated in Figure 4 [1-4]. In this approach, we still combine the log probabilities with Gumbel noise, but now we take the softmax over the samples instead of the argmax. Both groups also added a temperature factor, defined by <em>lambda</em>, which allows us to control how closely the Gumbel-softmax distribution approximates the categorical distribution [1-4]. This modified softmax function can be written as follows [1-4]:</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-eqn2.png" alt="Eqn 2" title="Equation 2" /> <br />
Equation 2</p>
<p>Notice that I am following Jang’s convention of using <em>y</em> to denote “a differentiable proxy of the corresponding discrete sample, <em>z</em>” [1].</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-fig5.png" alt="Fig 5" title="Figure 5" /> <br />
Figure 5 - Source: [1]</p>
<p>Let’s see how the temperature factor, <em>lambda</em>, can affect the shape of the Gumbel-softmax distribution. In Figure 5, we see Jang et al.’s presentation of how the Gumbel-softmax distribution becomes more uniformly distributed as the temperature increases [1]. In Figure 6, Maddison et al. also show how increasing the temperature shifts the distribution of probabilities to be uniformly distributed between all classes of the discrete distribution [2]. I like Figure 6 because it also gives us a visual example of what a simplex*2 looks like, since a 2-dimensional simplex is, in fact, a triangle [8]. We can see that for zero temperature, the distribution is discrete, and the probability of choosing one of the classes is concentrated at the vertices of the simplex [2]. As the temperature increases, the probability density is redistributed gradually until it is more centered in the middle of the simplex [2]. For more intuition about the effect of the temperature on the Gumbel-softmax distribution, Eric Jang has a fantastic interactive model on his <a href="https://blog.evjang.com/2016/11/tutorial-categorical-variational.html">personal blog</a>.</p>
<p><img src="/images/2020-08-13-GumbelSoftmax-fig6.png" alt="Fig 6" title="Figure 6" /> <br />
Figure 6 - Source: [2]</p>
<p>Both papers [1] and [2] explain that they use an annealing schedule to gradually reduce the temperature during training of their neural networks [1-4]. That is, initially they train the neural network with the temperature set at some large value, and they gradually decrease the value of the temperature during the training process until it approaches zero [1-4]. This approach balances the trade-off between model accuracy and variance: at high temperatures, the model has very low variance, which is good for robust training of neural networks [1-4]. However, as the temperature decreases (which means that our Gumbel-softmax distribution looks more like a categorical distribution), the variance also increases, which is bad for training [1-4]. The annealing process ensures that we train robustly as we are learning how to perform a task well, and then as the weights of the neural network converge, we can decrease the temperature without worrying that the increased variance will cause significant instability in our model at that point [1-4].</p>
<p>I hope this explanation at least made some sense in explaining why the Gumbel-softmax distribution is important and how it is used. The next question I’m interested in answering is: now that we know how it works, how do we use it?</p>
<h4 id="footnotes">Footnotes:</h4>
<p>*1 One mistake I made while initially trying to understand the Gumbel-Softmax technique was that I avoided reading the two papers [1, 2] that introduced the topic and instead read a large number of blogs that did not lay out the material as clearly as the papers themselves. Up until this experience I usually believed that blog posts were easier ways to grok an idea than academic papers, but now I am realizing that academic papers can be more clearly organized in some instances.</p>
<p>*2 Both papers describe the categorical distribution samples as “lying on the simplex,” which I did not understand until I saw this diagram [1,2].</p>
<h4 id="references">References:</h4>
<p>[1] E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax,” 5th Int. Conf. Learn. Represent. ICLR 2017 - Conf. Track Proc., Nov. 2016. ArXiv ID: 1611.01144. <a href="https://arxiv.org/abs/1611.01144">https://arxiv.org/abs/1611.01144</a></p>
<p>[2] C. J. Maddison, A. Mnih, and Y. W. Teh, “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables,” 5th Int. Conf. Learn. Represent. ICLR 2017 - Conf. Track Proc., Nov. 2016. ArXiv ID: 1611.00712 <a href="https://arxiv.org/abs/1611.00712">https://arxiv.org/abs/1611.00712</a></p>
<p>[3] Jang, E. “Categorical Reparameterization with Gumbel-Softmax & The Concrete Distribution.” YouTube. 2 Mar 2017. <a href="https://www.youtube.com/watch?v=JFgXEbgcT7g">https://www.youtube.com/watch?v=JFgXEbgcT7g</a> Visited 12 Aug 2020.</p>
<p>[4] Jang, E. “Tutorial: Categorical Variational Autoencoders using Gumbel-Softmax.” 8 Nov 2016. <a href="https://blog.evjang.com/2016/11/tutorial-categorical-variational.html">https://blog.evjang.com/2016/11/tutorial-categorical-variational.html</a> Visited 12 Aug 2020.</p>
<p>[5] “Categorical distribution.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Categorical_distribution">https://en.wikipedia.org/wiki/Categorical_distribution</a> Visited 12 Aug 2020.</p>
<p>[6] C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.</p>
<p>[7] “Gumbel distribution.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">https://en.wikipedia.org/wiki/Gumbel_distribution</a> Visited 12 Aug 2020.</p>
<p>[8] “Simplex.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Simplex">https://en.wikipedia.org/wiki/Simplex</a> Visited 13 Aug 2020.</p>I have been meaning to write this post about the Gumbel-softmax distribution for several months, but I put it on a back burner after I had dug myself into a hole of deep confusion and couldn’t get out. After some encouragement from my advisor, I decided to pick it up again, and this time I think I was able to figure things out.*1 So in this post, we are going to learn how the Gumbel-softmax distribution can be used to incorporate categorical distributions into algorithms that use neural networks and still allow for optimization via backpropagation [1, 2].