In a recent post, we talked about some fundamental mathematical operations presented in a robotics textbook written by Murray, Li and Sastry. One of these operations was a matrix exponential, which was unfamiliar to me. It turns out that matrix exponentials are a really cool idea that appear in lots of fields of science and engineering. Thanks to a fabulous video by 3Blue1Brown , I am going to present some of the basic concepts behind matrix exponentials and why they are useful in robotics when we are writing down the kinematics and dynamics of a robot.
As I have been learning the mathematics behind robot kinematics, I have been struggling to understand the difference between a spatial velocity and a body velocity. In this post, I am going to try to write a good definition of each of these velocity types.
This week we return to reading some papers about DNA nanotechnology that my PI recommended to me a while ago and I hadn’t read before now.
I have been trying to learn some of the fundamentals of writing equations of motion for multilink robots from the classic text by Murray, Li and Sastry*1. There were some key ideas and mathematical notations that I wanted to record for myself as I continue to work through the text . I also wanted to try out using MathJax to write equations instead of my former method. Many thanks to Ian Goodfellow’s blog post on the topic .
Last week I explored some ideas around motif-centric learning and then began to investigate the importance of equivariance and invariance in deep learning models. This week, I continue to explore why equivariance is important and useful. I also dive into the idea of an attention mechanism and transformers, ideas that I have heard of but never studied.
With the start of summer, and the transition to the final two years of the PhD program (yikes!), I have been thinking a lot about what I want to focus on learning during these next two years. One thing I want to do is read papers more frequently so that I stay on top of what is happening in my field. I have never been very good at doing this but now that I have completed most of the requirements in the PhD program (classes, TA assignments, etc.) I have more time to dedicate to doing this well. I’m going to try to write a blog post every week summarizing the papers I have read in that time, and my target is to read 4-5 papers a week for right now. This is going to be my first attempt at recording what I’ve read and learned.
Recently I wrote a post describing the fundamental operating principles for atomic force microscopy (AFM). Today, I would like to follow up with a more focused discussion about the different kinds of AFM probes and how to select one for your application. I will start by talking about probe design in general, and then walk us through some different parameters that we need to consider when selecting probes, and how they affect the probe’s performance. Let’s get started!
I have been writing a series of posts on different aspects of atomic force microscopy, starting with this post describing the fundamental operating principles, and this post on how to choose a probe for your application. Right now, I want to talk about two related aspects of AFM imaging: how to optimize your scanning parameters, and what common image artifacts you should look out for while you’re imaging. Let’s get to it!
The Microsystems and Mechanobiology Lab’s spectrophotometer, a Nanodrop 2000 model, is one of the workhorses of our lab. We use this instrument to measure the concentration of DNA in a sample, which is a critical piece of information when building DNA origami structures for use in studying cells, nanoscale self-assembly and shear forces on the microscale. I wanted to do a deep dive into how spectrophotometry works and what the data is telling us, so I’m going to take you on the journey with me. We will start by introducing the spectrophotometer’s operating principle, then discuss how the Nanodrop 2000 works, and conclude with some tips and tricks for taking good measurements. Let’s get started!
This post has been a long time in the making - I originally learned to use an atomic force microscope in January (I began on Inauguration Day, actually) and have done a lot of reading since then, but I have not paid my technical debt on my newfound knowledge. This post (and likely a follow-up as well) will outline how atomic force microscopy works, what some of the common variations of the technique are, and how to collect good data using AFM. I think I will save discussing probe selection for a separate post, though, because it can get a little complicated. So without further ado, let’s get started!
There are many ways to characterize and measure DNA nanostructures, and today I wanted to write about one tool in particular, dynamic light scattering (or photon correlation spectroscopy). Dynamic light scattering (DLS) is used to measure very small particles, typically on the order of 0.6nm - 6um, by studying how they move in solution . I actually first learned to use DLS two years ago, but I didn’t write down a good summary of what I had learned so I am making sure that I write down the basic operating principle this time! In this post, I’m going to give an outline of how the method works, then go into some of the mathematics used in the analysis. Let’s get started!
I was recently introduced to the concept of factor graphs and I am interested in using them in my research. I’m going to take a two-step approach to learning about factor graphs this time: first, we are going to look at what a factor graph is and some of its key advantages, as explained by the excellent blog posts by Prof. Frank Dallaert’s group at Georgia Tech. Then I am going to dive in deeper by reviewing a paper written by Dallaert and Kaess  that gives a more thorough discussion of using factor graphs to perform simultaneous localization and mapping (SLAM). My hope is that this first post will give us enough of the big picture and some of the mechanics that reading the paper will make more sense than it would otherwise. Okay, let’s give this a go!
In this post we are going to cover a couple of fundamental algorithms, including a common search algorithm known as binary search, as well as two sorting methods, merge sort and quick sort . For each algorithm, we will talk through how they work and some details on runtime. If you are looking for some example code, please see my implementations for binary sort, taken from , merge sort taken from  and quick sort taken from .
In this post we are briefly going to discuss two common graph traversal methods, depth-first search (DFS) and breadth-first search (BFS). These are not the only ways to visit all the nodes in a graph in a certain order, but they are two common methods that are useful for many applications . As always, if you are looking for a code implementation, you can find code for both DFS [2-4] and BFS .
Welcome back to another blog post about computer science fundamentals. Today I’m writing about heaps, with a focus on binary heaps and min-heaps. Heaps started out as an amorphous, nebulous concept in my mind - I wasn’t really sure what they were, and they were confusing because they looked like trees and graphs and I didn’t know why they were considered a separate category. I’m going to try to make clear the distinction of heaps as a unique data structure with specific applications today. And as always, if you want to see an implementation of min-heaps in code, please take a look here at work based on a tutorial by .
Okay, today we are going to talk about a really fundamental pair of concepts: stacks and queues. I have written about these before in the linked lists post but there are more details to their runtimes and when they’re useful that I wanted to cover today. As always, if you’re interested in implementing these abstract data structures in code, feel free to look at this code that draws from a tutorial by .
Today we are going to briefly introduce graphs in the context of computer science and implementations in Python. I have written about graphs before so I don’t want to belabor the point too much, but there are a couple ideas about how graphs can be implemented in code that I haven’t explicitly written about before. Also, if you are interested, I have played around with graph implementations in code thanks to tutorials by  and .
In this post we are going to talk about a trie (and the related suffix tree), which is a data structure similar to a binary search tree, but designed specifically for strings [1-2]. (Here we pronounce “trie” as “try”.) Each node of a trie contains a character (in general trees allow for many different data types to be stored in the nodes), and every path down the trie returns a complete word . Tries and suffix trees can be used to search through large sequences, like genomes, because they make it possible to search very efficiently [1-2]. We will learn about the structure of a trie, and how it can be modified to create a suffix tree. We will also briefly review some of the operations that can be performed with tries and suffix trees. If you are interested in a Python implementation of a trie, you can see my code here, which draws heavily from .
In this post we are going to talk about binary search trees, which is a specific type of tree, which itself is a more general data structure. Trees have a lot of uses and variants in computer science, so this post is only going to scratch the surface of what it is possible to do with trees. Laakmann McDowell points out that, because there are so many varieties of trees in computer science, as well as varied terminology, it is important to ask lots of clarification questions with your interviewer if you encounter a tree during a coding interview . In this post, we will discuss some of the basic theory behind trees, the specific properties of binary search trees, and some of the operations you can perform on them. I also have some implementations of trees in Python available here; these implementations draw heavily on tutorials by [2-3].
In my post on hash tables, I encountered the concept of immutable and mutable data for the first time, and promised to write a post about it in the future. The future is here, and so we’re going to talk briefly about what mutable and immutable data types are in the context of Python. Many thanks to Lorraine Li’s excellent post on the topic which guides this blog post .
Welcome back to another blog post about computer science basics by a wannabe coder. This time we are going to talk about hash tables, which Gayle Laakmann McDowell says is one of the most important topics to know for a coding interview . This post will be heavily influenced by Davide Mastromatteo’s excellent article on the subject . We’ll talk about why hash tables are useful, what hash functions are, and how common problems with hash tables, like collisions, can be managed. You can also see an implementation of hash tables in Python (drawing heavily from ) here. Let’s get started!
While working on my post about hash tables, I came across the concept of dynamically resized arrays/lists, which I didn’t know anything about. I wanted to write a post specifically about how we can resize arrays or lists because I think the details are useful to know when discussing more complex data structures. Many thanks to Yasufumi Taniguchi’s excellent article for guiding this post . We will first talk about where dynamically resized arrays and lists come into play and some factors to consider when resizing them, then we will get into some of the details of how efficient resizing can be.
Over the winter break I spent some time reflecting on the past year, and I came up with three insights/research techniques that I want to start/keep using in 2021. My long term goal is to emerge from the PhD program with better problem solving skills, so I am trying to keep track of the times when I hit upon a research method or strategy that seems to work well. These three techniques are:
In DNA nanotechnology, there is a rapidly growing collection of software tools for simulating how nanoscale DNA structures will appear when fully formed and at equilibrium. I have been learning to use tools for conducting these molecular dynamics simulations in my research. I wanted to write about the theory behind these simulations. My goal is to become more knowledgeable about how to intelligently leverage simulation in my research, and also to be aware of its limitations.
As part of my efforts to learn the basics of computer science (or at least enough for a coding interview), I am trying to build my own linked list implementation in Python. I have been relying heavily on the awesome tutorial by Pedro Pregueiro of Real Python , as well as on the overview provided by Gayle Laakman McDowell in her book, Cracking the Coding Interview . In this post, I’m going to give an overview of what linked lists are and why they’re useful, and include some of the code I wrote.
Recently I have been interested in constructing scene graphs directly from raw image data for a research project. Scene graphs are a representation of the objects in an image, as well as their relationships . Typically, a scene graph represents the objects as nodes and joins these objects together using edges and nodes that represent specific types of relationships . It can be useful to construct graphs from images so that we can use the graph information as part of a larger AI solution that operates on graphs. In the post we are going to focus on a particular method for deriving scene graphs from images as proposed by Prof. Fei-Fei Li’s group at Stanford in “Scene Graph Generation by Iterative Message Passing” by Xu et al. .
In my last post, I introduced the concept of discrete time for digital systems. We discussed difference equations and the z-transform. In this post, we are going to use these ideas to write the standard state space expressions for a system’s dynamics in discrete time. That is, we are going to find the matrices phi, gamma and H for the discrete system dynamics as written below :
We have been talking a lot lately about implementing model predictive control (MPC) in discrete time. In this final post on the subject (for now), we are going to look at how we can write MPC as a convex optimization problem. We are going to use fast solvers in our codebase to solve MPC as an optimization problem, and so this post is intended to understand how to structure the MPC problem so it is easy to implement in code. We are going to start by discussing what convex optimization is (at a high level) and then work through the math of casting MPC as a convex optimization problem.
Our lab is working on implementing model predictive control (MPC) for an ongoing project, and so I have been learning the finer points of MPC in order to build this implementation. Since MPC is usually implemented with a computer, it is helpful to understand something about discrete time and how it affects our control algorithms. In this post, I am going to review some of the basics to help inform our understanding of MPC. I will start by explaining why we care about discrete time, the fundamentals of sampling in discrete time, and then introduce difference equations and (briefly) the z-transform. This is going to set up a discussion of how we can write a full dynamics model in discrete time in a follow-on post.
In a previous post, I introduced the basic ideas behind Docker containers and using them on shared workstations. In this post, I want to build on that discussion by talking about some of the best practices recommended by experts when working with Docker containers. This is going to be a grab bag of topics, so feel free to scan through the headings to find information that is useful to you.
In this post we are going to introduce a useful tool called Docker containers. I came across these recently when I decided to finally move my deep learning model off Google Colab and onto a computer equipped with GPUs.*1 The Biorobotics lab uses Docker containers so that every lab member can run their own code inside a confined space on a shared workstation without affecting anybody else’s work. In this post I’m going to explain the motivation for using Docker containers, present the basic underlying principles for how they are set up, and provide some useful commands for using them. I may write a follow-up post focused on best practices, as well.
Recently I have been looking into various topics on control and motion planning, and I wanted to just write a quick post to define some terms. I am also providing a somewhat detailed explanation of one control technique in particular, called model predictive control (MPC).
Woohoo! I have completed the Silver Challenge! Today we wrapped up the course with a review of different RL algorithms used to play classic games like backgammon, Scrabble, chess and Go . I’ll briefly explain some of the intuition behind the TreeStrap algorithm, presenting the state of the art (at least in 2015), and then I wanted to close by highlighting some of the techniques Silver used in his teaching that I thought were really effective .
Today’s lecture discussed more intelligent methods of trading off exploration and exploitation than epsilon-greedy, which is the only method we have seen so far . Silver reviewed three approaches to making this trade-off, and he explained that for all of them, there is a lower bound on how well they can perform . I’m going to briefly explain that lower bound and then introduce the idea behind one of these three approaches which I thought was particularly interesting.
Okay, so I realize this post is coming on Friday, but I did watch Lecture 8 on Thursday. Fortunately I wrote the rules for this challenge myself and so as the official challenge arbiter I say it’s okay to post a day late.
I have been meaning to write this post about the Gumbel-softmax distribution for several months, but I put it on a back burner after I had dug myself into a hole of deep confusion and couldn’t get out. After some encouragement from my advisor, I decided to pick it up again, and this time I think I was able to figure things out.*1 So in this post, we are going to learn how the Gumbel-softmax distribution can be used to incorporate categorical distributions into algorithms that use neural networks and still allow for optimization via backpropagation [1, 2].
Today was such a weird lecture because we did not get to actually see Silver move around and point at things on the slides like we usually do. It added a weird little disconnect. Nevertheless, today’s lecture was very interesting because it covered policy gradient approaches to learning the optimal policy. This topic has been my focus recently as I have been learning about DDPG and other policy gradient methods for MolGAN. So although I have previously written about many of the topics he covered today, I wanted to share one “aha!” moment that I had, and a nice summary slide that Silver provided.
Today’s lecture was on value function approximation, which was good but, I have to be honest, not quite as ground-breaking as some of the other lectures we’ve seen so far. The key idea that I took away from this lecture was that we can think of learning the value function approximation as a supervised learning problem, but replace the training data supervisor with a target provided by our different policy evaluation approaches (Monte Carlo, TD(0) and TD-lambda) .
Things are getting exciting on the Silver Challenge! Today we started learning about model-free control, and various on-policy and off-policy approaches to solving the problem. We ended the lecture with an introduction to Q-learning, which I want to recap here because it seems to be a very important algorithm to know in the world of RL.
It’s Day 4 of the Silver Challenge. I am going to write a very brief post today, because I just wanted to talk about bootstrapping. I have never understood what the term meant before, but Silver explained it quite nicely. Bootstrapping is used during policy evaluation to substitute the remainder of a trajectory with an estimate of what will happen from that point onwards . Bootstrapping allows you to run your policy evaluation in real time because you do not need to watch the entire episode run before evaluating the policy - you can make an update part way through because you are making a prediction about the final outcome of that trajectory .
Welcome back to Day 3 of the Silver Challenge! Today we learned about dynamic programming. I was really glad to encounter this topic again because although I have heard about dynamic programming in other controls courses, I have never felt as though I had a good grasp of what it was. In fact, this lecture solidified my understanding of several key concepts, which I will review briefly here, including: dynamic programming, the Principle of Optimality, and the different approaches to applying dynamic programming to reinforcement learning. Let’s dig in.
It is Day 2 of my (entirely made-up) Silver Challenge! Today I want to highlight what I learned about the Bellman Equation, which I have also struggled to understand in the past. The Bellman Equation (in various forms) is used to recursively solve for the value function in a reinforcement learning problem . We saw the Bellman Equation used to solve Markov Reward Processes (MRPs) and Markov Decision Processes (MDPs) . In both cases, the Bellman Equation recursively decomposes the value function into two parts: (1) the immediate reward and (2) the discounted value of the successor state . Silver further emphasizes that the key idea of the Bellman Equation is that it allows you to solve for the value function .
Okay, it has taken us a little while to get to this point, but I think we are ready to introduce the Deep Deterministic Policy Gradient (DDPG) method of learning the policy function for a reinforcement learning situation. This approach was developed by a team from DeepMind and presented in 2016 by Lillicrap et al. . In this post we will explain in some detail how DDPG works and then see how the MolGAN authors adapted it for their purposes.
I have challenged myself to watch the 10 David Silver lectures on reinforcement learning in 10 days. I’m calling it the Silver Challenge! For each day I am going to try to write about one thing that I found useful from the lecture - not that I will only be learning one thing a day, but I just want to capture a quick highlight.
In my last post, I introduced some reinforcement learning concepts that we will need to understand the policy gradient method used by MolGAN: deep deterministic policy gradient (DDPG). In this post, we are going to touch on a couple more concepts - namely actor-critic methods and off-policy approaches. Then in the next post I promise that we are going to dive into DDPG itself. I know this is taking a long time to set up but I had a lot to learn about reinforcement learning before DDPG would make sense. Let’s get started.
We are back again to revisit another piece of the MolGAN paper and this time we are looking at how they trained their GAN to generate molecules that satisfied particular design criteria . If you recall from the overview of the MolGAN work, they designed a GAN that could generate novel molecules which met some design criteria, like being water soluble or druglike . In order to train the GAN to meet these criteria, De Cao and Kipf used tools from reinforcement learning to optimize their generator . Specifically, they used a policy gradient called Deep Deterministic Policy Gradient (DDPG) . It has been a long time since I have read research related to reinforcement learning, so this post is going to cover some of the basics of reinforcement learning (RL) and introduce the role that policy gradients have in RL. We will introduce algorithms that solve for the policy gradient, such as REINFORCE. In a follow-on post, we will dig into DDPG itself and explain how it works in the context of MolGAN.
Welcome back to the blog. Today we are (still) talking about MolGAN, this time with a focus on the loss function used to train the entire architecture. De Cao and Kipf use a Wasserstein GAN (WGAN) to operate on graphs, and today we are going to understand what that means . The WGAN was developed by another team of researchers, Arjovsky et al., in 2017, and it uses the Wasserstein distance to compute the loss function for training the GAN . In this post, we will provide some motivation as to why we care about the WGAN, and then we will learn about the Wasserstein distance that puts the “w” in WGAN. Next, we will look at how the Wasserstein distance leads to the definition of a new loss function for the WGAN, and various implementations of this loss function. Our discussion closes with a comparison of the original WGAN with an updated version, the WGAN-GP.
Okay, so we have now learned how graph convolution works, and we are ready to apply it to a relational graph convolutional network (R-GCN). In this post we will look at another paper by Prof. Welling’s group, this time applying the concept of graph convolution to R-GCNs . This paper uses R-GCNs to complete missing information in knowledge graphs . Specifically, the R-GCN is used to predict what the missing nodes are . The R-GCN is a form of encoder which converts a graph input to some latent space, that can then be used to predict the missing information in the knowledge graph . In this post we will build up the theoretical understanding for how this R-GCN works, and in a subsequent post we will then place the R-GCN in the context of the MolGAN architecture .
Today we are going to continue building on ideas from spectral graph theory to define an expression for convolution over a graph. This theory is based on several papers by Prof. Max Welling’s team of researchers at the University of Amsterdam into generating novel graphs using relational graph convolutional networks (R-GCNs) [1, 2] and generative adversarial models (GANs) .
We have talked a lot thus far about graph convolutional networks, starting with spectral graph theory, then developing an expression for graph convolution, and then modifying that expression to include directed, labeled relationships between different nodes. Now that we know more about how R-GCNs work, we are going to return to the original MolGAN paper we started with , and see how R-GCNs fit into the MolGAN architecture. We will begin by looking at the propagation function used by MolGAN, and then examine how the final output from several layers of convolution can be aggregated and pooled into a scalar output for the discriminator and reward function.
In this post about the MolGAN architecture , we will actually begin to explore another type of graph neural network architecture, the relational graph convolutional network (R-GCN) [2,3]. In the MolGAN architecture, both the discriminator and the reward function must process an input graph and return a scalar that evaluates whether the graph is real or fake (in the case of the discriminator) and if it successfully meets some design criteria (in the case of the reward function). Both the discriminator and the reward function use a relational graph convolutional network (R-GCN) [2, 3] to evaluate the input graphs. In this post, we will start to understand how R-GCNs work.
There is a lot of amazing work in the area of applying neural networks and AI approaches to graphs. Graphs, as I’ve explained before, are a powerful way of conveying the relationships between objects. They also represent a way that humans interpret and think about the world around them, which gives researchers hope that applying AI tools to graphs will bring us closer to building Artificial General Intelligence (AGI) solutions that mimic the way the human brain works. In this post, I want to explore one specific application of an AI tool to graphs.
We are back to our regularly scheduled programming! In other words, it is time to nerd out on a topic related to artificial intelligence again. Today I wanted to write a very quick post on binary cross-entropy (also called the log-loss function). I am currently learning to build a generative adversarial network (GAN) in TensorFlow. The tutorial computes the loss for both the generator and discriminator networks with binary cross-entropy. I wanted to better understand how this loss function works and why it is an appropriate choice for this application.
Earlier this week my advisors and I submitted a grant proposal to the National Science Foundation (NSF) and now that I have had a little time to celebrate that achievement, I wanted to reflect on the process. The grant writing experience was unlike anything I’d ever done before. I have mixed feelings about writing grants, but I definitely learned a lot and I wanted to share some tips with others who may be going through this process as well. In this post, I will give a brief overview of the writing process, some general reflections on what I learned from writing the grant, and then a list of tips of actions that helped me through the process of writing the grant.
About 2 years ago, a museum exhibit opened on Dec 6, 2018, titled “Women Empowered: Fashions from the Frontline” . I never saw this exhibit in person, but I read about it because the exhibit contained a piece of history that I found particularly inspiring: Alexandria Ocasio-Cortez’s shoes were on display. And they were not just any pair of shoes - they were a worn out, beat-up pair of sensible flats that were literally falling apart because she had canvassed so many houses during her campaign. I think about AOC’s shoes a lot, because to me they are an answer for why knocking on doors and calling voters really matters. AOC had a historic win in the 2018 mid-term election, and a huge contributing factor to her success was her team’s excellent canvassing of potential voters in her district [2,3].
I am reformatting my Bookshelf page and I decided that I would move my reflections on the books I had read to blog posts so that I could keep the Bookshelf list of titles clean and easy to scan. Here are 3 reflections on books that I read in 2019.
We first encountered the reparameterization trick when learning about variational autoencoders and how they approximate posterior distributions using KL divergence and the Evidence Lower Bound (ELBO). We saw that, if we were training a neural network to act as a VAE, then eventually we would need to perform backpropagation across a node in the network that was stochastic, because the ELBO has a stochastic term. But we cannot perform backpropagation across a stochastic node, because we cannot take the derivative of a random variable. So instead, we used something called the reparameterization trick, to restructure the problem. In this post, I am going to go into this solution in more detail. First I will outline why we need the trick in the first place, and then I will try to provide more insight into the mathematics behind the reparameterization trick.
We have been discussing a recent paper by Veerapaneni et al. that presents a novel way of teaching a computer to think and reason by representing its environments as entities that are related to each other by some laws . In this final post summarizing this paper, I want to explain (at a high level) how the OP3 learner is able to connect objects that it sees in video frames to entities in its latent space . We concluded the previous post by saying that the method OP3 uses is called “amortized interactive inference” . Let’s find out what that means.
In the last post, we introduced a new paper presenting an object-centric perception, prediction and planning (OP3) learner . We ended the post by noting that this learner must approximate a posterior predictive distribution of observations using Bayesian variational inference . In this post I am briefly going to summarize Bayesian variational inference and then explain how it is used in this paper. If you want to see another machine learning tool that uses variational inference, please take a look at my posts about variational autoencoders.
Today I am going to try to tackle a very impressive (and very dense) paper that was recently published by Veerapaneni et al., a group of researchers at UC Berkeley, Stanford and MIT . Their paper explains how they taught a computer to learn to identify objects in an image, model them as a graph, learn something about their dynamics from videos and then perform manipulation tasks without prior training . It looks like it could be a huge step forward in teaching a computer to reason like a human about its environment (I think it has been out for about 2 weeks as of this post so we should probably wait for the experts to tell us if it’s truly as impactful as I think it is). This paper builds on the ideas about graph networks we previously discussed, and I was particularly interested in how the authors taught a computer to construct a graph network on its own. I am going to try to summarize the major contributions of this paper, and it may take a couple of posts to really capture everything, so please hang on to your hats and get ready to read!
I’m going to switch gears in this post and write about a wildly different topic from machine learning. Today, I am going to share what I have recently learned about a fluorescent microscopy imaging technique which is called widefield Fluorescence Resonance Energy Transfer* (W-FRET) [1,2]. I want to understand how it works because it is a useful tool for studying DNA origami, but as simple as the underlying concept is, the analysis process was a little tricky to understand. I’m sure I still don’t have a complete grasp of it but I’ll share what I know so far. I’ll start by outlining the basic principle, then I will explain how the analysis is done to obtain a precision FRET image .
As part of a project I am working on, I have encountered the two concepts of the maximum likelihood estimate and the maximum a posteriori estimate. I don’t really understand their significance (although we have touched on the maximum likelihood estimate, or MLE, in our discussions of VAEs and filtering), so I wanted to write a post dedicated to describing these two ideas and their tradeoffs. The first thing to understand about both of these ideas is that they are both ways to estimate the parameters for a model of a process . These concepts are based in probability theory and Bayesian inference, but I am interested in them because they are also powerful ideas in machine learning and related topics. In this post, I will start by describing MLE because we have seen it before, and then introduce the maximum a posteriori estimate (MAP) as a generalization of the MLE approach. But first, we’ll start with some basic terms.
In this post I am going to cover a new (to me) machine learning algorithm called support vector machines. Support vector machines (SVMs) are another form of supervised learning that can be used for classification and to perform regression* . In this post we will start by learning the cost function for SVMs, then we’ll discuss why they are also called “large margin classifiers”, we will introduce kernels, and then we’ll finish with a discussion of how to choose the parameters for SVMs and when to use SVMs or logistic regression . Let’s go!
In this third and final post on filters, I want to explain how another kind of filter, the particle filter, works. We saw during our discussion of Kalman filters that they limit the user to thinking of the system as a Gaussian process, which may not be true in practice. The particle filter is a more general approach, and is popular in robotics and computer vision. If you want a brief, intuitive overview of the particle filter, I would strongly recommend watching this YouTube video by Uppsala University which is a fantastic explanation for what the particle filter is doing . In this post, I will begin by laying out the framework of the particle filter, and then present the algorithm.
In my previous post, I introduced the basic concepts behind filtering (which is used to track objects through time). In this post, I want to introduce the first filter that we will consider, the Kalman Filter. I should note that this filter is less popular in robotics and computer vision, but it is an integral part of classical and modern control theory . I have spent a lot of time studying control theory and I have always wanted to write a description of a Kalman filter, which is why I include it here, even though particle filters are more popular in the robotics and computer vision worlds . Without further ado, let’s get to it.
Recently I wanted to learn more about how filtering works, and so I am going to spend a little time explaining the theory behind filtering, with a particular focus on Kalman filters and particle filters. I actually would argue that “filtering” is a confusing term to use to describe these tools, because even though we typically use them to track objects in space and time, I usually think of a filter as the thing I put my coffee grounds in every morning. But that’s just me.
Whew! That run of posts on variational autoencoders was fun! The reason I wanted to investigate them in detail is because I want to understand how they can be used in conjunction with graphs, instead of single data points. In fact, I want to understand in general how neural networks can be used to process graphs as inputs, instead of scalar or vector values. This is a rich area of research and I thought I would start by reading an excellent review/position paper by a group of researchers at DeepMind . In this post (and possibly a few more, this paper was really long), I am going to discuss some of the basic concepts behind graph networks. Let’s dive in!
In my previous post, I presented the motivation for graph networks as described by a paper by DeepMind that I have been reading . We have covered the ideas of relational inductive bias and combinatorial generalization, which describe how we, as humans, think. The DeepMind paper argues that using graph networks to think like humans can help us solve some key challenges, such as (1) complex language/scene understanding, (2) reasoning about structured data, (3) transferring learning beyond training conditions and (4) learning from small amounts of data/experience . In this post, I want to introduce the basic structure of a graph network and some of the probability concepts that make them work. Specifically, we will discuss the concepts of joint probability distributions and conditional independence. Let’s get started.
This is the third post in a series on graph networks, based on a recent paper by researchers at DeepMind . So far we have discussed the motivation for why graph networks are useful for advancing AI research, and how they work from a probability perspective. Now I am going to briefly explain how the structure of a graph is useful in learning to model real-world processes, and present some of the mathematical framework for how graph networks learn.
Today is going to be a learning bonanza of topics related to machine learning (ML) and reinforcement learning (RL). I am working on a project that involves a lot of advanced ML/RL tools that I am trying to learn about very quickly. To that end, I have decided that today I am just going to write as many posts as possible on the topics that I need to understand. These posts may not be the most detailed essays I have ever written, but as always I will do my best to read several sources, cite them as appropriate, and only write about things I actually understand. Let’s go!
In this second post on VAE’s, I am going to dive into the details of calculating the loss function. You will recall from my previous post that this is going to be more complicated than the loss function for a standard neural network because the variational autoencoder is learning to represent the latent space as Gaussian distribution .
We are now on our third post about VAE’s, and this time we are going to do a deep dive into the probability that dictates how they function. If you recall from our last post, we ended on the observation that the encoder is learning a posterior distribution p(z given x) so that it can convert input data points, x, into a latent space representation, z. We saw that we can use Bayes’ Theorem to write an expression for this posterior distribution, but that it would be intractable to actually calculate the denominator in the expression, p(x). I hinted that we will find a way to approximate the posterior distribution p(z given x) instead, and that we will use the KL divergence to do so. Let’s figure out how this will work.
I just want to write a brief post about overfitting and how we can try to deal with it using regularization. Overfitting occurs when our model can accurately predict the outcome for the training data, but it does not accurately predict outcomes for new data it has not seen before . This happens because the algorithm has used a higher-order polynomial to draw a decision boundary, which very precisely matches the dataset but also produces a very irregularly shaped decision boundary which may not accurately model the true difference between the two datasets, see Figure 1 below .
In the Machine Learning Coursera class by Andrew Ng, I am in the middle of understanding how to improve the learning performance of a neural network. Ng provides some fantastic guidelines on how to properly train a neural network and improve its performance, which I want to discuss here.
In my last post we introduced neural networks, and in today’s post I would like to explain how neural networks “learn” by giving an overview of the learning algorithm. Neural networks learn by iteratively trying to recreate an output, given some inputs, by comparing the output that it generates to the desired value. The difference between the neural network’s output and the desired value can be used to calculate the error for a given learning iteration. The neural network uses the error to update the weights on each node in order to reduce that error on the next iteration. Over many iterations, the neural network can learn to complete the assigned task with a high degree of accuracy. Let’s dive in to understand how the learning process works.
Logistic regression is a form of classification, which is a kind of supervised learning . Logistic regression differs from linear regression because it fits a sigmoid function to the data set, instead of fitting a straight line . (The sigmoid function is going to appear again when we discuss neural networks because it is used there as well .) In this post we will discuss when logistic regression is useful, the mathematics of the algorithm and how to implement it on a simple dataset.
In this post, I am going to introduce the basics of neural networks, because they are fundamentally useful tools in reinforcement learning (RL). If you recall from my earlier post, deep RL uses neural networks as “function approximators” for the Q-function in reinforcement learning. The Q function can be very complex so we need to use something like a neural network to represent it. (In fact, we use neural networks to represent many different complex functions, not just the Q function, although it is a good first example.) In this post we will discuss the architecture of a neural network, some of the mathematics used to make it work, and different flavors of neural networks as well as their applications. It is likely that this post will lead to several others because there is a lot of material to cover!
In a previous post I discussed the basics of using git locally and remotely, and today I am going to extend that discussion to two popular workflows for teams that use git. I will also cover some best practices for using git.
In this first post about using git, a version control system, I am going to explain how it works conceptually, and how to complete basic tasks using the command line. For much more detail I would highly recommend the Software Carpentry tutorial on this topic as a starting point - I wish I had found it years ago! In a future post I will explain two workflows for using git - git-flow and github-flow.
Originally I had intended to make this a single blog post but as I was making notes I decided that there was too much information for one article. Instead, I will break up my lessons learned into two categories: general information on how to write good, robust code, and more specific approaches for implementing algorithms (especially those drawn from academic papers) and how to debug them. This post will start with some general lessons on how to write good code. Please note that I am not a trained computer scientist so I welcome your thoughts on the topic if you have different opinions or better ideas!
This afternoon Carnegie Mellon University announced that they will be moving all classes online for the rest of the semester to combat the coronavirus. It’s a scary message to get, and I wanted to reassure myself and others by sharing some information that I learned thanks to Greta Thunberg’s Instagram account this morning .
In today’s post, I want to discuss the basics of linear regression and gradient descent. These are two basic tools used in supervised learning (as discussed in my previous post). Linear regression is used to fit a mathematical function (like a line or a parabola) to a set of data . And gradient descent is used to find the local optima of such a function .
In addition to reading about reinforcement learning, I have also been following Andrew Ng’s course on machine learning (hosted on Coursera), to understand some of the fundamentals of machine learning as a whole. In this post, I’m going to discuss some of the basic principles of machine learning and hopefully follow on with more posts as I keep working through the course.
Machine learning has interested me for a couple years, ever since I read Pedro Domingos’ book, The Master Algorithm. In the lab, we have been discussing reinforcement learning lately and I have been trying to teach myself about this topic. This post is going to be my first introduction to reinforcement learning, and I hope to follow it with more posts soon.
I have been having some phenomenal conversations over the past several days about the way we teach and how we can do better, specifically at the university level. These conversations have been slowly introducing a new idea to me: that education is an act of democratization. If we teach well, we empower our students to do anything they want with their careers and their lives. If we teach poorly, our students are subtly pushed out of areas they were once interested in, or denied entry to challenging fields.
Recently several members of the Microsystems and Mechanobiology Lab attended the annual meeting of the Biomedical Engineering Society in Philadelphia, including four of my undergraduate student mentees, who came to present their work at a poster session. Several times during the conference they asked for advice on how to make the most of it. I wanted to write a post to share what I know (and invite anyone reading it to share their own thoughts). Hopefully I will get better at this as time goes on. But for now, let’s talk about How To Conference!
This week I am learning about the fundamentals of designing and building DNA origami structures. I should probably write a post just about cool stuff you can build with DNA origami to get you excited, but in the meantime, check out this cool TED Talk by one of the fathers of the field, Paul Rothemund. This post is going to dive into the fundamentals of designing DNA origami structures. Everything I present here I have learned from my lab mates at the MMBL lab.
We are pretty much down to the wire (the qual is tomorrow) so I just want to write a couple notes on how to design lead and lag compensators with Bode plots. All of the information presented here comes from Nise .
This post will show my solution to Example 11.4 in Nise . The design objective is to build a lag-lead compensator for the system shown below. The performance specifications are:
In this post I am going to walk through the lead compensator design process using Bode plots. This question is drawn from Nise’s Example 11.3 . The objective of this question is to design a lead compensator for the plant shown below. The design requirements are:
Although I have written a post on drawing Nyquist plots before, I was still feeling shaky in my understanding of the fundamentals, so I am going to use this post to try to cement my knowledge of where the Nyquist plot comes from. Please note that this entire explanation comes from Nise ; since this post is so detailed I am not going to cite specific sentences. Please understand that all information here comes exclusively from Nise .
In this post I am going to design a proportional controller to meet a set of specifications using the Bode plot of the plant. This is presented as Example 11.1 in Nise . The question asks for the gain required to yield a 9.5% overshoot in the transient response, and requires the solution to be presented in the frequency domain . Here’s the flow of the answer :
I am one week away from my qualifying exam and I am hustling to cover a couple more topics before the exam. Today I am going to do a couple posts on designing lag compensators with Bode plots as my design tool. (Hopefully I will cover lead compensators tomorrow.) I have several earlier posts on lead and lag compensators if you need a basic review.
I’m beginning to feel like there is more to know than I will ever have time for…during our study session last week I also realized that my Taylor series knowledge was pretty rusty. I wanted to write a very short post on how the Taylor series is derived to help me remember it in case I get asked in the exam.
Last week during a controls qual study session, I realized that I did not fully understand what non-minimum phase systems were. So I am going to briefly discuss what they are physically, and how that manifests itself in Bode plots, time domain responses and Nyquist plots. Let’s get started!
I am still finding linearization a tricky subject, but I had to linearize an inverted pendulum system for a class this weekend, and going through that process helped me to clarify for myself how linearization should work ,. If you haven’t read it already, please start with my earlier post on the fundamental concepts behind linearization.
This is a bonus post! I wasn’t sure if I should share this solution too for fear of exhausting the reader, but I’d be curious to compare my solution with others. This is my solution to Nise’s Skills Assessment Problem 9.1 . I’m not going to lay out a detailed explanation (please see my post on designing lag compensators) but my solution is shown below.
This is the 3rd post in my series on solving controller design problems from Nise . This time we are going to design a pure derivative controller to meet a settling time requirement. The interesting part of this problem is how we used the settling time requirement to choose the controller zero’s location.
This is the second post in a series where I am solving examples of controller design problems from Nise . This example is focused on designing a true lag compensator. The useful piece of this problem is using the gain ratio (of compensated : uncompensated gain) to choose the pole and zero locations.
In this next series of posts, I am going to do a couple of examples from Nise  to practice designing different kinds of controllers. Some of these have been solved fully in the textbook, and others are provided as practice problems, so if you cannot read my solutions, please look in the textbook for more clarity. I should note that all my images today were taken from the whiteboard, instead of my customary paper-based figures, because I am also trying to practice good board management for the actual qualifying exam.
Although I do not think it is going to play a major role in our upcoming controls qualifying exam, I realized that I could not remember how to derive the equations of motion for some simple systems. Here I am going to present some examples out of the undergraduate textbook  and in a future post I would like to focus on something a little more challenging, like the levitating magnetic ball.
This is now my third post on designing lead and lag compensators (I wrote a post on lead compensators, and a summary of lead, lag and lead-lag compensators. I am finding controller design to be the most challenging concept that I have encountered so far in studying classical control theory. I just did my first mock qualifying exam this morning and this was the part of the exam that tripped me up. So, I’m going to write one post here that summarizes the basic concepts as clearly and succinctly as possible. Then I will post a couple of design examples to walk through the process of designing controllers.
I just finished my mock qualifying exam this morning and one of the pieces of advice I got from some of the more senior students was to know when to use Bode plots, Nyquist plots and root locus plots. I am going to start a list of when to use each kind of plot, and I may come back to update this post occasionally based on what I learn as I do more problems.
Recently I wrote a post on stability of linear systems where I introduced the Nyquist plot. Now I want to go into the details of how to draw one without using a Bode plot. If I have time in the future I will also explain how to draw a Nyquist plot using the Bode plot - it is an easier method, in my opinion, but it assumes that you have a reliable Bode plot to start with.
Gain and phase margins are a useful concept to understand when designing controllers using Bode plots. Today I am going to explain what they are and how we can find them on both a Bode plot and a Nyquist plot. If we have time, I will go into how the margins are related to the sensitivity function.
In practicing for quals last week I managed to get myself very confused about linearization, so I wanted to go back to fundamentals. In this post I will cover why we linearize, the basic concept behind it, and the equations for Jacobian linearization, which is a standard method for linearizing. As it turns out, I’m still a little confused about the particular example that started me down this road in the first place, so hopefully in the near future I will return with another post that straightens that issue out. But for now, let’s begin with the stuff I do understand.
We recently encountered some exam problems where the Bode plot of the system looked normal, but the root locus plot was telling us that the system was actually unstable. I wanted to understand in more detail what a stable system was, and why the Bode plot and the root locus plot could tell us two very different things. First I will describe some of the basic concepts of stability, and then I will introduce a new tool, the Nyquist plot, which helps us address the mismatch between Bode and root locus plots.
In this post I want to review designing PI/lag and PD/lead and PID/lag-lead controllers/compensators, primarily using root locus plots. All of the information provided here comes from Nise; I have not cited individual sentences because in this post because I have only one source and I am stating explicitly here that all of the information comes from Nise .
In this blog post I want to describe the basics of linear time invariant (LTI) systems.
I have never fully understood the diffraction limit in microscopy and I thought it would be a good idea to learn the basic principles behind this concept before my research qualifying exam tomorrow. I am going to try to explain it succinctly here in a utilitarian way.
Recently I realized that I do not know how the DNA research field models the structural deformation properties of DNA and its various constructs. This is something that I am going to need to learn in much more detail, but as a first exploration of the field (before my research qualifying exam tomorrow), I wanted to write a brief summary of what I have learned so far. Please take this post with a grain of salt and know that I have a lot more to learn on this topic!
I recently have been talking to some of my colleagues about how to do literature reviews effectively, and I thought it might be useful to share the techniques that I have been using over the past year. I’d like to say right away that I am definitely not an expert (I have never even written a literature review article!) but I’ve tried many different things and I thought those adventures might be worth sharing. I also noticed that I couldn’t find any articles like this one when I was trying to teach myself how to do a literature review - that is, no one was sharing the activities they actually did on a regular basis to stay up to date on the literature. I’m sure most of us don’t need a perfectly rigorous method, but all of us just need to have some means of staying abreast of the key developments in our field. This article is meant to share what I have been doing because I have found it possible to do these things regularly, and I also want to start a conversation with my colleagues because I think I can improve this in many ways.
This post is going to be about understanding Bode plots and how to draw them. The Bode plot is the preferred way of depicting a system’s response in the frequency domain. It is popular among hardware engineers because it allows you to see how your system will behave over a wide range of frequencies and design accordingly. It can be obtained experimentally by providing inputs to the system and measuring the output frequencies (if the system is stable) and it can also be obtained by hand (even if the system is not stable) . The tricky thing about Bode plots is that they will not always tell you if your system is unstable or not. As we will see in an example, it is possible to draw a perfectly normal-looking Bode plot for a system that we know is unstable .
In this post I want to briefly cover the basics of the time and frequency domains and how we can switch between them with the Fourier and Laplace transforms.
In this post I want to talk about the Final Value Theorem and steady state error because these concepts will be useful as we go on to discuss controller design in more detail. Steady state error describes how far the system deviates from the desired reference signal as time goes to infinity . It is a commonly used performance metric for controller design, which is why I want to establish the concept now before we apply it in future posts on controller design . I will start the discussion, though, with the Final Value Theorem because that concept is critical to understanding steady state error.
I am really excited to be writing this post because I have been working on my understanding of this topic for a long time, and I can’t wait to practice explaining it to others. Today I want to explain the Scallop Theorem, a central tenet of the microswimmer research that I am doing at Carnegie Mellon University. The Scallop Theorem was introduced in a seminal paper in 1977 by the Nobel Prize winning physicist, E.M. Purcell . In my mind, it is also one of the most convivial scientific papers I have ever read, which makes coming back to it a joyful occasion, instead of the frustrating endeavor that it could have been. It is fun to read the paper in its entirety, but today I will focus on one particular part of it. Today, we will understand why it is so hard to build microswimmers.
This post is a follow-on to my original introduction to drawing root locus plots. Last time we learned what a root locus is and how it can be drawn. Today I am going to focus on how we can derive useful information from the plot. Then I will close out this series with at least one more post on how we can then use root locus plots to design different kinds of controllers.
We are going to bounce around a couple different controls topics in these next posts, I’m sorry for not choosing a more logical order. Please bear with me!
Word count: 445
I have been studying the basics of control theory in preparation for the Mechanical Engineering qualifying exams coming this fall. I want to write a series of posts that will help me consolidate what I am learning and practice explaining it to others. There is a lot of material to cover, but some of the topics that I am most concerned about are related to designing dynamic compensators and controllers. Today, I am going to review lead compensation.
Word Count: 499
Word Count: 500
Wow, many thanks to Barry Clark and his Jekyll Now repository! I have literally set up my website in 10 minutes while sitting at an airport terminal.