Welcome back! Previously we talked about how Fontaine, Matthew, and Stefanos Nikolaidis (2021) and Tjanaka, Bryon, et al. (2022) used gradients to simultaneously optimize fitnesses and explore diverse behavior descriptors in MAP-Elites, as well as how to estimate gradients using noisy samples (see Differentiable Quality-Diversity part 1). Towards the end we saw that this technique had comparable performance to CMA-ME (Fontaine et al. (2020)) when evolving simple genomes containing few parameters, but we also hypothesized that using gradients would give us considerable advantage when evolving high-dimensional genomes such as neural networks(NN). This is exactly what we’re going to attempt in this blog – evolving a diverse archive of NNs using the Differentiable Quality-Diversity(DQD) technique. Buckle up, it’s going to be a long one!

CMA-MEGA(ES): A first attempt

Actually, though not optimal, we can use exactly the same workflow as last time. In this approach, we use noisy samples to estimate gradients w.r.t. the genome for maximizing fitness and each behavioral descriptor (BD), then we do a weighted sum of these gradients where the weights are either randomly sampled or learned with CMA-ES. This sum gives us the gradient for exploring along a direction in the fitness-BD space, so we update the genome with this gradient – rinse and repeat. The only difference is that we are evolving the weights and biases of a NN, which can be easily bridged by using packages such as tensorflow and pytorch for the NNs and serializing(flattening) their weights and biases into a long 1D array when we need the genome.

I tried this simple adaptation, and although CMA-MEGA(ES) outperformed CMA-ME, I couldn’t get either to produce satisfactory results because they simply took too long to train. On my stock 8GB 2017 Macbook Pro, 1000 generations of CMA-MEGA(ES) evolving (28, 64, 64, 8) NNs would take approximately 40 minutes. So I will drop CMA-MEGA(ES) for the moment, and switch to CMA-MEGA(TD3) instead (though you can certainly try out CMA-MEGA(ES) on NNs by running ES_agent.py in the repo provided below). CMA-MEGA(TD3) is another variant of CMA-MEGA also introduced in Tjanaka, Bryon, et al. (2022), and as we will see soon, it will produce the best results we’ve seen so far.

CMA-MEGA(TD3): What is TD3 and why do we need it?

The main bottleneck slowing down CMA-MEGA(ES), I would say, was its gradient estimation procedure, which required sampling and evaluating a large number of noisy genomes. The evaluations were especially costly since they required running complete rollouts in simulation. Given this, wouldn’t it be nice if we can somehow learn a model to predict gradients without needing additional evaluations? That’s what TD3 is for! TD3 is an off-policy actor-critic reinforcement learning(RL) algorithm. Over the next few paragraphs, I will explain how we can use this algorithm to estimate gradients. As before, I will quickly explain the intuitions without focusing too much on details, but if you want a closer look on TD3, I recommend reading Chris Yoon’s DDPG blog followed by Donal Byrne’s TD3 blog (there’s also always OpenAI Spinning Up if you want to see the codes).

RL backgrounds leading to TD3. Feel free to skip if you already know RL well

RL basics

Anyways, let’s begin by reviewing what we want our NN to accomplish. We have a quadruped robot that’s going to run in simulation for up to 1000 timesteps. At each timestep, the robot expects an action command specifying the angles of each of its 8 joints, while also reporting back its 28 sensory states including things such as IMU and proprioception. We want our NN to take in the 28 sensory states, and output 8 joint angles to command the robot. Moreover, we want our NN to output actions at each timestep such that, when chained together, these actions make the robot walk as far as possible.

Compared to our earlier controller strategy based on sinusoidal gaits, NN controller, if properly trained, would be much more sustainable. Because unlike sinusoidal gaits, NN reads in sensor states and thus can correct itself when it starts losing balance. However, NN is also much harder to train. Firstly because of the large number of parameters, and secondly because it is learning a much harder problem. Unlike gaits, NN has to learn each state-action mapping from scratch, where even a single bad action among good actions could ruin the walking score.

Obviously, when a bad final score is caused by a single bad action, whereas all the other actions are good, we would want to penalize only that bad action. Therefore it makes much more sense to assign a reward to “each action”, rather than assigning a fitness for the entire rollout as we had been doing.

But how do we assign rewards to each action at each timestep? For a start, we can hardcode some rewards for behaviors that are usually indicative of good walking, such as “not falling at the current timestep” (i.e. robot torso not touching the ground). However, on their own, these would not be fair rewards for the actions, because the action at an earlier timestep affects later timesteps as well. For example, as bipedal walkers, we know that we fall down not because of the action we take the moment before hitting ground, but because of an earlier bad action which caused us to lose balance. Therefore, we want this kind of long-term effect to be represented when assigning rewards to each action predicted by NN. Such an action evaluator recognizing long-term effects is known as Q-value, $Q(s,a)$. $a$ is the action being evaluated, and $s$ is the sensor state when action $a$ is taken (since action is contextual, and a good action at one state may be a bad action at another state). Formally, we define $Q(s,a)$ as follows:

$ Q_{\pi}(s_t,a_t) = E_{\pi}[\Sigma_{j=t}^T \gamma^{j-t} r_j] $

$\pi(s) = a$ is the current policy(NN), $t$ is the current timestep, $r_j$ is the immediate reward at timestep $j$ (after $t$), and $\gamma$ is the reward discount. So in layman terms, this formula defines $Q(s,a)$ as the total sum of rewards we expect to get after executing action $a$ at state $s$ at timestep $t$, if we determine all following actions according to the policy $\pi$. The discount factor $\gamma$ is there because intuitively, we are less certain about whether or not we can receive a reward far into the future, since anything can happen between the current and future timesteps.

DQN

Ok, so we have named and defined this action evaluator $Q(s,a)$, but how do we obtain it? This is where temporal difference(TD) learning comes in. To begin, let’s first rewrite the aforementioned $Q(s,a)$ definition into bootstrapped form:

\[\begin{align*} Q_{\pi}(s_t,a_t) &= E_{\pi}[\Sigma_{j=t}^T \gamma^{j-t} r_j]\\ &= E_{\pi}[r_t + \Sigma_{j=t+1}^T \gamma^{j-t} r_j]\\ &= E_{\pi}[r_t] + E_{\pi}[\gamma\Sigma_{j=t+1}^T \gamma^{j-(t+1)} r_j]\\ &= r_t + \gamma E_{\pi}[\Sigma_{j=t+1}^T \gamma^{j-(t+1)} r_j]\\ &= r_t + \gamma Q_{\pi}(s_{t+1}, a_{t+1})) \hspace{1cm} \text{($s_{t+1}$ is the new state after executing action $a$ at state $s_t$)}\\ &= r_t + \gamma Q(s_{t+1}, \pi(s_{t+1})) \hspace{0.75cm} \text{(because we are following policy $\pi$)} \end{align*}\]

After these derivations, we realize that $Q(s_t,a_t)$ can be expressed in terms of the next timestep’s Q-value $Q(s_{t+1},a_{t+1})$. By the same logic, $Q(s_{t+1},a_{t+1})$ can be written in $Q(s_{t+2},a_{t+2})$, $Q(s_{t+2},a_{t+2})$ in $Q(s_{t+3},a_{t+3})$…until the last step $Q(s_{T-1},a_{T-1})$ written in terms of the final, overall walking score, which we know as fitness! Does this mean we can obtain all Q-values by tracing back from fitness? Unfortunately it’s not that simple. Firstly, we don’t know the dynamics of state transitions (i.e. we don’t know what the new state $s_{t+1}$ will be after executing action $a$ at state $s_t$). And we cannot trace back from fitness if we don’t even know which states lead to the terminal state. Secondly, we don’t know the policy(NN) $\pi$ either, since that’s also what we need to learn! We defined Q-value earlier as “the total sum of rewards…if we determine all following actions according to the policy $\pi$”. So without a policy, we cannot calculate Q-values.

Let’s address these problems one by one, starting with the first problem of not knowing state transition dynamics (for the moment, assume we have a policy $\pi(s) = a$). Well…it wouldn’t be entirely accurate to say we know nothing about the state dynamics. We do have this simulator (in our case QDgym) for evaluation, and if we simulate with it for a large number of episodes, recording all state-action transitions in the process and forming them into a huge table, we can use this table as dataset for training a dynamics model. Many RL algorithms learn this dynamics, and they are known as model-based RL. However, in our case we will learn Q-values without explicitly learning a dynamics, i.e. model-free RL. Earlier we established the relation between “neighboring” Q-values as $Q_{\pi}(s_t,a_t) = r_t + \gamma Q_{\pi}(s_{t+1}, a_{t+1}))$. The relation was abstract, but all we needed were $s_t$, $a_t$, $r_t$, $s_{t+1}$, and $a_{t+1}$ values to make it concrete. By recording states $s_1, s_2, \ldots s_{T-1}, s_T$, actions $a_1, a_2, \ldots a_{T-1}$, and stepwise rewards $r_1, r_2, \ldots r_{T-1}, r_T$ during simulations, we can obtain those values!

For this purpose, we create what is called a replay buffer, where we arrange the aforementioned recorded values into tuples $(s_t, a_t, s_{t+1}, r_t)$. Notice we are arranging the values into units of “transition steps”, which might seem a little confusing at first. We have the entire transition sequence at the end of each simulation, so why break it down to steps and not use the entire sequence? We could, but it would restrict us to using the on-policy approach to RL. In this approach, we evaluate a policy $\pi$ in simulation, record all states actions and rewards, trace back to learn the Q-values $Q_{\pi}$, and after learning we often throw this recorded sequence away. Why? Because the sequence $s_1, a_1, s_{t+1}, \ldots s_{T-1}, a_{T-1}, s_T$, as a whole, was specific to $\pi$ (i.e. only by following policy $\pi$ can we experience this sequence exactly), and we wouldn’t be able to learn anything more from it when, for example, we switch to a different policy $\pi’$. (Edit: In hindsight, this is NOT accurate because we can actually learn even when the behavior and target policies are different, albeit at lesser efficiency. The difficulty with learning from data collected using a different policy is known as the distribution shift problem. Using importance sampling, and/or constraining the learning according to the difference between behavior and target policies, are both solutions to the distribution shift problem.) This approach is fast and reliable, but as you might have observed, it is not very data-efficient, since each evaluation can only learn one policy.

That’s exactly why we divide the sequence into individual steps in our replay buffer, and thus follow the off-policy approach to RL. Whereas sequence is specific to policy, steps aren’t. $(s_t, a_t, s_{t+1}, r_t)$ simply indicates (a part of) the state transition dynamics, and will remain valid regardless of which policy we are following! To exploit the advantage of steps being policy-neutral, we employ a learning method different from backtracking. Specifically, we sample a random $(s_t, a_t, s_{t+1}, r_t)$ tuple from the replay buffer, and update $Q_{\pi}(s_t,a_t)$ with the aforementioned formula $Q_{\pi}(s_t,a_t) = r_t + \gamma Q_{\pi}(s_{t+1}, \pi(s_{t+1})))$ (we currently assume an arbitrary $\pi$, but we’ll soon talk about how to learn it). In the beginning, this will be wildly inaccurate because we don’t know $Q_{\pi}(s_{t+1}, \pi(s_{t+1}))$ and thus have to initialize it randomly. However, kinda like in the tabular value iteration you learned in your intro AI, accurate Q-values will eventually start propagating from the terminal state, until all Q-values become accurate.

If the states and actions are discrete (i.e. they have a finite number of possible values), we can store and update Q-values based on a table. But if the states and actions are continuous, like in our case, we can use a NN to predict and learn the Q-values. We can adapt the update formula into a TD error $L = MSE(Q_{\pi}(s_t,a_t), r_t + \gamma Q_{\pi}(s_{t+1}, \pi(s_{t+1})))$ and minimize this error to train the NN. And with that, we have the general intuitions of Deep Q-Network (DQN)!

DDPG

Learning a DQN from replay buffer and through minimizing TD error allows us to learn Q-values without learning state transition dynamics. Now we can address the second problem, which is the chicken-egg paradox that, one the one hand, we assume a policy when learning Q-values, and on the other, we need Q-values to learn the policy. We’ll have to take turns learning the policy and the Q-values. We start with a random policy (e.g. NN with random weights and biases), evaluate it to fill some transition steps into the replay buffer, and use the replay buffer and the random policy to learn DQN. Obviously, since the policy is bad, its corresponding Q-values are going to be low in general. But still, the random nature of the policy likely means the evaluation will stumble upon some relatively good states and actions, which is also going to be reflected in the Q-values learned by DQN. So the policy NN can now learn to reproduce these slightly better actions by learning to find actions that maximize the Q-value evaluations provided by DQN. This will lead to a better policy NN, which will allow the DQN to learn higher Q-values, which will againt improve policy NN …etc..

That was the general intuition how we can simultaneously learn DQN and policy NN by iteratively using one to improve the other. Since the policy NN chooses the actions with which to explore in simulation and hence train DQN, the policy NN is also called the actor. On the other hand, since DQN provides Q-values with which to guide policy training, DQN is also called the critic. Let’s formalize things by writing down the update rules. For the critic, we already defined the update as minimizing the TD error $L = MSE(Q_{\pi}(s_t,a_t), r_t + \gamma Q_{\pi}(s_{t+1}, \pi(s_{t+1})))$ (MSE stands for Mean Squared Error), and we can leave the gradient calculation w.r.t. DQN weights and biases to backpropagation. For the actor, we mentioned that we want to “maximize” the Q-values, so we can simply set the loss as negated Q-values $L(\pi(s)) = -Q(s, \pi(s))$, and again, leave the exact gradients to backpropagation.

By the way, Deep Deterministic Policy Gradients (DDPG) is called “Deterministic” because its policy NN outputs a deterministic action $\pi(s) = a$ (what we’ve been doing). But some other RL produce probabilistic actions. For example, some output the mean and covariance of a normal distribution from which the action can be sampled.

TD3 and DDPG also have a few more tricks to help with the algorithm’s stability, but they are not that important for our intuitions, so I will refer you to the quoted tutorials for this information.

I know TD3, just tell me how to use it in CMA-MEGA

We mentioned at the start that TD3 would be used to calculate the gradients for CMA-MEGA. Specifically, we need $\nabla_{\theta}Fitness, \nabla_{\theta}BD_{1}, \nabla_{\theta}BD_{2}, \ldots$, where $\theta$ stands for the genome, which in CMA-MEGA is the long array containing all serialized weights and biases of a policy NN. We just saw we could negate the critic DQN’s Q-value to get the loss for the actor, and then backpropagate this loss through the actor to get gradients w.r.t. its weights and biases. Notice this process really consists of two modules: (1) getting the negated Q-value and (2) backpropagating through actor NN, and module (1) doesn’t really care which actor NN is in module (2). We can essentially swap in any NN, give it the negated Q-value as loss, and it can backpropagate to get gradients for maximizing Q-value w.r.t. its weights and biases. In other words, if we have the critic DQN, we can calculate Q-value’s gradients w.r.t. any genome NN!

This would give us $\nabla_{\theta}Fitness$, as we have been discussing Q-value in the context of maximizing fitness. However, the process can be trivially adapted to get $\nabla_{\theta}BD_{k}$ for maximizing each BD as well. So in total, we need one TD3 agent for maximizing fitness and each of the BDs. In our case, that would be 4+1=5 TD3 agents since QDgym Ant has 4 BDs. After we have these TD3 agents, their critic DQNs will provide losses for maximizing fitness and each BDs, and on any arbitrary genome, we can backpropagate each of these losses seperately to get $\nabla_{\theta}Fitness, \nabla_{\theta}BD_{1}, \nabla_{\theta}BD_{2}, \ldots$.

Quick Note: In Tjanaka, Bryon, et al. (2022), only the fitness gradient was estimated with TD3, whereas BDs’s gradients were estimated with the noisy sampling procedure, the argument being that BDs are defined as blackbox and thus shouldn’t leverage the state-action design. I think that’s a fair reason, but I really needed the data-efficiency of TD3 because I only have an average computer…

As explained in Differentiable Quality-Diversity part 1, we can do a weighted sum of these gradients with CMA-ES-learned weights to explore along a random direction within Fitness-BD space. $\theta’ = \theta + |c_0|\nabla_{\theta}Fitness + \Sigma_{j=1}^{k+1} c_j \nabla_{\theta}BD_{j-1}$. The neat part is that we can interleave TD3 training within CMA-MEGA explorations, and therefore remove the need to run additional evaluations. The losses predicted by TD3 critics will be inaccurate at the start, but as we run more generations, TD3 will become better trained, until eventually, we have a mechanism which not only performs the task well, but also knows how to adapt to get other different, but nonetheless high-performing policies!

Implementation

Codes available here! This time I am not including trained models because they exceed the 100MB git push limit. You can train the algorithm from scratch by running TD3_agent.py.

If you want to speed up the training, you can run test.py to train the fitness TD3 first, and then copy trained TD3 (stored in TEST_QDAntBulletEnv-v0_{numiters}/tf_models as a_0, ta_0, c1_0, tc1_0, c2_0, tc2_0) into the main model overwriting existing files. That’s what I did – trained fitness TD3 alone for 1000 iterations, and then copy to main model and train for 10000 generations.

Because I trained TD3 seperately, showing just one best result might not be enough to prove CMA-MEGA(TD3) is actually working, so I’m showing multiple good results. In the end, CMA-MEGA(TD3) produced hundreds of solutions with fitness > 1500!

Good and diverse NN controllers generated with CMA-MEGA(TD3) (sorry for the bad compression)

Partial archive generated with CMA-MEGA(TD3) after 1000+10000 generations