Finding a learning rate in Deep Reinforcement Learning

M N
3 min readJul 10, 2020

Learning rate is one of the most important hyperparameters in Deep Learning. When training a RL agent you want the learning process to have a visible impact on some metric of solving the problem said RL agent is tasked to solve.

You should therefore give the same task (same starting state, same environment) to the RL agent for periodical evaluation. During learning you want the score of the periodical evaluation to vary. This means that in the case of Deep Q Learning, that the Q function has been altered enough by network parameter updates that a new best action in a given state has the highest value. This means that the learning has not stagnated, the agent is changing its policy (for better or for worse).

If you set the learning rate too high, there will indeed be fluctuations in the score of the periodical evaluation. But the Q function approximated by the neural network will be drastically changed after each optimizer step. This may cause the network to not converge to an optimum.

I suggest the following process for finding a learning rate which works:
Step 1. Start from a really low learning rate e.g. 1e-8
Step 2. Run a couple of training steps e.g 200 (including an optimizer step).
Step 3. See if during those 200 training steps, were there any fluctuations in the evaluation score. If not, increase the learning rate (by a factor of 10) and go to Step 2.

You should disable batch normalization weight update during the experiment. The updates to the rolling means could influence the predictions of the network causing different actions from step to step even if learning rate is 0.

If you don’t stop your learning rate finding experiment on some low learning rate, you can’t guess by looking at the change in evaluation score that the policy is converging to an optimum.

If you suspect that the learning rate may be too high check the network predictions. If you expect the values of Q function to be in the range <-1,1> and you see predictions like 5429 the network probably diverged and you should start training from scratch with a lower learning rate.

By picking the lowest learning rate which produces changes in the policy regularly during training you are picking the learning rate which is most likely to cause the network to converge to an optimum (at least local). I suggest this approach if you are not seeing improvement in the performance of your RL agent despite long training time. It may be that your learning rate is too high.

Bellow you can see the experiment I run for a RL task I’m working on.
I started with a learning rate of 1e-8. I then run 160 optimization steps.
I noticed no change in eval metric. I then run 160 steps with learning rate 1-e7, still no change. I then run 160 steps with learning rate 1e-6, and I noticed some rare changes in the eval metric. I then run 160 steps with learning rate 1e-5 and the eval metric was changing often. I checked if the predictions were in the range I expected, they were. I therefore concluded that this is a working learning rate for this task.

Plots for the learning rate finding experiment. 4 160-step runs with learning rates 1e-8, 1e-7, 1e-6, 1e-5. The bottom row contains the evaluation metric calculated after each optimizer step.

--

--