Epochs variation
The number of epochs in the configuration has a strong impact on the wall time of iterations. The num_epochs
determines how many times PPO will re-use the same batch of collected experiences for gradient updates before sampling a new batch of fresh data. Essentially, the num_epochs
is a knob that trades off data efficiency vs. stability. Increasing num_epochs
increases stability of training but on the other hand takes more compute time.
In Display 1 we depict the wall time for the default PPO configuration (with num_epochs=30
), alongside training results for num_epochs=20,10,5
. This shows that long term stability of training when num_epochs
decreases.

num_epochs=30,20,10,5
On the other hand however, Display 2 which depicts the same training but now in wall time on the x-axis, shows that the lower num_epochs
, the faster learnings occurs. Especially at the beginning. But on the other hand, training is less sustainable and stable.


num_epochs=30,20,10,5
This in particular shows the trade off between stability and speed when choosing the num_epochs
in PPO training. An interesting follow-up might be to increase num_eppochs
during training. Because num_epochs=5
gives fast results at the first 100 iterations with the same reward output as the experiments with higher num_epochs
. However, after 100 iterations this configuration breaks down. So, starting off training with num_epochs=5
and increasing it after 100 iterations might be result in a significant speed up of wall time, with roughly the same sustainability of training after 100 iterations.
TODO:
num_epochs=5
ifiteration <= 100
.num_epochs=20
ifiteration > 100
.- truncation at
max_iters=500
. - This might result in a full experiment wall time of 3-3.5 hours compared to the current 9.7 hours (for
num_epochs=30
and truncation atmax_iters=1000
), with roughly the same quality of training. - Another experiment might be to gradually increase
num_epochs=5
afteriteration > 100
in some determinstic way.