Skip to main content

Population Based Training (PBT) with the Pred-Prey-Grass Environment

This experiment uses Population Based Training (PBT) to explore adaptive learning dynamics in a multi-agent Predator-Prey-Grass simulation. The goal is not to converge to a single optimal policy, but to allow diverse agent types to co-evolve and adapt their hyperparameters over time in response to a dynamic environment and each other.

๐ŸŒฑ Environment: PredPreyGrassโ€‹

The environment simulates interactions between:

  • Predators of different movement speeds
  • Prey of different movement speeds
  • Grass as a renewable resource

Each agent group has its own policy and configuration, allowing asymmetric capabilities and behaviors. The environment is designed to support evolutionary dynamics like:

  • Arms races (e.g., faster prey vs. faster predators)
  • Emergent survival strategies
  • Shifting population balances

๐Ÿงช Why Population Based Training?โ€‹

Traditional RL fine-tunes a single agent or population using fixed hyperparameters. PBT offers a more dynamic and biologically inspired alternative:

  • Exploration: Each trial (i.e., agent population) begins with slightly different hyperparameters.
  • Selection: Periodically, poorly performing trials are replaced (cloned) by better-performing ones.
  • Mutation: Cloned trials mutate their hyperparameters to explore new niches.

This encourages continual adaptation, mimicking natural selection and evolution.

โš™๏ธ Technical Setupโ€‹

  • Framework: Ray Tune + RLlib (new API stack)
  • Algorithm: Proximal Policy Optimization (PPO)
  • Resource Scaling: Automatically adapts to 8 CPU or 32 CPU / 1 GPU setups
  • Parallel Trials: 6 PBT trials run concurrently on available hardware
  • Hyperparameters Mutated:
    • Learning rate
    • Entropy coefficient
    • Minibatch size
    • Epoch count
    • Train batch size per learner

๐Ÿ“ˆ Metrics and Checkpointsโ€‹

Each trial is evaluated on:

  • env_runners/episode_return_mean
  • Cloning is triggered every perturbation_interval = 3 iterations
  • Frequent checkpoints are saved to enable cloning

๐Ÿงญ Experiment Goalsโ€‹

This experiment is part of a broader investigation into:

  • Open-ended evolution of agent behaviors
  • Emergent adaptation in multi-agent ecosystems
  • Identifying conditions under which diversity persists rather than collapsing into a single dominant strategy

Ultimately, this setup provides a testbed for studying co-evolutionary learning and the spontaneous emergence of complex dynamics under competitive and cooperative pressures.


Implementation

Referencesโ€‹