PPO

The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor).

The main idea is that after an update, the new policy should be not too far from the old policy. For that, ppo uses clipping to avoid too large update.

Note

PPO contains several modifications from the original algorithm not documented by OpenAI: advantages are normalized and value function can be also clipped.

Notes

Can I use?

Note

A recurrent version of PPO is available in ##

  • Recurrent policies: ❌

  • Multi processing: ✔️

  • Gym spaces:

Space

Action

Observation

Discrete

✔️

✔️

Box

✔️

✔️

MultiDiscrete

✔️

✔️

Example

This example is only to demonstrate the use of the library and its functions, and the trained agents may not solve the environments. Optimized hyperparameters can be found in RL Zoo repository.

Train a PPO agent on CartPole-v1 using 4 environments.

import gymnasium as gym

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Parallel environments
vec_env = make_vec_env("CartPole-v1", n_envs=4)

model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=25000)
model.save("ppo_cartpole")

del model # remove to demonstrate saving and loading

model = PPO.load("ppo_cartpole")

obs = vec_env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

Results

PyBullet Environments

Gaussian means that the unstructured Gaussian noise is used for exploration, gSDE (generalized State-Dependent Exploration) is used otherwise.

Environments

A2C

A2C

PPO

PPO

Gaussian

gSDE

Gaussian

gSDE

HalfCheetah

2003 +/- 54

2032 +/- 122

1976 +/- 479

2826 +/- 45

Ant

2286 +/- 72

2443 +/- 89

2364 +/- 120

2782 +/- 76

Hopper

1627 +/- 158

1561 +/- 220

1567 +/- 339

2512 +/- 21

Walker2D

577 +/- 65

839 +/- 56

1230 +/- 147

2019 +/- 64

Parameters