Training a model

The environments produced by this framework are standard gymnasium environments that expose the usual reset / step / action_space / observation_space surface. Any RL library that consumes those should work. Verification status:

Tested: Stable Baselines 3 via sb3_ros_support (the primary path used in the paper experiments and in rl_training_validation).
Likely works: plain Stable Baselines 3 without the support wrapper; hand-written training loops.
Unverified but expected to work: CleanRL, Tianshou, RLlib, Tensorforce. They each inspect a few env attributes (spec, metadata, unwrapped, vector-env assumptions) that the proxy forwards correctly via __getattr__, but the full matrix hasn’t been exercised end-to-end.

uniros.make() returns a proxy that behaves like gym.Env but runs the underlying env in a worker process. That’s the only framework-specific detail; everything downstream is gymnasium- shaped.

This page shows three increasingly involved options:

Option 1 — Plain Stable Baselines 3 — pure Stable Baselines 3 with no ROS-specific wrappers.
Option 2 — sb3_ros_support — the convenience layer that ships with this ecosystem (YAML config, ROS-aware paths, uniform train / save / load surface, HER ready for goal envs).
Option 3 — Any other gymnasium-compatible framework — pointers for using CleanRL, Tianshou, RLlib, or a custom loop.

For the dedicated joint-sim-and-real training pattern (Use Case C from the paper), see Joint sim + real training.

Option 1 — Plain Stable Baselines 3

The simplest possible training script. No YAML, no extra wrappers, just SB3 against a uniros-managed env.

#!/bin/python3
import rospy
from multiros.utils import gazebo_core
import uniros as gym
import rl_environments  # registers gym IDs

from stable_baselines3 import SAC


if __name__ == "__main__":
    gazebo_core.launch_gazebo(launch_roscore=True, gui=False)
    rospy.init_node("rx200_reach_train_plain_sb3")

    env = gym.make("RX200ReacherSim-v0")

    model = SAC(
        "MlpPolicy",
        env,
        learning_rate=3e-4,
        buffer_size=1_000_000,
        batch_size=256,
        tensorboard_log="./tb_logs/",
        verbose=1,
    )
    model.learn(total_timesteps=100_000)
    model.save("rx200_reach_sac")

    env.close()

This works because uniros.make returns an object that responds to reset / step / close exactly like gym.Env. SB3 doesn’t know or care about ROS.

Option 2 — sb3_ros_support

If your script already lives in a ROS package and you want config-driven training (so swapping PPO for SAC for TD3 is a YAML edit, not a code rewrite), sb3_ros_support API reference adds:

A BasicModel base class and one subclass per algorithm — PPO, A2C, DDPG, TD3, SAC, DQN, plus their goal-conditioned *_GOAL variants for HER.
YAML-driven hyperparameter loading via ros_load_yaml.
Convenient train / save_model / load_trained_model / predict surface that wraps the underlying SB3 model.

#!/bin/python3
import rospy
from multiros.utils import gazebo_core
import uniros as gym
import rl_environments

from sb3_ros_support.td3 import TD3


if __name__ == "__main__":
    gazebo_core.launch_gazebo(launch_roscore=True, gui=False)
    rospy.init_node("rx200_reach_train_sim")

    env = gym.make("RX200ReacherSim-v0")
    env.reset()

    # YAML config lives inside the rl_training_validation package's
    # ``config/`` directory. Replace the filename for SAC, PPO, etc.
    pkg_path = "rl_training_validation"
    model = TD3(
        env,
        save_model_path="/models/td3/",
        log_path="/logs/td3/",
        model_pkg_path=pkg_path,
        config_file_pkg=pkg_path,
        config_filename="rx200_reacher_td3.yaml",
    )

    model.train()
    model.save_model()

    env.close()

Working examples live under rl_training_validation/src/rl_training_validation/rx200/reach/:

rx200_reach_train_sim.py / rx200_reach_validate_sim.py
rx200_reach_train_real.py / rx200_reach_validate_real.py

See sb3_ros_support API reference for the full algorithm list, and rl_training_validation API reference for the working scripts.

Option 3 — Any other gymnasium-compatible framework

CleanRL, Tianshou, RLlib, Tensorforce, and hand-written training loops should all be adaptable because they consume the Gymnasium API and that’s what uniros.make produces. SB3 via sb3_ros_support is the tested path; the snippets below are integration sketches — they haven’t been exercised end-to-end on this codebase.

CleanRL

CleanRL training scripts are single-file. Replace the line that creates env with uniros.make:

import uniros as gym
import rl_environments

def make_env(env_id):
    def thunk():
        env = gym.make(env_id)
        return env
    return thunk

# ... rest of CleanRL ppo_continuous_action.py / sac_continuous_action.py
# uses `make_env` as is.

Tianshou

import uniros as gym
import rl_environments
from tianshou.env import DummyVectorEnv
from tianshou.policy import SACPolicy

env = DummyVectorEnv([lambda: gym.make("RX200ReacherSim-v0")
                     for _ in range(4)])
# ... continue with the standard Tianshou trainer.

RLlib

RLlib expects a registered env. Register a thin wrapper:

from ray.tune.registry import register_env
import uniros as uniros_gym
import rl_environments

def _make(config):
    return uniros_gym.make(config["env_id"])

register_env("rx200_reacher", _make)

# algo = ppo.PPO(config={"env": "rx200_reacher",
#                        "env_config": {"env_id": "RX200ReacherSim-v0"},
#                        ...})

Hand-written training loop

import uniros as gym
import rl_environments

env = gym.make("RX200ReacherSim-v0")
obs, _ = env.reset(seed=0)
for step in range(100_000):
    action = your_policy(obs)
    obs, reward, term, trunc, info = env.step(action)
    your_learner.observe(obs, action, reward, term)
    if term or trunc:
        obs, _ = env.reset()

The only point where the framework’s identity matters is the call to uniros.make (which runs the env in a worker process). Once you have the proxy in hand, treat it as a normal gymnasium env.

Configuration via YAML (sb3_ros_support)

When using sb3_ros_support, hyperparameters live in a YAML file under any ROS package you control. The working examples ship under rl_training_validation/config/:

rx200_reacher_sac.yaml / rx200_reacher_sac_goal.yaml
rx200_reacher_td3.yaml / rx200_reacher_td3_goal.yaml
rx200_push_td3.yaml / rx200_push_td3_goal.yaml
multi_task_td3.yaml / multi_task_td3_goal.yaml

A typical file:

total_timesteps: 100000
learning_starts: 1000

policy: "MlpPolicy"
policy_kwargs:
  net_arch: [256, 256]
learning_rate: 0.0003

buffer_size: 1000000
batch_size: 256
gamma: 0.99
tau: 0.005

action_noise:
  type: "normal"
  mean: 0.0
  stddev: 0.1

# HER block (only for *_GOAL algorithms)
her:
  n_sampled_goal: 4
  goal_selection_strategy: "future"

Pass the filename to the algorithm wrapper’s config_filename; all the keys above are read at train() time.

Logging and checkpoints

Whichever option you use, TensorBoard is the standard reader:

tensorboard --logdir /path/to/logs/

Saved models from sb3_ros_support are SB3 .zip files that can be loaded back via sb3_ros_support.core.BasicModel.load_trained_model() or SB3’s own Algorithm.load(...).