rl_training_validation API reference

Training and validation scripts for the pre-built environments in rl_environments. Each script wires a gym.make() env to an sb3_ros_support algorithm wrapper and a YAML config file.

For the user-facing training guide see Training a model.

Note

This page covers the reach training/validation scripts across all four supported robots plus the multi-task learning script. The push and pick-and-place scripts share docstring patterns with their underlying task envs and are intentionally omitted while those docstrings are tidied — the scripts themselves are shipped and runnable from the source tree. See Training a model for the user-facing inventory.

Top-level package

Utilities

env_safety

Real-robot motion consent gate (--allow-real-robot-motion), env-ID parsing helpers, and the registry-introspection helpers used by the scripts/ smoke tools.

Environment availability + real-robot safety helpers for the training and validation scripts in this repository.

Pure registry-based: an env is “available” iff it’s registered in the Gymnasium registry (after import rl_environments). No separate implementation-status table — the registry IS the source of truth.

Goal-env detection is by id suffix (...Goal*Sim-v0/v1 or ...Goal*Real-v0/v1); real-env detection is by id suffix (...Real-v0/v1).

Nothing in this module touches Gazebo or any ROS topic.

rl_training_validation.utils.env_safety.is_registered(env_id)[source]
Parameters:

env_id (str)

Return type:

bool

rl_training_validation.utils.env_safety.is_implemented(env_id)
Parameters:

env_id (str)

Return type:

bool

rl_training_validation.utils.env_safety.is_real(env_id)[source]
Parameters:

env_id (str)

Return type:

bool

rl_training_validation.utils.env_safety.is_goal_env(env_id)[source]
Parameters:

env_id (str)

Return type:

bool

rl_training_validation.utils.env_safety.list_implemented()[source]

List the registered task-env ids (one per documented task).

Filters out the abstract robot-base registrations (e.g. RX200RobotEnv-v0, NED2RobotGoalBaseSimEnv-v0, VX300SRobotGoalBaseRealEnv-v0). Those are class-only entries used to share an env class across multiple task envs — they are not gym.make-able tasks and should not appear in user-facing “available envs” listings.

Return type:

List[str]

rl_training_validation.utils.env_safety.list_unimplemented()[source]
Return type:

List[str]

Check the env-var consent flag; rosparam check would require rospy.

Return type:

bool

rl_training_validation.utils.env_safety.add_real_motion_cli(parser)[source]

Add a --allow-real-robot-motion flag to parser.

Parameters:

parser (ArgumentParser)

Return type:

None

Single-channel consent gate for real-robot env construction.

Raises SystemExit for any real env id unless --allow-real-robot-motion was passed. When the flag IS passed, also exports ALLOW_REAL_ROBOT_MOTION=1 so downstream code can read consent from a single source.

The env var is a propagation of the CLI flag, not a second independent channel — you don’t need to set it manually, and unsetting it doesn’t lock motion out once the flag is already passed. If you want a kill-switch that survives the CLI flag, enforce it inside the real RobotEnv itself (e.g. a rosparam check inside _set_init_pose / the publish path).

Parameters:
  • env_id (str)

  • allow_real_flag (bool)

Return type:

None

rl_training_validation.utils.env_safety.add_cube_tracker_cli(parser)[source]

Add –cube-tracker / –cube-tracker-camera / –cube-tracker-target-frame.

Default behaviour is unchanged: vision pipeline is external. Pass --cube-tracker auto to have the env roslaunch rl_envs_cube_tracker/<camera>.launch automatically.

Parameters:

parser (ArgumentParser)

Return type:

None

rl_training_validation.utils.env_safety.apply_cube_tracker_kwargs(env_kwargs, args)[source]

Merge cube-tracker CLI args into env_kwargs. Returns it for chaining.

Parameters:
Return type:

dict

rl_training_validation.utils.env_safety.add_wrist_camera_cli(parser)[source]

Add --wrist-camera flag (default off).

NED2 envs accept use_wrist_camera: bool as a kwarg. Default off so there’s no extra Gazebo / vision-node load when the user doesn’t need it.

Parameters:

parser (ArgumentParser)

Return type:

None

rl_training_validation.utils.env_safety.apply_wrist_camera_kwargs(env_kwargs, args)[source]

Merge --wrist-camera into env_kwargs. Returns it for chaining.

Parameters:
Return type:

dict

rl_training_validation.utils.env_safety.add_goal_pose_cli(parser)[source]

Add --goal-pose-topic for push real validation.

Empty by default → the env keeps using random / hard-coded goals. When set (e.g. /goal_pose), the env subscribes; on each reset the latest pose (if fresh) overrides the random / hard-coded goal. Use with --cube-tracker auto to also auto-launch the second AprilTag adapter for the goal tag (id 1 by default).

Parameters:

parser (ArgumentParser)

Return type:

None

rl_training_validation.utils.env_safety.apply_goal_pose_kwargs(env_kwargs, args)[source]

Merge --goal-pose-topic into env_kwargs. Returns it for chaining.

Parameters:
Return type:

dict

rl_training_validation.utils.env_safety.with_seed_suffix(path, seed)[source]

Append seed_<N>/ to a save / log path.

Keeps runs trained with different seeds in separate directories so a later run does not silently overwrite (or load from) a previous seed’s checkpoints. The trailing slash is preserved when present in the input path.

Parameters:
Return type:

str

rl_training_validation.utils.env_safety.check_env_constructable(env_id, allow_real_flag=False)[source]

Raise SystemExit if env_id is unregistered or a real env without explicit consent.

Parameters:
  • env_id (str)

  • allow_real_flag (bool)

Return type:

None

rl_training_validation.utils.env_safety.filter_to_implemented(env_ids)[source]

Split a list of env ids into (registered, missing).

Parameters:

env_ids (Iterable[str])

Return type:

Tuple[List[str], List[str]]

rl_training_validation.utils.env_safety.assert_goal_env(env_id)[source]

Raise SystemExit if env_id is not a goal-conditioned env.

Parameters:

env_id (str)

Return type:

None

rl_training_validation.utils.env_safety.assert_non_goal_env(env_id)[source]

Raise SystemExit if env_id is a goal-conditioned env.

Parameters:

env_id (str)

Return type:

None

rl_training_validation.utils.env_safety.parse_env_id(env_id)[source]

Return (robot, mode, task, is_goal) or None.

Parses bare ids like RX200PushGoalSim-v0 / NED2ReacherReal-v0. Also accepts the Zed2-flavoured RX200 ids (e.g. RX200Zed2PnPGoalSim-v0); the implicit (no-prefix) kinect variants fall through the empty-prefix branch so callers don’t need to special-case the default sensor. Verified to round-trip every currently-registered task id without returning None.

Parameters:

env_id (str)

MultiTaskEnv

Wrapper for training a single policy across several task envs in one process. Used by the multi-task training scripts below.

class rl_training_validation.utils.multi_task_env.MultiTaskEnv(env_list, env_args_list=None, wrapper_list=None, wrapper_args_dict=None)[source]

Bases: Env

A wrapper that trains multiple UniROS-based Gym environments in one agent. Not MPI-parallel; it simply samples a different task each reset.

Parameters:
  • env_list (List[str]) – list of registered gym env names (e.g. UniROS names)

  • env_args_list (List[dict] | None) – list of dicts, one per env in env_list, passed to make()

  • wrapper_list (List[str] | None) – list of wrapper class names (strings) to apply to every env

  • wrapper_args_dict (dict | None) – mapping from wrapper name → kwargs dict

step(action)[source]

Step the currently-active sub-env with action trimmed to its action dimension, then pad the returned observation to the unified max_obs_dim and stamp info["task_id"].

reset(*, seed=None, options=None, **kwargs)[source]

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:
  • seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.

  • options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space

(typically a numpy array) and is analogous to the observation returned by step().

info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to

the info returned by step().

Return type:

observation (ObsType)

close()[source]

After the user has finished using the environment, close contains the code necessary to “clean up” the environment.

This is critical for closing rendering windows, database or HTTP connections. Calling close on an already closed environment has no effect and won’t raise an error.

MultiTaskGoalEnv

Goal-conditioned variant of MultiTaskEnv, routes HER reward recomputation per sub-env via info["task_id"].

class rl_training_validation.utils.multi_task_goal_env.MultiTaskGoalEnv(env_list, env_args_list=None, wrapper_list=None, wrapper_args_dict=None)[source]

Bases: GoalEnv

A wrapper that trains multiple UniROS-based GoalEnv gymnasium environments in one agent. Not MPI-parallel; it simply samples a different task each reset.

Parameters:
  • env_list (List[str]) – list of registered gym env names (must be GoalEnv)

  • env_args_list (List[dict] | None) – list of dicts, one per env in env_list, passed to make()

  • wrapper_list (List[str] | None) – list of wrapper class names (strings) to apply to every env

  • wrapper_args_dict (dict | None) – mapping from wrapper name → kwargs dict

step(action)[source]

Step the currently-active sub-env with action trimmed to its action dimension, then pad each piece of the returned observation dict (observation / achieved_goal / desired_goal) to the unified maximum dims and stamp info["task_id"].

reset(*, seed=None, options=None, **kwargs)[source]

Reset the environment.

In addition, check if the observation space is correct by inspecting the observation, achieved_goal, and desired_goal keys.

close()[source]

After the user has finished using the environment, close contains the code necessary to “clean up” the environment.

This is critical for closing rendering windows, database or HTTP connections. Calling close on an already closed environment has no effect and won’t raise an error.

compute_reward(achieved_goal, desired_goal, info)[source]

Called by HER to recompute rewards when relabelling experience.

Two call shapes have to be supported:

  • Live step / single transition: info is a dict; ag / dg are 1-D arrays. Route by info["task_id"] (stamped at reset() and step() time) so a relabel for sub-env i is computed by sub-env i, not whichever sub-env happens to be current_env now.

  • HER batched relabel: info is a Python list / ndarray of per-transition info dicts (SB3 HER copy_info_dict=False passes a list). Group transitions by info["task_id"] and let each sub-env compute its slice, then scatter the per-slice rewards back into a single float32 array shaped like ag.shape[:-1]. Without this, the previous isinstance(info, dict) check failed on the list path and silently routed every transition through self.current_env.

Falls back to self.current_env when task_id is missing from info (e.g. a single-task unit test that constructs info by hand). The fallback may misroute under HER if the active sub-env has changed since the transition was collected — preserve task_id through the replay buffer to avoid it.

Parameters:

RX200 — Reach

Sim training / validation

Train an SB3 policy on the RX200 sim Reach task.

Standard env id: RX200ReacherSim-v0 Goal env id: RX200ReacherGoalSim-v0

Requires Gazebo + roscore to already be running. The env class launches the appropriate MoveIt stack itself.

rl_training_validation.rx200.reach.rx200_reach_train_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.rx200.reach.rx200_reach_train_sim.main()[source]
Return type:

int

Validate a trained policy against the RX200 sim Reach task.

Mirrors the train script for env construction; loads a saved model and rolls it out for --episodes episodes, logging success rate, truncations, and sensor timeouts from info.

rl_training_validation.rx200.reach.rx200_reach_validate_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.rx200.reach.rx200_reach_validate_sim.main()[source]
Return type:

int

Real training / validation

Train an SB3 policy on the RX200 real Reach task.

This is the explicitly-opt-in real-robot trainer. Real motion is gated by check_env_constructable: it refuses to construct any ...Real env unless --allow-real-robot-motion is passed on the command line. The helper also exports ALLOW_REAL_ROBOT_MOTION=1 so downstream code can read consent from a single source — that env var is a propagation of the same gate, not an independent channel.

You MUST also have:
  • the actual RX200 connected and powered up,

  • the interbotix MoveIt / driver dependencies installed,

  • (optional) rosparam set /allow_real_robot_motion true if your launch chain prefers to query the parameter server.

Default behaviour without --allow-real-robot-motion is a clear SystemExit with no motion.

rl_training_validation.rx200.reach.rx200_reach_train_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.rx200.reach.rx200_reach_train_real.main()[source]
Return type:

int

Validate a trained policy against the RX200 real Reach task.

Same single-channel CLI gate as rx200_reach_train_real.

rl_training_validation.rx200.reach.rx200_reach_validate_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.rx200.reach.rx200_reach_validate_real.main()[source]
Return type:

int

Niryo Ned2 — Reach

Sim training / validation

Train an SB3 policy on the Ned2 sim Reach task.

Standard env id: NED2ReacherSim-v0 Goal env id: NED2ReacherGoalSim-v0

Requires Gazebo + roscore to already be running. The env class launches the appropriate MoveIt stack itself.

rl_training_validation.ned2.reach.ned2_reach_train_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.ned2.reach.ned2_reach_train_sim.main()[source]
Return type:

int

Validate a trained policy against the Ned2 sim Reach task.

Mirrors the train script for env construction; loads a saved model and rolls it out for --episodes episodes, logging success rate, truncations, and sensor timeouts from info.

rl_training_validation.ned2.reach.ned2_reach_validate_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.ned2.reach.ned2_reach_validate_sim.main()[source]
Return type:

int

Real training / validation

Train an SB3 policy on the Ned2 real Reach task.

Real motion is gated by check_env_constructable: it refuses to construct any ...Real env unless --allow-real-robot-motion is passed on the command line. The helper also exports ALLOW_REAL_ROBOT_MOTION=1 so downstream code can read consent from a single source — that env var is a propagation of the same gate, not an independent channel.

Reach does not need a cube tracker. --wrist-camera is optional (off by default); when set, the env subscribes to the niryo wrist camera and exposes the decoded frame as self.cv_image_wrist.

rl_training_validation.ned2.reach.ned2_reach_train_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.ned2.reach.ned2_reach_train_real.main()[source]
Return type:

int

Validate a trained policy against the Ned2 real Reach task.

Same single-channel CLI gate as ned2_reach_train_real.

rl_training_validation.ned2.reach.ned2_reach_validate_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.ned2.reach.ned2_reach_validate_real.main()[source]
Return type:

int

Trossen VX300S — Reach

Sim training / validation

Train an SB3 policy on the VX300S sim Reach task.

Standard env id: VX300SReacherSim-v0 Goal env id: VX300SReacherGoalSim-v0

The env launches its own Gazebo/roscore by default and starts the appropriate MoveIt stack itself.

rl_training_validation.vx300s.reach.vx300s_reach_train_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.vx300s.reach.vx300s_reach_train_sim.main()[source]
Return type:

int

Validate a trained policy against the VX300S sim Reach task.

Mirrors the train script for env construction; loads a saved model and rolls it out for --episodes episodes, logging success rate, truncations, and sensor timeouts from info.

rl_training_validation.vx300s.reach.vx300s_reach_validate_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.vx300s.reach.vx300s_reach_validate_sim.main()[source]
Return type:

int

Real training / validation

Train an SB3 policy on the VX300S real Reach task.

This is the explicitly-opt-in real-robot trainer. Real motion is gated by check_env_constructable: it refuses to construct any ...Real env unless --allow-real-robot-motion is passed on the command line. The helper also exports ALLOW_REAL_ROBOT_MOTION=1 so downstream code can read consent from a single source — that env var is a propagation of the same gate, not an independent channel.

You MUST also have:
  • the actual VX300S connected and powered up,

  • the interbotix MoveIt / driver dependencies installed,

  • (optional) rosparam set /allow_real_robot_motion true if your launch chain prefers to query the parameter server.

Default behaviour without --allow-real-robot-motion is a clear SystemExit with no motion.

rl_training_validation.vx300s.reach.vx300s_reach_train_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.vx300s.reach.vx300s_reach_train_real.main()[source]
Return type:

int

Validate a trained policy against the VX300S real Reach task.

Same single-channel CLI gate as vx300s_reach_train_real.

rl_training_validation.vx300s.reach.vx300s_reach_validate_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.vx300s.reach.vx300s_reach_validate_real.main()[source]
Return type:

int

Universal Robots UR5e — Reach

Sim training / validation

Train an SB3 policy on the UR5e sim Reach task.

Standard env id: UR5eReacherSim-v0 Goal env id: UR5eReacherGoalSim-v0

The env launches its own Gazebo/roscore by default and starts the appropriate MoveIt stack itself.

rl_training_validation.ur5e.reach.ur5e_reach_train_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.ur5e.reach.ur5e_reach_train_sim.main()[source]
Return type:

int

Validate a trained policy against the UR5e sim Reach task.

Mirrors the train script for env construction; loads a saved model and rolls it out for --episodes episodes, logging success rate, truncations, and sensor timeouts from info.

rl_training_validation.ur5e.reach.ur5e_reach_validate_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.ur5e.reach.ur5e_reach_validate_sim.main()[source]
Return type:

int

Real training / validation

Train an SB3 policy on the UR5e real Reach task.

This is the explicitly-opt-in real-robot trainer. Real motion is gated by check_env_constructable: it refuses to construct any ...Real env unless --allow-real-robot-motion is passed on the command line. The helper also exports ALLOW_REAL_ROBOT_MOTION=1 so downstream code can read consent from a single source — that env var is a propagation of the same gate, not an independent channel.

You MUST also have:
  • the actual UR5e connected and powered up,

  • the interbotix MoveIt / driver dependencies installed,

  • (optional) rosparam set /allow_real_robot_motion true if your launch chain prefers to query the parameter server.

Default behaviour without --allow-real-robot-motion is a clear SystemExit with no motion.

rl_training_validation.ur5e.reach.ur5e_reach_train_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.ur5e.reach.ur5e_reach_train_real.main()[source]
Return type:

int

Validate a trained policy against the UR5e real Reach task.

Same single-channel CLI gate as ur5e_reach_train_real.

rl_training_validation.ur5e.reach.ur5e_reach_validate_real.parse_args()[source]
Return type:

Namespace

rl_training_validation.ur5e.reach.ur5e_reach_validate_real.main()[source]
Return type:

int

Multi-task learning

Multi-task TD3 training across several UniROS sim envs simultaneously.

By default this uses the two implemented RX200 sim envs that share similar dynamics: Reach and Push. The user can switch in any other implemented env via the CLI; the script refuses to construct an unimplemented env id.

This script is sim-only. Real envs cannot be added to the multi-task mix from here — they require explicit per-env consent and are not covered by the multi-task wrapper’s resampling logic.

rl_training_validation.multi_task_learning.multi_train_sim.parse_args()[source]
Return type:

Namespace

rl_training_validation.multi_task_learning.multi_train_sim.main()[source]
Return type:

int