rl_training_validation API reference
Training and validation scripts for the pre-built environments in
rl_environments. Each script wires a gym.make() env to an
sb3_ros_support algorithm wrapper and a YAML config file.
For the user-facing training guide see Training a model.
Note
This page covers the reach training/validation scripts across all four supported robots plus the multi-task learning script. The push and pick-and-place scripts share docstring patterns with their underlying task envs and are intentionally omitted while those docstrings are tidied — the scripts themselves are shipped and runnable from the source tree. See Training a model for the user-facing inventory.
Top-level package
Utilities
env_safety
Real-robot motion consent gate
(--allow-real-robot-motion), env-ID parsing helpers, and the
registry-introspection helpers used by the scripts/ smoke tools.
Environment availability + real-robot safety helpers for the training and validation scripts in this repository.
Pure registry-based: an env is “available” iff it’s registered in the
Gymnasium registry (after import rl_environments). No separate
implementation-status table — the registry IS the source of truth.
Goal-env detection is by id suffix (...Goal*Sim-v0/v1 or
...Goal*Real-v0/v1); real-env detection is by id suffix
(...Real-v0/v1).
Nothing in this module touches Gazebo or any ROS topic.
- rl_training_validation.utils.env_safety.is_implemented(env_id)
- rl_training_validation.utils.env_safety.list_implemented()[source]
List the registered task-env ids (one per documented task).
Filters out the abstract robot-base registrations (e.g.
RX200RobotEnv-v0,NED2RobotGoalBaseSimEnv-v0,VX300SRobotGoalBaseRealEnv-v0). Those are class-only entries used to share an env class across multiple task envs — they are notgym.make-able tasks and should not appear in user-facing “available envs” listings.
- rl_training_validation.utils.env_safety.real_motion_consent_present()[source]
Check the env-var consent flag; rosparam check would require rospy.
- Return type:
- rl_training_validation.utils.env_safety.add_real_motion_cli(parser)[source]
Add a
--allow-real-robot-motionflag toparser.- Parameters:
parser (ArgumentParser)
- Return type:
None
- rl_training_validation.utils.env_safety.enforce_real_motion_consent(env_id, allow_real_flag)[source]
Single-channel consent gate for real-robot env construction.
Raises
SystemExitfor any real env id unless--allow-real-robot-motionwas passed. When the flag IS passed, also exportsALLOW_REAL_ROBOT_MOTION=1so downstream code can read consent from a single source.The env var is a propagation of the CLI flag, not a second independent channel — you don’t need to set it manually, and unsetting it doesn’t lock motion out once the flag is already passed. If you want a kill-switch that survives the CLI flag, enforce it inside the real RobotEnv itself (e.g. a rosparam check inside
_set_init_pose/ the publish path).
- rl_training_validation.utils.env_safety.add_cube_tracker_cli(parser)[source]
Add –cube-tracker / –cube-tracker-camera / –cube-tracker-target-frame.
Default behaviour is unchanged: vision pipeline is external. Pass
--cube-tracker autoto have the env roslaunchrl_envs_cube_tracker/<camera>.launchautomatically.- Parameters:
parser (ArgumentParser)
- Return type:
None
- rl_training_validation.utils.env_safety.apply_cube_tracker_kwargs(env_kwargs, args)[source]
Merge cube-tracker CLI args into
env_kwargs. Returns it for chaining.
- rl_training_validation.utils.env_safety.add_wrist_camera_cli(parser)[source]
Add
--wrist-cameraflag (default off).NED2 envs accept
use_wrist_camera: boolas a kwarg. Default off so there’s no extra Gazebo / vision-node load when the user doesn’t need it.- Parameters:
parser (ArgumentParser)
- Return type:
None
- rl_training_validation.utils.env_safety.apply_wrist_camera_kwargs(env_kwargs, args)[source]
Merge
--wrist-cameraintoenv_kwargs. Returns it for chaining.
- rl_training_validation.utils.env_safety.add_goal_pose_cli(parser)[source]
Add
--goal-pose-topicfor push real validation.Empty by default → the env keeps using random / hard-coded goals. When set (e.g.
/goal_pose), the env subscribes; on each reset the latest pose (if fresh) overrides the random / hard-coded goal. Use with--cube-tracker autoto also auto-launch the second AprilTag adapter for the goal tag (id 1 by default).- Parameters:
parser (ArgumentParser)
- Return type:
None
- rl_training_validation.utils.env_safety.apply_goal_pose_kwargs(env_kwargs, args)[source]
Merge
--goal-pose-topicintoenv_kwargs. Returns it for chaining.
- rl_training_validation.utils.env_safety.with_seed_suffix(path, seed)[source]
Append
seed_<N>/to a save / log path.Keeps runs trained with different seeds in separate directories so a later run does not silently overwrite (or load from) a previous seed’s checkpoints. The trailing slash is preserved when present in the input path.
- rl_training_validation.utils.env_safety.check_env_constructable(env_id, allow_real_flag=False)[source]
Raise SystemExit if
env_idis unregistered or a real env without explicit consent.
- rl_training_validation.utils.env_safety.filter_to_implemented(env_ids)[source]
Split a list of env ids into
(registered, missing).
- rl_training_validation.utils.env_safety.assert_goal_env(env_id)[source]
Raise SystemExit if
env_idis not a goal-conditioned env.- Parameters:
env_id (str)
- Return type:
None
- rl_training_validation.utils.env_safety.assert_non_goal_env(env_id)[source]
Raise SystemExit if
env_idis a goal-conditioned env.- Parameters:
env_id (str)
- Return type:
None
- rl_training_validation.utils.env_safety.parse_env_id(env_id)[source]
Return
(robot, mode, task, is_goal)or None.Parses bare ids like
RX200PushGoalSim-v0/NED2ReacherReal-v0. Also accepts theZed2-flavoured RX200 ids (e.g.RX200Zed2PnPGoalSim-v0); the implicit (no-prefix) kinect variants fall through the empty-prefix branch so callers don’t need to special-case the default sensor. Verified to round-trip every currently-registered task id without returningNone.- Parameters:
env_id (str)
MultiTaskEnv
Wrapper for training a single policy across several task envs in one process. Used by the multi-task training scripts below.
- class rl_training_validation.utils.multi_task_env.MultiTaskEnv(env_list, env_args_list=None, wrapper_list=None, wrapper_args_dict=None)[source]
Bases:
EnvA wrapper that trains multiple UniROS-based Gym environments in one agent. Not MPI-parallel; it simply samples a different task each reset.
- Parameters:
env_list (List[str]) – list of registered gym env names (e.g. UniROS names)
env_args_list (List[dict] | None) – list of dicts, one per env in env_list, passed to make()
wrapper_list (List[str] | None) – list of wrapper class names (strings) to apply to every env
wrapper_args_dict (dict | None) – mapping from wrapper name → kwargs dict
- step(action)[source]
Step the currently-active sub-env with
actiontrimmed to its action dimension, then pad the returned observation to the unifiedmax_obs_dimand stampinfo["task_id"].
- reset(*, seed=None, options=None, **kwargs)[source]
Resets the environment to an initial internal state, returning an initial observation and info.
This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the
seedparameter otherwise if the environment already has a random number generator andreset()is called withseed=None, the RNG is not reset.Therefore,
reset()should (in the typical use case) be called with a seed right after initialization and then never again.For Custom environments, the first line of
reset()should besuper().reset(seed=seed)which implements the seeding correctly.Changed in version v0.25: The
return_infoparameter was removed and now info is expected to be returned.- Parameters:
seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random). If the environment does not already have a PRNG and
seed=None(the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG andseed=Noneis passed, the PRNG will not be reset. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)
- Returns:
- Observation of the initial state. This will be an element of
observation_space (typically a numpy array) and is analogous to the observation returned by
step().- info (dictionary): This dictionary contains auxiliary information complementing
observation. It should be analogous to the
inforeturned bystep().
- Observation of the initial state. This will be an element of
- Return type:
observation (ObsType)
MultiTaskGoalEnv
Goal-conditioned variant of MultiTaskEnv, routes HER
reward recomputation per sub-env via info["task_id"].
- class rl_training_validation.utils.multi_task_goal_env.MultiTaskGoalEnv(env_list, env_args_list=None, wrapper_list=None, wrapper_args_dict=None)[source]
Bases:
GoalEnvA wrapper that trains multiple UniROS-based GoalEnv gymnasium environments in one agent. Not MPI-parallel; it simply samples a different task each reset.
- Parameters:
env_list (List[str]) – list of registered gym env names (must be GoalEnv)
env_args_list (List[dict] | None) – list of dicts, one per env in env_list, passed to make()
wrapper_list (List[str] | None) – list of wrapper class names (strings) to apply to every env
wrapper_args_dict (dict | None) – mapping from wrapper name → kwargs dict
- step(action)[source]
Step the currently-active sub-env with
actiontrimmed to its action dimension, then pad each piece of the returned observation dict (observation/achieved_goal/desired_goal) to the unified maximum dims and stampinfo["task_id"].
- reset(*, seed=None, options=None, **kwargs)[source]
Reset the environment.
In addition, check if the observation space is correct by inspecting the observation, achieved_goal, and desired_goal keys.
- close()[source]
After the user has finished using the environment, close contains the code necessary to “clean up” the environment.
This is critical for closing rendering windows, database or HTTP connections. Calling
closeon an already closed environment has no effect and won’t raise an error.
- compute_reward(achieved_goal, desired_goal, info)[source]
Called by HER to recompute rewards when relabelling experience.
Two call shapes have to be supported:
Live step / single transition:
infois adict;ag/dgare 1-D arrays. Route byinfo["task_id"](stamped at reset() and step() time) so a relabel for sub-env i is computed by sub-env i, not whichever sub-env happens to becurrent_envnow.HER batched relabel:
infois a Python list / ndarray of per-transition info dicts (SB3 HERcopy_info_dict=Falsepasses a list). Group transitions byinfo["task_id"]and let each sub-env compute its slice, then scatter the per-slice rewards back into a single float32 array shaped likeag.shape[:-1]. Without this, the previousisinstance(info, dict)check failed on the list path and silently routed every transition throughself.current_env.
Falls back to
self.current_envwhentask_idis missing frominfo(e.g. a single-task unit test that constructsinfoby hand). The fallback may misroute under HER if the active sub-env has changed since the transition was collected — preservetask_idthrough the replay buffer to avoid it.
RX200 — Reach
Sim training / validation
Train an SB3 policy on the RX200 sim Reach task.
Standard env id: RX200ReacherSim-v0
Goal env id: RX200ReacherGoalSim-v0
Requires Gazebo + roscore to already be running. The env class launches the appropriate MoveIt stack itself.
Validate a trained policy against the RX200 sim Reach task.
Mirrors the train script for env construction; loads a saved
model and rolls it out for --episodes episodes, logging
success rate, truncations, and sensor timeouts from info.
Real training / validation
Train an SB3 policy on the RX200 real Reach task.
This is the explicitly-opt-in real-robot trainer. Real motion is
gated by check_env_constructable: it refuses to construct any
...Real env unless --allow-real-robot-motion is passed on the
command line. The helper also exports ALLOW_REAL_ROBOT_MOTION=1
so downstream code can read consent from a single source — that env
var is a propagation of the same gate, not an independent channel.
- You MUST also have:
the actual RX200 connected and powered up,
the interbotix MoveIt / driver dependencies installed,
(optional)
rosparam set /allow_real_robot_motion trueif your launch chain prefers to query the parameter server.
Default behaviour without --allow-real-robot-motion is a clear
SystemExit with no motion.
Validate a trained policy against the RX200 real Reach task.
Same single-channel CLI gate as rx200_reach_train_real.
Niryo Ned2 — Reach
Sim training / validation
Train an SB3 policy on the Ned2 sim Reach task.
Standard env id: NED2ReacherSim-v0
Goal env id: NED2ReacherGoalSim-v0
Requires Gazebo + roscore to already be running. The env class launches the appropriate MoveIt stack itself.
Validate a trained policy against the Ned2 sim Reach task.
Mirrors the train script for env construction; loads a saved
model and rolls it out for --episodes episodes, logging
success rate, truncations, and sensor timeouts from info.
Real training / validation
Train an SB3 policy on the Ned2 real Reach task.
Real motion is gated by check_env_constructable: it refuses to
construct any ...Real env unless --allow-real-robot-motion is
passed on the command line. The helper also exports
ALLOW_REAL_ROBOT_MOTION=1 so downstream code can read consent
from a single source — that env var is a propagation of the same
gate, not an independent channel.
Reach does not need a cube tracker. --wrist-camera is optional
(off by default); when set, the env subscribes to the niryo wrist
camera and exposes the decoded frame as self.cv_image_wrist.
Validate a trained policy against the Ned2 real Reach task.
Same single-channel CLI gate as ned2_reach_train_real.
Trossen VX300S — Reach
Sim training / validation
Train an SB3 policy on the VX300S sim Reach task.
Standard env id: VX300SReacherSim-v0
Goal env id: VX300SReacherGoalSim-v0
The env launches its own Gazebo/roscore by default and starts the appropriate MoveIt stack itself.
Validate a trained policy against the VX300S sim Reach task.
Mirrors the train script for env construction; loads a saved
model and rolls it out for --episodes episodes, logging
success rate, truncations, and sensor timeouts from info.
Real training / validation
Train an SB3 policy on the VX300S real Reach task.
This is the explicitly-opt-in real-robot trainer. Real motion is
gated by check_env_constructable: it refuses to construct any
...Real env unless --allow-real-robot-motion is passed on the
command line. The helper also exports ALLOW_REAL_ROBOT_MOTION=1
so downstream code can read consent from a single source — that env
var is a propagation of the same gate, not an independent channel.
- You MUST also have:
the actual VX300S connected and powered up,
the interbotix MoveIt / driver dependencies installed,
(optional)
rosparam set /allow_real_robot_motion trueif your launch chain prefers to query the parameter server.
Default behaviour without --allow-real-robot-motion is a clear
SystemExit with no motion.
Validate a trained policy against the VX300S real Reach task.
Same single-channel CLI gate as vx300s_reach_train_real.
Universal Robots UR5e — Reach
Sim training / validation
Train an SB3 policy on the UR5e sim Reach task.
Standard env id: UR5eReacherSim-v0
Goal env id: UR5eReacherGoalSim-v0
The env launches its own Gazebo/roscore by default and starts the appropriate MoveIt stack itself.
Validate a trained policy against the UR5e sim Reach task.
Mirrors the train script for env construction; loads a saved
model and rolls it out for --episodes episodes, logging
success rate, truncations, and sensor timeouts from info.
Real training / validation
Train an SB3 policy on the UR5e real Reach task.
This is the explicitly-opt-in real-robot trainer. Real motion is
gated by check_env_constructable: it refuses to construct any
...Real env unless --allow-real-robot-motion is passed on the
command line. The helper also exports ALLOW_REAL_ROBOT_MOTION=1
so downstream code can read consent from a single source — that env
var is a propagation of the same gate, not an independent channel.
- You MUST also have:
the actual UR5e connected and powered up,
the interbotix MoveIt / driver dependencies installed,
(optional)
rosparam set /allow_real_robot_motion trueif your launch chain prefers to query the parameter server.
Default behaviour without --allow-real-robot-motion is a clear
SystemExit with no motion.
Validate a trained policy against the UR5e real Reach task.
Same single-channel CLI gate as ur5e_reach_train_real.
Multi-task learning
Multi-task TD3 training across several UniROS sim envs simultaneously.
By default this uses the two implemented RX200 sim envs that share similar dynamics: Reach and Push. The user can switch in any other implemented env via the CLI; the script refuses to construct an unimplemented env id.
This script is sim-only. Real envs cannot be added to the multi-task mix from here — they require explicit per-env consent and are not covered by the multi-task wrapper’s resampling logic.