UR5e — Reach

The arm must move its end-effector to a 3D target sampled in the workspace above the cafe-table. No cube; the gripper is not commanded. The achieved goal is the EE position; the desired goal is the sampled target.

This page covers four registered Gymnasium env IDs:

UR5eReacherSim-v0 — standard, Gazebo
UR5eReacherGoalSim-v0 — goal-conditioned (HER), Gazebo
UR5eReacherReal-v0 — standard, real hardware
UR5eReacherGoalReal-v0 — goal-conditioned, real hardware

Description

A UR5e arm with a Robotiq 2F-85 gripper sits on a 4-legged ur5_base (top at z = 0.59) next to a cafe_table workspace at world (0.7, 0, 0). The agent commands joint-space deltas (default) or absolute joint positions, or alternatively end-effector position deltas in EE mode (ee_action_type=True). Every commanded action is checked link-by-link against the workspace floor before being published — actions that would dip a link below the safety floor or into the cafe-table footprint are rejected with a penalty reward.

The env loop runs at environment_loop_rate (default 10 Hz). In real-time mode (realtime_mode=True, the default), Gazebo physics is never paused; step() reads the latest cached obs / reward / done values. Otherwise the standard MDP loop pauses physics around each action.

Action Space

Joint mode (default, ee_action_type=False). Box(6,):

Num	Action	Min	Max	Joint	Unit
0	shoulder pan delta (or absolute, per `delta_action`)	-3.14	+3.14	`shoulder_pan_joint`	rad
1	shoulder lift delta	-3.14	+3.14	`shoulder_lift_joint`	rad
2	elbow delta	-3.14	+3.14	`elbow_joint`	rad
3	wrist 1 delta	-3.14	+3.14	`wrist_1_joint`	rad
4	wrist 2 delta	-3.14	+3.14	`wrist_2_joint`	rad
5	wrist 3 delta	-3.14	+3.14	`wrist_3_joint`	rad

When delta_action=True (default), the action is scaled by delta_coeff = 0.05 and added to the current joint position. When delta_action=False the action is the absolute joint target, clipped to the box bounds.

EE mode (ee_action_type=True). Box(3,) — ΔEE position in the robot’s base frame:

Num	Action	Min	Max	Notes
0	Δx (or absolute x)	-0.90	1.20	EE x in base_link frame
1	Δy (or absolute y)	-0.90	0.90	EE y
2	Δz (or absolute z)	0.00	1.50	EE z

The env solves IK against the target EE pose, then publishes the joint-space trajectory through the same per-link safety check as joint mode.

Observation Space

Standard env (``UR5eReacherSim-v0`` / ``UR5eReacherReal-v0``). Box(27,) by default (24 if ee_action_type=True):

Idx	Dim	Component	Source	Unit
0–2	3	EE position	MoveIt FK	m
3–5	3	Unit vector EE → goal	normalized	unitless
6	1	Euclidean distance EE → goal	‖goal − ee‖	m
7–13	7	Current joint positions	`/ur5e/joint_states.position`	rad
14–19	6 (or 3)	Previous action	cached	matches action space
20–26	7	Current joint velocities	`/ur5e/joint_states.velocity`	rad/s

The 7-element joint vectors are in alphabetical order from /joint_states: elbow_joint, robotiq_85_left_knuckle_joint, shoulder_lift_joint, shoulder_pan_joint, wrist_1_joint, wrist_2_joint, wrist_3_joint.

Goal env (``UR5eReacherGoalSim-v0`` / ``UR5eReacherGoalReal-v0``). Gymnasium Dict with three keys:

observation — Box(24,) (or 21 in EE mode). Same as the standard env’s Box minus the EE→goal feature columns (no goal info leaks into the policy’s plain observation).

desired_goal — Box(3,). Sampled target XYZ in base frame:

Idx	Dim	Component	Min	Max
0	1	goal x	0.40	0.80
1	1	goal y	-0.30	0.30
2	1	goal z	0.85	1.10

achieved_goal — Box(3,). Current EE XYZ (same coordinate frame as desired_goal).

Rewards

The env supports two reward modes selected by the reward_type kwarg.

Sparse (reward_type="Sparse", required for HER on goal envs):

reward = 0.0  if ‖ee − goal‖ < reach_tolerance else -1.0

Dense (reward_type="Dense", default for std env):

reward = -multiplier_dist_reward * ‖ee − goal‖   # step shaping
       + reached_goal_reward     if ‖ee − goal‖ < reach_tolerance
       + step_reward             every step
       + joint_limits_reward     if action outside joint bounds
       + none_exe_reward         if MoveIt plan / FK safety rejects
       + not_within_goal_space_reward  if goal sampling failed

Defaults (from config/ur5e_reach_task_config.yaml): reach_tolerance=0.02, multiplier_dist_reward=2.0, reached_goal_reward=20, step_reward=-0.5, joint_limits_reward=-2.0, none_exe_reward=-5.0, not_within_goal_space_reward=-2.0.

Code example:

import uniros as gym
import rl_environments  # noqa: F401  (triggers registration)

# Standard env, dense reward
env = gym.make("UR5eReacherSim-v0", reward_type="Dense")
# Goal env, sparse reward (HER)
env = gym.make("UR5eReacherGoalSim-v0", reward_type="Sparse")

Starting State

Initial joint pose (folded upright, set via gazebo_msgs/SetModelConfiguration while Gazebo is paused, then unpaused):

shoulder_pan_joint  =  0.000
shoulder_lift_joint = -1.5707  (-90°, upper arm vertical up)
elbow_joint         =  1.5707  (+90°, forearm horizontal forward)
wrist_1_joint       = -1.5707
wrist_2_joint       = -1.5707
wrist_3_joint       =  0.000

The arm’s all-zeros URDF pose puts it horizontal at base height (z = 0.59), colliding with the cafe-table column at x = 0.7. The folded pose above puts the EE over the workspace at roughly (0.40, 0, 0.95) in world coordinates.

Goal sampling. Each reset() draws a fresh desired_goal ∈ Box(3,) from [position_goal_min, position_goal_max]:

x ∈ [0.40, 0.80]   y ∈ [-0.30, 0.30]   z ∈ [0.85, 1.10]

This box sits above the cafe-table top (z = 0.775) and within the UR5e’s ≈ 0.85 m reach from the arm base at (0, 0, 0.59).

Episode End

Truncation. Episodes truncate after max_episode_steps (default 100, set at registration time; override via the TimeLimitWrapper in the train scripts). Episodes also terminate / truncate if the joint-state staleness gate fires on the real env (/joint_states not updated for joint_state_timeout_s = 0.5 seconds).

Termination. Episode terminates when the EE reaches the goal (‖ee − goal‖ < reach_tolerance). Termination is only set on the sparse-reward path; on dense reward the agent keeps accumulating the shaping signal even after reaching the goal until the time limit.

Arguments

Top-level kwargs to gym.make("UR5eReacher*-v0", ...). All have sensible defaults; only gazebo_gui and reward_type are commonly overridden.

Kwarg	Default	Meaning
`seed`	`None`	RNG seed for goal sampling.
`gazebo_gui`	`False`	Set `True` to launch Gazebo with the GUI.
`reward_type`	`"Dense"` (std) / `"Sparse"` (goal)	One of `"Sparse"` or `"Dense"`.
`ee_action_type`	`False`	`True` → Box(3,) EE action; `False` → Box(6,) joint action.
`delta_action`	`True`	`True` → action interpreted as delta (× `delta_coeff`); `False` → action is the absolute target.
`delta_coeff`	`0.05`	Scale factor when `delta_action=True`.
`environment_loop_rate`	`10.0`	Hz for the internal env loop / obs cache update.
`action_cycle_time`	`0.5`	Seconds the env waits between actions. Must be ≥ 1 / `environment_loop_rate`.
`action_speed`	`0.2` (sim) / configurable (real)	Time the controller has to interpolate to the commanded joint target.
`realtime_mode`	`True`	`True` → UniROS real-time loop (physics never paused). `False` → MDP-style pause-step-resume.
`use_kinect`	`False`	Opt-in subscribe to `/head_mount_kinect2/*` for RGB / depth.
`log_internal_state`	`False`	Verbose `rospy.loginfo` for debugging.

Real-only kwargs (UR5eReacher*Real-v0): inherits the above plus the --allow-real-robot-motion gate enforced by rl_training_validation.utils.env_safety.check_env_constructable.

Version History

v0 — first release (rl_environments v0.1.0). Per-link FK safety check; SetModelConfiguration init-pose path; 27-dim Box obs (standard) or 24-dim Box + 3-dim Box × 2 (goal).