UR5e — Reach
The arm must move its end-effector to a 3D target sampled in the workspace above the cafe-table. No cube; the gripper is not commanded. The achieved goal is the EE position; the desired goal is the sampled target.
This page covers four registered Gymnasium env IDs:
UR5eReacherSim-v0— standard, GazeboUR5eReacherGoalSim-v0— goal-conditioned (HER), GazeboUR5eReacherReal-v0— standard, real hardwareUR5eReacherGoalReal-v0— goal-conditioned, real hardware
Description
A UR5e arm with a Robotiq 2F-85 gripper sits on a 4-legged
ur5_base (top at z = 0.59) next to a cafe_table workspace at
world (0.7, 0, 0). The agent commands joint-space deltas (default)
or absolute joint positions, or alternatively end-effector position
deltas in EE mode (ee_action_type=True). Every commanded action
is checked link-by-link against the workspace floor before being
published — actions that would dip a link below the safety floor or
into the cafe-table footprint are rejected with a penalty reward.
The env loop runs at environment_loop_rate (default 10 Hz). In
real-time mode (realtime_mode=True, the default), Gazebo physics
is never paused; step() reads the latest cached obs / reward /
done values. Otherwise the standard MDP loop pauses physics around
each action.
Action Space
Joint mode (default, ee_action_type=False). Box(6,):
Num |
Action |
Min |
Max |
Joint |
Unit |
|---|---|---|---|---|---|
0 |
shoulder pan delta (or absolute, per |
-3.14 |
+3.14 |
|
rad |
1 |
shoulder lift delta |
-3.14 |
+3.14 |
|
rad |
2 |
elbow delta |
-3.14 |
+3.14 |
|
rad |
3 |
wrist 1 delta |
-3.14 |
+3.14 |
|
rad |
4 |
wrist 2 delta |
-3.14 |
+3.14 |
|
rad |
5 |
wrist 3 delta |
-3.14 |
+3.14 |
|
rad |
When delta_action=True (default), the action is scaled by
delta_coeff = 0.05 and added to the current joint position. When
delta_action=False the action is the absolute joint target,
clipped to the box bounds.
EE mode (ee_action_type=True). Box(3,) — ΔEE position in
the robot’s base frame:
Num |
Action |
Min |
Max |
Notes |
|---|---|---|---|---|
0 |
Δx (or absolute x) |
-0.90 |
1.20 |
EE x in base_link frame |
1 |
Δy (or absolute y) |
-0.90 |
0.90 |
EE y |
2 |
Δz (or absolute z) |
0.00 |
1.50 |
EE z |
The env solves IK against the target EE pose, then publishes the joint-space trajectory through the same per-link safety check as joint mode.
Observation Space
Standard env (``UR5eReacherSim-v0`` / ``UR5eReacherReal-v0``).
Box(27,) by default (24 if ee_action_type=True):
Idx |
Dim |
Component |
Source |
Unit |
|---|---|---|---|---|
0–2 |
3 |
EE position |
MoveIt FK |
m |
3–5 |
3 |
Unit vector EE → goal |
normalized |
unitless |
6 |
1 |
Euclidean distance EE → goal |
‖goal − ee‖ |
m |
7–13 |
7 |
Current joint positions |
|
rad |
14–19 |
6 (or 3) |
Previous action |
cached |
matches action space |
20–26 |
7 |
Current joint velocities |
|
rad/s |
The 7-element joint vectors are in alphabetical order from
/joint_states: elbow_joint, robotiq_85_left_knuckle_joint,
shoulder_lift_joint, shoulder_pan_joint, wrist_1_joint,
wrist_2_joint, wrist_3_joint.
Goal env (``UR5eReacherGoalSim-v0`` / ``UR5eReacherGoalReal-v0``).
Gymnasium Dict with three keys:
observation — Box(24,) (or 21 in EE mode). Same as the standard
env’s Box minus the EE→goal feature columns (no goal info leaks into
the policy’s plain observation).
desired_goal — Box(3,). Sampled target XYZ in base frame:
Idx |
Dim |
Component |
Min |
Max |
|---|---|---|---|---|
0 |
1 |
goal x |
0.40 |
0.80 |
1 |
1 |
goal y |
-0.30 |
0.30 |
2 |
1 |
goal z |
0.85 |
1.10 |
achieved_goal — Box(3,). Current EE XYZ (same coordinate frame
as desired_goal).
Rewards
The env supports two reward modes selected by the reward_type
kwarg.
Sparse (reward_type="Sparse", required for HER on goal envs):
reward = 0.0 if ‖ee − goal‖ < reach_tolerance else -1.0
Dense (reward_type="Dense", default for std env):
reward = -multiplier_dist_reward * ‖ee − goal‖ # step shaping
+ reached_goal_reward if ‖ee − goal‖ < reach_tolerance
+ step_reward every step
+ joint_limits_reward if action outside joint bounds
+ none_exe_reward if MoveIt plan / FK safety rejects
+ not_within_goal_space_reward if goal sampling failed
Defaults (from config/ur5e_reach_task_config.yaml):
reach_tolerance=0.02, multiplier_dist_reward=2.0,
reached_goal_reward=20, step_reward=-0.5,
joint_limits_reward=-2.0, none_exe_reward=-5.0,
not_within_goal_space_reward=-2.0.
Code example:
import uniros as gym
import rl_environments # noqa: F401 (triggers registration)
# Standard env, dense reward
env = gym.make("UR5eReacherSim-v0", reward_type="Dense")
# Goal env, sparse reward (HER)
env = gym.make("UR5eReacherGoalSim-v0", reward_type="Sparse")
Starting State
Initial joint pose (folded upright, set via
gazebo_msgs/SetModelConfiguration while Gazebo is paused, then
unpaused):
shoulder_pan_joint = 0.000
shoulder_lift_joint = -1.5707 (-90°, upper arm vertical up)
elbow_joint = 1.5707 (+90°, forearm horizontal forward)
wrist_1_joint = -1.5707
wrist_2_joint = -1.5707
wrist_3_joint = 0.000
The arm’s all-zeros URDF pose puts it horizontal at base height (z = 0.59), colliding with the cafe-table column at x = 0.7. The folded pose above puts the EE over the workspace at roughly (0.40, 0, 0.95) in world coordinates.
Goal sampling. Each reset() draws a fresh
desired_goal ∈ Box(3,) from
[position_goal_min, position_goal_max]:
x ∈ [0.40, 0.80] y ∈ [-0.30, 0.30] z ∈ [0.85, 1.10]
This box sits above the cafe-table top (z = 0.775) and within the UR5e’s ≈ 0.85 m reach from the arm base at (0, 0, 0.59).
Episode End
Truncation. Episodes truncate after max_episode_steps (default
100, set at registration time; override via the TimeLimitWrapper
in the train scripts). Episodes also terminate / truncate if the
joint-state staleness gate fires on the real env (/joint_states
not updated for joint_state_timeout_s = 0.5 seconds).
Termination. Episode terminates when the EE reaches the goal
(‖ee − goal‖ < reach_tolerance). Termination is only set on the
sparse-reward path; on dense reward the agent keeps accumulating the
shaping signal even after reaching the goal until the time limit.
Arguments
Top-level kwargs to gym.make("UR5eReacher*-v0", ...). All have
sensible defaults; only gazebo_gui and reward_type are
commonly overridden.
Kwarg |
Default |
Meaning |
|---|---|---|
|
|
RNG seed for goal sampling. |
|
|
Set |
|
|
One of |
|
|
|
|
|
|
|
|
Scale factor when |
|
|
Hz for the internal env loop / obs cache update. |
|
|
Seconds the env waits between actions. Must be ≥ 1 /
|
|
|
Time the controller has to interpolate to the commanded joint target. |
|
|
|
|
|
Opt-in subscribe to |
|
|
Verbose |
Real-only kwargs (UR5eReacher*Real-v0): inherits the above plus
the --allow-real-robot-motion gate enforced by
rl_training_validation.utils.env_safety.check_env_constructable.
Version History
v0— first release (rl_environmentsv0.1.0). Per-link FK safety check;SetModelConfigurationinit-pose path; 27-dim Box obs (standard) or 24-dim Box + 3-dim Box × 2 (goal).