Training a model
================

The environments produced by this framework are **standard
gymnasium environments** that expose the usual
``reset`` / ``step`` / ``action_space`` / ``observation_space``
surface. Any RL library that consumes those should work.
Verification status:

* **Tested**: Stable Baselines 3 via ``sb3_ros_support`` (the
  primary path used in the paper experiments and in
  ``rl_training_validation``).
* **Likely works**: plain Stable Baselines 3 without the support
  wrapper; hand-written training loops.
* **Unverified but expected to work**: CleanRL, Tianshou, RLlib,
  Tensorforce. They each inspect a few env attributes (``spec``,
  ``metadata``, ``unwrapped``, vector-env assumptions) that the
  proxy forwards correctly via ``__getattr__``, but the full
  matrix hasn't been exercised end-to-end.

``uniros.make()`` returns a proxy that behaves like ``gym.Env``
but runs the underlying env in a worker process. That's the only
framework-specific detail; everything downstream is gymnasium-
shaped.

This page shows three increasingly involved options:

1. :ref:`training-raw-sb3` — pure Stable Baselines 3 with no
   ROS-specific wrappers.
2. :ref:`training-sb3-ros-support` — the convenience layer that
   ships with this ecosystem (YAML config, ROS-aware paths,
   uniform train / save / load surface, HER ready for goal envs).
3. :ref:`training-other-frameworks` — pointers for using CleanRL,
   Tianshou, RLlib, or a custom loop.

For the dedicated joint-sim-and-real training pattern (Use Case
C from the paper), see :doc:`joint_sim_real_training`.


.. _training-raw-sb3:

Option 1 — Plain Stable Baselines 3
-----------------------------------

The simplest possible training script. No YAML, no extra wrappers,
just SB3 against a uniros-managed env.

.. code-block:: python

   #!/bin/python3
   import rospy
   from multiros.utils import gazebo_core
   import uniros as gym
   import rl_environments  # registers gym IDs

   from stable_baselines3 import SAC


   if __name__ == "__main__":
       gazebo_core.launch_gazebo(launch_roscore=True, gui=False)
       rospy.init_node("rx200_reach_train_plain_sb3")

       env = gym.make("RX200ReacherSim-v0")

       model = SAC(
           "MlpPolicy",
           env,
           learning_rate=3e-4,
           buffer_size=1_000_000,
           batch_size=256,
           tensorboard_log="./tb_logs/",
           verbose=1,
       )
       model.learn(total_timesteps=100_000)
       model.save("rx200_reach_sac")

       env.close()

This works because ``uniros.make`` returns an object that responds
to ``reset`` / ``step`` / ``close`` exactly like ``gym.Env``. SB3
doesn't know or care about ROS.


.. _training-sb3-ros-support:

Option 2 — sb3_ros_support
--------------------------

If your script already lives in a ROS package and you want
config-driven training (so swapping PPO for SAC for TD3 is a YAML
edit, not a code rewrite), :doc:`/api/sb3_ros_support` adds:

* A ``BasicModel`` base class and one subclass per algorithm —
  ``PPO``, ``A2C``, ``DDPG``, ``TD3``, ``SAC``, ``DQN``, plus their
  goal-conditioned ``*_GOAL`` variants for HER.
* YAML-driven hyperparameter loading via ``ros_load_yaml``.
* Convenient ``train`` / ``save_model`` / ``load_trained_model``
  / ``predict`` surface that wraps the underlying SB3 model.

.. code-block:: python

   #!/bin/python3
   import rospy
   from multiros.utils import gazebo_core
   import uniros as gym
   import rl_environments

   from sb3_ros_support.td3 import TD3


   if __name__ == "__main__":
       gazebo_core.launch_gazebo(launch_roscore=True, gui=False)
       rospy.init_node("rx200_reach_train_sim")

       env = gym.make("RX200ReacherSim-v0")
       env.reset()

       # YAML config lives inside the rl_training_validation package's
       # ``config/`` directory. Replace the filename for SAC, PPO, etc.
       pkg_path = "rl_training_validation"
       model = TD3(
           env,
           save_model_path="/models/td3/",
           log_path="/logs/td3/",
           model_pkg_path=pkg_path,
           config_file_pkg=pkg_path,
           config_filename="rx200_reacher_td3.yaml",
       )

       model.train()
       model.save_model()

       env.close()

Working examples live under
``rl_training_validation/src/rl_training_validation/rx200/reach/``:

* ``rx200_reach_train_sim.py`` / ``rx200_reach_validate_sim.py``
* ``rx200_reach_train_real.py`` / ``rx200_reach_validate_real.py``

See :doc:`/api/sb3_ros_support` for the full algorithm list, and
:doc:`/api/rl_training_validation` for the working scripts.


.. _training-other-frameworks:

Option 3 — Any other gymnasium-compatible framework
---------------------------------------------------

CleanRL, Tianshou, RLlib, Tensorforce, and hand-written training
loops should all be adaptable because they consume the Gymnasium
API and that's what ``uniros.make`` produces. SB3 via
``sb3_ros_support`` is the tested path; the snippets below are
integration sketches — they haven't been exercised end-to-end on
this codebase.

**CleanRL**

CleanRL training scripts are single-file. Replace the line that
creates ``env`` with ``uniros.make``:

.. code-block:: python

   import uniros as gym
   import rl_environments

   def make_env(env_id):
       def thunk():
           env = gym.make(env_id)
           return env
       return thunk

   # ... rest of CleanRL ppo_continuous_action.py / sac_continuous_action.py
   # uses `make_env` as is.

**Tianshou**

.. code-block:: python

   import uniros as gym
   import rl_environments
   from tianshou.env import DummyVectorEnv
   from tianshou.policy import SACPolicy

   env = DummyVectorEnv([lambda: gym.make("RX200ReacherSim-v0")
                        for _ in range(4)])
   # ... continue with the standard Tianshou trainer.

**RLlib**

RLlib expects a registered env. Register a thin wrapper:

.. code-block:: python

   from ray.tune.registry import register_env
   import uniros as uniros_gym
   import rl_environments

   def _make(config):
       return uniros_gym.make(config["env_id"])

   register_env("rx200_reacher", _make)

   # algo = ppo.PPO(config={"env": "rx200_reacher",
   #                        "env_config": {"env_id": "RX200ReacherSim-v0"},
   #                        ...})

**Hand-written training loop**

.. code-block:: python

   import uniros as gym
   import rl_environments

   env = gym.make("RX200ReacherSim-v0")
   obs, _ = env.reset(seed=0)
   for step in range(100_000):
       action = your_policy(obs)
       obs, reward, term, trunc, info = env.step(action)
       your_learner.observe(obs, action, reward, term)
       if term or trunc:
           obs, _ = env.reset()

The only point where the framework's identity matters is the call
to ``uniros.make`` (which runs the env in a worker process). Once
you have the proxy in hand, treat it as a normal gymnasium env.


Configuration via YAML (sb3_ros_support)
----------------------------------------

When using ``sb3_ros_support``, hyperparameters live in a YAML file
under any ROS package you control. The working examples ship under
``rl_training_validation/config/``:

* ``rx200_reacher_sac.yaml`` / ``rx200_reacher_sac_goal.yaml``
* ``rx200_reacher_td3.yaml`` / ``rx200_reacher_td3_goal.yaml``
* ``rx200_push_td3.yaml`` / ``rx200_push_td3_goal.yaml``
* ``multi_task_td3.yaml`` / ``multi_task_td3_goal.yaml``

A typical file:

.. code-block:: yaml

   total_timesteps: 100000
   learning_starts: 1000

   policy: "MlpPolicy"
   policy_kwargs:
     net_arch: [256, 256]
   learning_rate: 0.0003

   buffer_size: 1000000
   batch_size: 256
   gamma: 0.99
   tau: 0.005

   action_noise:
     type: "normal"
     mean: 0.0
     stddev: 0.1

   # HER block (only for *_GOAL algorithms)
   her:
     n_sampled_goal: 4
     goal_selection_strategy: "future"

Pass the filename to the algorithm wrapper's ``config_filename``;
all the keys above are read at ``train()`` time.


Logging and checkpoints
-----------------------

Whichever option you use, TensorBoard is the standard reader:

.. code-block:: bash

   tensorboard --logdir /path/to/logs/

Saved models from ``sb3_ros_support`` are SB3 ``.zip`` files that
can be loaded back via
:func:`sb3_ros_support.core.BasicModel.load_trained_model` or
SB3's own ``Algorithm.load(...)``.