Tiny struggles

Always hacking something 🧑‍🔬.

Robotics Training Experiments: What Worked, What Didn't, and What Surprised Me

Part 2 of a series on practical robotics experimentation. See part1.

After building my simulation environment, the real learning began. I threw multiple approaches at the cube-to-bin task: Soft Actor-Critic (SAC) reinforcement learning, behavior cloning from demonstrations, and modern imitation learning with LeRobot. Each taught me something different about the practical realities of robot learning.

Here’s what I discovered—including some findings that challenged my assumptions and revealed the gap between academic ideals and development reality.

You can find the training code here.

Why SAC? Guidance from ChatGPT and the Literature

When it came time to choose a reinforcement learning algorithm for manipulation, I turned to ChatGPT for guidance on what would work well for the precise, continuous control that robotic manipulation demands.

The Algorithm Landscape:

  • PPO: Great for many RL tasks, but can struggle with the precise, smooth actions needed for manipulation
  • DDPG: Deterministic policy gradients, but known for training instability
  • TD3: Improved version of DDPG with better stability
  • SAC: Soft Actor-Critic with entropy regularization

Why SAC Emerged as the Right Choice: Through my conversations with ChatGPT and reading the literature, SAC stood out for manipulation tasks:

  • Smooth, continuous actions: Essential for robotic control
  • Sample efficiency: Better than policy gradient methods like PPO
  • Stability: More reliable than DDPG/TD3 for continuous control
  • Exploration: The entropy term helps discover diverse manipulation strategies

The Manipulation Sweet Spot: Robotic manipulation requires policies that can be both exploratory (to discover grasping strategies) and precise (for fine motor control). SAC’s soft policy approach naturally balances these needs.

My background in ML and deep generative modeling helped me understand why the entropy regularization was important, but ChatGPT’s domain-specific guidance really pointed me toward SAC as the practical choice for this type of problem.

This background influenced my entire approach to the experiments that followed.

You’re absolutely right! I made the same mistake again. Let me fix that section:

The SAC Experiments: Dense vs. Sparse Rewards

The Reward Shaping Struggle: I didn’t initially know about HER. I first spent several days training and tweaking reward shaping—trying sparse and dense methods, experimenting with penalizing for missing the goal on every step versus not penalizing. When I saw poor training success and weird behaviors from my trained agents, it made me investigate alternative strategies.

SAC with Dense Rewards: The Winner

For dense rewards, I shaped the reward function around manipulation primitives:

# Simplified reward structure
approach_reward = -distance_to_cube
grasp_reward = gripper_closure_bonus if near_cube else 0
lift_reward = cube_height_above_table * lift_multiplier
placement_reward = -distance_cube_to_bin if cube_grasped else 0
success_reward = 100 if cube_in_bin else 0

This approach worked reliably. The policy learned smooth approach trajectories, consistent grasping, and coordinated placement motions. Training was stable and converged predictably.

SAC + HER with Sparse Rewards: A Later Discovery

Why HER Seemed Appealing: When I discovered HER, it promised a more disciplined approach to the reward engineering problem I’d been struggling with: just define success and let the algorithm figure out the intermediate steps. The idea is elegant—when an episode fails to reach the intended goal, HER treats it as if it was trying to reach wherever it actually ended up. This creates synthetic “success” experiences from every failed attempt.

The HER Insight: Instead of manually crafting rewards for “getting closer to the cube,” HER automatically discovers that reaching certain intermediate positions is useful for eventually reaching the final goal. It’s like learning to play chess by treating every game as a success for reaching whatever position you actually achieved.

For robotic manipulation, this seemed perfect:

  • No manual reward engineering: Just define task success
  • Automatic curriculum: Learn easier goals (approach) before harder ones (grasp+place)
  • Sample efficiency: Every episode contributes training signal

The Implementation Reality: But first, I had to figure out how to actually implement HER, which wasn’t obvious. The documentation and examples didn’t make it clear how to structure the environment properly.

The Goal Environment Challenge: HER requires a “goal-conditioned” environment where observations include both the current state and a target goal. See my implementation example.

The HER Training Setup: Once you have the goal environment structure:

from stable_baselines3 import HerReplayBuffer, SAC

model = SAC(
    "MultiInputPolicy",  # Required for dict observations
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy='future',
    ),
    verbose=1
)

The honest truth about my experience:

  • Learning progress was painfully slow—each step of improvement took much longer compared to dense rewards
  • I never actually had the patience to let it finish training—the progress was so glacial that I always gave up and went back to dense rewards
  • The few times I did wait longer, performance was unstable

Was I doing it right? It’s entirely possible that I wasn’t implementing the GoalEnv correctly, or that my hardware was too slow for HER to show its benefits. The implementation wasn’t obvious from the documentation, and debugging goal-conditioned environments is more complex than standard ones.

For the patient (or well-equipped): If you have more patience than I did—or faster hardware—this goal-conditioned setup might work well for you. But life is short and iteration speed matters.

Lesson learned: Academic papers often focus on sample efficiency, but wall-clock time and learning progress rate matter enormously for practical development. Sometimes the “less elegant” dense reward approach is the right engineering choice.

The SAC Training Setup: Real Implementation Details

Here’s what my actual SAC training pipeline looked like (full implementation in scripts/train_sac.py):

Multi-Environment Training:

# Create vectorized environments for faster training
vec_env = make_vec_env(
    create_single_env,
    n_envs=num_envs,  # I used 2-6 parallel environments
    vec_env_cls=SubprocVecEnv,
    env_kwargs={"task": task},
)
vec_env = VecTransposeImage(vec_env)
vec_env = VecNormalize(vec_env, norm_obs=True, norm_reward=False, clip_obs=10.0)

Model Configuration:

model = SAC(
    policy="MultiInputPolicy",
    env=vec_env,
    learning_rate=1e-4,
    buffer_size=2_000,  # Started small, later increased
    batch_size=256,
    ent_coef="auto",
    target_entropy=-2.0,
    device=device,
    tensorboard_log=log_dir,
)

Stage-Based Training Approach: Rather than just running continuous training, I implemented a stage-based system that adjusted exploration over time:

class StageBasedTraining:
    def __init__(self, model, vec_env, callback=None, start_steps=0, num_envs=2):
        # Stage 1: High exploration (target_entropy = -2.0)
        # Stage 2: Balanced phase (target_entropy = -3.0) 
        # Stage 3: Exploitation (target_entropy = -7.0, lower learning rate)

Custom Evaluation Callback: The most useful part was a custom callback that recorded videos during training:

class EvaluationVideoCallback(BaseCallback):
    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            mean_reward, video_frames = self.evaluate_function(
                self.model, num_episodes=self.num_episodes
            )
            # Save video and update best model if improved

This let me visually track learning progress and automatically save the best-performing models with videos.

The Stability Challenge and Practical Constraints

Even with dense rewards, SAC wasn’t always smooth sailing. I encountered several issues that taught me about the practical side of RL:

The Buffer Size Discovery: An Accidental Experiment

Looking back at my code, I discovered an interesting inconsistency that taught me about buffer sizes the hard way:

# Fresh training: very small buffer
buffer_size=2_000,  

# Resumed training: much larger buffer  
buffer_size=50_000,

This 25x difference was completely unintentional, but it created an accidental experiment. The small buffer meant:

  • Faster iteration during initial experimentation
  • Lower memory usage on my M3 MacBook (I was easily hitting memory limits)
  • Less stable training due to limited experience replay

The M3 Memory Reality: On macOS M3, I was constantly running into memory constraints. The 2K buffer was actually a practical necessity rather than a design choice—larger buffers would cause memory issues, especially with multiple vectorized environments and image observations.

Episode Length Evolution: I initially started with 300-step episodes, but later extended them to 700 steps. The reason: 300 steps were too short for me to complete teleoperation demonstrations. This change also affected my buffer size calculations—longer episodes meant each episode took more buffer space, so my effective “number of episodes in buffer” was different than I initially thought.

Platform Dependencies: The Colab Surprise

Stable-Baselines3’s vectorized environments were genuinely useful for throughput, but there was a huge platform dependency: On macOS with MuJoCo running on CPU, vectorized environments were surprisingly fast. On Google Colab, the same setup was painfully slow—likely due to weaker single-thread CPU performance and different MuJoCo builds.

Lesson: Always benchmark on your target platform. Performance characteristics can vary dramatically.

The Real Development Workflow: Storage, Videos, and AI-Guided Decisions

The Storage Reality: I was checkpointing aggressively—saving models, replay buffers, and VecNormalize stats every few thousand steps. Between the checkpoints and training videos, I ran out of my 1TB MacBook storage multiple times. The main culprit was the replay buffers at 3-4GB each—pretty typical for SAC with image observations. The other files were small in comparison:

  • Model checkpoints (~100MB each)
  • Replay buffers (3-4GB each - the real storage killer)
  • VecNormalize stats (small)
  • Evaluation videos (small)

When you’re saving every few thousand steps and keeping multiple checkpoints for safety, those 3-4GB replay buffers add up fast.

Videos as the Primary Feedback Loop: The evaluation videos turned out to be absolutely critical for understanding training progress. Unlike the reward curves, videos immediately showed me:

  • Whether the robot was learning useful behaviors
  • If it was getting stuck in local minima (like endlessly circling the cube)
  • When training was diverging before the metrics made it obvious
  • The quality of grasping and placement behaviors

Divergence Patterns in Video: Videos were especially revealing when the model diverged. You could see the policy just curling up in some specific pose regardless of what was happening—the robot would move to the same configuration every time, completely ignoring the cube position or any environmental state. This was much more obvious in video than in the reward curves, which might still show some variance.

Visualizing Entropy: The videos also worked as an excellent entropy visualizer. When entropy was too high, you’d see the robot making erratic, inconsistent movements—approaching the cube from random angles, jerky motions, inconsistent grasping attempts. When entropy was too low, the robot would be overly deterministic, potentially getting stuck in the same suboptimal strategy repeatedly.

AI-Assisted Training Decisions: I regularly copied my tensorboard training stats and shared them with Claude to get guidance on whether to interrupt training and resume from an earlier checkpoint. Since I didn’t know exactly what learning rate and entropy target to set for SAC, having an AI co-pilot helped me interpret the training curves and decide when to intervene.

This back-and-forth with Claude became part of my training workflow—especially when deciding whether unstable training was worth continuing or if I should revert to a more stable checkpoint.

The Resolution Debugging Story

This was my biggest “check your assumptions” moment. I had been training with 64×48 pixel observations for speed, and the results were disappointing. Policies would approach the cube but fail at precise manipulation.

I spent weeks tuning hyperparameters, reward functions, and network architectures. The rendered videos looked fine—I could clearly see the cube and bin. What was wrong?

The breakthrough came when I examined the actual tensor inputs to the model:

# Debug: save actual model inputs
obs_tensor = env.observation_space.sample()
plt.imshow(obs_tensor['pixels'].transpose(1, 2, 0))
plt.title("What the model actually sees")
plt.show()

The 64×48 images were severely degraded compared to my 480×640 rendered videos. Critical details for grasping—cube edges, gripper alignment, depth perception—were lost in the downsampling.

Switching to 640x480 input resolution immediately improved performance. The lesson: Always verify your actual model inputs, not just what you think you’re feeding it. Rendered videos can be misleading about what the policy actually observes.

Behavior Cloning: The Simulation vs. Reality Gap

I also experimented with imitation learning using manually collected demonstrations (implementation in scripts/train_bc.py). The behavior cloning itself worked well—policies could replicate demonstrated behaviors with reasonable fidelity. But collecting quality demonstrations revealed a surprising insight.

Using keyboard or gamepad controls, I struggled to actually complete the task successfully. I would succeed maybe 1 out of 5 attempts in simulation, spending most episodes fumbling with controls and failing to grasp or place the cube properly.

The insight that changed everything: Later, when I got access to a leader arm for real-world teleoperation, the difference was stark. Tasks I could barely complete in simulation became trivial in reality—I could consistently complete the manipulation sequence on nearly every attempt.

This isn’t about speed—it’s about task completion rate. The same human (me) went from a ~20% success rate in simulation to near-100% success rate with physical teleoperation.

The Double-Edged Sword: This creates an interesting tension in imitation learning. Struggling with simulation controls means you’re training on lower-quality demonstrations, but imperfect demonstrations might actually be better for learning robust policies—perfect demonstrations can be brittle and don’t show recovery behaviors.

My simulation demos included:

  • Recovery behaviors when grasps failed
  • Multiple approach angles when obvious paths didn’t work
  • Natural variation in timing and trajectories

Future Improvements: Hybrid Demonstration Strategies

Looking ahead, I could significantly improve the setup:

IK + Noise Demonstrations: Generate synthetic demonstrations using inverse kinematics with added noise for volume and consistency, following the ALOHA approach.

Leader Arm → Simulation Bridge: Connect a physical leader arm directly to simulation—combining intuitive 6-DOF control with simulation benefits.

Hybrid Dataset Strategy:

  • IK + noise for bulk demonstrations with good coverage
  • Human struggle for recovery behaviors and natural variation
  • Leader arm sim for high-quality expert demonstrations

Each source contributes different aspects of robust manipulation behavior.

VecNormalize: Mixed Feelings

Stable-Baselines3 offers VecNormalize to automatically normalize observations and rewards, but it couples environment statistics with your trained model. Deployment becomes more complex—you need to save normalization statistics alongside the policy.

My current preference: Handle normalization inside the policy network rather than as environment wrappers. It’s more explicit and avoids deployment surprises.

What I’d Do Differently

Looking back on these experiments, several lessons stand out:

  1. Verify your actual model inputs early—don’t trust that preprocessing works as expected
  2. Checkpoint often— training is slow, you don’t want to lose progress, and it’s sometimes helpful to revert back and change some hyperparameters (e.g. to avoid divergence)
  3. Visualize everything—videos reveal problems hours before metrics do
  4. Track your experiments and settings meticulously—there are settings scattered everywhere

The Settings Sprawl Problem: One of the biggest challenges was keeping track of all the different settings across multiple places:

  • Reward shaping: Defined in the task implementation
  • Camera resolution: Set in the environment creation
  • Environment setup: Goal vs. standard env, with or without wrappers
  • SAC training parameters: Learning rates, buffer sizes, entropy targets
  • VecNormalize settings: Observation normalization parameters

When experiments went wrong, it was often unclear which of these many settings was the culprit. Better experiment tracking from the start would have saved significant debugging time.

The Hyperparameter Uncertainty: Honestly, I didn’t know exactly what learning rates and entropy targets to set for SAC. Most of my hyperparameter choices came from a combination of:

  • Default values from Stable-Baselines3
  • Advice from Claude/ChatGPT
  • Trial and error based on video feedback
  • Academic papers (when I could find relevant ones)

This uncertainty made the video & checkpoint system even more valuable—I could experiment with different settings and revert when things went wrong.


Next up: How I integrated with LeRobot and what I learned about bridging different ML ecosystems in robotics.


Have you experienced similar challenges with RL stability or the simulation vs. real-world gap? What’s been your experience with different learning approaches?

This is my mathjax support partial