Research Disclaimer

This tutorial is based on:

  • Stable-Baselines3 v2.2+ (PyTorch-based RL algorithms)
  • Gymnasium v0.29+ (successor to OpenAI Gym)
  • RLlib v2.9+ (Ray distributed RL)
  • Optuna v3.5+ (hyperparameter optimization)
  • Academic RL papers: PPO (Schulman et al., 2017), DQN (Mnih et al., 2015), A2C (Mnih et al., 2016)
  • TensorBoard v2.15+ and Weights & Biases (monitoring)

All code examples are production-ready implementations following documented best practices. Examples tested with Python 3.10+ and work on both CPU and GPU. Stable-Baselines3 is the most actively maintained RL library as of 2025.

Introduction

Reinforcement Learning (RL) enables agents to learn optimal decision-making through trial and error. Modern open-source frameworks have made RL accessible for production applications ranging from robotics to finance.

This comprehensive guide demonstrates production-grade RL:

  • Stable-Baselines3: Industry-standard algorithms (DQN, PPO, A2C, SAC)
  • Custom environments with Gymnasium API
  • Distributed training with RLlib and Ray
  • Hyperparameter optimization with Optuna
  • Production deployment with ONNX export and monitoring
  • Complete working examples for Cart-Pole, LunarLander, and custom tasks

When to Use Reinforcement Learning

Use Case RL Appropriate? Alternative
Game AI (chess, Go, Atari) ✅ Yes Minimax, Monte Carlo Tree Search
Robotics control ✅ Yes PID controllers, MPC
Recommendation systems ⚠️ Maybe Supervised learning (historical data)
Financial trading ⚠️ Maybe Time-series forecasting
Supervised task with labels ❌ No Classification/regression
Static optimization problem ❌ No Linear programming, genetic algorithms

Key Requirement: RL works best when you can define a reward signal and run many simulations or real-world trials.

Prerequisites

Required Knowledge:

  • Python 3.8+ and NumPy basics
  • Basic understanding of neural networks
  • Familiarity with MDP (Markov Decision Process) concepts
  • Understanding of policy, value functions, and rewards

Required Libraries:

# Core RL frameworks
pip install stable-baselines3==2.2.1 gymnasium==0.29.1

# Optional: advanced algorithms
pip install sb3-contrib==2.2.1  # Additional algorithms (QRDQN, TQC, etc.)

# Distributed training
pip install "ray[rllib]==2.9.0"

# Hyperparameter tuning
pip install optuna==3.5.0 optuna-dashboard==0.15.0

# Visualization
pip install tensorboard==2.15.0 matplotlib==3.8.0

# Additional environments
pip install gymnasium[box2d]==0.29.1  # LunarLander, BipedalWalker
pip install gymnasium[atari]==0.29.1  # Atari games
pip install gymnasium[accept-rom-license]==0.29.1  # Accept Atari ROM license

Hardware Recommendations:

  • CPU: 4+ cores recommended
  • GPU: NVIDIA with CUDA (10x+ speedup for image-based environments)
  • RAM: 8GB minimum, 16GB+ recommended

Framework Overview

Pros:

  • ✅ Well-tested, production-ready implementations
  • ✅ Excellent documentation and tutorials
  • ✅ Active community and maintenance
  • ✅ Clean, consistent API
  • ✅ TensorBoard integration built-in

Cons:

  • ❌ Single-machine only (no built-in distributed training)
  • ❌ PyTorch-only (no TensorFlow)

Best For: Prototyping, research, single-machine production deployments

RLlib (Ray)

Pros:

  • ✅ Distributed training across clusters
  • ✅ Scalable to hundreds of workers
  • ✅ Supports TensorFlow and PyTorch
  • ✅ Advanced features (population-based training, multi-agent)

Cons:

  • ❌ Steeper learning curve
  • ❌ More complex setup
  • ❌ Occasional API changes

Best For: Large-scale distributed training, multi-agent environments

Quick Start: Training Your First Agent

Example 1: Cart-Pole with PPO

File: train_cartpole.py - Complete training script

"""
Train a PPO agent on CartPole-v1 environment.
"""

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold
import numpy as np

def train_cartpole():
    """Train PPO agent on CartPole environment."""

    # Create vectorized environment (4 parallel environments)
    env = make_vec_env('CartPole-v1', n_envs=4)

    # Create evaluation environment
    eval_env = gym.make('CartPole-v1')

    # Stop training when mean reward reaches 475 (out of 500 max)
    callback_on_best = StopTrainingOnRewardThreshold(
        reward_threshold=475,
        verbose=1
    )

    # Evaluate agent every 1000 steps
    eval_callback = EvalCallback(
        eval_env,
        callback_on_new_best=callback_on_best,
        eval_freq=1000,
        n_eval_episodes=10,
        best_model_save_path='./models/',
        log_path='./logs/',
        verbose=1
    )

    # Initialize PPO agent
    model = PPO(
        'MlpPolicy',  # Multi-layer perceptron policy
        env,
        learning_rate=3e-4,
        n_steps=2048,  # Steps per update
        batch_size=64,
        n_epochs=10,
        gamma=0.99,  # Discount factor
        gae_lambda=0.95,  # Generalized Advantage Estimation
        clip_range=0.2,  # PPO clipping parameter
        verbose=1,
        tensorboard_log='./tensorboard_logs/'
    )

    # Train agent
    print("Training PPO on CartPole-v1...")
    model.learn(
        total_timesteps=100000,
        callback=eval_callback,
        progress_bar=True
    )

    # Save final model
    model.save("ppo_cartpole_final")

    # Evaluate trained agent
    mean_reward, std_reward = evaluate_policy(
        model,
        eval_env,
        n_eval_episodes=100
    )

    print(f"\nFinal Evaluation (100 episodes):")
    print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

    # Close environments
    env.close()
    eval_env.close()

    return model

def test_agent(model_path='ppo_cartpole_final.zip'):
    """Test trained agent with visualization."""
    import time

    # Load trained model
    model = PPO.load(model_path)

    # Create environment with rendering
    env = gym.make('CartPole-v1', render_mode='human')

    # Test for 5 episodes
    for episode in range(5):
        obs, info = env.reset()
        episode_reward = 0
        done = False

        while not done:
            # Get action from policy
            action, _states = model.predict(obs, deterministic=True)

            # Take action
            obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            episode_reward += reward

            env.render()
            time.sleep(0.02)  # Slow down for visualization

        print(f"Episode {episode + 1}: Reward = {episode_reward}")

    env.close()

if __name__ == "__main__":
    # Train agent
    model = train_cartpole()

    # Test trained agent
    test_agent()

Run the training:

python train_cartpole.py

# Monitor training with TensorBoard
tensorboard --logdir=./tensorboard_logs/

Deep Q-Network (DQN) Implementation

DQN is ideal for discrete action spaces. Let’s implement it for LunarLander:

File: train_lunarlander_dqn.py

"""
Train DQN agent on LunarLander-v2 environment.
"""

import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback
import torch

def train_lunar_lander():
    """Train DQN agent on LunarLander environment."""

    # Create environment
    env = gym.make('LunarLander-v2')
    eval_env = gym.make('LunarLander-v2')

    # Evaluation callback
    eval_callback = EvalCallback(
        eval_env,
        eval_freq=5000,
        n_eval_episodes=10,
        best_model_save_path='./models/',
        log_path='./logs/',
        verbose=1
    )

    # Initialize DQN agent
    model = DQN(
        'MlpPolicy',
        env,
        learning_rate=1e-4,
        buffer_size=100000,  # Replay buffer size
        learning_starts=10000,  # Start learning after this many steps
        batch_size=128,
        tau=0.005,  # Target network update rate
        gamma=0.99,
        train_freq=4,  # Train every 4 steps
        gradient_steps=1,
        target_update_interval=1000,  # Update target network every 1000 steps
        exploration_fraction=0.12,  # Fraction of training for epsilon decay
        exploration_initial_eps=1.0,
        exploration_final_eps=0.01,
        verbose=1,
        tensorboard_log='./tensorboard_logs/',
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )

    # Train agent
    print("Training DQN on LunarLander-v2...")
    model.learn(
        total_timesteps=500000,
        callback=eval_callback,
        log_interval=100,
        progress_bar=True
    )

    # Save model
    model.save("dqn_lunarlander_final")

    # Final evaluation
    mean_reward, std_reward = evaluate_policy(
        model,
        eval_env,
        n_eval_episodes=100
    )

    print(f"\nFinal Evaluation (100 episodes):")
    print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
    print(f"Solved threshold: 200 (mean reward > 200 = solved)")

    env.close()
    eval_env.close()

    return model

if __name__ == "__main__":
    model = train_lunar_lander()

Custom Environment Creation

Create your own RL environment following the Gymnasium API:

File: custom_env.py - Custom trading environment example

"""
Custom trading environment using Gymnasium API.
"""

import gymnasium as gym
from gymnasium import spaces
import numpy as np
import pandas as pd

class TradingEnv(gym.Env):
    """
    Simple stock trading environment.

    Observation:
        - Portfolio value
        - Current stock price
        - Stock price history (last 10 days)
        - Current position (shares owned)

    Actions:
        0: Hold
        1: Buy (invest 10% of cash)
        2: Sell (sell 10% of holdings)

    Reward:
        Change in portfolio value
    """

    metadata = {'render_modes': ['human'], 'render_fps': 1}

    def __init__(
        self,
        price_data: np.ndarray,
        initial_cash: float = 10000,
        commission_rate: float = 0.001
    ):
        """
        Initialize trading environment.

        Args:
            price_data: Historical price data (n_days,)
            initial_cash: Starting cash
            commission_rate: Trading commission (0.1% default)
        """
        super(TradingEnv, self).__init__()

        self.price_data = price_data
        self.initial_cash = initial_cash
        self.commission_rate = commission_rate

        # Define action space: 0 = Hold, 1 = Buy, 2 = Sell
        self.action_space = spaces.Discrete(3)

        # Define observation space
        # [cash, shares, current_price, price_history (10 days)]
        self.observation_space = spaces.Box(
            low=0,
            high=np.inf,
            shape=(13,),  # 1 + 1 + 1 + 10
            dtype=np.float32
        )

        # Episode state
        self.current_step = 0
        self.cash = initial_cash
        self.shares = 0
        self.portfolio_value = initial_cash

    def reset(self, seed=None, options=None):
        """Reset environment to initial state."""
        super().reset(seed=seed)

        self.current_step = 10  # Start after 10 days for price history
        self.cash = self.initial_cash
        self.shares = 0
        self.portfolio_value = self.initial_cash

        observation = self._get_observation()
        info = self._get_info()

        return observation, info

    def step(self, action):
        """Execute one step in the environment."""
        current_price = self.price_data[self.current_step]

        # Execute action
        if action == 1:  # Buy
            buy_amount = self.cash * 0.1  # Invest 10% of cash
            commission = buy_amount * self.commission_rate
            shares_to_buy = (buy_amount - commission) / current_price

            self.cash -= buy_amount
            self.shares += shares_to_buy

        elif action == 2:  # Sell
            shares_to_sell = self.shares * 0.1  # Sell 10% of holdings
            sell_proceeds = shares_to_sell * current_price
            commission = sell_proceeds * self.commission_rate

            self.shares -= shares_to_sell
            self.cash += (sell_proceeds - commission)

        # Calculate new portfolio value
        new_portfolio_value = self.cash + self.shares * current_price

        # Reward is change in portfolio value
        reward = new_portfolio_value - self.portfolio_value

        self.portfolio_value = new_portfolio_value

        # Move to next step
        self.current_step += 1

        # Check if episode is done
        terminated = self.current_step >= len(self.price_data) - 1
        truncated = False

        observation = self._get_observation()
        info = self._get_info()

        return observation, reward, terminated, truncated, info

    def _get_observation(self):
        """Get current observation."""
        current_price = self.price_data[self.current_step]

        # Get price history (last 10 days)
        price_history = self.price_data[
            self.current_step - 10:self.current_step
        ]

        # Normalize prices
        price_mean = np.mean(price_history)
        price_std = np.std(price_history) + 1e-8

        normalized_history = (price_history - price_mean) / price_std
        normalized_current = (current_price - price_mean) / price_std

        observation = np.array([
            self.cash / self.initial_cash,  # Normalized cash
            self.shares * current_price / self.initial_cash,  # Normalized holdings value
            normalized_current,  # Normalized current price
            *normalized_history  # Price history
        ], dtype=np.float32)

        return observation

    def _get_info(self):
        """Get additional info."""
        return {
            'cash': self.cash,
            'shares': self.shares,
            'portfolio_value': self.portfolio_value,
            'current_price': self.price_data[self.current_step],
        }

    def render(self):
        """Render environment state."""
        if self.render_mode == 'human':
            info = self._get_info()
            print(f"Step: {self.current_step}")
            print(f"Cash: ${info['cash']:.2f}")
            print(f"Shares: {info['shares']:.4f}")
            print(f"Portfolio Value: ${info['portfolio_value']:.2f}")
            print(f"Current Price: ${info['current_price']:.2f}")
            print("-" * 40)

# Example usage
def test_custom_env():
    """Test custom trading environment."""

    # Generate synthetic price data (random walk)
    np.random.seed(42)
    price_data = 100 + np.cumsum(np.random.randn(1000))
    price_data = np.maximum(price_data, 1)  # Ensure positive prices

    # Create environment
    env = TradingEnv(price_data)

    # Train PPO agent on custom environment
    from stable_baselines3 import PPO

    model = PPO(
        'MlpPolicy',
        env,
        learning_rate=3e-4,
        verbose=1
    )

    print("Training PPO on custom trading environment...")
    model.learn(total_timesteps=50000, progress_bar=True)

    # Test trained agent
    obs, info = env.reset()
    for _ in range(100):
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, terminated, truncated, info = env.step(action)

        if terminated or truncated:
            break

    print(f"\nFinal portfolio value: ${info['portfolio_value']:.2f}")
    print(f"Return: {((info['portfolio_value'] / env.initial_cash) - 1) * 100:.2f}%")

if __name__ == "__main__":
    test_custom_env()

Hyperparameter Tuning with Optuna

Automatically find optimal hyperparameters:

File: hyperparameter_tuning.py

"""
Hyperparameter optimization using Optuna.
"""

import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

def optimize_ppo(trial):
    """
    Objective function for Optuna hyperparameter optimization.

    Args:
        trial: Optuna trial object

    Returns:
        Mean reward from evaluation
    """

    # Sample hyperparameters
    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
    n_steps = trial.suggest_categorical('n_steps', [512, 1024, 2048, 4096])
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128, 256])
    n_epochs = trial.suggest_int('n_epochs', 3, 30)
    gamma = trial.suggest_float('gamma', 0.9, 0.9999)
    gae_lambda = trial.suggest_float('gae_lambda', 0.8, 1.0)
    clip_range = trial.suggest_float('clip_range', 0.1, 0.4)
    ent_coef = trial.suggest_float('ent_coef', 0.0, 0.1)

    # Create environment
    env = make_vec_env('CartPole-v1', n_envs=4)
    eval_env = gym.make('CartPole-v1')

    # Create model
    model = PPO(
        'MlpPolicy',
        env,
        learning_rate=learning_rate,
        n_steps=n_steps,
        batch_size=batch_size,
        n_epochs=n_epochs,
        gamma=gamma,
        gae_lambda=gae_lambda,
        clip_range=clip_range,
        ent_coef=ent_coef,
        verbose=0
    )

    # Train for limited timesteps
    model.learn(total_timesteps=50000)

    # Evaluate
    mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)

    env.close()
    eval_env.close()

    return mean_reward

def run_optimization():
    """Run Optuna study."""

    # Create study
    study = optuna.create_study(
        direction='maximize',
        sampler=TPESampler(),
        pruner=MedianPruner()
    )

    # Run optimization
    print("Starting hyperparameter optimization...")
    study.optimize(optimize_ppo, n_trials=50, n_jobs=1)

    # Print results
    print("\n" + "=" * 60)
    print("Optimization Complete!")
    print("=" * 60)
    print(f"Best trial: {study.best_trial.number}")
    print(f"Best value (mean reward): {study.best_value:.2f}")
    print("\nBest hyperparameters:")
    for key, value in study.best_params.items():
        print(f"  {key}: {value}")

    # Optionally save study
    # import joblib
    # joblib.dump(study, 'ppo_optuna_study.pkl')

    return study

if __name__ == "__main__":
    study = run_optimization()

Distributed Training with RLlib

Scale training across multiple machines:

File: train_rllib_distributed.py

"""
Distributed training using RLlib (Ray).
"""

import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.registry import register_env
import gymnasium as gym

def create_env(env_config):
    """Environment creator for RLlib."""
    return gym.make('LunarLander-v2')

def train_distributed():
    """Train PPO agent using RLlib with distributed workers."""

    # Initialize Ray
    ray.init(ignore_reinit_error=True)

    # Register environment
    register_env("lunar_lander", create_env)

    # Configure PPO
    config = (
        PPOConfig()
        .environment("lunar_lander")
        .framework("torch")
        .rollouts(num_rollout_workers=4)  # 4 parallel workers
        .training(
            train_batch_size=4000,
            sgd_minibatch_size=128,
            num_sgd_iter=30,
            lr=5e-5,
            gamma=0.99,
            lambda_=0.95,
            clip_param=0.2,
        )
        .evaluation(
            evaluation_interval=5,
            evaluation_duration=10,
            evaluation_num_workers=1,
        )
    )

    # Train with Tune
    results = tune.run(
        "PPO",
        config=config.to_dict(),
        stop={"episode_reward_mean": 200},  # Stop when solved
        checkpoint_freq=10,
        checkpoint_at_end=True,
        local_dir="./ray_results",
        verbose=1
    )

    # Get best checkpoint
    best_checkpoint = results.get_best_checkpoint(
        results.get_best_trial("episode_reward_mean", mode="max"),
        metric="episode_reward_mean",
        mode="max"
    )

    print(f"\nBest checkpoint: {best_checkpoint}")

    ray.shutdown()

if __name__ == "__main__":
    train_distributed()

Production Deployment

Export to ONNX for Fast Inference

File: export_onnx.py

"""
Export trained model to ONNX format for production deployment.
"""

import torch
import numpy as np
from stable_baselines3 import PPO
import onnx
import onnxruntime as ort

def export_to_onnx(model_path, onnx_path, observation_size):
    """
    Export Stable-Baselines3 model to ONNX.

    Args:
        model_path: Path to trained .zip model
        onnx_path: Output ONNX file path
        observation_size: Size of observation space
    """

    # Load trained model
    model = PPO.load(model_path)

    # Get policy network
    policy = model.policy

    # Create dummy input
    dummy_input = torch.randn(1, observation_size)

    # Export to ONNX
    torch.onnx.export(
        policy,
        dummy_input,
        onnx_path,
        export_params=True,
        opset_version=12,
        input_names=['observation'],
        output_names=['action'],
        dynamic_axes={
            'observation': {0: 'batch_size'},
            'action': {0: 'batch_size'}
        }
    )

    print(f"✓ Model exported to {onnx_path}")

    # Verify ONNX model
    onnx_model = onnx.load(onnx_path)
    onnx.checker.check_model(onnx_model)
    print("✓ ONNX model verified")

    return onnx_path

def inference_with_onnx(onnx_path, observation):
    """
    Run inference with ONNX model.

    Args:
        onnx_path: Path to ONNX model
        observation: Input observation (numpy array)

    Returns:
        Action
    """
    # Load ONNX model
    session = ort.InferenceSession(onnx_path)

    # Prepare input
    if len(observation.shape) == 1:
        observation = observation.reshape(1, -1)

    # Run inference
    action = session.run(
        None,
        {'observation': observation.astype(np.float32)}
    )[0]

    return action

# Example usage
if __name__ == "__main__":
    # Export model
    export_to_onnx(
        model_path="ppo_cartpole_final.zip",
        onnx_path="ppo_cartpole.onnx",
        observation_size=4  # CartPole observation size
    )

    # Test ONNX inference
    import gymnasium as gym

    env = gym.make('CartPole-v1')
    obs, _ = env.reset()

    action = inference_with_onnx("ppo_cartpole.onnx", obs)
    print(f"ONNX inference - Observation: {obs}, Action: {action}")

Production Best Practices

1. Monitoring with Weights & Biases

import wandb
from wandb.integration.sb3 import WandbCallback

# Initialize wandb
wandb.init(
    project="rl-training",
    config={
        "algorithm": "PPO",
        "environment": "CartPole-v1",
    }
)

# Add wandb callback
model = PPO('MlpPolicy', env, verbose=1)
model.learn(
    total_timesteps=100000,
    callback=WandbCallback(
        model_save_path="./models/",
        verbose=2,
    )
)

wandb.finish()

2. Reproducibility

import numpy as np
import torch
import random

def set_seed(seed=42):
    """Set seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

# Use deterministic algorithms
torch.use_deterministic_algorithms(True)

3. Save/Load Best Model

from stable_baselines3.common.callbacks import CheckpointCallback

# Save checkpoint every 10000 steps
checkpoint_callback = CheckpointCallback(
    save_freq=10000,
    save_path='./checkpoints/',
    name_prefix='rl_model'
)

model.learn(total_timesteps=100000, callback=checkpoint_callback)

# Load best model
model = PPO.load("checkpoints/rl_model_100000_steps")

Algorithm Selection Guide

Algorithm Action Space Sample Efficiency Stability Use Case
DQN Discrete Medium Medium Atari games, discrete control
PPO Both Medium High General purpose, robotics
A2C Both Low Medium Fast prototyping
SAC Continuous High High Robotics, continuous control
TD3 Continuous High High Precise continuous control
DDPG Continuous Medium Low Continuous control (older)

Recommendation: Start with PPO for most tasks. It’s stable, works for both discrete and continuous actions, and has good sample efficiency.

Known Limitations

Limitation Impact Mitigation
Sample inefficiency Requires millions of steps Use model-based RL, offline RL, or transfer learning
Reward engineering Hard to design good rewards Use reward shaping, inverse RL
Non-stationary environments Agent fails when env changes Continual learning, domain randomization
Sim-to-real gap Simulator ≠ real world System identification, domain adaptation
Hyperparameter sensitivity Hard to tune Use Optuna, default configs from papers
Credit assignment Hard in long horizons Use hierarchical RL, options
Reproducibility issues Random seeds matter Fix all seeds, use deterministic mode

Conclusion

Modern open-source RL frameworks enable production deployment of intelligent agents:

  • Stable-Baselines3: Best for single-machine, production-ready implementations
  • RLlib: Best for distributed training at scale
  • Gymnasium: Standard API for custom environments
  • Optuna: Automated hyperparameter tuning

Key Takeaways:

  • Start with PPO for general tasks, DQN for discrete actions, SAC for continuous
  • Use vectorized environments for faster training (4+ parallel envs)
  • Monitor training with TensorBoard or Weights & Biases
  • Tune hyperparameters with Optuna before production
  • Export to ONNX for fast inference deployment

Next Steps:

  • Try advanced algorithms (SAC, TD3, QRDQN)
  • Implement curriculum learning for complex tasks
  • Explore multi-agent RL with RLlib
  • Add imitation learning (behavior cloning) for bootstrapping

Further Resources: