Research Disclaimer
This tutorial is based on:
- Stable-Baselines3 v2.2+ (PyTorch-based RL algorithms)
- Gymnasium v0.29+ (successor to OpenAI Gym)
- RLlib v2.9+ (Ray distributed RL)
- Optuna v3.5+ (hyperparameter optimization)
- Academic RL papers: PPO (Schulman et al., 2017), DQN (Mnih et al., 2015), A2C (Mnih et al., 2016)
- TensorBoard v2.15+ and Weights & Biases (monitoring)
All code examples are production-ready implementations following documented best practices. Examples tested with Python 3.10+ and work on both CPU and GPU. Stable-Baselines3 is the most actively maintained RL library as of 2025.
Introduction
Reinforcement Learning (RL) enables agents to learn optimal decision-making through trial and error. Modern open-source frameworks have made RL accessible for production applications ranging from robotics to finance.
This comprehensive guide demonstrates production-grade RL:
- Stable-Baselines3: Industry-standard algorithms (DQN, PPO, A2C, SAC)
- Custom environments with Gymnasium API
- Distributed training with RLlib and Ray
- Hyperparameter optimization with Optuna
- Production deployment with ONNX export and monitoring
- Complete working examples for Cart-Pole, LunarLander, and custom tasks
When to Use Reinforcement Learning
| Use Case | RL Appropriate? | Alternative |
|---|---|---|
| Game AI (chess, Go, Atari) | ✅ Yes | Minimax, Monte Carlo Tree Search |
| Robotics control | ✅ Yes | PID controllers, MPC |
| Recommendation systems | ⚠️ Maybe | Supervised learning (historical data) |
| Financial trading | ⚠️ Maybe | Time-series forecasting |
| Supervised task with labels | ❌ No | Classification/regression |
| Static optimization problem | ❌ No | Linear programming, genetic algorithms |
Key Requirement: RL works best when you can define a reward signal and run many simulations or real-world trials.
Prerequisites
Required Knowledge:
- Python 3.8+ and NumPy basics
- Basic understanding of neural networks
- Familiarity with MDP (Markov Decision Process) concepts
- Understanding of policy, value functions, and rewards
Required Libraries:
# Core RL frameworks
pip install stable-baselines3==2.2.1 gymnasium==0.29.1
# Optional: advanced algorithms
pip install sb3-contrib==2.2.1 # Additional algorithms (QRDQN, TQC, etc.)
# Distributed training
pip install "ray[rllib]==2.9.0"
# Hyperparameter tuning
pip install optuna==3.5.0 optuna-dashboard==0.15.0
# Visualization
pip install tensorboard==2.15.0 matplotlib==3.8.0
# Additional environments
pip install gymnasium[box2d]==0.29.1 # LunarLander, BipedalWalker
pip install gymnasium[atari]==0.29.1 # Atari games
pip install gymnasium[accept-rom-license]==0.29.1 # Accept Atari ROM license
Hardware Recommendations:
- CPU: 4+ cores recommended
- GPU: NVIDIA with CUDA (10x+ speedup for image-based environments)
- RAM: 8GB minimum, 16GB+ recommended
Framework Overview
Stable-Baselines3 (Recommended for Most Use Cases)
Pros:
- ✅ Well-tested, production-ready implementations
- ✅ Excellent documentation and tutorials
- ✅ Active community and maintenance
- ✅ Clean, consistent API
- ✅ TensorBoard integration built-in
Cons:
- ❌ Single-machine only (no built-in distributed training)
- ❌ PyTorch-only (no TensorFlow)
Best For: Prototyping, research, single-machine production deployments
RLlib (Ray)
Pros:
- ✅ Distributed training across clusters
- ✅ Scalable to hundreds of workers
- ✅ Supports TensorFlow and PyTorch
- ✅ Advanced features (population-based training, multi-agent)
Cons:
- ❌ Steeper learning curve
- ❌ More complex setup
- ❌ Occasional API changes
Best For: Large-scale distributed training, multi-agent environments
Quick Start: Training Your First Agent
Example 1: Cart-Pole with PPO
File: train_cartpole.py - Complete training script
"""
Train a PPO agent on CartPole-v1 environment.
"""
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold
import numpy as np
def train_cartpole():
"""Train PPO agent on CartPole environment."""
# Create vectorized environment (4 parallel environments)
env = make_vec_env('CartPole-v1', n_envs=4)
# Create evaluation environment
eval_env = gym.make('CartPole-v1')
# Stop training when mean reward reaches 475 (out of 500 max)
callback_on_best = StopTrainingOnRewardThreshold(
reward_threshold=475,
verbose=1
)
# Evaluate agent every 1000 steps
eval_callback = EvalCallback(
eval_env,
callback_on_new_best=callback_on_best,
eval_freq=1000,
n_eval_episodes=10,
best_model_save_path='./models/',
log_path='./logs/',
verbose=1
)
# Initialize PPO agent
model = PPO(
'MlpPolicy', # Multi-layer perceptron policy
env,
learning_rate=3e-4,
n_steps=2048, # Steps per update
batch_size=64,
n_epochs=10,
gamma=0.99, # Discount factor
gae_lambda=0.95, # Generalized Advantage Estimation
clip_range=0.2, # PPO clipping parameter
verbose=1,
tensorboard_log='./tensorboard_logs/'
)
# Train agent
print("Training PPO on CartPole-v1...")
model.learn(
total_timesteps=100000,
callback=eval_callback,
progress_bar=True
)
# Save final model
model.save("ppo_cartpole_final")
# Evaluate trained agent
mean_reward, std_reward = evaluate_policy(
model,
eval_env,
n_eval_episodes=100
)
print(f"\nFinal Evaluation (100 episodes):")
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
# Close environments
env.close()
eval_env.close()
return model
def test_agent(model_path='ppo_cartpole_final.zip'):
"""Test trained agent with visualization."""
import time
# Load trained model
model = PPO.load(model_path)
# Create environment with rendering
env = gym.make('CartPole-v1', render_mode='human')
# Test for 5 episodes
for episode in range(5):
obs, info = env.reset()
episode_reward = 0
done = False
while not done:
# Get action from policy
action, _states = model.predict(obs, deterministic=True)
# Take action
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
episode_reward += reward
env.render()
time.sleep(0.02) # Slow down for visualization
print(f"Episode {episode + 1}: Reward = {episode_reward}")
env.close()
if __name__ == "__main__":
# Train agent
model = train_cartpole()
# Test trained agent
test_agent()
Run the training:
python train_cartpole.py
# Monitor training with TensorBoard
tensorboard --logdir=./tensorboard_logs/
Deep Q-Network (DQN) Implementation
DQN is ideal for discrete action spaces. Let’s implement it for LunarLander:
File: train_lunarlander_dqn.py
"""
Train DQN agent on LunarLander-v2 environment.
"""
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback
import torch
def train_lunar_lander():
"""Train DQN agent on LunarLander environment."""
# Create environment
env = gym.make('LunarLander-v2')
eval_env = gym.make('LunarLander-v2')
# Evaluation callback
eval_callback = EvalCallback(
eval_env,
eval_freq=5000,
n_eval_episodes=10,
best_model_save_path='./models/',
log_path='./logs/',
verbose=1
)
# Initialize DQN agent
model = DQN(
'MlpPolicy',
env,
learning_rate=1e-4,
buffer_size=100000, # Replay buffer size
learning_starts=10000, # Start learning after this many steps
batch_size=128,
tau=0.005, # Target network update rate
gamma=0.99,
train_freq=4, # Train every 4 steps
gradient_steps=1,
target_update_interval=1000, # Update target network every 1000 steps
exploration_fraction=0.12, # Fraction of training for epsilon decay
exploration_initial_eps=1.0,
exploration_final_eps=0.01,
verbose=1,
tensorboard_log='./tensorboard_logs/',
device='cuda' if torch.cuda.is_available() else 'cpu'
)
# Train agent
print("Training DQN on LunarLander-v2...")
model.learn(
total_timesteps=500000,
callback=eval_callback,
log_interval=100,
progress_bar=True
)
# Save model
model.save("dqn_lunarlander_final")
# Final evaluation
mean_reward, std_reward = evaluate_policy(
model,
eval_env,
n_eval_episodes=100
)
print(f"\nFinal Evaluation (100 episodes):")
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
print(f"Solved threshold: 200 (mean reward > 200 = solved)")
env.close()
eval_env.close()
return model
if __name__ == "__main__":
model = train_lunar_lander()
Custom Environment Creation
Create your own RL environment following the Gymnasium API:
File: custom_env.py - Custom trading environment example
"""
Custom trading environment using Gymnasium API.
"""
import gymnasium as gym
from gymnasium import spaces
import numpy as np
import pandas as pd
class TradingEnv(gym.Env):
"""
Simple stock trading environment.
Observation:
- Portfolio value
- Current stock price
- Stock price history (last 10 days)
- Current position (shares owned)
Actions:
0: Hold
1: Buy (invest 10% of cash)
2: Sell (sell 10% of holdings)
Reward:
Change in portfolio value
"""
metadata = {'render_modes': ['human'], 'render_fps': 1}
def __init__(
self,
price_data: np.ndarray,
initial_cash: float = 10000,
commission_rate: float = 0.001
):
"""
Initialize trading environment.
Args:
price_data: Historical price data (n_days,)
initial_cash: Starting cash
commission_rate: Trading commission (0.1% default)
"""
super(TradingEnv, self).__init__()
self.price_data = price_data
self.initial_cash = initial_cash
self.commission_rate = commission_rate
# Define action space: 0 = Hold, 1 = Buy, 2 = Sell
self.action_space = spaces.Discrete(3)
# Define observation space
# [cash, shares, current_price, price_history (10 days)]
self.observation_space = spaces.Box(
low=0,
high=np.inf,
shape=(13,), # 1 + 1 + 1 + 10
dtype=np.float32
)
# Episode state
self.current_step = 0
self.cash = initial_cash
self.shares = 0
self.portfolio_value = initial_cash
def reset(self, seed=None, options=None):
"""Reset environment to initial state."""
super().reset(seed=seed)
self.current_step = 10 # Start after 10 days for price history
self.cash = self.initial_cash
self.shares = 0
self.portfolio_value = self.initial_cash
observation = self._get_observation()
info = self._get_info()
return observation, info
def step(self, action):
"""Execute one step in the environment."""
current_price = self.price_data[self.current_step]
# Execute action
if action == 1: # Buy
buy_amount = self.cash * 0.1 # Invest 10% of cash
commission = buy_amount * self.commission_rate
shares_to_buy = (buy_amount - commission) / current_price
self.cash -= buy_amount
self.shares += shares_to_buy
elif action == 2: # Sell
shares_to_sell = self.shares * 0.1 # Sell 10% of holdings
sell_proceeds = shares_to_sell * current_price
commission = sell_proceeds * self.commission_rate
self.shares -= shares_to_sell
self.cash += (sell_proceeds - commission)
# Calculate new portfolio value
new_portfolio_value = self.cash + self.shares * current_price
# Reward is change in portfolio value
reward = new_portfolio_value - self.portfolio_value
self.portfolio_value = new_portfolio_value
# Move to next step
self.current_step += 1
# Check if episode is done
terminated = self.current_step >= len(self.price_data) - 1
truncated = False
observation = self._get_observation()
info = self._get_info()
return observation, reward, terminated, truncated, info
def _get_observation(self):
"""Get current observation."""
current_price = self.price_data[self.current_step]
# Get price history (last 10 days)
price_history = self.price_data[
self.current_step - 10:self.current_step
]
# Normalize prices
price_mean = np.mean(price_history)
price_std = np.std(price_history) + 1e-8
normalized_history = (price_history - price_mean) / price_std
normalized_current = (current_price - price_mean) / price_std
observation = np.array([
self.cash / self.initial_cash, # Normalized cash
self.shares * current_price / self.initial_cash, # Normalized holdings value
normalized_current, # Normalized current price
*normalized_history # Price history
], dtype=np.float32)
return observation
def _get_info(self):
"""Get additional info."""
return {
'cash': self.cash,
'shares': self.shares,
'portfolio_value': self.portfolio_value,
'current_price': self.price_data[self.current_step],
}
def render(self):
"""Render environment state."""
if self.render_mode == 'human':
info = self._get_info()
print(f"Step: {self.current_step}")
print(f"Cash: ${info['cash']:.2f}")
print(f"Shares: {info['shares']:.4f}")
print(f"Portfolio Value: ${info['portfolio_value']:.2f}")
print(f"Current Price: ${info['current_price']:.2f}")
print("-" * 40)
# Example usage
def test_custom_env():
"""Test custom trading environment."""
# Generate synthetic price data (random walk)
np.random.seed(42)
price_data = 100 + np.cumsum(np.random.randn(1000))
price_data = np.maximum(price_data, 1) # Ensure positive prices
# Create environment
env = TradingEnv(price_data)
# Train PPO agent on custom environment
from stable_baselines3 import PPO
model = PPO(
'MlpPolicy',
env,
learning_rate=3e-4,
verbose=1
)
print("Training PPO on custom trading environment...")
model.learn(total_timesteps=50000, progress_bar=True)
# Test trained agent
obs, info = env.reset()
for _ in range(100):
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
print(f"\nFinal portfolio value: ${info['portfolio_value']:.2f}")
print(f"Return: {((info['portfolio_value'] / env.initial_cash) - 1) * 100:.2f}%")
if __name__ == "__main__":
test_custom_env()
Hyperparameter Tuning with Optuna
Automatically find optimal hyperparameters:
File: hyperparameter_tuning.py
"""
Hyperparameter optimization using Optuna.
"""
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env
def optimize_ppo(trial):
"""
Objective function for Optuna hyperparameter optimization.
Args:
trial: Optuna trial object
Returns:
Mean reward from evaluation
"""
# Sample hyperparameters
learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
n_steps = trial.suggest_categorical('n_steps', [512, 1024, 2048, 4096])
batch_size = trial.suggest_categorical('batch_size', [32, 64, 128, 256])
n_epochs = trial.suggest_int('n_epochs', 3, 30)
gamma = trial.suggest_float('gamma', 0.9, 0.9999)
gae_lambda = trial.suggest_float('gae_lambda', 0.8, 1.0)
clip_range = trial.suggest_float('clip_range', 0.1, 0.4)
ent_coef = trial.suggest_float('ent_coef', 0.0, 0.1)
# Create environment
env = make_vec_env('CartPole-v1', n_envs=4)
eval_env = gym.make('CartPole-v1')
# Create model
model = PPO(
'MlpPolicy',
env,
learning_rate=learning_rate,
n_steps=n_steps,
batch_size=batch_size,
n_epochs=n_epochs,
gamma=gamma,
gae_lambda=gae_lambda,
clip_range=clip_range,
ent_coef=ent_coef,
verbose=0
)
# Train for limited timesteps
model.learn(total_timesteps=50000)
# Evaluate
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
env.close()
eval_env.close()
return mean_reward
def run_optimization():
"""Run Optuna study."""
# Create study
study = optuna.create_study(
direction='maximize',
sampler=TPESampler(),
pruner=MedianPruner()
)
# Run optimization
print("Starting hyperparameter optimization...")
study.optimize(optimize_ppo, n_trials=50, n_jobs=1)
# Print results
print("\n" + "=" * 60)
print("Optimization Complete!")
print("=" * 60)
print(f"Best trial: {study.best_trial.number}")
print(f"Best value (mean reward): {study.best_value:.2f}")
print("\nBest hyperparameters:")
for key, value in study.best_params.items():
print(f" {key}: {value}")
# Optionally save study
# import joblib
# joblib.dump(study, 'ppo_optuna_study.pkl')
return study
if __name__ == "__main__":
study = run_optimization()
Distributed Training with RLlib
Scale training across multiple machines:
File: train_rllib_distributed.py
"""
Distributed training using RLlib (Ray).
"""
import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.registry import register_env
import gymnasium as gym
def create_env(env_config):
"""Environment creator for RLlib."""
return gym.make('LunarLander-v2')
def train_distributed():
"""Train PPO agent using RLlib with distributed workers."""
# Initialize Ray
ray.init(ignore_reinit_error=True)
# Register environment
register_env("lunar_lander", create_env)
# Configure PPO
config = (
PPOConfig()
.environment("lunar_lander")
.framework("torch")
.rollouts(num_rollout_workers=4) # 4 parallel workers
.training(
train_batch_size=4000,
sgd_minibatch_size=128,
num_sgd_iter=30,
lr=5e-5,
gamma=0.99,
lambda_=0.95,
clip_param=0.2,
)
.evaluation(
evaluation_interval=5,
evaluation_duration=10,
evaluation_num_workers=1,
)
)
# Train with Tune
results = tune.run(
"PPO",
config=config.to_dict(),
stop={"episode_reward_mean": 200}, # Stop when solved
checkpoint_freq=10,
checkpoint_at_end=True,
local_dir="./ray_results",
verbose=1
)
# Get best checkpoint
best_checkpoint = results.get_best_checkpoint(
results.get_best_trial("episode_reward_mean", mode="max"),
metric="episode_reward_mean",
mode="max"
)
print(f"\nBest checkpoint: {best_checkpoint}")
ray.shutdown()
if __name__ == "__main__":
train_distributed()
Production Deployment
Export to ONNX for Fast Inference
File: export_onnx.py
"""
Export trained model to ONNX format for production deployment.
"""
import torch
import numpy as np
from stable_baselines3 import PPO
import onnx
import onnxruntime as ort
def export_to_onnx(model_path, onnx_path, observation_size):
"""
Export Stable-Baselines3 model to ONNX.
Args:
model_path: Path to trained .zip model
onnx_path: Output ONNX file path
observation_size: Size of observation space
"""
# Load trained model
model = PPO.load(model_path)
# Get policy network
policy = model.policy
# Create dummy input
dummy_input = torch.randn(1, observation_size)
# Export to ONNX
torch.onnx.export(
policy,
dummy_input,
onnx_path,
export_params=True,
opset_version=12,
input_names=['observation'],
output_names=['action'],
dynamic_axes={
'observation': {0: 'batch_size'},
'action': {0: 'batch_size'}
}
)
print(f"✓ Model exported to {onnx_path}")
# Verify ONNX model
onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model)
print("✓ ONNX model verified")
return onnx_path
def inference_with_onnx(onnx_path, observation):
"""
Run inference with ONNX model.
Args:
onnx_path: Path to ONNX model
observation: Input observation (numpy array)
Returns:
Action
"""
# Load ONNX model
session = ort.InferenceSession(onnx_path)
# Prepare input
if len(observation.shape) == 1:
observation = observation.reshape(1, -1)
# Run inference
action = session.run(
None,
{'observation': observation.astype(np.float32)}
)[0]
return action
# Example usage
if __name__ == "__main__":
# Export model
export_to_onnx(
model_path="ppo_cartpole_final.zip",
onnx_path="ppo_cartpole.onnx",
observation_size=4 # CartPole observation size
)
# Test ONNX inference
import gymnasium as gym
env = gym.make('CartPole-v1')
obs, _ = env.reset()
action = inference_with_onnx("ppo_cartpole.onnx", obs)
print(f"ONNX inference - Observation: {obs}, Action: {action}")
Production Best Practices
1. Monitoring with Weights & Biases
import wandb
from wandb.integration.sb3 import WandbCallback
# Initialize wandb
wandb.init(
project="rl-training",
config={
"algorithm": "PPO",
"environment": "CartPole-v1",
}
)
# Add wandb callback
model = PPO('MlpPolicy', env, verbose=1)
model.learn(
total_timesteps=100000,
callback=WandbCallback(
model_save_path="./models/",
verbose=2,
)
)
wandb.finish()
2. Reproducibility
import numpy as np
import torch
import random
def set_seed(seed=42):
"""Set seeds for reproducibility."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Use deterministic algorithms
torch.use_deterministic_algorithms(True)
3. Save/Load Best Model
from stable_baselines3.common.callbacks import CheckpointCallback
# Save checkpoint every 10000 steps
checkpoint_callback = CheckpointCallback(
save_freq=10000,
save_path='./checkpoints/',
name_prefix='rl_model'
)
model.learn(total_timesteps=100000, callback=checkpoint_callback)
# Load best model
model = PPO.load("checkpoints/rl_model_100000_steps")
Algorithm Selection Guide
| Algorithm | Action Space | Sample Efficiency | Stability | Use Case |
|---|---|---|---|---|
| DQN | Discrete | Medium | Medium | Atari games, discrete control |
| PPO | Both | Medium | High | General purpose, robotics |
| A2C | Both | Low | Medium | Fast prototyping |
| SAC | Continuous | High | High | Robotics, continuous control |
| TD3 | Continuous | High | High | Precise continuous control |
| DDPG | Continuous | Medium | Low | Continuous control (older) |
Recommendation: Start with PPO for most tasks. It’s stable, works for both discrete and continuous actions, and has good sample efficiency.
Known Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| Sample inefficiency | Requires millions of steps | Use model-based RL, offline RL, or transfer learning |
| Reward engineering | Hard to design good rewards | Use reward shaping, inverse RL |
| Non-stationary environments | Agent fails when env changes | Continual learning, domain randomization |
| Sim-to-real gap | Simulator ≠ real world | System identification, domain adaptation |
| Hyperparameter sensitivity | Hard to tune | Use Optuna, default configs from papers |
| Credit assignment | Hard in long horizons | Use hierarchical RL, options |
| Reproducibility issues | Random seeds matter | Fix all seeds, use deterministic mode |
Conclusion
Modern open-source RL frameworks enable production deployment of intelligent agents:
- Stable-Baselines3: Best for single-machine, production-ready implementations
- RLlib: Best for distributed training at scale
- Gymnasium: Standard API for custom environments
- Optuna: Automated hyperparameter tuning
Key Takeaways:
- Start with PPO for general tasks, DQN for discrete actions, SAC for continuous
- Use vectorized environments for faster training (4+ parallel envs)
- Monitor training with TensorBoard or Weights & Biases
- Tune hyperparameters with Optuna before production
- Export to ONNX for fast inference deployment
Next Steps:
- Try advanced algorithms (SAC, TD3, QRDQN)
- Implement curriculum learning for complex tasks
- Explore multi-agent RL with RLlib
- Add imitation learning (behavior cloning) for bootstrapping
Further Resources:
- Stable-Baselines3 Docs: https://stable-baselines3.readthedocs.io/
- Gymnasium: https://gymnasium.farama.org/
- RLlib: https://docs.ray.io/en/latest/rllib/index.html
- Spinning Up in Deep RL (OpenAI): https://spinningup.openai.com/
- RL Course (Hugging Face): https://huggingface.co/learn/deep-rl-course/
- Optuna: https://optuna.org/