您的位置:首页 > 教育 > 培训 > 南宁博信网络技术有限公司_广东网页制作网站_大数据平台_开发小程序

南宁博信网络技术有限公司_广东网页制作网站_大数据平台_开发小程序

2025/5/9 12:39:03 来源:https://blog.csdn.net/m0_60414444/article/details/146426297  浏览:    关键词:南宁博信网络技术有限公司_广东网页制作网站_大数据平台_开发小程序
南宁博信网络技术有限公司_广东网页制作网站_大数据平台_开发小程序

一、算法原理与核心思想

1. 基于模型的强化学习

基于模型的强化学习(Model-Based RL)通过构建环境动态模型,在虚拟环境中进行轨迹规划和策略优化。与无模型方法相比,它具备以下优势:

  • 数据高效:减少真实环境交互次数

  • 规划能力强:通过模型预测进行长序列决策

  • 安全可控:在虚拟环境中测试高风险动作

2. PETS 算法核心

PETS(Probabilistic Ensembles with Trajectory Sampling)通过概率集成模型交叉熵优化实现高效规划:

  • 概率集成模型:使用多个神经网络建模环境动态,捕捉不确定性

  • 轨迹优化:通过交叉熵方法(CEM)生成最优动作序列

算法优势
特性说明
高数据利用率单次环境交互数据可复用多次
强抗干扰能力概率模型有效处理传感器噪声
适应复杂动作空间支持连续控制和高维动作输出

二、PETS 实现步骤

我们将以 MuJoCo HalfCheetah 环境为例,展示 PETS 的实现流程:

1. 问题描述

  • 环境目标:控制仿真猎豹机器人快速奔跑

  • 状态空间:17 维向量(关节角度、速度等)

  • 动作空间:6 维连续向量(关节扭矩)

2. 实现步骤

  1. 构建概率集成模型(5个独立神经网络)

  2. 实现基于交叉熵方法的轨迹优化器

  3. 设计数据收集与模型训练系统

  4. 在 Gymnasium 环境集成训练流程


三、代码实现

import gymnasium as gym
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import Dataset, DataLoader
from collections import deque
import random
import time
​
​
# ================== 配置参数 ==================
class PETSConfig:# 环境参数env_name = "HalfCheetah-v5"max_episode_steps = 200
​# 训练参数batch_size = 256lr = 1e-3epochs = 400horizon = 30num_particles = 20num_ensembles = 5
​# 网络架构hidden_dim = 200num_layers = 4device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
​
​
# ================== 概率集成模型 ==================
class ProbabilisticEnsemble(nn.Module):def __init__(self, state_dim, action_dim):super().__init__()self.state_dim = state_dimself.action_dim = action_dim
​self.models = nn.ModuleList([self._build_network(state_dim + action_dim, 2 * state_dim)for _ in range(PETSConfig.num_ensembles)])
​def _build_network(self, in_dim, out_dim):layers = []input_dim = in_dimfor _ in range(PETSConfig.num_layers):layers.extend([nn.Linear(input_dim, PETSConfig.hidden_dim),nn.SiLU(),])input_dim = PETSConfig.hidden_dimlayers.append(nn.Linear(input_dim, out_dim))return nn.Sequential(*layers)
​def forward(self, state, action):x = torch.cat([state, action], dim=-1)outputs = [model(x) for model in self.models]means = torch.stack([out[:, :self.state_dim] for out in outputs])logvars = torch.stack([out[:, self.state_dim:] for out in outputs])return means, logvars
​def predict(self, state, action):""" 预测下一个状态 (支持 NumPy 输入) """# 转换为 Tensorstate_tensor = torch.FloatTensor(state).to(PETSConfig.device)action_tensor = torch.FloatTensor(action).to(PETSConfig.device)
​# 随机选择一个集成模型model_idx = np.random.randint(0, PETSConfig.num_ensembles)model = self.models[model_idx]
​# 前向传播with torch.no_grad():output = model(torch.cat([state_tensor, action_tensor], dim=-1).unsqueeze(0))mean = output[:, :self.state_dim]logvar = output[:, self.state_dim:]
​# 从分布中采样std = torch.exp(0.5 * logvar)epsilon = torch.randn_like(std)next_state = mean + epsilon * std
​return next_state.cpu().numpy().squeeze(0)  # 转换为 NumPy 并去除 batch 维度
​
​
# ================== 轨迹优化器 ==================
class TrajectoryOptimizer:def __init__(self, model, state_dim, action_dim):self.model = modelself.model.eval()  # 设置为评估模式self.state_dim = state_dimself.action_dim = action_dim
​def optimize(self, initial_state):# 交叉熵方法实现best_actions = np.random.uniform(-1, 1, size=(PETSConfig.horizon, self.action_dim))for _ in range(5):  # CEM 迭代次数# 生成候选动作noise = np.random.normal(scale=0.2, size=(100, PETSConfig.horizon, self.action_dim))candidates = np.clip(best_actions + noise, -1, 1)
​# 评估候选动作returns = []for action_seq in candidates:total_reward = self._evaluate_sequence(initial_state, action_seq)returns.append(total_reward)
​# 选择精英样本elite_idx = np.argsort(returns)[-10:]best_actions = np.mean(candidates[elite_idx], axis=0)
​return best_actions[0]  # 返回第一个最优动作
​def _evaluate_sequence(self, state, action_seq):total_reward = 0current_state = state.copy()for action in action_seq:next_state = self.model.predict(current_state, action)# 使用环境真实的奖励函数(需要与您的任务匹配)reward = self._get_reward(current_state, action, next_state)total_reward += rewardcurrent_state = next_statereturn total_reward
​def _get_reward(self, state, action, next_state):""" 示例:HalfCheetah 的真实奖励函数 """# 建议替换为环境提供的真实奖励计算# 当前示例使用简单的速度追踪forward_vel = next_state[8]  # 假设第9个维度是前进速度return forward_vel  # 最大化前进速度
​
​
# ================== 训练系统 ==================
class PETSTrainer:def __init__(self):# 初始化 Gymnasium 环境self.env = gym.make(PETSConfig.env_name,max_episode_steps=PETSConfig.max_episode_steps)self.state_dim = self.env.observation_space.shape[0]self.action_dim = self.env.action_space.shape[0]
​# 初始化模型self.model = ProbabilisticEnsemble(self.state_dim, self.action_dim).to(PETSConfig.device)self.optimizer = torch.optim.Adam(self.model.parameters(), lr=PETSConfig.lr)self.trajectory_optimizer = TrajectoryOptimizer(self.model, self.state_dim, self.action_dim)
​# 数据缓冲区self.buffer = deque(maxlen=100000)
​def collect_data(self, num_episodes=50):""" 收集初始随机数据 """for _ in range(num_episodes):obs, _ = self.env.reset()done = Falsewhile not done:action = self.env.action_space.sample()next_obs, reward, terminated, truncated, _ = self.env.step(action)self.buffer.append((obs, action, next_obs))obs = next_obsdone = terminated or truncated
​def train_model(self):""" 训练概率集成模型 """states, actions, next_states = zip(*random.sample(self.buffer, PETSConfig.batch_size))states = torch.FloatTensor(np.array(states)).to(PETSConfig.device)actions = torch.FloatTensor(np.array(actions)).to(PETSConfig.device)next_states = torch.FloatTensor(np.array(next_states)).to(PETSConfig.device)
​self.optimizer.zero_grad()means, logvars = self.model(states, actions)loss = (means - next_states).pow(2).mean() + 0.5 * logvars.exp().mean()loss.backward()self.optimizer.step()return loss.item()
​def evaluate(self, num_episodes=5):""" 评估策略性能 """total_rewards = []for _ in range(num_episodes):obs, _ = self.env.reset()episode_reward = 0done = Falsewhile not done:action = self.trajectory_optimizer.optimize(obs)obs, reward, terminated, truncated, _ = self.env.step(action)episode_reward += rewarddone = terminated or truncatedtotal_rewards.append(episode_reward)return np.mean(total_rewards)
​
​
# ================== 主程序 ==================
if __name__ == "__main__":trainer = PETSTrainer()
​# 第一阶段:收集初始数据print("收集初始数据...")trainer.collect_data(num_episodes=100)
​# 第二阶段:交替训练与评估for epoch in range(PETSConfig.epochs):# 模型训练loss = trainer.train_model()
​# 策略评估if (epoch + 1) % 20 == 0:avg_reward = trainer.evaluate()print(f"Epoch {epoch + 1:04d} | Loss: {loss:.2f} | Avg Reward: {avg_reward:.1f}")
​# 保存最终模型# torch.save(trainer.model.state_dict(), "pets_model.pth")

四、代码解析

1. 概率集成模型

class ProbabilisticEnsemble(nn.Module):def __init__(self, state_dim, action_dim):# 构建5个独立的前向预测网络self.models = nn.ModuleList([...])  def predict(self, state, action):# 随机选择一个模型进行概率预测model = np.random.choice(self.models)# 计算带噪声的状态预测return mean + torch.exp(0.5*logvar)*torch.randn_like(mean)

2. 轨迹优化器

class TrajectoryOptimizer:def optimize(self, initial_state):# 交叉熵方法核心流程for _ in range(5):  # 迭代优化candidates = best_actions + noise  # 生成候选动作returns = [评估动作序列(seq) for seq in candidates]best_actions = 选择精英样本(candidates, returns)return best_actions[0]  # 返回最优首帧动作

3. 训练系统

class PETSTrainer:def train_model(self):# 从缓冲区采样数据states, actions, next_states = zip(*random.sample(self.buffer, PETSConfig.batch_size))# 计算双损失项:预测误差 + 方差正则化loss = (means - next_states).pow(2).mean() + 0.5*logvars.exp().mean()

五、运行结果与优化建议

1. 典型训练日志

Epoch0020 | Loss: 13.34 | Avg Reward: -34.7
Epoch0040 | Loss: 12.66 | Avg Reward: -55.4
Epoch0060 | Loss: 10.36 | Avg Reward: -69.6
Epoch0080 | Loss: 7.20  | Avg Reward: -50.7
Epoch0100 | Loss: 5.40  | Avg Reward: -58.4
Epoch0120 | Loss: 4.18  | Avg Reward: -23.0
Epoch0140 | Loss: 3.30  | Avg Reward: -69.8
Epoch0160 | Loss: 2.93  | Avg Reward: -34.4
Epoch0180 | Loss: 2.18  | Avg Reward: -58.6
Epoch0200 | Loss: 2.01  | Avg Reward: -63.3
Epoch0220 | Loss: 1.81  | Avg Reward: -71.7
Epoch0240 | Loss: 1.71  | Avg Reward: -47.7
Epoch0260 | Loss: 1.56  | Avg Reward: -32.1
Epoch0280 | Loss: 1.53  | Avg Reward: -63.4
Epoch0300 | Loss: 1.38  | Avg Reward: -49.7
Epoch0320 | Loss: 1.44  | Avg Reward: -34.3
Epoch0340 | Loss: 1.50  | Avg Reward: -59.0
Epoch0360 | Loss: 1.49  | Avg Reward: -42.7
Epoch0380 | Loss: 1.33  | Avg Reward: -47.1
Epoch0400 | Loss: 1.40  | Avg Reward: -36.3

2. 性能瓶颈分析

现象可能原因解决方案
损失下降但回报不提升奖励函数设计不合理添加存活奖励项
初期回报波动剧烈探索不足增大CEM噪声尺度
后期回报停滞模型容量不足增加网络层数和隐藏单元

3. 关键调参建议

class ImprovedConfig(Config):num_ensembles = 7      # 增加集成模型数量horizon = 40           # 延长规划步长batch_size = 512       # 增大批次大小hidden_dim = 256       # 扩展网络容量

六、总结与扩展

  • 本文基于 PETS 算法实现了基于模型的强化学习,展示了其在连续控制任务中的高效性。读者可尝试以下扩展:

    1. 添加更复杂的环境模型(如 LSTM)

    2. 在真实机器人任务中测试算法

    3. 结合离线数据提升模型精度

    在下一篇文章中,我们将探索 元强化学习(Meta RL),并实现 MAML 算法!


注意事项

  1. 运行前需安装依赖:

    pip install gymnasium[mujoco] torch

  2. 需配置MuJoCo许可证:

    mkdir -p ~/.mujoco && cp mjkey.txt ~/.mujoco/

  3. 推荐使用GPU加速训练(RTX 3060及以上)

希望本文能帮助您掌握基于模型的强化学习核心方法!如遇问题,欢迎在评论区讨论交流。

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com