Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja; Aurick Zhou; Pieter Abbeel; Sergey Levine

首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

【24h】

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

机译：软演员批评：带有随机演员的非政策最大熵深度强化学习

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.

机译：无模型的深度强化学习（RL）算法已在一系列具有挑战性的决策和控制任务中得到证明。但是，这些方法通常面临两个主要挑战：极高的样本复杂性和易碎的收敛特性，这需要进行精细的超参数调整。这两个挑战都严重限制了此类方法在复杂的实际领域中的应用。在本文中，我们提出了基于最大熵强化学习框架的“软行为者批判”，一种非策略性行为者批判深度RL算法。在这个框架中，参与者的目标是最大化期望的回报，同时也最大化熵。也就是说，要在完成任务的同时尽可能随机地行动。基于该框架的先前的深度RL方法已被公式化为Q学习方法。通过将策略外更新与稳定的随机行为者-批评公式相结合，我们的方法可以在一系列连续控制基准任务上实现最先进的性能，优于以前的策略和策略外方法。此外，我们证明，与其他非策略算法相比，我们的方法非常稳定，在不同的随机种子上实现了非常相似的性能。

著录项

来源
《JMLR: Workshop and Conference Proceedings》 |2018年第12期|共10页
作者
Tuomas Haarnoja; Aurick Zhou; Pieter Abbeel; Sergey Levine;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类人工智能理论;
关键词

相似文献

外文文献
中文文献
专利

1. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor [J] . Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, JMLR: Workshop and Conference Proceedings . 2018,第12期

机译：软演员批评：带有随机演员的非政策最大熵深度强化学习
2. A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning [J] . Wesley Suttle, Zhuoran Yang, Kaiqing Zhang, IFAC PapersOnLine . 2020,第2期

机译：用于分布式强化学习的多功能脱机演员 - 批评算法
3. An actor-critic deep reinforcement learning approach for metro train scheduling with rolling stock circulation under stochastic demand [J] . Ying Cheng-shuo, Chow Andy H. F., Chin Kwai-Sang Transportation Research Part B: Methodological . 2020,第Octa期

机译：随机需求下滚动股票循环的地铁列车调节探测深度加强学习方法
4. Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus [C] . Yan Zhang, Michael M. Zavlanos IEEE Annual Conference on Decision and Control . 2019

机译：分布禁止政策演员 - 批评政策协商委员会的批评学习
5. Acquiring Diverse Robot Skills via Maximum Entropy Deep Reinforcement Learning [D] . Haarnoja, Tuomas. 2018

机译：通过最大熵深度强化学习掌握各种机器人技能
6. Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor–Critic with Hindsight Experience Replay [O] . Evan Prianto, MyeongSeop Kim, Jae-Han Park, 2020

机译：使用深度加强学习的多臂操纵器的路径规划：软演员 - 与后敏感体验重播
7. Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors [O] . Jingliang Duan, Yang Guan, Shengbo Eben Li, 2021

机译：分布软演员 - 评论家：解决价值估计错误的禁止策略加固学习

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

摘要

著录项

相似文献

相关主题

期刊订阅