连续空间中的一种动作加权行动者评论家算法

刘全; 章鹏; 钟珊; 钱炜晟; 翟建伟

首页> 中文期刊> 《计算机学报》 >连续空间中的一种动作加权行动者评论家算法

连续空间中的一种动作加权行动者评论家算法

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

经典的强化学习算法主要应用于离散状态动作空间中.在复杂的学习环境下,离散空间的强化学习方法不能很好地满足实际需求,而常用的连续空间的方法最优策略的震荡幅度较大.针对连续空间下具有区间约束的连续动作空间的最优控制问题,提出了一种动作加权的行动者评论家算法(Action Weight Policy Search Actor-Critic,AW-PS-AC).AW-PS-AC算法以行动者评论家为基本框架,对最优状态值函数和最优策略使用线性函数逼近器进行近似,通过梯度下降方法对一组值函数参数和两组策略参数进行更新.对两组策略参数进行加权获得最优策略,并对获得的最优动作通过区间进行约束,以防止动作越界.为了进一步提高算法的收敛速度,设计了一种改进的时间差分算法,即采用值函数的时间差分误差来更新最优策略,并引入了策略资格迹调整策略参数.为了证明算法的收敛性,在指定的假设条件下对AW-PS-AC算法的收敛性进行了分析.为了验证AW-PS-AC算法的有效性,在平衡杆和水洼世界实验中对AW-PS-AC算法进行仿真.实验结果表明AW-PS-AC算法在两个实验中均能有效求解连续空间中近似最优策略问题,并且与经典的连续动作空间算法相比,该算法具有收敛速度快和稳定性高的优点.%Classic reinforcement learning algorithms mainly aim at the discrete state and action spaces.For the complex environment or the more applicable continuous spaces,the methods for the discrete spaces cannot satisfy the requirement.One feasible method is to discretize the state and action spaces,then the methods applied in discrete spaces can solve these problems with continuous state and action spaces.However,the reasonable discretization for the state and action spaces is not an easy problem.The methods applicable in continuous spaces do not have to discretize the state or action spaces,but most of them did not consider the constraint of the action range,additionally,the fluctuations of the optimal action were heavily.To be more applicable in continuous action spaces,we propose an actor-critic algorithm for continuous action space based on weighting of the actions by considering the constraint of the action range and decreasing the fluctuation,called AW-PS-AC.AW-PS-AC is designed in the framework of the actor-critic which is a classic method for the continuous space.The action exploration policy takes the Gaussian distribute by using the optimal action as the mean value,so that the selective action is the action with a small exploration factor.The optimal state value function and the optimal policy are approximated by linear function approximation,where the gradient descent method is utilized to update one set of the value function parameter and two sets of the policy parameters.The two sets of the policy parameters are weighted to obtain the optimal policy to constraint the optimal action,so that the optimal action will not surpass the action range and the optimal policy will not fluctuate significantly.The weighting for the actions can satisfy the constraint of the action range.Moreover,the samples can be utilized more comprehensively,resulting in a better performance under only a small amount of the data.To speed the convergence rate,an improved temporal difference algorithm is designed,where the temporal difference error (TD-error) of the value function are employed to update the optimal policy and the policy eligibility trace is introduced to improve the convergence rate for the algorithm.To prove the convergence of this proposed method,under the three given assumptions,AW-PS-AC is analyzed theoretically and its convergence is proved.On two classic benchmarks of the classic reinforcement learning benchmarks which have the nonlinear system dynamics,pole-balancing problem and puddle world problem,AW-PS-AC is compared with the representative methods which are representative in continuous spaces,namely,continuous actor-critic learning automaton (CALAC),continuous-action on Q-learning (CAQ) and incremental natural actor-critic with scaling gradient (INAC-S),and they are implemented on them.The results show that the AW-PS-AC algorithm performs well in the two experiments.The good performances in the two experiments demonstrate that the AW-PS-AC algorithm can solve the approximated-optimal problems effectively in continuous space.Compared with the state-of-the-art algorithms,AW-PS-AC outperforms them not only in convergence but also in stability.From the experiments,it is clearly that AW-PS-AC algorithm can converge only after only a few episodes,moreover,it can be stable all the time after it is converged.

著录项

来源
《计算机学报》 |2017年第6期|1252-1264|共13页
作者
刘全; 章鹏; 钟珊; 钱炜晟; 翟建伟;
展开▼
作者单位

苏州大学计算机科学与技术学院江苏苏州 215006;

软件新技术与产业化协同创新中心南京 210000;

吉林大学符号计算与知识工程教育部重点实验室长春 130012;

苏州大学计算机科学与技术学院江苏苏州 215006;

苏州大学计算机科学与技术学院江苏苏州 215006;

苏州大学计算机科学与技术学院江苏苏州 215006;

苏州大学计算机科学与技术学院江苏苏州 215006;

展开▼
原文格式 PDF
正文语种 chi
中图分类人工智能理论;
关键词
强化学习; 连续空间; 函数逼近; 行动者评论家; 梯度下降; 人工智能;

相似文献

中文文献
外文文献
专利

1. 一种用于连续动作空间的最小二乘行动者-评论家方法 [J] . 朱斐 ,刘全 ,傅启明 . 计算机研究与发展 . 2014,第003期
2. 连续空间的递归最小二乘行动者—评论家算法 [J] . 朱文文 ,金玉净 ,伏玉琛 . 计算机应用研究 . 2014,第007期
3. 一种基于高斯过程的行动者评论家算法 [J] . 陈仕超 ,凌兴宏 ,刘全 . 计算机应用研究 . 2016,第006期
4. 基于双重注意力机制的异步优势行动者评论家算法 [J] . 凌兴宏 ,李杰 ,朱斐 . 计算机学报 . 2020,第001期
5. 带最大熵修正的行动者评论家算法 [J] . 姜玉斌 ,刘全 ,胡智慧 . 计算机学报 . 2020,第010期
6. 一种障碍空间中不确定对象的连续最近邻查询方法 [C] . 李传文 ,谷峪 ,李芳芳 . 第27届中国数据库学术会议 . 2010
7. 基于连续动作空间的行动者评论家方法研究 [A] . 章鹏 . 2017

连续空间中的一种动作加权行动者评论家算法

摘要

著录项

相似文献

相关主题

期刊订阅