基于凸多面体抽象域的自适应强化学习技术研究

陈冬火; 刘全; 朱斐; 金海东

摘要

表格驱动的算法是解决强化学习问题的一类重要方法,但由于“维数灾”现象的存在,这种方法不能直接应用于解决具有连续状态空间的强化学习问题.解决维数灾问题的方法主要包括两种:状态空间的离散化和函数近似方法.相比函数近似,基于连续状态空间离散化的表格驱动方法具有原理直观、程序结构简单和计算轻量化的特点.基于连续状态空间离散化方法的关键是发现合适的状态空间离散化机制,平衡计算量及准确性,并且确保基于离散抽象状态空间的数值性度量,例如V值函数和Q值函数,可以较为准确地对原始强化学习问题进行策略评估和最优策略π*计算.文中提出一种基于凸多面体抽象域的自适应状态空间离散化方法,实现自适应的基于凸多面体抽象域的Q(λ)强化学习算法(Adaptive Polyhedra Domain based Q(λ),APDQ(λ)).凸多面体是一种抽象状态的表达方法,广泛应用于各种随机系统性能评估和程序数值性属性的验证.这种方法通过抽象函数,建立具体状态空间至多面体域的抽象状态空间的映射,把连续状态空间最优策略的计算问题转化为有限大小的和易于处理的抽象状态空间最优策略的计算问题.根据与抽象状态相关的样本集信息,设计了包括BoxReinement、LFReinement 和MVLFRefinement多种自适应精化机制.依据这些精化机制,对抽象状态空间持续进行适应性精化,从而优化具体状态空间的离散化机制,产生符合在线抽样样本空间所蕴涵的统计奖赏模型.基于多面体专业计算库PPL(Parma Polyhedra Library)和高精度数值计算库GMP(GNU Multiple Precision)实现了算法APDQ(λ),并实施了实例研究.选择典型的连续状态空间强化学习问题山地车(Mountain Car,MC)和杂技机器人(Acrobatic robot,Acrobot)作为实验对象,详细评估了各种强化学习参数和自适应精化相关的阈值参数对APDQ(λ)性能的影响,探究了抽象状态空间动态变化情况下各种参数在策略优化过程中的作用机理.实验结果显示当折扣率γ大于0.7时,算法展现出较好的综合性能,在初期,策略都快速地改进,后面的阶段平缓地趋向收敛(如图6～图13所示),并且对学习率α和各种抽象状态空间精化参数都具有较好的适应性;当折扣率y小于0.6时,算法的性能衰退较快.抽象解释技术用于统计学习过程是一种较好的解决连续强化学习问题的思想,有许多问题值得进一步研究和探讨,例如基于近似模型的采样和值函数更新等问题.%The table-driven based algorithm is an important method for solving the reinforcement learning problems,but for the real world problems with continuous state spaces,the method is challenged by the curse of dimensionality,also named as the state explosion problem.Two methods have been presented for attacking the curse of dimensionality,including discretization of continuous state space and function approximation.For the usage of the discretization of continuous state space,the table-driven based algorithm is of some advantages than the function approximation based algorithms,namely straightforward principle,the implementation with concise data structure and the lightweight computation.Note that the core of algorithm is to discover a qualified discretization mechanism with which the computation cost and the accuracy of the abstract model are well balanced,and the optimal policy of an original reinforcement learning problem can be approximately derived according to its abstract state space and quantitative reward metrics.This paper presents an adaptive discretization technique based on the convex polyhedra abstraction domain,and designs an adaptive polyhedra domain based Q(λ) algorithm (APDQ(λ)) on the basis of Q(λ),an important algorithm in reinforcement learning.Convex polyhedron is a qualified representation of abstract state,which is widely in performance evaluation of complex stochastic systems and verification of numerical properties of programs.The method abstracts a continuous (infinite) concrete state space into a discrete and manageable set of abstract states by defining an abstract function,such that the control problem of the original system can be resolved directly by the corresponding abstract system.Especially,some adaptive refinement operators,such as BoxRefinement,LFRefinement and MVLFRefinement,are studied,which are dependent on the online samples information for a refined abstract polyhedron state.The abstract state space is dynamically adjusted,such that a finite and discrete model is statistically derived according to online samples,which approximates the dynamic and reward model of continuous Markov system.Finally,APDQ(λ) is implemented,in which,the involved algebraic and geometrical computations of polyhedra with the requirement of high precision are programmed by calling the APIs of Parma polyhedra library (PPL) and GNU multiple precision (GMP),and some case studies are conducted for showing the performance of APDQ(λ).In the experiments,using mountain car (MC) and acrobatic robot (Acrobot) with the continuous state spaces as the experimental subjects,the ability and the limitation of APDQ(λ) under different combinations of parameters values are probed in detail.The experimental results demonstrate that (1) APDQ(λ)behave well when γ≥0.7,just as shown in figures 6-13:The policy is rapidly improved in the initial phase and inclines to converge later on,moreover,it has fine adaptability to learning rate α and all sorts of parameters related to the refinement of abstract state space;(2) the performance of the algorithm degrades severely when γ≤0.6.Summarily,it is a novel idea on solving reinforcement learning problems with continuous state space that abstraction interpretation technique is applied to statistical learning process,and many topics deserve more attention,such as the sampling policy and the update of value functions in the context of abstract approximate model.

基于凸多面体抽象域的自适应强化学习技术研究

摘要

著录项

相似文献

相关主题

期刊订阅