首页> 外文期刊>Engineering Applications of Artificial Intelligence >An automated signalized junction controller that learns strategies by temporal difference reinforcement learning
【24h】

An automated signalized junction controller that learns strategies by temporal difference reinforcement learning

机译:通过时差强化学习来学习策略的自动化信号连接控制器

获取原文
获取原文并翻译 | 示例
           

摘要

This paper shows how temporal difference learning can be used to build a signalized junction controller that will learn its own strategies through experience. Simulation tests detailed here show that the learned strategies can have high performance. This work builds upon previous work where a neural network based junction controller that can learn strategies from a human expert was developed (Box and Waterson, 2012). In the simulations presented, vehicles are assumed to be broadcasting their position over WiFi giving the junction controller rich information. The vehicle's position data are pre-processed to describe a simplified stare. The state-space is classified into regions associated with junction control decisions using a neural network. This classification is the strategy and is parametrized by the weights of the neural network. The weights can be learned either through supervised learning with a human trainer or reinforcement learning by temporal difference (TD). Tests on a model of an isolated T junction show an average delay of 14.12 sand 14.36 s respectively for the human trained and TD trained networks. Tests on a model of a pair of closely spaced junctions show 17.44 s and 20.82 s respectively. Both methods of training produced strategies that were approximately equivalent in their equitable treatment of vehicles, defined here as the variance over the journey time distributions.
机译:本文展示了如何使用时差学习来构建信号化结点控制器,该结点控制器将通过经验学习其自身的策略。此处详细介绍的仿真测试表明,所学习的策略可以具有较高的性能。这项工作是在以前的工作基础上开发的,该工作开发了基于神经网络的结点控制器,可以向人类专家学习策略(Box和Waterson,2012年)。在给出的仿真中,假定车辆正在通过WiFi广播其位置,从而为路口控制器提供了丰富的信息。车辆的位置数据经过预处理以描述简化的凝视。使用神经网络将状态空间分类为与路口控制决策相关的区域。这种分类是一种策略,并由神经网络的权重进行参数化。权重既可以通过人类培训师的监督学习来学习,也可以通过时差(TD)进行强化学习来学习。对隔离的T型结的模型进行的测试表明,对于人类训练和TD训练的网络,平均延迟分别为14.12沙14.36 s。一对紧密间隔的结点模型的测试分别显示17.44 s和20.82 s。两种训练方法所产生的策略在对车辆的公平对待方面均大致相同,此处定义为行驶时间分布的方差。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号