Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents
Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents
Zeng, Fanyu;Wang, Chen
2020-10-15 00:00:00
Hindawi Journal of Robotics Volume 2020, Article ID 8702962, 7 pages https://doi.org/10.1155/2020/8702962 Research Article Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents Fanyu Zeng and Chen Wang School of Computer Science and Engineering, Center for Robotics, University of Electronic Science and Technology of China, Chengdu 611731, China Correspondence should be addressed to Fanyu Zeng; zengfanyu_cs@163.com Received 12 February 2020; Revised 10 August 2020; Accepted 21 September 2020; Published 15 October 2020 Academic Editor: Weitian Wang Copyright © 2020 Fanyu Zeng and Chen Wang. )is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Vanilla policy gradient methods suffer from high variance, leading to unstable policies during training, where the policy’s performance fluctuates drastically between iterations. To address this issue, we analyze the policy optimization process of the navigation method based on deep reinforcement learning (DRL) that uses asynchronous gradient descent for optimization. A variant navigation (asynchronous proximal policy optimization navigation, appoNav) is presented that can guarantee the policy monotonic improvement during the process of policy optimization. Our experiments are tested in DeepMind Lab, and the experimental results show that the artificial agents with appoNav perform better than the compared algorithm. asynchronous variants of AC algorithms, termed as asyn- 1. Introduction chronous advantage actor-critic (A3C), and showed that Navigation in an unstructured environment is one of the most parallel actor-learners have a stabilizing effect on training important abilities for mobile robotics and artificial agents artificial agents. Researchers can construct navigation agents [1–3]. Traditional methods mainly divide navigation into based on these DRL algorithms. However, vanilla policy several parts [4]: simultaneous localization and mapping gradient methods have poor data efficiency [19], which leads (SLAM) [5–7], path planning [8], and semantic segmentation to navigation agents suffering from high variance and un- [9, 10]. )e methods mentioned are not an end-to-end al- stable policies. gorithm where each part is a challenging research subject, and In this work, we take A3C as an example to show how to the fusion of each part often leads to large computational guarantee the policy monotonic improvement. )e training errors. To reduce the fusion error, we focus on the end-to-end environment is DeepMind Lab [20], and it is a first-person navigation based on deep reinforcement learning where 3D virtual environment designed for research and devel- navigational abilities could emerge as the byproduct of an opment of general artificial intelligence. DeepMind Lab can artificial agent learning policy with reward maximization. be used to study how autonomous artificial agents learn With the fast development of deep learning [11–14], a complex tasks in large, partially observed, and visually di- variety of DRL architectures have been proposed [2]. Mnih verse worlds. In addition, the worlds are rendered with rich et al. [15] presented the advances in training deep neural science fiction-style visuals. Actions are to look around and networks to develop the deep Q-network (DQN), which can move in the 3D virtual world, and example tasks include learn successful policies directly from high-dimensional navigation in different mazes. Mirowski et al. [21] proposed image inputs using end-to-end reinforcement learning. On- a DRL navigation method based on A3C [18], augmented policy reinforcement learning methods such as actor-critic with auxiliary learning targets, to train artificial agents to (AC) [16, 17] were proposed such that the actor is the policy, navigate in DeepMind Lab. For ease of expression, we call and the critic is the baseline. Minh et al. [18] presented the DRL navigation using A3C as a3cNav. 2 Journal of Robotics In this paper, the issues on policy optimization for R � c r , which is a discounted sum of rewards. )e t k�0 t+k navigation based on the vanilla policy gradient are analyzed; action-value function Q � E[R | s � s, a] is the expected t t this type of navigation cannot control the change of expected return following action a from state s under policy π. )e advantage when an artificial agent learns to navigate in a value function V � E[R | s � s] is the expected return t t maze. Based on the navigation techniques presented in [21], from state s. we show how to reduce training variances and get higher In policy-based methods, let π(a | s; θ) be a policy with reward when an artificial agent interacts with an environ- parameters θ, which is updated by performing gradient ment. Inspired by [19, 22], we adjust the policy update ascent on E[R ]. Policy gradient algorithms adjust the policy process of the navigation in [21] to guarantee the monotonic by updating parameters θ in the direction improvement of the navigation policy. Experimental results ∇ logπ(a | s ; θ)R that is an unbiased estimate of ∇ E[R ]. θ t t t θ t show that an artificial agent via appoNav learns better To reduce the variance of this estimate, Williams [29] navigation policy in DeepMind Lab and suffers from lower subtracted a learned function called baseline b (s ) for the t t standard deviation than a3cNav. return, so the improved gradient becomes ∇ logπ(a | s ; θ)(R − b (s )). )ere exists an equation θ t t t t t b (s ) ≈ V (s), and R − b (s ) can be seen as an estimate of t t t t t 2. Related Work the advantage of action at under state s . )e numerical value Traditional navigation, which is model-based, includes si- of Q (s, a) equals the value of R ; hence, the advantage multaneous localization and mapping (SLAM) [5, 7, 23], function can be rewritten as A(a , s ) � Q(a , s ) − V(s ). t t t t t path planning [8, 24], and semantic segmentation [9]. Each )is method is called actor-critic (AC) architecture where part of them is a challenge research area, and the fusion of the actor is the policy π and the critic is the baseline b them often leads to large computation error. Moreover, [16, 17]. Minh et al. [18] presented asynchronous variants of model-based navigation needs to model the environments AC algorithms, termed as asynchronous advantage actor- effectively for some dynamic and complex scenes, which critic (A3C), and showed that parallel actor-learners have a severely affect navigation performance. stabilizing effect on training artificial agents. With recent advances in DRL, many navigation methods When a DRL agent interacts with its environment, the based on DRL have been proposed [2]. DRL navigation, state sequences of each interaction change a lot, leading to which is end to end, avoids the computation error caused by fluctuations in rewards. )erefore, DRL algorithms (such as the fusion of traditional navigation. Mirowski et al. [21] DQN and A3C) have unstable fluctuations during training. addressed navigation via auxiliary depth prediction and Researchers wonder whether they can find a method to loop-closure classification tasks. Jaderberg et al. [25] also reduce such fluctuations while maintaining a steady im- used auxiliary tasks for navigation and incorporated A3C provement in the policy. Schulman et al. [22] proposed trust with control tasks and prediction tasks including pixel region policy optimization (TRPO) to make the monotonic control and reward prediction. By using features extracted improvement for the policy. Furthermore, Schulman et al. from the world model as inputs to an agent, Ha and [19] proposed proximal policy optimization (PPO) to Schmidhuber [26] used DRL to construct a world model and simplify the calculation of TRPO. In addition, Heess et al. used the model in a car navigation task. Bruce et al. [27] [30] proposed a distributed implementation of PPO, called leveraged an interactive world model based on DRL built distributed PPO. Besides the similar process of the gradient from a single traversal of the environment and utilized a update with A3C, distributed PPO includes various tricks, pretrained visual feature encoder to demonstrate successful such as normalizations (observation normalization, reward zero-shot transfer under real-world environmental varia- reshape normalization, and per-batch normalization of the tions without fine-tuning. Banino et al. [28] proposed a advantages), sharing of algorithm parameters across local vector-based navigation method that fuses DRL with grid- workers, and additional trust region constraint. )ese tricks like representations in the artificial agent. When these DRL result in that the computation of distributed PPO is more navigation agents interact with environments, the state se- complex than appoNav. quences of each interaction change a lot, leading to large fluctuations in rewards. )erefore, these DRL navigation 3.2. NavA3C + D D . In this work, we use the methods suffer from high variance and have unstable pol- 1 2 icies during training. NavA3C + D D architecture [21] as shown in Figure 1, 1 2 which includes 2 CNNs and 2 LSTMs. NavA3C + D D has 4 1 2 inputs: the current RGB image x , previous reward r , t t−1 3. Background previous action a , and the current velocity v . )e 2 CNNs t−1 t act as the encoder for RGB image x , and the first LSTM 3.1. Reinforcement Learning. We consider the standard re- inforcement learning setting where an artificial agent in- makes associations between reward r and visual obser- t−1 vations x that are provided as context to the second LSTM teracts with an environment over a number of discrete time steps. At each time step t, the agent receives a state s from from which the policy π(a | s ; θ) and the value V(s ; θ ) are t t t v computed. Artificial agents based on this architecture try to the environment and outputs an action a according to its learned policy π. In return, the environment gives the agent a maximize the cumulative reward R during their interaction with the maze and minimize the auxiliary depth losses next sate s and a reward r . )e goal of reinforcement t+1 t learning is to maximize the accumulated reward L and L . Finally, the agent can learn how to Depth1 Depth2 Journal of Robotics 3 CNN CNN LSTM LSTM t–1 {v , a } t t–1 Figure 1: a3cNav architecture. In the architecture, image x is the input of a3cNav, and following the full connection layer is a two-layer CNN which outputs depth D as well as a two-layer stacked LSTM which outputs depth D , policy π, and value V. In addition, auxiliary task 1 2 used in this architecture in which the first LSTM only receives the reward and the velocity and previously selected action are fed into the second LSTM. navigate in DeepMind Lab. For ease of expression, we the global network of a3cNav, leading to the unstable training of the agent. In this section, we improve the pa- rename NavA3C + D D as a3cNav. 1 2 a3cNav is based on the A3C framework into which rameter updates of a3cNav to guarantee its policy mono- unsupervised auxiliary tasks are incorporated. )erefore, its tonic improvement. loss function includes the loss of A3C L and the loss of In [22], a policy can be rewritten as A3C auxiliary tasks. a3cNav can be optimized as follows: η(π ) � η(π) + ρ (s) π (a | s)A (s, a), π (3) s a L (θ) � L + λ L + λ L , (1) a3cNav A3C Depth1 Depth1 Depth2 Depth2 where π denotes a stochastic policy and π is another policy. where λ and λ are weighting terms on the indi- Depth1 Depth2 η(π) and η(π) are the expected discounted cost for π and π, vidual loss components. respectively. Here, ρ (s) is the distribution of the state s π )e global parameters θ of a3cNav are updated in according to π, and A is the advantage function following π. multithread environments, and θ are copied to the local Equation (3) implies that if we want to reduce η or leave it as worker parameters θ . )e local worker of a3cNav interacts constant, we should keep the expected advantage π(a | s) ′ a with the maze, and the policy gradients wrt θ and the value A (s, a)≤ 0 at every state s when a policy update π ⟶ π. )is ′ π gradients wrt θ are computed from the policy loss and value demonstrates that if we want to reduce the training variance of loss. )e gradient for the parameter update is proportional a3cNav and keep its policy monotonic improvement, we must to the product of advantage function A . Equation (2) shows guarantee π (a | s)A (s, a)≤ 0. However, a3cNav cannot a π the calculation of gradients: control the change of the expected advantage when the artificial ′ ′ dθ⟵ dθ + ∇ logπa s ; θ