Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents

Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents Hindawi Journal of Robotics Volume 2020, Article ID 8702962, 7 pages https://doi.org/10.1155/2020/8702962 Research Article Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents Fanyu Zeng and Chen Wang School of Computer Science and Engineering, Center for Robotics, University of Electronic Science and Technology of China, Chengdu 611731, China Correspondence should be addressed to Fanyu Zeng; zengfanyu_cs@163.com Received 12 February 2020; Revised 10 August 2020; Accepted 21 September 2020; Published 15 October 2020 Academic Editor: Weitian Wang Copyright © 2020 Fanyu Zeng and Chen Wang. )is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Vanilla policy gradient methods suffer from high variance, leading to unstable policies during training, where the policy’s performance fluctuates drastically between iterations. To address this issue, we analyze the policy optimization process of the navigation method based on deep reinforcement learning (DRL) that uses asynchronous gradient descent for optimization. A variant navigation (asynchronous proximal policy optimization navigation, appoNav) is presented that can guarantee the policy monotonic improvement during the process of policy optimization. Our experiments are tested in DeepMind Lab, and the experimental results show that the artificial agents with appoNav perform better than the compared algorithm. asynchronous variants of AC algorithms, termed as asyn- 1. Introduction chronous advantage actor-critic (A3C), and showed that Navigation in an unstructured environment is one of the most parallel actor-learners have a stabilizing effect on training important abilities for mobile robotics and artificial agents artificial agents. Researchers can construct navigation agents [1–3]. Traditional methods mainly divide navigation into based on these DRL algorithms. However, vanilla policy several parts [4]: simultaneous localization and mapping gradient methods have poor data efficiency [19], which leads (SLAM) [5–7], path planning [8], and semantic segmentation to navigation agents suffering from high variance and un- [9, 10]. )e methods mentioned are not an end-to-end al- stable policies. gorithm where each part is a challenging research subject, and In this work, we take A3C as an example to show how to the fusion of each part often leads to large computational guarantee the policy monotonic improvement. )e training errors. To reduce the fusion error, we focus on the end-to-end environment is DeepMind Lab [20], and it is a first-person navigation based on deep reinforcement learning where 3D virtual environment designed for research and devel- navigational abilities could emerge as the byproduct of an opment of general artificial intelligence. DeepMind Lab can artificial agent learning policy with reward maximization. be used to study how autonomous artificial agents learn With the fast development of deep learning [11–14], a complex tasks in large, partially observed, and visually di- variety of DRL architectures have been proposed [2]. Mnih verse worlds. In addition, the worlds are rendered with rich et al. [15] presented the advances in training deep neural science fiction-style visuals. Actions are to look around and networks to develop the deep Q-network (DQN), which can move in the 3D virtual world, and example tasks include learn successful policies directly from high-dimensional navigation in different mazes. Mirowski et al. [21] proposed image inputs using end-to-end reinforcement learning. On- a DRL navigation method based on A3C [18], augmented policy reinforcement learning methods such as actor-critic with auxiliary learning targets, to train artificial agents to (AC) [16, 17] were proposed such that the actor is the policy, navigate in DeepMind Lab. For ease of expression, we call and the critic is the baseline. Minh et al. [18] presented the DRL navigation using A3C as a3cNav. 2 Journal of Robotics In this paper, the issues on policy optimization for R � 􏽐 c r , which is a discounted sum of rewards. )e t k�0 t+k navigation based on the vanilla policy gradient are analyzed; action-value function Q � E[R | s � s, a] is the expected t t this type of navigation cannot control the change of expected return following action a from state s under policy π. )e advantage when an artificial agent learns to navigate in a value function V � E[R | s � s] is the expected return t t maze. Based on the navigation techniques presented in [21], from state s. we show how to reduce training variances and get higher In policy-based methods, let π(a | s; θ) be a policy with reward when an artificial agent interacts with an environ- parameters θ, which is updated by performing gradient ment. Inspired by [19, 22], we adjust the policy update ascent on E[R ]. Policy gradient algorithms adjust the policy process of the navigation in [21] to guarantee the monotonic by updating parameters θ in the direction improvement of the navigation policy. Experimental results ∇ logπ(a | s ; θ)R that is an unbiased estimate of ∇ E[R ]. θ t t t θ t show that an artificial agent via appoNav learns better To reduce the variance of this estimate, Williams [29] navigation policy in DeepMind Lab and suffers from lower subtracted a learned function called baseline b (s ) for the t t standard deviation than a3cNav. return, so the improved gradient becomes ∇ logπ(a | s ; θ)(R − b (s )). )ere exists an equation θ t t t t t b (s ) ≈ V (s), and R − b (s ) can be seen as an estimate of t t t t t 2. Related Work the advantage of action at under state s . )e numerical value Traditional navigation, which is model-based, includes si- of Q (s, a) equals the value of R ; hence, the advantage multaneous localization and mapping (SLAM) [5, 7, 23], function can be rewritten as A(a , s ) � Q(a , s ) − V(s ). t t t t t path planning [8, 24], and semantic segmentation [9]. Each )is method is called actor-critic (AC) architecture where part of them is a challenge research area, and the fusion of the actor is the policy π and the critic is the baseline b them often leads to large computation error. Moreover, [16, 17]. Minh et al. [18] presented asynchronous variants of model-based navigation needs to model the environments AC algorithms, termed as asynchronous advantage actor- effectively for some dynamic and complex scenes, which critic (A3C), and showed that parallel actor-learners have a severely affect navigation performance. stabilizing effect on training artificial agents. With recent advances in DRL, many navigation methods When a DRL agent interacts with its environment, the based on DRL have been proposed [2]. DRL navigation, state sequences of each interaction change a lot, leading to which is end to end, avoids the computation error caused by fluctuations in rewards. )erefore, DRL algorithms (such as the fusion of traditional navigation. Mirowski et al. [21] DQN and A3C) have unstable fluctuations during training. addressed navigation via auxiliary depth prediction and Researchers wonder whether they can find a method to loop-closure classification tasks. Jaderberg et al. [25] also reduce such fluctuations while maintaining a steady im- used auxiliary tasks for navigation and incorporated A3C provement in the policy. Schulman et al. [22] proposed trust with control tasks and prediction tasks including pixel region policy optimization (TRPO) to make the monotonic control and reward prediction. By using features extracted improvement for the policy. Furthermore, Schulman et al. from the world model as inputs to an agent, Ha and [19] proposed proximal policy optimization (PPO) to Schmidhuber [26] used DRL to construct a world model and simplify the calculation of TRPO. In addition, Heess et al. used the model in a car navigation task. Bruce et al. [27] [30] proposed a distributed implementation of PPO, called leveraged an interactive world model based on DRL built distributed PPO. Besides the similar process of the gradient from a single traversal of the environment and utilized a update with A3C, distributed PPO includes various tricks, pretrained visual feature encoder to demonstrate successful such as normalizations (observation normalization, reward zero-shot transfer under real-world environmental varia- reshape normalization, and per-batch normalization of the tions without fine-tuning. Banino et al. [28] proposed a advantages), sharing of algorithm parameters across local vector-based navigation method that fuses DRL with grid- workers, and additional trust region constraint. )ese tricks like representations in the artificial agent. When these DRL result in that the computation of distributed PPO is more navigation agents interact with environments, the state se- complex than appoNav. quences of each interaction change a lot, leading to large fluctuations in rewards. )erefore, these DRL navigation 3.2. NavA3C + D D . In this work, we use the methods suffer from high variance and have unstable pol- 1 2 icies during training. NavA3C + D D architecture [21] as shown in Figure 1, 1 2 which includes 2 CNNs and 2 LSTMs. NavA3C + D D has 4 1 2 inputs: the current RGB image x , previous reward r , t t−1 3. Background previous action a , and the current velocity v . )e 2 CNNs t−1 t act as the encoder for RGB image x , and the first LSTM 3.1. Reinforcement Learning. We consider the standard re- inforcement learning setting where an artificial agent in- makes associations between reward r and visual obser- t−1 vations x that are provided as context to the second LSTM teracts with an environment over a number of discrete time steps. At each time step t, the agent receives a state s from from which the policy π(a | s ; θ) and the value V(s ; θ ) are t t t v computed. Artificial agents based on this architecture try to the environment and outputs an action a according to its learned policy π. In return, the environment gives the agent a maximize the cumulative reward R during their interaction with the maze and minimize the auxiliary depth losses next sate s and a reward r . )e goal of reinforcement t+1 t learning is to maximize the accumulated reward L and L . Finally, the agent can learn how to Depth1 Depth2 Journal of Robotics 3 CNN CNN LSTM LSTM t–1 {v , a } t t–1 Figure 1: a3cNav architecture. In the architecture, image x is the input of a3cNav, and following the full connection layer is a two-layer CNN which outputs depth D as well as a two-layer stacked LSTM which outputs depth D , policy π, and value V. In addition, auxiliary task 1 2 used in this architecture in which the first LSTM only receives the reward and the velocity and previously selected action are fed into the second LSTM. navigate in DeepMind Lab. For ease of expression, we the global network of a3cNav, leading to the unstable training of the agent. In this section, we improve the pa- rename NavA3C + D D as a3cNav. 1 2 a3cNav is based on the A3C framework into which rameter updates of a3cNav to guarantee its policy mono- unsupervised auxiliary tasks are incorporated. )erefore, its tonic improvement. loss function includes the loss of A3C L and the loss of In [22], a policy can be rewritten as A3C auxiliary tasks. a3cNav can be optimized as follows: η(π 􏽥) � η(π) + 􏽘 ρ (s) 􏽘 π 􏽥(a | s)A (s, a), 􏽥 π (3) s a L (θ) � L + λ L + λ L , (1) a3cNav A3C Depth1 Depth1 Depth2 Depth2 where π denotes a stochastic policy and π 􏽥 is another policy. where λ and λ are weighting terms on the indi- Depth1 Depth2 􏽥 􏽥 η(π) and η(π) are the expected discounted cost for π and π, vidual loss components. respectively. Here, ρ (s) is the distribution of the state s 􏽥 π )e global parameters θ of a3cNav are updated in according to π, and A is the advantage function following π. multithread environments, and θ are copied to the local Equation (3) implies that if we want to reduce η or leave it as worker parameters θ . )e local worker of a3cNav interacts constant, we should keep the expected advantage 􏽐 π(a | s) ′ a with the maze, and the policy gradients wrt θ and the value A (s, a)≤ 0 at every state s when a policy update π 􏽥 ⟶ π. )is ′ π gradients wrt θ are computed from the policy loss and value demonstrates that if we want to reduce the training variance of loss. )e gradient for the parameter update is proportional a3cNav and keep its policy monotonic improvement, we must to the product of advantage function A . Equation (2) shows guarantee 􏽐 π 􏽥(a | s)A (s, a)≤ 0. However, a3cNav cannot a π the calculation of gradients: control the change of the expected advantage when the artificial ′ ′ dθ⟵ dθ + ∇ logπ􏼐a 􏼌 s ; θ 􏼑 R − V s ; θ 􏼁 􏼁 agent learns to navigate in the maze. ′ t t t v To make the policy monotonic improvement, Schulman + β∇ H π s ; θ 􏼁􏼁 et al. [22] proposed a trust region constraint, as shown in θ t , (2) equation (4), over policy update to make 􏽐 π(a | s)A a π (s, a)≤ 0: z R − V s ; θ 􏼁 􏼁 t v dθ ⟵ dθ + v v zθ π a |s􏼁 θ t t max E 􏼢 A 􏼣, t t θ π a |s􏼁 θ t t old where H(π(s ; θ )) is the entropy of the policy π, which (4) improves exploration by discouraging premature conver- E 􏽨KL􏽨π ·|s􏼁 , π ·|s􏼁 􏽩􏽩≤ δ. gence to suboptimal deterministic policies. )en, asyn- t θ t θ t old chronous update of θ using dθ and of θ using dθ are v v Equation (4) is relatively complex and is not compatible with applied into the global network for parameter update. the architectures which include parameter sharing between the policy function and the value function, or with auxiliary tasks 4. Approach [19]. )e policy and the value network of a3cNav both share the same network, and a3cNav has the auxiliary depth prediction. 4.1. Monotonic Policy Improvement. )e artificial agent in- )erefore, TRPO cannot be used into a3cNav. teracts randomly with the environment which in turn gives 􏼌 􏼌 􏼌 􏼌 􏼌 􏼌 high-dimensional images to the agent. Hence, a3cNav has π 􏼐a 􏼌 s 􏼑 π 􏼐a 􏼌 s 􏼑 θ t t θ t t ⎢ ⎥ ⎡ ⎢ ⎝ 􏽢 ⎝ ⎠ 􏽢 ⎠⎤ ⎥ ⎣ ⎛ ⎛ ⎞ ⎞⎦ 􏼌 􏼌 E min A , clip , 1 − ε, 1 + ε A . t 􏼌 t 􏼌 t poor data efficiency and robustness. In addition, complex 􏼌 􏼌 π a s π a s 􏼐 􏼌 􏼑 􏼐 􏼌 􏼑 θ t t θ t t old old navigation environment that sends changing images to the (5) artificial agent aggravates the variance and instability of PPO [19] improves TRPO with only first-order opti- training. In detail, each local worker of a3cNav interacts with the maze, and the gradients with big variance are applied to mization and replaces the constraint with the clipped 4 Journal of Robotics Table 1: )e states that the artificial agent sees in stairway_to_melon. Time 600 700 800 900 1000 1100 1200 1300 1400 1500 Episode )e first episode )e second episode )e third episode Time 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 Episode )e first episode )e second episode )e third episode surrogate objective as equation (5). Hence, PPO is a first- policy generated by appoNav has lower variance and more order optimization method and is compatible with pa- stable training performance. rameter sharing and auxiliary tasks. 5. Experiments 4.2. appoNav. To make the monotonic improvement for the navigation policy, we seek to incorporate the features of PPO 5.1. Experimental Settings. We implement our algorithm in into the local worker of a3cNav. In each thread, the im- TensorFlow and train it on Nvidia GeForce GTX Titan X proved local policy tends to improve monotonically. And the GPU and Intel Xeon E5-2687W v2@3.4GHz 17 CPU. new local gradients are applied to the global network, )e proposed method is evaluated in DeepMind Lab leading to the whole network with monotonic improvement. environments [20]. )e action space in DeepMind Lab has 8 As the navigation method is based on the monotonic policy actions: the agent can rotate in small increments, accelerate improvement of PPO, we call this navigation as appoNav. forward or backward or sideways, or induce rotational ac- Assume that the global network shared parameter vector celeration while moving. Reward encourages the agent to θ and local worker parameter vector θ . Equation (6) is the learn navigation; a reward is achieved when the artificial policy optimization loss of A3C [18]: agent reaches a goal from a random start location and 􏼌 orientation. If the agent reaches the goal, a new episode L � logπ􏼐a 􏼌 s ; θ􏼑 + βH π s ; θ􏼁􏼁 . (6) A3C t t t starts, and the same interaction restarts. Fruit represents the reward in DeepMind Lab: apples are worth 1 point, When added to the local worker of a3cNav, the loss strawberries 2 points, and goals 10 points. function becomes the form of equation (5) with entropy of appoNav is evaluated by training the agent in stair- the policy, and it is rewritten for the local workers as way_to_melon and nav_maze_static_01 of DeepMind Lab. For ease of expression, we name stairway_to_melon as the π a ∣ s π a ∣ s 􏼁 􏼁 ′ ′ ⎢ θ t t θ t t ⎥ ⎡ ⎢ ⎝ ⎝ ⎠ ⎠⎤ ⎥ ⎣ ⎛ 􏽢 ⎛ ⎞ 􏽢 ⎞⎦ stairway maze and nav_maze_static_01 as the static01 maze. E min A , clip , 1 − ε, 1 + ε A t t t π a ∣ s􏼁 π a ∣ s􏼁 ′ ′ θ t t θ t t old old In each case, blue curve stands for a3cNav and orange for appoNav. For experimental analysis, we run 2500 episodes + βH π s ; θ 􏼁 􏼁 . t for the stairway maze and 7800 episodes for the maze01 maze. (7) Equation (7) is the policy update of the local worker of a3cNav, that is, appoNav. Each local worker has a low 5.2. Experimental Results and Analysis. Table 1 shows the variance than before and applies the new gradient to the images that the artificial agent sees in the stairway maze; we global network for the policy update. Finally, the whole stochastically select 3 episodes from time 600 to 2500 with Journal of Robotics 5 0 500 1000 1500 2000 2500 a3cNav appoNav Figure 2: Reward achieved by the artificial agent in stairway_to_melon. Table 2: Standard deviation of the reward in stairway_to_melon. Algorithm Standard deviation a3cNav 30.16 appoNav 27.24 Table 3: )e states that the artificial agent sees in stairway_to_melon. Time 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 Episode )e first episode )e second episode )e third episode Time 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 Episode )e first episode )e second episode )e third episode interval 100, which demonstrate three different states at the std of appoNav and a3cNav is 27.24 and 30.16, respectively; same time with different episodes. )e artificial agents can this shows that the learning process of the former is more receive different images and be not stuck in one place, which stable than the latter one. demonstrates the agents learning to navigation in stairway )e reason why our method converges faster is that the maze. local worker of appoNav can generate a more stable policy Figure 2 shows the reward achieved by the artificial agent with the monotonic improvement when it interacts with the in stairway_to_melon; it shows that appoNav gets higher stairway. During the training iterations, improved accu- reward than a3cNav. In addition, we calculate the standard mulated gradients are applied for the parameter update of deviation (std) of the reward curve. From Table 2, the reward appoNav, which make appoNav more stable than a3cNav. 6 Journal of Robotics Conflicts of Interest )e authors declare that there are no conflicts of interest regarding the publication of this paper. Acknowledgments )is work was supported by the National Natural Science Foundation of China (U1813202, 61773093, and 62003381), National Key R&D Program of China (2018YFC0831800), Research Programs of Sichuan Science and Technology Department (17ZDYF3184), and Important Science and Technology Innovation Projects in Chengdu (2018-YF08- 00039-GX). 0 1000 2000 3000 4000 5000 6000 7000 8000 a3cNav References appoNav [1] G. N. DeSouza and A. C. Kak, “Vision for mobile robot Figure 3: Reward achieved by the artificial agent in navigation: a survey,” IEEE Transactions on Pattern Analysis nav_maze_static_01. and Machine Intelligence, vol. 24, no. 2, pp. 237–267, 2002. [2] F. Zeng, C. Wang, and S. S. Ge, “A survey on visual navigation Table 4: Standard deviation of the reward in nav_maze_static_01. for artificial agents with deep reinforcement learning,” IEEE Access, vol. 8, Article ID 135426, 135442 pages, 2020. Algorithm Standard deviation [3] S. )run, W. Burgard, and D. Fox, Probabilistic Robotics, MIT a3cNav 28.99 Press, Cambridge, MA, USA, 2005. appoNav 24.79 [4] C. Cadena, L. Carlone, H. Carrillo et al., “Past, present, and future of simultaneous localization and mapping: toward the robust-perception age,” IEEE Transactions on Robotics, For further verification of appoNav’s effectiveness, we vol. 32, no. 6, pp. 1309–1332, 2016. test our agent in the maze01 maze which is more complex [5] J. Engel, T. Schops, ¨ and D. Cremers, “Lsd-slam: large-scale than the stairway maze. Because the agent needs more time direct monocular slam,” in Proceedings of the European to converge, we stochastically select 3 episodes from time Conference on Computer Vision, pp. 834–849, Springer, Zurich, Switzerland, September 2014. 1000 to 4800 with interval 200, as shown in Table 3. [6] H. Lategahn, A. Geiger, and B. Kitt, “Visual slam for au- Figure 3 shows the reward achieved by the artificial agent tonomous ground vehicles,” in Proceedings of the IEEE In- in nav_maze_static_01; it demonstrates that appoNav per- ternational Conference on Robotics and Automation, forms better than a3cNav, and it has higher reward. Table 4 pp. 1732–1737, IEEE, Shanghai, China, June 2011. shows that the std of a3cNav is 28.99, and the std of appoA3C [7] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a is 24.79. )e policy learnt by appoNav is more stable than the versatile and accurate monocular slam system,” IEEE policy learnt by a3cNav. Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015. Owing to that appoNav uses better gradient ascents to [8] S. S. Ge and Y. J. Cui, “New potential functions for mobile update each policy, the artificial agent with appoNav learns robot path planning,” IEEE Transactions on Robotics and stronger navigation ability as each local worker produces a Automation, vol. 16, no. 5, pp. 615–620, 2000. more stable policy in the complex maze. [9] Y. Zhang, H. Chen, Y. He, M. Ye, X. Cai, and D. Zhang, “Road segmentation for all-day outdoor robot navigation,” Neuro- computing, vol. 314, pp. 316–325, 2018. 6. Conclusion [10] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “)e synthia dataset: a large collection of Visual navigation-based vanilla policy gradient methods suffer synthetic images for semantic segmentation of urban scenes,” from high variance and instability during training, where the in Proceedings of the IEEE Conference on Computer Vision and navigation performance fluctuates greatly between iterations. Pattern Recognition, pp. 3234–3243, Las Vegas, NV, USA, June 2016. We analyze the reason why visual navigation suffers such an [11] X. Li, M. Ye, Y. Liu, and C. Zhu, “Adaptive deep convolutional issue and improve its policy update to guarantee the policy neural networks for scene-specific object detection,” IEEE monotonic improvement. )e improved method appoNav has Transactions on Circuits and Systems for Video Technology, lower standard deviation and gets higher reward. In short, vol. 29, no. 9, pp. 2538–2551, 2017. appoNav can learn better navigation policy. [12] X. Li, M. Ye, Y. Liu, F. Zhang, D. Liu, and S. Tang, “Accurate object detection using memory-based models in surveillance Data Availability scenes,” Pattern Recognition, vol. 67, pp. 73–84, 2017. [13] J. Li, K. Lu, Z. Huang, L. Zhu, and H. T. Shen, “Transfer )e raw data required to reproduce these findings are independently together: a generalized framework for domain available to download from https://github.com/deepmind/ adaptation,” IEEE Transactions on Cybernetics, vol. 49, no. 6, lab. pp. 2144–2155, 2018. Journal of Robotics 7 [14] J. Li, Y. Wu, and K. Lu, “Structured domain adaptation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 8, pp. 1700–1713, 2016. [15] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015. [16] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, USA, 2018. [17] T. Degris, P. M. Pilarski, and R. S. Sutton, “Model-free re- inforcement learning with continuous action in practice,” in Proceedings of the 2012 American Control Conference (ACC), pp. 2177–2182, IEEE, Montreal, Canada, June 2012. [18] V. Mnih, A. P. Badia, M. Mirza et al., “Asynchronous methods for deep reinforcement learning,” in Proceedings of the In- ternational Conference on Machine Learning, pp. 1928–1937, New York, NY, USA, June 2016. [19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017, https://arxiv.org/abs/1707.06347. [20] C. Beattie, J. Z. Leibo, D. Teplyashin et al., “Deepmind lab,” https://arxiv.org/abs/1612.03801. [21] P. Mirowski, R. Pascanu, F. Viola et al., “Learning to navigate in complex environments,” in Proceedings of the International conference on Learning Representations, San Juan, PA, USA, May 2016. [22] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proceedings of the In- ternational Conference on Machine Learning, pp. 1889–1897, Lille, France, July 2015. [23] S. S. Ge, Q. Zhang, A. T. Abraham, and B. Rebsamen, “Si- multaneous path planning and topological mapping (sp2atm) for environment exploration and goal oriented navigation,” Robotics and Autonomous Systems, vol. 59, no. 3-4, pp. 228– 242, 2011. [24] S. S. Ge, X. Lai, and A. A. Mamun, “Boundary following and globally convergent path planning using instant goals,” IEEE Transactions on Systems, Man and Cybernetics, Part B (Cy- bernetics), vol. 35, no. 2, pp. 240–254, 2005. [25] M. Jaderberg, V. Mnih, W. M. Czarnecki et al., “Reinforce- ment learning with unsupervised auxiliary tasks,” 2016, https://arxiv.org/abs/1611.05397. [26] D. Ha and J. Schmidhuber, “World models,” 2018, https://arxiv. org/abs/1803.10122. [27] J. Bruce, N. Sunderhauf, ¨ P. Mirowski, R. Hadsell, and M. Milford, “One-shot reinforcement learning for robot navigation with interactive replay,” in Proceedings of the Conference and Workshop on Neural Information Processing Systems, Long Beach, CA, USA, December 2017. [28] A. Banino, C. Barry, B. Uria et al., “Vector-based navigation using grid-like representations in artificial agents,” Nature, vol. 557, no. 7705, pp. 429–433, 2018. [29] R. J. Williams, “Simple statistical gradient-following algo- rithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3-4, pp. 229–256, 1992. [30] N. Heess, D. TB, S. Sriram et al., “Emergence of locomotion behaviours in rich environments,” 2017, https://arxiv.org/abs/ 1707.02286. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Robotics Hindawi Publishing Corporation

Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents

Journal of Robotics , Volume 2020 – Oct 15, 2020

Loading next page...
 
/lp/hindawi-publishing-corporation/visual-navigation-with-asynchronous-proximal-policy-optimization-in-927nCmdd9h

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2020 Fanyu Zeng and Chen Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-9600
eISSN
1687-9619
DOI
10.1155/2020/8702962
Publisher site
See Article on Publisher Site

Abstract

Hindawi Journal of Robotics Volume 2020, Article ID 8702962, 7 pages https://doi.org/10.1155/2020/8702962 Research Article Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents Fanyu Zeng and Chen Wang School of Computer Science and Engineering, Center for Robotics, University of Electronic Science and Technology of China, Chengdu 611731, China Correspondence should be addressed to Fanyu Zeng; zengfanyu_cs@163.com Received 12 February 2020; Revised 10 August 2020; Accepted 21 September 2020; Published 15 October 2020 Academic Editor: Weitian Wang Copyright © 2020 Fanyu Zeng and Chen Wang. )is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Vanilla policy gradient methods suffer from high variance, leading to unstable policies during training, where the policy’s performance fluctuates drastically between iterations. To address this issue, we analyze the policy optimization process of the navigation method based on deep reinforcement learning (DRL) that uses asynchronous gradient descent for optimization. A variant navigation (asynchronous proximal policy optimization navigation, appoNav) is presented that can guarantee the policy monotonic improvement during the process of policy optimization. Our experiments are tested in DeepMind Lab, and the experimental results show that the artificial agents with appoNav perform better than the compared algorithm. asynchronous variants of AC algorithms, termed as asyn- 1. Introduction chronous advantage actor-critic (A3C), and showed that Navigation in an unstructured environment is one of the most parallel actor-learners have a stabilizing effect on training important abilities for mobile robotics and artificial agents artificial agents. Researchers can construct navigation agents [1–3]. Traditional methods mainly divide navigation into based on these DRL algorithms. However, vanilla policy several parts [4]: simultaneous localization and mapping gradient methods have poor data efficiency [19], which leads (SLAM) [5–7], path planning [8], and semantic segmentation to navigation agents suffering from high variance and un- [9, 10]. )e methods mentioned are not an end-to-end al- stable policies. gorithm where each part is a challenging research subject, and In this work, we take A3C as an example to show how to the fusion of each part often leads to large computational guarantee the policy monotonic improvement. )e training errors. To reduce the fusion error, we focus on the end-to-end environment is DeepMind Lab [20], and it is a first-person navigation based on deep reinforcement learning where 3D virtual environment designed for research and devel- navigational abilities could emerge as the byproduct of an opment of general artificial intelligence. DeepMind Lab can artificial agent learning policy with reward maximization. be used to study how autonomous artificial agents learn With the fast development of deep learning [11–14], a complex tasks in large, partially observed, and visually di- variety of DRL architectures have been proposed [2]. Mnih verse worlds. In addition, the worlds are rendered with rich et al. [15] presented the advances in training deep neural science fiction-style visuals. Actions are to look around and networks to develop the deep Q-network (DQN), which can move in the 3D virtual world, and example tasks include learn successful policies directly from high-dimensional navigation in different mazes. Mirowski et al. [21] proposed image inputs using end-to-end reinforcement learning. On- a DRL navigation method based on A3C [18], augmented policy reinforcement learning methods such as actor-critic with auxiliary learning targets, to train artificial agents to (AC) [16, 17] were proposed such that the actor is the policy, navigate in DeepMind Lab. For ease of expression, we call and the critic is the baseline. Minh et al. [18] presented the DRL navigation using A3C as a3cNav. 2 Journal of Robotics In this paper, the issues on policy optimization for R � 􏽐 c r , which is a discounted sum of rewards. )e t k�0 t+k navigation based on the vanilla policy gradient are analyzed; action-value function Q � E[R | s � s, a] is the expected t t this type of navigation cannot control the change of expected return following action a from state s under policy π. )e advantage when an artificial agent learns to navigate in a value function V � E[R | s � s] is the expected return t t maze. Based on the navigation techniques presented in [21], from state s. we show how to reduce training variances and get higher In policy-based methods, let π(a | s; θ) be a policy with reward when an artificial agent interacts with an environ- parameters θ, which is updated by performing gradient ment. Inspired by [19, 22], we adjust the policy update ascent on E[R ]. Policy gradient algorithms adjust the policy process of the navigation in [21] to guarantee the monotonic by updating parameters θ in the direction improvement of the navigation policy. Experimental results ∇ logπ(a | s ; θ)R that is an unbiased estimate of ∇ E[R ]. θ t t t θ t show that an artificial agent via appoNav learns better To reduce the variance of this estimate, Williams [29] navigation policy in DeepMind Lab and suffers from lower subtracted a learned function called baseline b (s ) for the t t standard deviation than a3cNav. return, so the improved gradient becomes ∇ logπ(a | s ; θ)(R − b (s )). )ere exists an equation θ t t t t t b (s ) ≈ V (s), and R − b (s ) can be seen as an estimate of t t t t t 2. Related Work the advantage of action at under state s . )e numerical value Traditional navigation, which is model-based, includes si- of Q (s, a) equals the value of R ; hence, the advantage multaneous localization and mapping (SLAM) [5, 7, 23], function can be rewritten as A(a , s ) � Q(a , s ) − V(s ). t t t t t path planning [8, 24], and semantic segmentation [9]. Each )is method is called actor-critic (AC) architecture where part of them is a challenge research area, and the fusion of the actor is the policy π and the critic is the baseline b them often leads to large computation error. Moreover, [16, 17]. Minh et al. [18] presented asynchronous variants of model-based navigation needs to model the environments AC algorithms, termed as asynchronous advantage actor- effectively for some dynamic and complex scenes, which critic (A3C), and showed that parallel actor-learners have a severely affect navigation performance. stabilizing effect on training artificial agents. With recent advances in DRL, many navigation methods When a DRL agent interacts with its environment, the based on DRL have been proposed [2]. DRL navigation, state sequences of each interaction change a lot, leading to which is end to end, avoids the computation error caused by fluctuations in rewards. )erefore, DRL algorithms (such as the fusion of traditional navigation. Mirowski et al. [21] DQN and A3C) have unstable fluctuations during training. addressed navigation via auxiliary depth prediction and Researchers wonder whether they can find a method to loop-closure classification tasks. Jaderberg et al. [25] also reduce such fluctuations while maintaining a steady im- used auxiliary tasks for navigation and incorporated A3C provement in the policy. Schulman et al. [22] proposed trust with control tasks and prediction tasks including pixel region policy optimization (TRPO) to make the monotonic control and reward prediction. By using features extracted improvement for the policy. Furthermore, Schulman et al. from the world model as inputs to an agent, Ha and [19] proposed proximal policy optimization (PPO) to Schmidhuber [26] used DRL to construct a world model and simplify the calculation of TRPO. In addition, Heess et al. used the model in a car navigation task. Bruce et al. [27] [30] proposed a distributed implementation of PPO, called leveraged an interactive world model based on DRL built distributed PPO. Besides the similar process of the gradient from a single traversal of the environment and utilized a update with A3C, distributed PPO includes various tricks, pretrained visual feature encoder to demonstrate successful such as normalizations (observation normalization, reward zero-shot transfer under real-world environmental varia- reshape normalization, and per-batch normalization of the tions without fine-tuning. Banino et al. [28] proposed a advantages), sharing of algorithm parameters across local vector-based navigation method that fuses DRL with grid- workers, and additional trust region constraint. )ese tricks like representations in the artificial agent. When these DRL result in that the computation of distributed PPO is more navigation agents interact with environments, the state se- complex than appoNav. quences of each interaction change a lot, leading to large fluctuations in rewards. )erefore, these DRL navigation 3.2. NavA3C + D D . In this work, we use the methods suffer from high variance and have unstable pol- 1 2 icies during training. NavA3C + D D architecture [21] as shown in Figure 1, 1 2 which includes 2 CNNs and 2 LSTMs. NavA3C + D D has 4 1 2 inputs: the current RGB image x , previous reward r , t t−1 3. Background previous action a , and the current velocity v . )e 2 CNNs t−1 t act as the encoder for RGB image x , and the first LSTM 3.1. Reinforcement Learning. We consider the standard re- inforcement learning setting where an artificial agent in- makes associations between reward r and visual obser- t−1 vations x that are provided as context to the second LSTM teracts with an environment over a number of discrete time steps. At each time step t, the agent receives a state s from from which the policy π(a | s ; θ) and the value V(s ; θ ) are t t t v computed. Artificial agents based on this architecture try to the environment and outputs an action a according to its learned policy π. In return, the environment gives the agent a maximize the cumulative reward R during their interaction with the maze and minimize the auxiliary depth losses next sate s and a reward r . )e goal of reinforcement t+1 t learning is to maximize the accumulated reward L and L . Finally, the agent can learn how to Depth1 Depth2 Journal of Robotics 3 CNN CNN LSTM LSTM t–1 {v , a } t t–1 Figure 1: a3cNav architecture. In the architecture, image x is the input of a3cNav, and following the full connection layer is a two-layer CNN which outputs depth D as well as a two-layer stacked LSTM which outputs depth D , policy π, and value V. In addition, auxiliary task 1 2 used in this architecture in which the first LSTM only receives the reward and the velocity and previously selected action are fed into the second LSTM. navigate in DeepMind Lab. For ease of expression, we the global network of a3cNav, leading to the unstable training of the agent. In this section, we improve the pa- rename NavA3C + D D as a3cNav. 1 2 a3cNav is based on the A3C framework into which rameter updates of a3cNav to guarantee its policy mono- unsupervised auxiliary tasks are incorporated. )erefore, its tonic improvement. loss function includes the loss of A3C L and the loss of In [22], a policy can be rewritten as A3C auxiliary tasks. a3cNav can be optimized as follows: η(π 􏽥) � η(π) + 􏽘 ρ (s) 􏽘 π 􏽥(a | s)A (s, a), 􏽥 π (3) s a L (θ) � L + λ L + λ L , (1) a3cNav A3C Depth1 Depth1 Depth2 Depth2 where π denotes a stochastic policy and π 􏽥 is another policy. where λ and λ are weighting terms on the indi- Depth1 Depth2 􏽥 􏽥 η(π) and η(π) are the expected discounted cost for π and π, vidual loss components. respectively. Here, ρ (s) is the distribution of the state s 􏽥 π )e global parameters θ of a3cNav are updated in according to π, and A is the advantage function following π. multithread environments, and θ are copied to the local Equation (3) implies that if we want to reduce η or leave it as worker parameters θ . )e local worker of a3cNav interacts constant, we should keep the expected advantage 􏽐 π(a | s) ′ a with the maze, and the policy gradients wrt θ and the value A (s, a)≤ 0 at every state s when a policy update π 􏽥 ⟶ π. )is ′ π gradients wrt θ are computed from the policy loss and value demonstrates that if we want to reduce the training variance of loss. )e gradient for the parameter update is proportional a3cNav and keep its policy monotonic improvement, we must to the product of advantage function A . Equation (2) shows guarantee 􏽐 π 􏽥(a | s)A (s, a)≤ 0. However, a3cNav cannot a π the calculation of gradients: control the change of the expected advantage when the artificial ′ ′ dθ⟵ dθ + ∇ logπ􏼐a 􏼌 s ; θ 􏼑 R − V s ; θ 􏼁 􏼁 agent learns to navigate in the maze. ′ t t t v To make the policy monotonic improvement, Schulman + β∇ H π s ; θ 􏼁􏼁 et al. [22] proposed a trust region constraint, as shown in θ t , (2) equation (4), over policy update to make 􏽐 π(a | s)A a π (s, a)≤ 0: z R − V s ; θ 􏼁 􏼁 t v dθ ⟵ dθ + v v zθ π a |s􏼁 θ t t max E 􏼢 A 􏼣, t t θ π a |s􏼁 θ t t old where H(π(s ; θ )) is the entropy of the policy π, which (4) improves exploration by discouraging premature conver- E 􏽨KL􏽨π ·|s􏼁 , π ·|s􏼁 􏽩􏽩≤ δ. gence to suboptimal deterministic policies. )en, asyn- t θ t θ t old chronous update of θ using dθ and of θ using dθ are v v Equation (4) is relatively complex and is not compatible with applied into the global network for parameter update. the architectures which include parameter sharing between the policy function and the value function, or with auxiliary tasks 4. Approach [19]. )e policy and the value network of a3cNav both share the same network, and a3cNav has the auxiliary depth prediction. 4.1. Monotonic Policy Improvement. )e artificial agent in- )erefore, TRPO cannot be used into a3cNav. teracts randomly with the environment which in turn gives 􏼌 􏼌 􏼌 􏼌 􏼌 􏼌 high-dimensional images to the agent. Hence, a3cNav has π 􏼐a 􏼌 s 􏼑 π 􏼐a 􏼌 s 􏼑 θ t t θ t t ⎢ ⎥ ⎡ ⎢ ⎝ 􏽢 ⎝ ⎠ 􏽢 ⎠⎤ ⎥ ⎣ ⎛ ⎛ ⎞ ⎞⎦ 􏼌 􏼌 E min A , clip , 1 − ε, 1 + ε A . t 􏼌 t 􏼌 t poor data efficiency and robustness. In addition, complex 􏼌 􏼌 π a s π a s 􏼐 􏼌 􏼑 􏼐 􏼌 􏼑 θ t t θ t t old old navigation environment that sends changing images to the (5) artificial agent aggravates the variance and instability of PPO [19] improves TRPO with only first-order opti- training. In detail, each local worker of a3cNav interacts with the maze, and the gradients with big variance are applied to mization and replaces the constraint with the clipped 4 Journal of Robotics Table 1: )e states that the artificial agent sees in stairway_to_melon. Time 600 700 800 900 1000 1100 1200 1300 1400 1500 Episode )e first episode )e second episode )e third episode Time 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 Episode )e first episode )e second episode )e third episode surrogate objective as equation (5). Hence, PPO is a first- policy generated by appoNav has lower variance and more order optimization method and is compatible with pa- stable training performance. rameter sharing and auxiliary tasks. 5. Experiments 4.2. appoNav. To make the monotonic improvement for the navigation policy, we seek to incorporate the features of PPO 5.1. Experimental Settings. We implement our algorithm in into the local worker of a3cNav. In each thread, the im- TensorFlow and train it on Nvidia GeForce GTX Titan X proved local policy tends to improve monotonically. And the GPU and Intel Xeon E5-2687W v2@3.4GHz 17 CPU. new local gradients are applied to the global network, )e proposed method is evaluated in DeepMind Lab leading to the whole network with monotonic improvement. environments [20]. )e action space in DeepMind Lab has 8 As the navigation method is based on the monotonic policy actions: the agent can rotate in small increments, accelerate improvement of PPO, we call this navigation as appoNav. forward or backward or sideways, or induce rotational ac- Assume that the global network shared parameter vector celeration while moving. Reward encourages the agent to θ and local worker parameter vector θ . Equation (6) is the learn navigation; a reward is achieved when the artificial policy optimization loss of A3C [18]: agent reaches a goal from a random start location and 􏼌 orientation. If the agent reaches the goal, a new episode L � logπ􏼐a 􏼌 s ; θ􏼑 + βH π s ; θ􏼁􏼁 . (6) A3C t t t starts, and the same interaction restarts. Fruit represents the reward in DeepMind Lab: apples are worth 1 point, When added to the local worker of a3cNav, the loss strawberries 2 points, and goals 10 points. function becomes the form of equation (5) with entropy of appoNav is evaluated by training the agent in stair- the policy, and it is rewritten for the local workers as way_to_melon and nav_maze_static_01 of DeepMind Lab. For ease of expression, we name stairway_to_melon as the π a ∣ s π a ∣ s 􏼁 􏼁 ′ ′ ⎢ θ t t θ t t ⎥ ⎡ ⎢ ⎝ ⎝ ⎠ ⎠⎤ ⎥ ⎣ ⎛ 􏽢 ⎛ ⎞ 􏽢 ⎞⎦ stairway maze and nav_maze_static_01 as the static01 maze. E min A , clip , 1 − ε, 1 + ε A t t t π a ∣ s􏼁 π a ∣ s􏼁 ′ ′ θ t t θ t t old old In each case, blue curve stands for a3cNav and orange for appoNav. For experimental analysis, we run 2500 episodes + βH π s ; θ 􏼁 􏼁 . t for the stairway maze and 7800 episodes for the maze01 maze. (7) Equation (7) is the policy update of the local worker of a3cNav, that is, appoNav. Each local worker has a low 5.2. Experimental Results and Analysis. Table 1 shows the variance than before and applies the new gradient to the images that the artificial agent sees in the stairway maze; we global network for the policy update. Finally, the whole stochastically select 3 episodes from time 600 to 2500 with Journal of Robotics 5 0 500 1000 1500 2000 2500 a3cNav appoNav Figure 2: Reward achieved by the artificial agent in stairway_to_melon. Table 2: Standard deviation of the reward in stairway_to_melon. Algorithm Standard deviation a3cNav 30.16 appoNav 27.24 Table 3: )e states that the artificial agent sees in stairway_to_melon. Time 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 Episode )e first episode )e second episode )e third episode Time 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 Episode )e first episode )e second episode )e third episode interval 100, which demonstrate three different states at the std of appoNav and a3cNav is 27.24 and 30.16, respectively; same time with different episodes. )e artificial agents can this shows that the learning process of the former is more receive different images and be not stuck in one place, which stable than the latter one. demonstrates the agents learning to navigation in stairway )e reason why our method converges faster is that the maze. local worker of appoNav can generate a more stable policy Figure 2 shows the reward achieved by the artificial agent with the monotonic improvement when it interacts with the in stairway_to_melon; it shows that appoNav gets higher stairway. During the training iterations, improved accu- reward than a3cNav. In addition, we calculate the standard mulated gradients are applied for the parameter update of deviation (std) of the reward curve. From Table 2, the reward appoNav, which make appoNav more stable than a3cNav. 6 Journal of Robotics Conflicts of Interest )e authors declare that there are no conflicts of interest regarding the publication of this paper. Acknowledgments )is work was supported by the National Natural Science Foundation of China (U1813202, 61773093, and 62003381), National Key R&D Program of China (2018YFC0831800), Research Programs of Sichuan Science and Technology Department (17ZDYF3184), and Important Science and Technology Innovation Projects in Chengdu (2018-YF08- 00039-GX). 0 1000 2000 3000 4000 5000 6000 7000 8000 a3cNav References appoNav [1] G. N. DeSouza and A. C. Kak, “Vision for mobile robot Figure 3: Reward achieved by the artificial agent in navigation: a survey,” IEEE Transactions on Pattern Analysis nav_maze_static_01. and Machine Intelligence, vol. 24, no. 2, pp. 237–267, 2002. [2] F. Zeng, C. Wang, and S. S. Ge, “A survey on visual navigation Table 4: Standard deviation of the reward in nav_maze_static_01. for artificial agents with deep reinforcement learning,” IEEE Access, vol. 8, Article ID 135426, 135442 pages, 2020. Algorithm Standard deviation [3] S. )run, W. Burgard, and D. Fox, Probabilistic Robotics, MIT a3cNav 28.99 Press, Cambridge, MA, USA, 2005. appoNav 24.79 [4] C. Cadena, L. Carlone, H. Carrillo et al., “Past, present, and future of simultaneous localization and mapping: toward the robust-perception age,” IEEE Transactions on Robotics, For further verification of appoNav’s effectiveness, we vol. 32, no. 6, pp. 1309–1332, 2016. test our agent in the maze01 maze which is more complex [5] J. Engel, T. Schops, ¨ and D. Cremers, “Lsd-slam: large-scale than the stairway maze. Because the agent needs more time direct monocular slam,” in Proceedings of the European to converge, we stochastically select 3 episodes from time Conference on Computer Vision, pp. 834–849, Springer, Zurich, Switzerland, September 2014. 1000 to 4800 with interval 200, as shown in Table 3. [6] H. Lategahn, A. Geiger, and B. Kitt, “Visual slam for au- Figure 3 shows the reward achieved by the artificial agent tonomous ground vehicles,” in Proceedings of the IEEE In- in nav_maze_static_01; it demonstrates that appoNav per- ternational Conference on Robotics and Automation, forms better than a3cNav, and it has higher reward. Table 4 pp. 1732–1737, IEEE, Shanghai, China, June 2011. shows that the std of a3cNav is 28.99, and the std of appoA3C [7] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a is 24.79. )e policy learnt by appoNav is more stable than the versatile and accurate monocular slam system,” IEEE policy learnt by a3cNav. Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015. Owing to that appoNav uses better gradient ascents to [8] S. S. Ge and Y. J. Cui, “New potential functions for mobile update each policy, the artificial agent with appoNav learns robot path planning,” IEEE Transactions on Robotics and stronger navigation ability as each local worker produces a Automation, vol. 16, no. 5, pp. 615–620, 2000. more stable policy in the complex maze. [9] Y. Zhang, H. Chen, Y. He, M. Ye, X. Cai, and D. Zhang, “Road segmentation for all-day outdoor robot navigation,” Neuro- computing, vol. 314, pp. 316–325, 2018. 6. Conclusion [10] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “)e synthia dataset: a large collection of Visual navigation-based vanilla policy gradient methods suffer synthetic images for semantic segmentation of urban scenes,” from high variance and instability during training, where the in Proceedings of the IEEE Conference on Computer Vision and navigation performance fluctuates greatly between iterations. Pattern Recognition, pp. 3234–3243, Las Vegas, NV, USA, June 2016. We analyze the reason why visual navigation suffers such an [11] X. Li, M. Ye, Y. Liu, and C. Zhu, “Adaptive deep convolutional issue and improve its policy update to guarantee the policy neural networks for scene-specific object detection,” IEEE monotonic improvement. )e improved method appoNav has Transactions on Circuits and Systems for Video Technology, lower standard deviation and gets higher reward. In short, vol. 29, no. 9, pp. 2538–2551, 2017. appoNav can learn better navigation policy. [12] X. Li, M. Ye, Y. Liu, F. Zhang, D. Liu, and S. Tang, “Accurate object detection using memory-based models in surveillance Data Availability scenes,” Pattern Recognition, vol. 67, pp. 73–84, 2017. [13] J. Li, K. Lu, Z. Huang, L. Zhu, and H. T. Shen, “Transfer )e raw data required to reproduce these findings are independently together: a generalized framework for domain available to download from https://github.com/deepmind/ adaptation,” IEEE Transactions on Cybernetics, vol. 49, no. 6, lab. pp. 2144–2155, 2018. Journal of Robotics 7 [14] J. Li, Y. Wu, and K. Lu, “Structured domain adaptation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 8, pp. 1700–1713, 2016. [15] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015. [16] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, USA, 2018. [17] T. Degris, P. M. Pilarski, and R. S. Sutton, “Model-free re- inforcement learning with continuous action in practice,” in Proceedings of the 2012 American Control Conference (ACC), pp. 2177–2182, IEEE, Montreal, Canada, June 2012. [18] V. Mnih, A. P. Badia, M. Mirza et al., “Asynchronous methods for deep reinforcement learning,” in Proceedings of the In- ternational Conference on Machine Learning, pp. 1928–1937, New York, NY, USA, June 2016. [19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017, https://arxiv.org/abs/1707.06347. [20] C. Beattie, J. Z. Leibo, D. Teplyashin et al., “Deepmind lab,” https://arxiv.org/abs/1612.03801. [21] P. Mirowski, R. Pascanu, F. Viola et al., “Learning to navigate in complex environments,” in Proceedings of the International conference on Learning Representations, San Juan, PA, USA, May 2016. [22] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proceedings of the In- ternational Conference on Machine Learning, pp. 1889–1897, Lille, France, July 2015. [23] S. S. Ge, Q. Zhang, A. T. Abraham, and B. Rebsamen, “Si- multaneous path planning and topological mapping (sp2atm) for environment exploration and goal oriented navigation,” Robotics and Autonomous Systems, vol. 59, no. 3-4, pp. 228– 242, 2011. [24] S. S. Ge, X. Lai, and A. A. Mamun, “Boundary following and globally convergent path planning using instant goals,” IEEE Transactions on Systems, Man and Cybernetics, Part B (Cy- bernetics), vol. 35, no. 2, pp. 240–254, 2005. [25] M. Jaderberg, V. Mnih, W. M. Czarnecki et al., “Reinforce- ment learning with unsupervised auxiliary tasks,” 2016, https://arxiv.org/abs/1611.05397. [26] D. Ha and J. Schmidhuber, “World models,” 2018, https://arxiv. org/abs/1803.10122. [27] J. Bruce, N. Sunderhauf, ¨ P. Mirowski, R. Hadsell, and M. Milford, “One-shot reinforcement learning for robot navigation with interactive replay,” in Proceedings of the Conference and Workshop on Neural Information Processing Systems, Long Beach, CA, USA, December 2017. [28] A. Banino, C. Barry, B. Uria et al., “Vector-based navigation using grid-like representations in artificial agents,” Nature, vol. 557, no. 7705, pp. 429–433, 2018. [29] R. J. Williams, “Simple statistical gradient-following algo- rithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3-4, pp. 229–256, 1992. [30] N. Heess, D. TB, S. Sriram et al., “Emergence of locomotion behaviours in rich environments,” 2017, https://arxiv.org/abs/ 1707.02286.

Journal

Journal of RoboticsHindawi Publishing Corporation

Published: Oct 15, 2020

References