Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Simulation Design of a Live Working Manipulator for Patrol Inspection in Power Grid

Simulation Design of a Live Working Manipulator for Patrol Inspection in Power Grid Hindawi Journal of Robotics Volume 2022, Article ID 7318090, 8 pages https://doi.org/10.1155/2022/7318090 Research Article Simulation Design of a Live Working Manipulator for Patrol Inspection in Power Grid 1 2 2 2 2 Tao Xie , Zhengyu Li , Yongyun Zhang , Bin Yuan , and Xiao Liu State Grid Shanxi Electric Power Company, Taiyuan, Shanxi 030021, China State Grid UHV Transformation Co. of SEPC, Taiyuan, Shanxi 030032, China Correspondence should be addressed to Tao Xie; 220192216007@ncepu.edu.cn Received 15 August 2022; Revised 24 August 2022; Accepted 26 August 2022; Published 14 September 2022 Academic Editor: Shahid Hussain Copyright © 2022 Tao Xie et al. �is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. �e distribution line network is the electric power infrastructure directly facing the users, with the characteristics of large coverage and complex network, and its operation safety is directly related to the stability and reliability of the power supply system, which is the key link to ensure the safety of power supply and the reliability of residential electricity consumption. In order to realize the autonomous obstacle avoidance and autonomous navigation of the live working manipulator for inspection and maintenance of the power grid equipment, a mobile manipulator intelligent control method combining SumTree-weighted sampling and deep deterministic policy gradient (DDPG) is proposed. Firstly, the traditional DDPG algorithm is improved to optimize the action value function of Q-learning to get a better control strategy, and the weighted sampling technique is used to add priority to each sample in the replay bu“er, which improves the learning speed and accelerates the convergence speed. �e corresponding environmental state space is designed, and simulation experiments are conducted to verify the proposed manipulator control method. Simulation results demonstrate that the proposed method performs better than traditional DDPG and DQN methods in obstacle avoidance and navigation tasks, with faster convergence, better path planning ability, and lower o“set cost, which can provide theoretical and technical references for realizing fully autonomous power grid inspection operations. Since the 1980s, the United States, Canada, Spain, 1. Introduction France, Japan, and other developed countries have succes- Mechanization, automation, and intelligentization of live sively carried out the research of like working robots, such as working operation are the development trend and vital Japan Kyushu Electric Power Company, the State Grid technological direction to realize intelligent inspection Corporation of China, and so on. According to their au- and maintenance of power grid and safe live working tomation degree, robots can be mainly divided into three operation [1]. At present, the research related to the categories, namely, manual remote control, remote tele- automation of substation equipment with power grid operation, and autonomous operation [4]. Manipulator operation has been carried out worldwide, pilot appli- control is based on an automation control algorithm, which cations have been carried out, and certain results have automatically searches the locations of the objects in the been achieved, which can replace or assist the manual space and accurately identi¢es the obstacles in the travel path completion of some typical inspection and maintenance to plan a most reasonable motion trajectory [5]. Deep projects [2]. �is kind of technology has been innovatively learning is an advanced stage in the development of arti¢cial developed and gradually applied in recent years, which is intelligence and one of the most important and popular conducive to enriching the technical means of power grid research areas of arti¢cial intelligence, but there are di“erent operation with a live working environment and improving states in the obstacle avoidance navigation process of the the safety and automation level of the operation, and has robot, which needs to solve the two core problems of large good development prospects [3]. di“erences in strategy distribution and sparse positive 2 Journal of Robotics feedback rewards. )e traditional reinforcement learning optimization approach in which the learned deep dynamic model cannot explore the space completely, causing the model was used to compute the policy gradient and the value function was adopted as a criterion, thus improving the strategy to fall into local minima easily, and cannot derive the full-space manipulator obstacle avoidance and naviga- efficiency of training. tion control strategy through long training time and mul- In the study of practical operations of robotic arms, in tiple training rounds [6]. [13], based on deep reinforcement learning techniques, the In this article, through deeply analyzing the live working optimal grasping point and grasping stability of the robotic process and action sequence of the mobile manipulator, the arms were evaluated considering the working performance traditional deep deterministic policy gradient (DDPG) al- under different tasks, in order to realize the requirement of gorithm is optimized and improved to improve the di- grasping different shapes of objects in various tasks and to mensionality, scalability, and generalization capability of the complete the collision-free autonomous mobile robot nav- algorithm, making it more applicable to the trajectory igation. Sangiovanni et al. [14] applied reinforcement planning and obstacle avoidance control of the mobile learning methods to the obstacle avoidance and navigation tasks of the robot manipulators and achieved obstacle-free manipulator and improving the control accuracy of the trajectories during the manipulator movement. Specifically, path planning for the robot. However, the model is only based on the DDPG algorithm, this article adopts a weighted trained and tested for a single target point, and the full sampling method with a SumTree data structure instead of operational space decision was not achieved. Wei and Shan uniform sampling, so that successful samples have a higher [15] proposed to treat the three arms of the manipulator as chance of being learned by the agent, and the Q-learning three different agents, and the three arms were restricted to action value function is optimized and applied to the au- constantly be in the same plane and the control strategies tonomous power grid inspection and maintenance tasks. were set separately to realize the full working space decision- Simulation results prove that the proposed method enables making of the robot manipulator. In summary, the above the manipulator to complete the tasks more accurately and methods do not change the way of sample extraction but still quickly. uniformly collect samples from the replay buffer for training, and the agents cannot learn the successful samples effi- )e rest of this article is organized as follows: Section 2 introduces the related research. )e basic principles of the ciently, which leads to a long training time [16]. proposed method are explained in Section 3. Section 4 describes the proposed intelligent manipulator control 3. Fundamental Principles method. Section 5 gives the simulation results and discus- 3.1. Live Working Manipulator and Kinematic Analysis. sion. Finally, Section 6 concludes the whole article. Considering that the weight of the end-effector for live working operation is more than 5 kg, this article takes the 2. Related Works UR10 robot arm as the research object and further carries out its abstract modeling and kinematic analysis, mainly A lot of effective path planning methods have been proposed studying the motion state and control strategy of the three- for mobile robots worldwide, but most of the path planning segment arm of UR10 in the Cartesian coordinate system algorithms for mobile robots are not applicable to robotic [17]. arms because manipulators are complex nonlinear systems with challenging factors such as high degrees of freedom and coupling between connecting rods, which make the planning 3.1.1. Cartesian Coordinate System. In the Cartesian coor- more difficult [7]. dinate system, (x, y, z) coordinates can be used to represent Researchers have carried out the exploration of deep any position information in three-dimensional space by reinforcement learning in the field of grasping, localization, constructing multiple Cartesian coordinate systems of the and obstacle avoidance of industrial manipulators with some same basic direction with {o , o , o , o } as the origin, the first 1 2 3 4 success. Mnih et al. [8] proposed the Deep Q-Network and last positions of each endpoint of the manipulator in (DQN) algorithm, which combines neural networks with Q- three-dimensional space can be calculated by accumulating learning. )e model was trained in the ATARI2600 game, the increments of each segment as shown in Figure 1. and the final performance was much higher than that of humans. Lillicrap et al. [9] proposed the DDPG algorithm and applies it to a high-dimensional continuous action 3.1.2. Motion Degree Freedom Analysis. )e kinematic space. Rusu et al. [10] transferred the training results in a analysis of the manipulator used in this article is mainly a simulation environment to a physical robotic arm and forward operation problem [18]. )e forward operation achieved similar results to the simulation after only a simple problem is to solve the endpoint coordinates of the ma- training. Liz et al. [11] proposed an improved DDPG al- nipulator in three-dimensional space given the arm lengths gorithm in which a success experience pool and a collision l , l , and l of the three-segment arm and the rotation angles 1 2 3 experience pool are added to optimize the traditional ex- θ , θ , θ , θ , θ , and θ . Using the z-coordinate axis as the 1 2 3 4 5 6 perience pool. Compared with the traditional DDPG al- rotation axis and the endpoint of the first segment of the gorithm, the improved DDPG algorithm has fewer training robot arm as the origin, a manipulator model is constructed, episodes, but achieves better results. Luck et al. [12] com- and six angles of rotation are defined as six degrees of bined the DDPG algorithm with a model-based trajectory freedom. θ and θ control the first segment, where θ 1 2 1 Journal of Robotics 3 θ 5 (a) (b) Figure 1: Diagrams of the manipulator and its simplified model in the Cartesian coordinate system. (a) UR10 manipulator. (b) Simplified model. represents the angle of rotation of l around the z-axis and θ executing a in the environment s . Typically, the environ- 1 2 i i represents the angle of l with the x-y plane. l , l , and the z- ment in which the long-term gain is scaled down by the 1 1 2 axis are kept in the same plane. θ controls l and represents parameter c: 3 2 its angle with l . For the convenience of calculation, it is c i expressed by the angle between the second segment arm and R � 􏽘 c r s , a􏼁 , (3) i i the x-y plane of the o coordinate system. θ ∼θ control l , 2 4 6 3 i�0 where θ and θ act similar to θ and θ to make l move at 4 5 1 2 3 where c ∈ (0, 1). )e long-term benefits of performing ac- any angle within the o coordinate system, and θ controls l 3 6 3 tion a in an environment s are generally represented by the to rotate around its own arm axis without affecting the action value function: endpoint coordinates. )e coordinates of each node of the manipulator can be calculated as T c i− t ⎡ ⎣ ⎤ ⎦ Q s , a􏼁 � E􏼂R |s � s , a � a 􏼃 � E 􏽘 c r s , a􏼁 . (4) t t t t i i ⎧ ⎪Δl � l cos θ cos θ , l cos θ sin θ , l sin θ 􏼁, i�t 1 1 2 1 1 2 1 1 2 Δl � l cos θ cos θ , l cos θ sin θ , l sin θ 􏼁, ⎪ 2 2 3 1 2 3 1 2 3 )is optimal action value function is usually found using the Bellman equation [20]: ⎨ Δl � l cos θ cos θ , l cos θ sin θ , l sin θ , 3 3 5 4 3 5 4 3 5 (1) o � o + Δl , 2 1 1 Q∗ s , a􏼁 � E􏼠r s , a􏼁 + c max Q∗ s , a 􏼁 􏼣. (5) t t t t t+1 t+1 o � o + Δl , t+1 3 1 2 o � o + Δl . However, this approach is only suitable for those situ- 4 3 3 ations where both action space and state space are discrete. To apply reinforcement learning to the problem where ac- tion space and state space are continuous, DDPG designed 3.2. Deep Deterministic Policy Gradient. In a standard re- Q two deep neural networks: action value network Q(s , a |θ ) t t inforcement learning environment, each agent interacts with μ Q μ and action network μ(s , a |θ ), where θ and θ are network t t the environment with the ultimate goal of maximizing μ parameters. Action network μ(s , a |θ ) is a mapping cor- t t environmental gain. )is interactive process is formally responding to the state space and action space, which can described as a Markov decision process (MDP) [19], which directly generate the desired action based on the state. )e can be described by the quadruplet (S, A, R, P). S represents Q action value network Q(s , a |θ ) is used to approach the t t the state space, A represents the action space, action value function and can provide gradients for the R: S × A ⟶ R is the reward function, and training of the action network. P: S × A × S ⟶ [0, 1] is the transfer probability. In this )e training of the action value networks is to minimize environment, an intelligence learns a strategy π: S ⟶ A to the loss function: maximize the gain in the environment: Q Q′ Q T ′ L􏼐θ 􏼑 � 􏼒r s , a􏼁 + cQ 􏼒s , a |θ 􏼓 − Q􏼐s , a |θ 􏼑􏼓 , t t t+1 t+1 t t R � 􏽘 r s , a􏼁 , (2) 0 i i (6) i�0 where T is the number of steps advanced at the end of the where Q′ is the target value network with synchronized interaction and r(s , a ) denotes the gain obtained by weights from the Q-network. And the update of the action i i 4 Journal of Robotics network parameters requires using the policy gradient al- Control gorithm in which the gradient update direction can be Strategy expressed as: Q Q μ (s , a , s , r , δ ) ∇ μ Q􏼐s, a|θ 􏼑| � ∇ Q􏼐s, a|θ 􏼑| ∇ μμ s|θ 􏼁 | , (7) t t t+1 t t θ s�s ,a�μ s ,v a s�s ,a�μ s ,v θ s�s states ( ) ( ) t t t t t Environment where the control strategy μ represents the actions of the manipulator, and the Q-network is the action value function. )e training of these deep neural networks requires the Weighted sampling Replay Q-Network buffer input data to be independent and uniformly distributed, while reinforcement learning models with Markov decision Figure 2: Algorithm diagram. processes where data are collected sequentially and do not satisfy the requirements. )erefore, experience replay of DQN (deep Q-learning network) is introduced to break the uniform random sampling. However, the samples in the data correlation [21]. After a certain amount of training data replay buffer make a big difference for network training. are stored in the experience pool, data are collected from the leading to few successful cases and more failures in the replay buffer for training the Q-network according to uni- actual training of manipulator trajectory planning. form sampling. )erefore, if the uniform random sampling method is adopted, it will make the extraction of successful samples 3.3. Design and Optimization of Action Value Function. difficult. )e action value function is an important basis for judging SumTree utilizes a binary tree structure to access data, the quality of the current strategy of the manipulator. )e the proposed method applies it to the experience replay of input of the Q-network contains the current state and action the DDPG algorithm. )e expectation of the difference of the manipulator, and the neural network is used to fit the between the target Q-value and the real Q-value is applied in Q-function. )e Q-learning algorithm is used to optimize the DDPG algorithm to update the parameters in the the Q-network along with the policy network, and the ob- strategy network and the value network, and the larger jective function can be optimized as difference represents that the parameters are not selected accurately, that is, the samples need to be trained more by δ � r + cQ s ,μ s 􏼁􏼁 − Q s , a􏼁 . (8) t t t+1 t+1 t t the manipulator. Firstly, SumTree is initialized, the capacity ′ ′ size is defined, and the initial state s is set as the first current In addition, the target networks Q and μ are set for the state. )en, the state s is taken as the input of the real actor Q-network and the policy network, respectively: t network and the policy π is computed to get a . Finally, the ′ ′ δ � r + cQ s ,μ s |θ􏼁 􏼁 − Q s , a􏼁 . (9) t t t+1 t+1 t t t action a is executed to get the reward value r and the new t t state s . t+1 )e Q-network generates delay errors in updating the In this article, the loss value resulting from the target Q- parameters during the training process; therefore, the fol- value and the real Q-value is used as the criterion of priority, lowing equations are used to perform updates: and it can be expressed as: ω ←τω + (1 − τ)ω , ′ ′ (10) P(i) � , (11) θ ←τθ + (1 − τ)θ . β ′ ′ 􏽐 p k k where ω and ω′ are the parameters of the Q-network and the where P(i) is the probability that the ith tuple of training target network, respectively, and θ and θ′ are the parameters samples is sampled; β is a constant; the larger the β is, the of the manipulator strategy network and target network, greater the weight of priority; and k is the total number of respectively. samples in the replay buffer. )e tuple stored in the replay buffer is optimized as (s , a , s , r ,δ ) compared to the t t t+1 t t 4. Control Method of Patrol Manipulator original one (s , a , s , r ). )e structure of SumTree is t t t+1 t shown in Figure 3. Combining SumTree-Weighted Sampling As shown in the figure, the leaf nodes of the SumTree and DDPG store the priority of the samples, each node has a weight, and A schematic diagram of the proposed algorithm is given in the weight of the parent node is equal to the sum of the Figure 2. )e deep DDPG algorithm combined with Sum- weights of the two child nodes, which finally converges to the Tree-weighted sampling is explained in detail in the fol- root node. )e capacity of the replay buffer is k, the number lowing section. of nodes is 2k − 1, and the value of the root node is )e data (s , a , s , r )obtained from each interaction ψ � 􏽐 ψ . t t t+1 t total k k of the manipulator with the environment are stored in the After establishing the SumTree structure, the data in the replay buffer. )e traditional DDPG algorithm treats all replay buffer are sampled in the following way: firstly, a samples in the replay buffer as having the same value for weight is sampled from [0,ψ ], then the comparison starts total network training and extracts the training samples by from the root node, taking the order from top to bottom and Deterministic Policy Gradient Journal of Robotics 5 Root node 15 20 6 10 3 12 14 6 2 4 9 1 Leaf node [0, 3] (3, 15] (15, 29] (29, 35] (35, 37] (37, 41] (41, 50] (50, 51] Figure 3: SumTree structure. from left to right. If the selected weight ψ is less than or guiding, wire clearing, and insulator replacement. Consid- equal to the left node weight, the left child node is taken; if ering the possible obstacles and restrictions in the operation, the selected weight ψ is greater than the left node weight, an abstract simulation model is established based on Python, the weight of the left node is subtracted from ψ and the new as shown in Figure 4, where xyz is the spatial coordinate weight obtained is assigned to ψ , then the right node is system of the operation. taken as the new node to continue down the collection until When working on insulators and other equipment, the possible obstacles are other insulated equipment. Since the the current node is a leaf node, then the data are extracted and the search is finished. In doing so and based on equation obstacles have limited distribution in space, the manipulator i i i i (11), N tuples of (s , a , r , s ) are sampled. is effective in full-space obstacle avoidance behavior and t t t t+1 )ereafter, the current target Q-value is calculated as cannot limit the degrees of freedom of movement, so a follows: general three-dimensional model is established for simu- lation. Taking insulator obstacles as an example, the ap- 􏼌 ′ ′ μ Q ′ ′ y � r + cQ s ,μ s θ |θ . (12) 􏼒 􏼒 􏼌 􏼓 􏼓 pearance details of these insulators in the actual operation i i i+1 i+1 will not have an impact on the manipulator route planning )e loss function is expressed as follows: due to the limitation of the safety distance of the live working operation, and the insulators can be regarded as cylindrical 1 2 L � 􏽘 􏼐y − Q􏼐s , a |θ 􏼑􏼑 . (13) obstacles. )e operation area is divided into three parts, i i i which are Region 1, which lies in the middle between the manipulator and obstacles; Region 2, which lies on the right Using gradient descent technique to reversely update side; and Region 3, which lies on the left side, to compare the strategy network parameters: success rate of obstacle avoidance and autonomous navi- Q μ gation of the mobile manipulator. μ μ ∇ J ≈ 􏽘∇ Q􏼐s, a|θ 􏼑| ∇ s|θ 􏼁 | . (14) θ a s�s ,a�μ s θ s ( ) i i 6. Results and Discussion All sample errors are recalculated and the priority value P of all nodes is updated in SumTree. If S is the final state, Firstly, a single target point is set in the simulated envi- the iteration is ended. ronment of the manipulator power inspection operation scenario. )e effectiveness of hazard action determination 5. Experiment and reward functions can be verified in a single object setting, so as to compare the performance of the proposed 5.1. Experimental Setup and Algorithm Parameters. In the method with other deep reinforcement learning algorithms. power patrol simulation model, the experimental environ- )e DQN [8], DDPG [9], and the proposed method are ment with full-space single target point and random target tested separately. Results show that after training, the ma- points is designed to verify that the proposed method is nipulator can successfully bypass the obstacle and reach the effective in high-dimensional space using the validation set. target point when using different methods. )e cumulative )e parameters of the proposed intelligent manipulator reward curve of the training process is shown in Figure 5, control algorithm combining SumTree and DDPG are and the mean values of the number of training rounds and shown in Table 1. cumulative reward values after the convergence of the re- ward curves are given in Table 2. It can be seen from the 5.2. Simulation Experimental Environment. )e typical results that after training, DQN, DDPG, and the proposed mobile manipulator live working operation in power dis- method converge, and all network models drive the ma- tribution networks include energized line disconnecting and nipulator to accomplish the obstacle avoidance and 6 Journal of Robotics Table 1: Algorithm parameters. Table 2: Performance comparison with single target point. Parameters Values Methods Cumulative reward Training rounds Neural network learning rate α 0.001 DQN [8] 50.88± 9.54 805± 48 Number of hidden layer nodes of neural network 500 DDPG [9] 99.34± 6.77 560± 62 Discount factor c 0.99 Proposed method 121.77± 3.13 311± 28 Initial greedy value ε 0.05 Database capacity 2 × 10 Batch size 64 Maximum iterations 2000 Table 3: Test results under random target point scenarios. Epochs 5000 Methods Success (%) Collision (%) Incomplete (%) DQN [8] 35.77 9.85 43.67 DDPG [9] 30.59 12.77 47.55 Proposed method 88.43 7.17 3.99 the training reward will not tend to a stable convergence value. DQN and DDPG methods are difficult to learn a unified obstacle avoidance and navigation method. )ough the obstacle avoidance tasks in experiments can be com- pleted by these two methods in most cases, however, when region3 the target is far away or the location is difficult to reach, these two models cannot reach the target points smoothly, leading region2 to chaotic behaviors when the effective strategy cannot be 100 300 executed. 200 y )e success rate, collision rate, and incomplete rate 600 0 corresponding to each algorithm are shown in Table 3. It can be seen that the DQN and DDPG methods have low success Figure 4: Simulation model of the live working operation in power rates for the full-space target point tasks, and it is difficult to distribution grid. reliably complete the multiple target points tasks. )e proposed method achieves a better control strategy by en- hancing the exploration ability of the agent and improves the uniform sampling of the traditional DDPG algorithm. It learns a relatively safe path from the training and can stably reach the target point to complete the task. -50 )e completion rates of the obstacle avoidance and -150 navigation subtasks using the proposed model are shown in -250 Figure 6. From the convergence trend of the curves in the -350 figure, it can be seen that the weighted sampling based on -450 SumTree makes the training more stable; the task success 1 110 210 310 410 510 610 710 810 910 rate tends to be stable and maintains a high-level perfor- training rounds mance. )e high completion rate of subtasks guarantees a DDPG high live working operation success rate for the proposed Proposed methed method. DQN In addition, the test results in each area under the Figure 5: Reward function curve comparison. random target point scenario using the proposed method are shown in Table 4. Comparing the success rates of each area, it can be seen that the collisions mainly appeared in Region navigation task. However, the training performance varies 2, and the incomplete cases mainly occurs in Region 1. )e among different algorithms. )e training of DDPG and main reason for the incomplete cases is that there are few DQN can also approach convergence but with greater os- sampling samples in Region 1, the variance of the output cillation, and the termination timings of the training have strategy distribution is large, and it is prone to sample more influence on the model performance than the pro- unreasonable actions. )e occurrence of obstacle collisions posed method. In addition, it can be seen that the proposed is concentrated in the areas with the highest difficulty. )e method converges faster and gets higher rewards. fundamental reason is that the model itself has an inherent Next, through the full-space random target points test, error rate. )e obstacle avoidance success rate of the pro- the full-space action effect of different models is tested. Due posed method is close to 91%. )e overall failure probability to the different inherent rewards between different tasks caused by different target points, the strategy no longer of the proposed model is relatively low, and it will not fall into a local minimum. satisfies the independent and identical distribution, so that cumulative reward region1 Journal of Robotics 7 1.0 1.2 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 20 120 220 320 420 520 620 720 820 920 20 120 220 320 420 520 620 720 820 920 training rounds training rounds (a) (b) Figure 6: Success rate of the proposed method under random target point test. (a) Success rate of obstacle avoidance training. (b) Success rate of navigation training. Table 4: Test results in different regions under the random target Data Availability point scenario. )e data used to support the findings of this study are in- Location Success (%) Collision (%) Incomplete (%) cluded within the article. Region 1 89.64 2.86 7.5 Region 2 85.47 13.85 0.68 Region 3 90.18 6.02 3.80 Conflicts of Interest )e author declares that there are no conflicts of interest. 7. Conclusion References In this article, a live working manipulator control scheme for [1] R. S. Gonçalves and J. C. M. Carvalho, “Review and latest patrol inspection in power grid is proposed. Firstly, the trends in mobile robots used on power transmission lines,” existing DDPG algorithm based on deep reinforcement International Journal of Advanced Robotic Systems, vol. 10, learning is improved, and the action value function of Q- no. 12, p. 408, 2013. learning is optimized, therefore enhancing the exploration [2] S. Lu, Y. Zhang, and J. Su, “Mobile robot for power substation ability of the agent and obtaining a better control strategy. inspection: a survey,” IEEE/CAA Journal of Automatica Secondly, the uniform sampling is improved, and the Sinica, vol. 4, no. 4, pp. 830–847, 2017. weighted sampling method of the SumTree data structure is [3] W. Jiang, G. Zuo, D. H. Zou, L. Hongjun, J. Y. Jiu, and used to add priority to each sample in the replay buffer, C. Y. Gao, “Autonomous behavior intelligence control of self- which improves the learning speed of the manipulator and evolution mobile robot for high-voltage transmission line in greatly reduces the training time. Simulation results show complex smart grid,” Complexity, vol. 202017 pages, Article ID 8843178, 2020. that the proposed model is suitable for solving the problem [4] T. Zhang and J. Dai, “Electric power intelligent inspection of obstacle avoidance and navigation of live working ma- robot: a review,” Journal of Physics: Conference Series, nipulators, and it is an effective solution to realize auton- vol. 1750, no. 1, Article ID 012023, 2021. omous control of live working mobile robots. With the rapid [5] S. A. Ajwad, J. Iqbal, M. I. Ullah, and A. Mehmood, “A improvement of the performance of industrial computers systematic review of current and emergent manipulator and the high-speed innovation of deep learning algorithms, control approaches,” Frontiers of Mechanical Engineering, the realization of autonomous navigation with high posi- vol. 10, no. 2, pp. 198–210, 2015. tioning accuracy and robustness is the inevitable develop- ˇ ´ ´ [6] S. Kruzic, J. Music, R. Kamnik, and P. Vladan, “Estimating ment direction of power inspection robots. In the future, we robot manipulator end-effector forces using deep learning,” in will try to configure more types of sensors on the inspection Proceedings of the 2020 43rd international convention on robot to obtain diverse equipment status information, such information, communication and electronic technology as ultraviolet flaw detection and laser vibration measure- (MIPRO), pp. 1163–1168, IEEE, Opatija, Croatia, 2020. ment, and transform the function of the inspection robot [7] M. Yang, E. Yang, R. C. Zante, P. Mark, and L. Xuefeng, from problem finding to problem solving, thereby further “Collaborative mobile industrial manipulator: a review of reducing the work pressure of electric inspection personnel. system architecture and applications,” in Proceedings of the Obstacle avoidance rate navigation success rate 8 Journal of Robotics 2019 25th International Conference on Automation and Computing (ICAC), pp. 1–6, IEEE, Lancaster, UK, 2019. [8] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. [9] T. P. Lillicrap, J. J. Hunt, A. Pritzel et al., “Continuous control with deep reinforcement learning,” 2015, https://arxiv.org/ abs/1509.02971. [10] A. A. Rusu, N. C. Rabinowitz, G. Desjardins et al., “Pro- gressive neural networks,” 2016, https://arxiv.org/abs/1606. [11] Z. Li, H. Ma, Y. Ding, W. Chen, and J. Ying, “Motion planning of six-dof arm robot based on improved DDPG algorithm,” in Proceedings of the 2020 39th Chinese Control Conference (CCC), pp. 3954–3959, IEEE, Shenyang, China, 2020. [12] K. S. Luck, M. Vecerik, S. Stepputtis, B. A. Heni, and S. Jonathan, “Improved exploration through latent trajectory optimization in deep deterministic policy gradient,” in Pro- ceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3704–3711, IEEE, Macau, China, 2019. [13] K. Fang, Y. Zhu, A. Garg et al., “Learning task-oriented grasping for tool manipulation from simulated self-supervi- sion,” Ge International Journal of Robotics Research, vol. 39, no. 2-3, pp. 202–216, 2020. [14] B. Sangiovanni, G. P. Incremona, M. Piastra, and A. Ferrara, “Self-configuring robot path planning with obstacle avoidance via deep reinforcement learning,” IEEE Control Systems Letters, vol. 5, no. 2, pp. 397–402, 2021. [15] X. U. Wei and L. U. Shan, “Analysis of space manipulator route planning based on sarsa (λ) reinforcement learning,” Journal of Astronautics, vol. 22, no. 3, pp. 12–15, 2019. [16] W. Yuan, J. A. Stork, D. Kragic, Y. W. Michael, and H. Kaiyu, “Rearrangement with nonprehensile manipulation using deep reinforcement learning,” in Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 270–277, IEEE, Brisbane, Australia, 2018. [17] O. Ravn, N. A. Andersen, and T. T. Andersen, “Ur10 per- formance analysis,” 2014, https://backend.orbit.dtu.dk/ws/ portalfiles/portal/105275650/ur10_performance_analysis.pdf. [18] D. Zhang and Z. Gao, “Forward kinematics, performance analysis, and multi-objective optimization of a bio-inspired parallel manipulator,” Robotics and Computer-Integrated Manufacturing, vol. 28, no. 4, pp. 484–492, 2012. [19] M. S. Kim, D. K. Han, J. H. Park, and J. S. Kim, “Motion planning of robot manipulators for a smoother path using a twin delayed deep deterministic policy gradient with hind- sight experience replay,” Applied Sciences, vol. 10, no. 2, p. 575, [20] Y. Wang, X. Lan, C. Feng et al., “An experience-based policy gradient method for smooth manipulation,” in Proceedings of the 2019 IEEE 9th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 93–97, IEEE, Suzhou, China, 2019. [21] S. Zhang and R. S. Sutton, “A deeper look at experience replay,” 2017, https://arxiv.org/abs/1712.01275. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Robotics Hindawi Publishing Corporation

Simulation Design of a Live Working Manipulator for Patrol Inspection in Power Grid

Simulation Design of a Live Working Manipulator for Patrol Inspection in Power Grid

Abstract

The distribution line network is the electric power infrastructure directly facing the users, with the characteristics of large coverage and complex network, and its operation safety is directly related to the stability and reliability of the power supply system, which is the key link to ensure the safety of power supply and the reliability of residential electricity consumption. In order to realize the autonomous obstacle avoidance and autonomous navigation of the live working manipulator for inspection and maintenance of the power grid equipment, a mobile manipulator intelligent control method combining SumTree-weighted sampling and deep deterministic policy gradient (DDPG) is proposed. Firstly, the traditional DDPG algorithm is improved to optimize the action value function of <i>Q</i>-learning to get a better control strategy, and the weighted sampling technique is used to add priority to each sample in the replay buffer, which improves the learning speed and accelerates the convergence speed. The corresponding environmental state space is designed, and simulation experiments are conducted to verify the proposed manipulator control method. Simulation results demonstrate that the proposed method performs better than traditional DDPG and DQN methods in obstacle avoidance and navigation tasks, with faster convergence, better path planning ability, and lower offset cost, which can provide theoretical and technical references for realizing fully autonomous power grid inspection operations.

Loading next page...
 
/lp/hindawi-publishing-corporation/simulation-design-of-a-live-working-manipulator-for-patrol-inspection-TmM0XwHZyr

References (21)

Publisher
Hindawi Publishing Corporation
ISSN
1687-9600
eISSN
1687-9619
DOI
10.1155/2022/7318090
Publisher site
See Article on Publisher Site

Abstract

Hindawi Journal of Robotics Volume 2022, Article ID 7318090, 8 pages https://doi.org/10.1155/2022/7318090 Research Article Simulation Design of a Live Working Manipulator for Patrol Inspection in Power Grid 1 2 2 2 2 Tao Xie , Zhengyu Li , Yongyun Zhang , Bin Yuan , and Xiao Liu State Grid Shanxi Electric Power Company, Taiyuan, Shanxi 030021, China State Grid UHV Transformation Co. of SEPC, Taiyuan, Shanxi 030032, China Correspondence should be addressed to Tao Xie; 220192216007@ncepu.edu.cn Received 15 August 2022; Revised 24 August 2022; Accepted 26 August 2022; Published 14 September 2022 Academic Editor: Shahid Hussain Copyright © 2022 Tao Xie et al. �is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. �e distribution line network is the electric power infrastructure directly facing the users, with the characteristics of large coverage and complex network, and its operation safety is directly related to the stability and reliability of the power supply system, which is the key link to ensure the safety of power supply and the reliability of residential electricity consumption. In order to realize the autonomous obstacle avoidance and autonomous navigation of the live working manipulator for inspection and maintenance of the power grid equipment, a mobile manipulator intelligent control method combining SumTree-weighted sampling and deep deterministic policy gradient (DDPG) is proposed. Firstly, the traditional DDPG algorithm is improved to optimize the action value function of Q-learning to get a better control strategy, and the weighted sampling technique is used to add priority to each sample in the replay bu“er, which improves the learning speed and accelerates the convergence speed. �e corresponding environmental state space is designed, and simulation experiments are conducted to verify the proposed manipulator control method. Simulation results demonstrate that the proposed method performs better than traditional DDPG and DQN methods in obstacle avoidance and navigation tasks, with faster convergence, better path planning ability, and lower o“set cost, which can provide theoretical and technical references for realizing fully autonomous power grid inspection operations. Since the 1980s, the United States, Canada, Spain, 1. Introduction France, Japan, and other developed countries have succes- Mechanization, automation, and intelligentization of live sively carried out the research of like working robots, such as working operation are the development trend and vital Japan Kyushu Electric Power Company, the State Grid technological direction to realize intelligent inspection Corporation of China, and so on. According to their au- and maintenance of power grid and safe live working tomation degree, robots can be mainly divided into three operation [1]. At present, the research related to the categories, namely, manual remote control, remote tele- automation of substation equipment with power grid operation, and autonomous operation [4]. Manipulator operation has been carried out worldwide, pilot appli- control is based on an automation control algorithm, which cations have been carried out, and certain results have automatically searches the locations of the objects in the been achieved, which can replace or assist the manual space and accurately identi¢es the obstacles in the travel path completion of some typical inspection and maintenance to plan a most reasonable motion trajectory [5]. Deep projects [2]. �is kind of technology has been innovatively learning is an advanced stage in the development of arti¢cial developed and gradually applied in recent years, which is intelligence and one of the most important and popular conducive to enriching the technical means of power grid research areas of arti¢cial intelligence, but there are di“erent operation with a live working environment and improving states in the obstacle avoidance navigation process of the the safety and automation level of the operation, and has robot, which needs to solve the two core problems of large good development prospects [3]. di“erences in strategy distribution and sparse positive 2 Journal of Robotics feedback rewards. )e traditional reinforcement learning optimization approach in which the learned deep dynamic model cannot explore the space completely, causing the model was used to compute the policy gradient and the value function was adopted as a criterion, thus improving the strategy to fall into local minima easily, and cannot derive the full-space manipulator obstacle avoidance and naviga- efficiency of training. tion control strategy through long training time and mul- In the study of practical operations of robotic arms, in tiple training rounds [6]. [13], based on deep reinforcement learning techniques, the In this article, through deeply analyzing the live working optimal grasping point and grasping stability of the robotic process and action sequence of the mobile manipulator, the arms were evaluated considering the working performance traditional deep deterministic policy gradient (DDPG) al- under different tasks, in order to realize the requirement of gorithm is optimized and improved to improve the di- grasping different shapes of objects in various tasks and to mensionality, scalability, and generalization capability of the complete the collision-free autonomous mobile robot nav- algorithm, making it more applicable to the trajectory igation. Sangiovanni et al. [14] applied reinforcement planning and obstacle avoidance control of the mobile learning methods to the obstacle avoidance and navigation tasks of the robot manipulators and achieved obstacle-free manipulator and improving the control accuracy of the trajectories during the manipulator movement. Specifically, path planning for the robot. However, the model is only based on the DDPG algorithm, this article adopts a weighted trained and tested for a single target point, and the full sampling method with a SumTree data structure instead of operational space decision was not achieved. Wei and Shan uniform sampling, so that successful samples have a higher [15] proposed to treat the three arms of the manipulator as chance of being learned by the agent, and the Q-learning three different agents, and the three arms were restricted to action value function is optimized and applied to the au- constantly be in the same plane and the control strategies tonomous power grid inspection and maintenance tasks. were set separately to realize the full working space decision- Simulation results prove that the proposed method enables making of the robot manipulator. In summary, the above the manipulator to complete the tasks more accurately and methods do not change the way of sample extraction but still quickly. uniformly collect samples from the replay buffer for training, and the agents cannot learn the successful samples effi- )e rest of this article is organized as follows: Section 2 introduces the related research. )e basic principles of the ciently, which leads to a long training time [16]. proposed method are explained in Section 3. Section 4 describes the proposed intelligent manipulator control 3. Fundamental Principles method. Section 5 gives the simulation results and discus- 3.1. Live Working Manipulator and Kinematic Analysis. sion. Finally, Section 6 concludes the whole article. Considering that the weight of the end-effector for live working operation is more than 5 kg, this article takes the 2. Related Works UR10 robot arm as the research object and further carries out its abstract modeling and kinematic analysis, mainly A lot of effective path planning methods have been proposed studying the motion state and control strategy of the three- for mobile robots worldwide, but most of the path planning segment arm of UR10 in the Cartesian coordinate system algorithms for mobile robots are not applicable to robotic [17]. arms because manipulators are complex nonlinear systems with challenging factors such as high degrees of freedom and coupling between connecting rods, which make the planning 3.1.1. Cartesian Coordinate System. In the Cartesian coor- more difficult [7]. dinate system, (x, y, z) coordinates can be used to represent Researchers have carried out the exploration of deep any position information in three-dimensional space by reinforcement learning in the field of grasping, localization, constructing multiple Cartesian coordinate systems of the and obstacle avoidance of industrial manipulators with some same basic direction with {o , o , o , o } as the origin, the first 1 2 3 4 success. Mnih et al. [8] proposed the Deep Q-Network and last positions of each endpoint of the manipulator in (DQN) algorithm, which combines neural networks with Q- three-dimensional space can be calculated by accumulating learning. )e model was trained in the ATARI2600 game, the increments of each segment as shown in Figure 1. and the final performance was much higher than that of humans. Lillicrap et al. [9] proposed the DDPG algorithm and applies it to a high-dimensional continuous action 3.1.2. Motion Degree Freedom Analysis. )e kinematic space. Rusu et al. [10] transferred the training results in a analysis of the manipulator used in this article is mainly a simulation environment to a physical robotic arm and forward operation problem [18]. )e forward operation achieved similar results to the simulation after only a simple problem is to solve the endpoint coordinates of the ma- training. Liz et al. [11] proposed an improved DDPG al- nipulator in three-dimensional space given the arm lengths gorithm in which a success experience pool and a collision l , l , and l of the three-segment arm and the rotation angles 1 2 3 experience pool are added to optimize the traditional ex- θ , θ , θ , θ , θ , and θ . Using the z-coordinate axis as the 1 2 3 4 5 6 perience pool. Compared with the traditional DDPG al- rotation axis and the endpoint of the first segment of the gorithm, the improved DDPG algorithm has fewer training robot arm as the origin, a manipulator model is constructed, episodes, but achieves better results. Luck et al. [12] com- and six angles of rotation are defined as six degrees of bined the DDPG algorithm with a model-based trajectory freedom. θ and θ control the first segment, where θ 1 2 1 Journal of Robotics 3 θ 5 (a) (b) Figure 1: Diagrams of the manipulator and its simplified model in the Cartesian coordinate system. (a) UR10 manipulator. (b) Simplified model. represents the angle of rotation of l around the z-axis and θ executing a in the environment s . Typically, the environ- 1 2 i i represents the angle of l with the x-y plane. l , l , and the z- ment in which the long-term gain is scaled down by the 1 1 2 axis are kept in the same plane. θ controls l and represents parameter c: 3 2 its angle with l . For the convenience of calculation, it is c i expressed by the angle between the second segment arm and R � 􏽘 c r s , a􏼁 , (3) i i the x-y plane of the o coordinate system. θ ∼θ control l , 2 4 6 3 i�0 where θ and θ act similar to θ and θ to make l move at 4 5 1 2 3 where c ∈ (0, 1). )e long-term benefits of performing ac- any angle within the o coordinate system, and θ controls l 3 6 3 tion a in an environment s are generally represented by the to rotate around its own arm axis without affecting the action value function: endpoint coordinates. )e coordinates of each node of the manipulator can be calculated as T c i− t ⎡ ⎣ ⎤ ⎦ Q s , a􏼁 � E􏼂R |s � s , a � a 􏼃 � E 􏽘 c r s , a􏼁 . (4) t t t t i i ⎧ ⎪Δl � l cos θ cos θ , l cos θ sin θ , l sin θ 􏼁, i�t 1 1 2 1 1 2 1 1 2 Δl � l cos θ cos θ , l cos θ sin θ , l sin θ 􏼁, ⎪ 2 2 3 1 2 3 1 2 3 )is optimal action value function is usually found using the Bellman equation [20]: ⎨ Δl � l cos θ cos θ , l cos θ sin θ , l sin θ , 3 3 5 4 3 5 4 3 5 (1) o � o + Δl , 2 1 1 Q∗ s , a􏼁 � E􏼠r s , a􏼁 + c max Q∗ s , a 􏼁 􏼣. (5) t t t t t+1 t+1 o � o + Δl , t+1 3 1 2 o � o + Δl . However, this approach is only suitable for those situ- 4 3 3 ations where both action space and state space are discrete. To apply reinforcement learning to the problem where ac- tion space and state space are continuous, DDPG designed 3.2. Deep Deterministic Policy Gradient. In a standard re- Q two deep neural networks: action value network Q(s , a |θ ) t t inforcement learning environment, each agent interacts with μ Q μ and action network μ(s , a |θ ), where θ and θ are network t t the environment with the ultimate goal of maximizing μ parameters. Action network μ(s , a |θ ) is a mapping cor- t t environmental gain. )is interactive process is formally responding to the state space and action space, which can described as a Markov decision process (MDP) [19], which directly generate the desired action based on the state. )e can be described by the quadruplet (S, A, R, P). S represents Q action value network Q(s , a |θ ) is used to approach the t t the state space, A represents the action space, action value function and can provide gradients for the R: S × A ⟶ R is the reward function, and training of the action network. P: S × A × S ⟶ [0, 1] is the transfer probability. In this )e training of the action value networks is to minimize environment, an intelligence learns a strategy π: S ⟶ A to the loss function: maximize the gain in the environment: Q Q′ Q T ′ L􏼐θ 􏼑 � 􏼒r s , a􏼁 + cQ 􏼒s , a |θ 􏼓 − Q􏼐s , a |θ 􏼑􏼓 , t t t+1 t+1 t t R � 􏽘 r s , a􏼁 , (2) 0 i i (6) i�0 where T is the number of steps advanced at the end of the where Q′ is the target value network with synchronized interaction and r(s , a ) denotes the gain obtained by weights from the Q-network. And the update of the action i i 4 Journal of Robotics network parameters requires using the policy gradient al- Control gorithm in which the gradient update direction can be Strategy expressed as: Q Q μ (s , a , s , r , δ ) ∇ μ Q􏼐s, a|θ 􏼑| � ∇ Q􏼐s, a|θ 􏼑| ∇ μμ s|θ 􏼁 | , (7) t t t+1 t t θ s�s ,a�μ s ,v a s�s ,a�μ s ,v θ s�s states ( ) ( ) t t t t t Environment where the control strategy μ represents the actions of the manipulator, and the Q-network is the action value function. )e training of these deep neural networks requires the Weighted sampling Replay Q-Network buffer input data to be independent and uniformly distributed, while reinforcement learning models with Markov decision Figure 2: Algorithm diagram. processes where data are collected sequentially and do not satisfy the requirements. )erefore, experience replay of DQN (deep Q-learning network) is introduced to break the uniform random sampling. However, the samples in the data correlation [21]. After a certain amount of training data replay buffer make a big difference for network training. are stored in the experience pool, data are collected from the leading to few successful cases and more failures in the replay buffer for training the Q-network according to uni- actual training of manipulator trajectory planning. form sampling. )erefore, if the uniform random sampling method is adopted, it will make the extraction of successful samples 3.3. Design and Optimization of Action Value Function. difficult. )e action value function is an important basis for judging SumTree utilizes a binary tree structure to access data, the quality of the current strategy of the manipulator. )e the proposed method applies it to the experience replay of input of the Q-network contains the current state and action the DDPG algorithm. )e expectation of the difference of the manipulator, and the neural network is used to fit the between the target Q-value and the real Q-value is applied in Q-function. )e Q-learning algorithm is used to optimize the DDPG algorithm to update the parameters in the the Q-network along with the policy network, and the ob- strategy network and the value network, and the larger jective function can be optimized as difference represents that the parameters are not selected accurately, that is, the samples need to be trained more by δ � r + cQ s ,μ s 􏼁􏼁 − Q s , a􏼁 . (8) t t t+1 t+1 t t the manipulator. Firstly, SumTree is initialized, the capacity ′ ′ size is defined, and the initial state s is set as the first current In addition, the target networks Q and μ are set for the state. )en, the state s is taken as the input of the real actor Q-network and the policy network, respectively: t network and the policy π is computed to get a . Finally, the ′ ′ δ � r + cQ s ,μ s |θ􏼁 􏼁 − Q s , a􏼁 . (9) t t t+1 t+1 t t t action a is executed to get the reward value r and the new t t state s . t+1 )e Q-network generates delay errors in updating the In this article, the loss value resulting from the target Q- parameters during the training process; therefore, the fol- value and the real Q-value is used as the criterion of priority, lowing equations are used to perform updates: and it can be expressed as: ω ←τω + (1 − τ)ω , ′ ′ (10) P(i) � , (11) θ ←τθ + (1 − τ)θ . β ′ ′ 􏽐 p k k where ω and ω′ are the parameters of the Q-network and the where P(i) is the probability that the ith tuple of training target network, respectively, and θ and θ′ are the parameters samples is sampled; β is a constant; the larger the β is, the of the manipulator strategy network and target network, greater the weight of priority; and k is the total number of respectively. samples in the replay buffer. )e tuple stored in the replay buffer is optimized as (s , a , s , r ,δ ) compared to the t t t+1 t t 4. Control Method of Patrol Manipulator original one (s , a , s , r ). )e structure of SumTree is t t t+1 t shown in Figure 3. Combining SumTree-Weighted Sampling As shown in the figure, the leaf nodes of the SumTree and DDPG store the priority of the samples, each node has a weight, and A schematic diagram of the proposed algorithm is given in the weight of the parent node is equal to the sum of the Figure 2. )e deep DDPG algorithm combined with Sum- weights of the two child nodes, which finally converges to the Tree-weighted sampling is explained in detail in the fol- root node. )e capacity of the replay buffer is k, the number lowing section. of nodes is 2k − 1, and the value of the root node is )e data (s , a , s , r )obtained from each interaction ψ � 􏽐 ψ . t t t+1 t total k k of the manipulator with the environment are stored in the After establishing the SumTree structure, the data in the replay buffer. )e traditional DDPG algorithm treats all replay buffer are sampled in the following way: firstly, a samples in the replay buffer as having the same value for weight is sampled from [0,ψ ], then the comparison starts total network training and extracts the training samples by from the root node, taking the order from top to bottom and Deterministic Policy Gradient Journal of Robotics 5 Root node 15 20 6 10 3 12 14 6 2 4 9 1 Leaf node [0, 3] (3, 15] (15, 29] (29, 35] (35, 37] (37, 41] (41, 50] (50, 51] Figure 3: SumTree structure. from left to right. If the selected weight ψ is less than or guiding, wire clearing, and insulator replacement. Consid- equal to the left node weight, the left child node is taken; if ering the possible obstacles and restrictions in the operation, the selected weight ψ is greater than the left node weight, an abstract simulation model is established based on Python, the weight of the left node is subtracted from ψ and the new as shown in Figure 4, where xyz is the spatial coordinate weight obtained is assigned to ψ , then the right node is system of the operation. taken as the new node to continue down the collection until When working on insulators and other equipment, the possible obstacles are other insulated equipment. Since the the current node is a leaf node, then the data are extracted and the search is finished. In doing so and based on equation obstacles have limited distribution in space, the manipulator i i i i (11), N tuples of (s , a , r , s ) are sampled. is effective in full-space obstacle avoidance behavior and t t t t+1 )ereafter, the current target Q-value is calculated as cannot limit the degrees of freedom of movement, so a follows: general three-dimensional model is established for simu- lation. Taking insulator obstacles as an example, the ap- 􏼌 ′ ′ μ Q ′ ′ y � r + cQ s ,μ s θ |θ . (12) 􏼒 􏼒 􏼌 􏼓 􏼓 pearance details of these insulators in the actual operation i i i+1 i+1 will not have an impact on the manipulator route planning )e loss function is expressed as follows: due to the limitation of the safety distance of the live working operation, and the insulators can be regarded as cylindrical 1 2 L � 􏽘 􏼐y − Q􏼐s , a |θ 􏼑􏼑 . (13) obstacles. )e operation area is divided into three parts, i i i which are Region 1, which lies in the middle between the manipulator and obstacles; Region 2, which lies on the right Using gradient descent technique to reversely update side; and Region 3, which lies on the left side, to compare the strategy network parameters: success rate of obstacle avoidance and autonomous navi- Q μ gation of the mobile manipulator. μ μ ∇ J ≈ 􏽘∇ Q􏼐s, a|θ 􏼑| ∇ s|θ 􏼁 | . (14) θ a s�s ,a�μ s θ s ( ) i i 6. Results and Discussion All sample errors are recalculated and the priority value P of all nodes is updated in SumTree. If S is the final state, Firstly, a single target point is set in the simulated envi- the iteration is ended. ronment of the manipulator power inspection operation scenario. )e effectiveness of hazard action determination 5. Experiment and reward functions can be verified in a single object setting, so as to compare the performance of the proposed 5.1. Experimental Setup and Algorithm Parameters. In the method with other deep reinforcement learning algorithms. power patrol simulation model, the experimental environ- )e DQN [8], DDPG [9], and the proposed method are ment with full-space single target point and random target tested separately. Results show that after training, the ma- points is designed to verify that the proposed method is nipulator can successfully bypass the obstacle and reach the effective in high-dimensional space using the validation set. target point when using different methods. )e cumulative )e parameters of the proposed intelligent manipulator reward curve of the training process is shown in Figure 5, control algorithm combining SumTree and DDPG are and the mean values of the number of training rounds and shown in Table 1. cumulative reward values after the convergence of the re- ward curves are given in Table 2. It can be seen from the 5.2. Simulation Experimental Environment. )e typical results that after training, DQN, DDPG, and the proposed mobile manipulator live working operation in power dis- method converge, and all network models drive the ma- tribution networks include energized line disconnecting and nipulator to accomplish the obstacle avoidance and 6 Journal of Robotics Table 1: Algorithm parameters. Table 2: Performance comparison with single target point. Parameters Values Methods Cumulative reward Training rounds Neural network learning rate α 0.001 DQN [8] 50.88± 9.54 805± 48 Number of hidden layer nodes of neural network 500 DDPG [9] 99.34± 6.77 560± 62 Discount factor c 0.99 Proposed method 121.77± 3.13 311± 28 Initial greedy value ε 0.05 Database capacity 2 × 10 Batch size 64 Maximum iterations 2000 Table 3: Test results under random target point scenarios. Epochs 5000 Methods Success (%) Collision (%) Incomplete (%) DQN [8] 35.77 9.85 43.67 DDPG [9] 30.59 12.77 47.55 Proposed method 88.43 7.17 3.99 the training reward will not tend to a stable convergence value. DQN and DDPG methods are difficult to learn a unified obstacle avoidance and navigation method. )ough the obstacle avoidance tasks in experiments can be com- pleted by these two methods in most cases, however, when region3 the target is far away or the location is difficult to reach, these two models cannot reach the target points smoothly, leading region2 to chaotic behaviors when the effective strategy cannot be 100 300 executed. 200 y )e success rate, collision rate, and incomplete rate 600 0 corresponding to each algorithm are shown in Table 3. It can be seen that the DQN and DDPG methods have low success Figure 4: Simulation model of the live working operation in power rates for the full-space target point tasks, and it is difficult to distribution grid. reliably complete the multiple target points tasks. )e proposed method achieves a better control strategy by en- hancing the exploration ability of the agent and improves the uniform sampling of the traditional DDPG algorithm. It learns a relatively safe path from the training and can stably reach the target point to complete the task. -50 )e completion rates of the obstacle avoidance and -150 navigation subtasks using the proposed model are shown in -250 Figure 6. From the convergence trend of the curves in the -350 figure, it can be seen that the weighted sampling based on -450 SumTree makes the training more stable; the task success 1 110 210 310 410 510 610 710 810 910 rate tends to be stable and maintains a high-level perfor- training rounds mance. )e high completion rate of subtasks guarantees a DDPG high live working operation success rate for the proposed Proposed methed method. DQN In addition, the test results in each area under the Figure 5: Reward function curve comparison. random target point scenario using the proposed method are shown in Table 4. Comparing the success rates of each area, it can be seen that the collisions mainly appeared in Region navigation task. However, the training performance varies 2, and the incomplete cases mainly occurs in Region 1. )e among different algorithms. )e training of DDPG and main reason for the incomplete cases is that there are few DQN can also approach convergence but with greater os- sampling samples in Region 1, the variance of the output cillation, and the termination timings of the training have strategy distribution is large, and it is prone to sample more influence on the model performance than the pro- unreasonable actions. )e occurrence of obstacle collisions posed method. In addition, it can be seen that the proposed is concentrated in the areas with the highest difficulty. )e method converges faster and gets higher rewards. fundamental reason is that the model itself has an inherent Next, through the full-space random target points test, error rate. )e obstacle avoidance success rate of the pro- the full-space action effect of different models is tested. Due posed method is close to 91%. )e overall failure probability to the different inherent rewards between different tasks caused by different target points, the strategy no longer of the proposed model is relatively low, and it will not fall into a local minimum. satisfies the independent and identical distribution, so that cumulative reward region1 Journal of Robotics 7 1.0 1.2 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 20 120 220 320 420 520 620 720 820 920 20 120 220 320 420 520 620 720 820 920 training rounds training rounds (a) (b) Figure 6: Success rate of the proposed method under random target point test. (a) Success rate of obstacle avoidance training. (b) Success rate of navigation training. Table 4: Test results in different regions under the random target Data Availability point scenario. )e data used to support the findings of this study are in- Location Success (%) Collision (%) Incomplete (%) cluded within the article. Region 1 89.64 2.86 7.5 Region 2 85.47 13.85 0.68 Region 3 90.18 6.02 3.80 Conflicts of Interest )e author declares that there are no conflicts of interest. 7. Conclusion References In this article, a live working manipulator control scheme for [1] R. S. Gonçalves and J. C. M. Carvalho, “Review and latest patrol inspection in power grid is proposed. Firstly, the trends in mobile robots used on power transmission lines,” existing DDPG algorithm based on deep reinforcement International Journal of Advanced Robotic Systems, vol. 10, learning is improved, and the action value function of Q- no. 12, p. 408, 2013. learning is optimized, therefore enhancing the exploration [2] S. Lu, Y. Zhang, and J. Su, “Mobile robot for power substation ability of the agent and obtaining a better control strategy. inspection: a survey,” IEEE/CAA Journal of Automatica Secondly, the uniform sampling is improved, and the Sinica, vol. 4, no. 4, pp. 830–847, 2017. weighted sampling method of the SumTree data structure is [3] W. Jiang, G. Zuo, D. H. Zou, L. Hongjun, J. Y. Jiu, and used to add priority to each sample in the replay buffer, C. Y. Gao, “Autonomous behavior intelligence control of self- which improves the learning speed of the manipulator and evolution mobile robot for high-voltage transmission line in greatly reduces the training time. Simulation results show complex smart grid,” Complexity, vol. 202017 pages, Article ID 8843178, 2020. that the proposed model is suitable for solving the problem [4] T. Zhang and J. Dai, “Electric power intelligent inspection of obstacle avoidance and navigation of live working ma- robot: a review,” Journal of Physics: Conference Series, nipulators, and it is an effective solution to realize auton- vol. 1750, no. 1, Article ID 012023, 2021. omous control of live working mobile robots. With the rapid [5] S. A. Ajwad, J. Iqbal, M. I. Ullah, and A. Mehmood, “A improvement of the performance of industrial computers systematic review of current and emergent manipulator and the high-speed innovation of deep learning algorithms, control approaches,” Frontiers of Mechanical Engineering, the realization of autonomous navigation with high posi- vol. 10, no. 2, pp. 198–210, 2015. tioning accuracy and robustness is the inevitable develop- ˇ ´ ´ [6] S. Kruzic, J. Music, R. Kamnik, and P. Vladan, “Estimating ment direction of power inspection robots. In the future, we robot manipulator end-effector forces using deep learning,” in will try to configure more types of sensors on the inspection Proceedings of the 2020 43rd international convention on robot to obtain diverse equipment status information, such information, communication and electronic technology as ultraviolet flaw detection and laser vibration measure- (MIPRO), pp. 1163–1168, IEEE, Opatija, Croatia, 2020. ment, and transform the function of the inspection robot [7] M. Yang, E. Yang, R. C. Zante, P. Mark, and L. Xuefeng, from problem finding to problem solving, thereby further “Collaborative mobile industrial manipulator: a review of reducing the work pressure of electric inspection personnel. system architecture and applications,” in Proceedings of the Obstacle avoidance rate navigation success rate 8 Journal of Robotics 2019 25th International Conference on Automation and Computing (ICAC), pp. 1–6, IEEE, Lancaster, UK, 2019. [8] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. [9] T. P. Lillicrap, J. J. Hunt, A. Pritzel et al., “Continuous control with deep reinforcement learning,” 2015, https://arxiv.org/ abs/1509.02971. [10] A. A. Rusu, N. C. Rabinowitz, G. Desjardins et al., “Pro- gressive neural networks,” 2016, https://arxiv.org/abs/1606. [11] Z. Li, H. Ma, Y. Ding, W. Chen, and J. Ying, “Motion planning of six-dof arm robot based on improved DDPG algorithm,” in Proceedings of the 2020 39th Chinese Control Conference (CCC), pp. 3954–3959, IEEE, Shenyang, China, 2020. [12] K. S. Luck, M. Vecerik, S. Stepputtis, B. A. Heni, and S. Jonathan, “Improved exploration through latent trajectory optimization in deep deterministic policy gradient,” in Pro- ceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3704–3711, IEEE, Macau, China, 2019. [13] K. Fang, Y. Zhu, A. Garg et al., “Learning task-oriented grasping for tool manipulation from simulated self-supervi- sion,” Ge International Journal of Robotics Research, vol. 39, no. 2-3, pp. 202–216, 2020. [14] B. Sangiovanni, G. P. Incremona, M. Piastra, and A. Ferrara, “Self-configuring robot path planning with obstacle avoidance via deep reinforcement learning,” IEEE Control Systems Letters, vol. 5, no. 2, pp. 397–402, 2021. [15] X. U. Wei and L. U. Shan, “Analysis of space manipulator route planning based on sarsa (λ) reinforcement learning,” Journal of Astronautics, vol. 22, no. 3, pp. 12–15, 2019. [16] W. Yuan, J. A. Stork, D. Kragic, Y. W. Michael, and H. Kaiyu, “Rearrangement with nonprehensile manipulation using deep reinforcement learning,” in Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 270–277, IEEE, Brisbane, Australia, 2018. [17] O. Ravn, N. A. Andersen, and T. T. Andersen, “Ur10 per- formance analysis,” 2014, https://backend.orbit.dtu.dk/ws/ portalfiles/portal/105275650/ur10_performance_analysis.pdf. [18] D. Zhang and Z. Gao, “Forward kinematics, performance analysis, and multi-objective optimization of a bio-inspired parallel manipulator,” Robotics and Computer-Integrated Manufacturing, vol. 28, no. 4, pp. 484–492, 2012. [19] M. S. Kim, D. K. Han, J. H. Park, and J. S. Kim, “Motion planning of robot manipulators for a smoother path using a twin delayed deep deterministic policy gradient with hind- sight experience replay,” Applied Sciences, vol. 10, no. 2, p. 575, [20] Y. Wang, X. Lan, C. Feng et al., “An experience-based policy gradient method for smooth manipulation,” in Proceedings of the 2019 IEEE 9th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 93–97, IEEE, Suzhou, China, 2019. [21] S. Zhang and R. S. Sutton, “A deeper look at experience replay,” 2017, https://arxiv.org/abs/1712.01275.

Journal

Journal of RoboticsHindawi Publishing Corporation

Published: Sep 14, 2022

There are no references for this article.