Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Sim-to-Real Reinforcement Learning for Autonomous Driving Using Pseudosegmentation Labeling and Dynamic Calibration

Sim-to-Real Reinforcement Learning for Autonomous Driving Using Pseudosegmentation Labeling and... Hindawi Journal of Robotics Volume 2022, Article ID 9916292, 10 pages https://doi.org/10.1155/2022/9916292 Research Article Sim-to-Real Reinforcement Learning for Autonomous Driving Using Pseudosegmentation Labeling and Dynamic Calibration Jiseong Heo and Hyoung woo Lim Agency for Defense Development, Daejeon, Republic of Korea Correspondence should be addressed to Hyoung woo Lim; hwlim@add.re.kr Received 17 January 2022; Accepted 31 May 2022; Published 26 June 2022 Academic Editor: Keigo Watanabe Copyright © 2022 Jiseong Heo and Hyoung woo Lim. �is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Applying reinforcement learning algorithms to autonomous driving is di€cult because of mismatches between the simulation in which the algorithm was trained and the real world. To address this problem, data from global navigation satellite systems and inertial navigation systems (GNSS/INS) were used to gather pseudolabels for semantic segmentation. A very simple dynamics model was used as a simulator, and dynamic parameters were obtained from the linear regression of manual driving records. Segmentation and a dynamic calibration method were found to be eŠective in easing the transition from a simulation to the real world. Pseudosegmentation labels are found to be more suitable for reinforcement learning models. We conducted tests on the e€cacy of our proposed method, and a vehicle using the proposed system successfully drove on an unpaved track for ap- proximately 1.8 km at an average speed of 26.57 km/h without incident. OŠ-road driving environments are quite diŠerent from 1. Introduction road driving environments, wherein the path is kept much Due to the improvements in deep convolutional neural more standard and uniform. In oŠ-road environments, there network architectures and graphical processing units, recent exists much more variability, such as grass growing in the research has aimed at applying deep learning algorithms to middle of the road, road curvature changes due to seasonal factors like rain and snow, and even color changes in the autonomous driving tasks. Previously, traditional computer vision algorithms, including edge detection and template road itself before and after rain. Autonomous driving based on images in oŠ-road environments is inevitably aŠected by matching, were used to infer how the vehicle should drive. Researchers utilized deep learning methods, including these various disturbances. To address this problem, we convolutional neural networks (CNNs) to leverage their require the ability to generalize oŠ-road driving environ- complicated features and enable the self-driving algorithm ments, and we also need a robust algorithm that can make to behave more intelligently. the best choice in a wide variety of driving situations. Imitation learning has been a common approach for In contrast, reinforcement learning (RL) has several autonomous driving [1]. During imitation learning, a CNN is advantages over imitation learning. Agents can learn how to trained to learn human-like control from a given image and drive over many trials in a simulation, and they can be features. However, there are some drawbacks associated with trained from a near-inŸnite number of possible cases without the need for labeled data. Moreover, reinforcement imitation learning. First, imitation learning cannot encom- pass the very diverse number of possible cases that can occur learning has the potential to outperform human drivers in association with driving. Additionally, imitation learning because the driving performance of systems trained through requires large amounts of labeled data for training, which reinforcement learning is not limited by the training dataset. must be collected from actual driving environments and is Nevertheless, deploying reinforcement learning in a real therefore cost-ineŠective and labor-intensive. vehicle remains challenging, in part because of distribution 2 Journal of Robotics Steering, rottle PPO2 Agent Simulator Share Road width Reward, Observation Camera Parameters GNSS/INS North East Elevation Synthesized label Le Camera Loss Center Camera Right Camera image FCN Segmentation Backpropagation steering FCN PPO2 throttle Figure 1: Overall architecture of our method. shifts between the simulations in which they are trained and regression to mediate problems that may occur from dis- the real world [2]. Distribution shift is one of the main tribution shift, covariate shift, and dynamics. *e experi- reasons why a trained model might perform poorly in a real- ments conducted in this article demonstrated that this world test environment. Covariate shift, specifically, refers to simple method was effective in reducing the concept shift the difference between trained input data and testing input between the simulation and real environment. However, modeling the complicated car dynamics in a way that data [3]. Because a simulation cannot perfectly reconstruct the real world, there are often mismatches between the real considers the many relevant parameters requires significant and simulation scenes, which results in negative effects on a computational effort and can often not be generalized to model’s driving performance [4]. Alternatively, concept shift other types of vehicles, our simple and data-driven approach refers to the difference in the relationships between labels can be used on different devices and vehicles with little and their given inputs [2]. For example, the correct control modification. command for a given image might be different in a simu- *e overall architecture of the proposed method, and in lation and in the real world because of the differences in particular its training phase, is illustrated in Figure 1. *ere dynamics of the two environments. are two parts of the training phase. *e first is for training *e covariate shift between a simulator and the real the semantic segmentation network with synthesized labels world can be relieved using intermediate representations of from the global navigation satellite systems (GNSS) and inertial navigation systems (INS) data. *e second is to train the input [5]. For example, two-class semantic segmentation narrows the gap between the simulator and the real world. the RL model using our calibrated simulator. Our simulator For this to work, simulators can easily produce binary and synthesized labels share the same road width and images. In the real world, images can be processed into camera parameters, including focal length, camera height, binary images by semantic segmentation networks. Using and tilt angle. these binary images instead of the raw images can help Figure 2 shows the testing phase of our method, wherein reduce covariance shift. semantic segmentation and RL model inference are con- In this article, dynamic calibration was used to enable a ducted sequentially. *e steering and throttle values pro- trained model to behave similarly in both simulations and duced by the RL model are then passed to the control system of the test vehicle. the real world. A simple vehicle dynamics model was used, and only four parameters were fine-tuned using linear In this article, our main contributions are as follows: Journal of Robotics 3 steering throttle Center Camera Input FCN Segmentation PPO2 policy Action Figure 2: Pipeline of our method used in inference mode. (i) Reconstruction of pseudoroad segmentation labels trained in simulations perform well in the real world [14–17]. from GNSS/INS records (ii) Simulator dynamic calibration through the linear regression of actual driving records and execution of 3. Method the RL model 3.1. Semantic Segmentation Using GNSS/INS. For the se- (iii) Driving test of our algorithm in a real vehicle and mantic segmentation of the road area, a fully convolutional road, demonstrating its efficacy network (FCN) [7] was trained with images and labels. As semantic segmentation is trained using supervised learning, 2. Related Works segmentation labels are necessary for each corresponding raw image. To gather the tremendous number of necessary Autonomous driving has become a key research area in the segmentation labels, GNSS/INS data were utilized to syn- field of artificial intelligence. Pomerleau [6] introduced thesize pseudosegmentation labels, rather than rely on ALVINN, a military vehicle driven by an algorithm, and human labor. demonstrated that it could successfully drive on paved roads. GNSS/INS data can accurately measure the current lo- In 2016, Bojarski et al. [1] applied and achieved end-to-end cation of a vehicle, localizing its position within an error of learning of a self-driving task using a convolutional network. 0.40 meters at 20 Hz. As a vehicle drives around a track, its *eir model had five convolutional layers and two fully location data, including longitude and latitude, are recorded connected layers to output throttle and steering control. *e for every frame. Each location component is estimated in authors demonstrated that each of the convolutional filters meters. was able to successfully recognize the edges of roads without Figure 3 shows how the segmentation labels are pro- explicitly being provided that information. duced from GNSS/INS data. *e left image shows the lo- Semantic segmentation is a traditional computer vision cation points projected on a camera input, which is taken at task that classifies every pixel in a given image. Long et al. [7] one of the recorded locations. *e image in the middle shows used fully convolutional networks (FCNs) and skip con- the lateral points of each location points. *ey are predefined nections to leverage both coarse and fine-grained features distance away from its corresponding location point. Lastly, from the image for semantic segmentation. *ey achieved the road segmentation label is produced by polygons state-of-the-art performance on several segmentation composed of lateral points of each location point. challenges, including PASCAL-VOC [8] and SIFT flows. An FCN [7] with ResNet50 [18] backbone was used as Chen et al. [9] proposed DeepLab, which uses Atrous the semantic segmentation network. *e network consists of convolution and conditional random fields to increase the 57 layers, which takes 224 × 224 sized RGB image (3 performance of semantic segmentation, and achieved state- channels) as an input and produces an output tensor with of-the-art performance at PASCAL-VOC-2012 [8]. the shape of 21 × 224 × 224. We used the default number of Reinforcement learning is one of the most important classes, which is 21, to load the pretrained weights of FCN. branches of artificial intelligence. Mnih et al. [10] applied *e training dataset consisted of two types of labeled deep convolutional networks to reinforcement learning to datasets. *e first was the dataset with synthesized seg- allow an agent to play Atari games. *eir trained model mentation labels from the GNSS/INS records. *e other outperformed humans for most of the games it played, and contained labels produced by humans. *e number of labels the authors showed that only a few consecutive images were in the two datasets was 6,492 and 969, respectively. *e necessary to train the reinforcement learning algorithm. model was trained for 100 epochs, with a learning rate of Deep reinforcement learning has also been applied to self- 0.001. Adam was used as the optimizer during training. driving to handle various scenarios, which cannot be solved with traditional rule-based algorithms [11–13]. Sim-to-real transfer, lastly, regards transferring a model 3.2.Simulator. *e simulator was designed to simulate real- that was trained in a virtual environment to the real world. world situations. *e global map in the simulator was Domain randomization is one of the representative tech- reconstructed as a 10,000 ×10,000 array filled with zeros. *e niques of sim-to-real. It randomizes various property of the global map was derived through accumulating the trajectory inputs, including brightness, contrast, and dynamics, to positions driven by an expert driver. Figure 4 shows the allow a trained model to consider the real-world input as just global map of the simulator and one of the rendered images. one of the randomized simulation data. Researchers have *e car dynamics in the simulator were designed to be as applied domain randomization techniques to make vehicles simple as possible. In the simulator, the car was considered a 4 Journal of Robotics Figure 3: Segmentation labeling process using GNSS/INS data. *e left image shows the location points projected on the corresponding image. *e center image shows the lateral points of each location. *e right image shows a segmentation label constructed from those lateral points. simple dynamics. Moreover, by using complicated and deep networks can cause an overfitting problem, which is critical for sim-to-real transfer. Observation refers to the data that are input to an RL model. *e simulator provides binary images of roads from the camera’s point of view. To compress the input image to a feature vector, the lengths of ten evenly spaced vertical lines were taken as observations. Figure 5 visualizes the obser- vation from the perspective of the simulation. In addition, to Figure 4: Simulator preview. *e left image shows the global map provide the agent with temporal information, the previous of our simulator. *e right image is a rendered image taken from (i.e., historical) steering and throttle values were added to the the camera’s point of view. observation. *erefore, the total number of observation features was 12. *e objective of reinforcement learning is to maximize point with a heading vector. *e movements depend on the the rewards. *e reward function is composed of four types steering and throttle inputs. Steering s ∈ [0, 1] determines of reward. We describe each reward in detail: the change in the heading angle (∆θ), and the throttle t ∈ [0, 1] determines the distance advanced along the R � R + P + R + P , t i e c (2) heading vector (∆n). We assume these variables have linear R � λ T. t t correlations between them. *us, those relationships can be modeled as the following equations. *e weights (w , w ) and Here, R is the throttle reward. *e throttle reward in- biases (b , b ) in the equations are parameters optimized by s t duces the vehicle to move forward. T refers to the throttle linear regression: value. λ is the weight of the throttle reward, which can be empirically determined. *e imbalance penalty is given as Δθ � w ∗ s + b , s s (1) |l − r| Δn � w ∗ t + b . t t (3) P � − . l + r *e outputs from RL models often show unrealistic Here, P is the imbalance penalty, which measures how actions. For example, the steering values from an RL model close the vehicle is to the center of a road. l is the distance of may rapidly change between the maximum and minimum the left side to the road boundary from the car, and r is the values, or the trembling of steering can help an agent to distance from the opposite direction. If the vehicle gets maximize rewards within the learning paradigm. However, closer to one of the road boundaries, the penalty increases. these unrealistic movements can often cause a catastrophic *e imbalance penalty was found to be useful in preventing breakdown of wheel motors in the real world. For this vehicles from driving in zigzags and encouraged a straighter reason, the maximum steering change was limited to M , path: which is obtained from actual driving data. ⎧ ⎪ , if c ∉ V, R � (4) 3.3. Reinforcement Learning. Proximal Policy Optimization (PPO2) [19] with Multilayer Perceptron Policy (MlpPolicy) 0, else. was used as the reinforcement learning model. MlpPolicy consists of two layers, with 64 features each. *e depth of the Here, R is the exploration reward that induces an agent policy network is shallow, and the number of features is to visit an unseen area. It outputs 1000/N when the car significantly lower than many modern CNN networks. reaches a new track tile [20], where N is the total number of However, MlpPolicy is still sufficient to train tasks where location points used to build the global map. Our simulator inputs are simple and state transitions are very consistent. determines that the agent arrives at an unvisited point only Our simulator provides binary images of roads and uses when the current closest location point c is not included in a Journal of Robotics 5 heading (a) (b) Figure 5: (a) Visualization of line segments for constructing the observation, with the lengths of 10 lines taken as an observation. (b) Visualization of l and r for calculating the imbalance penalty. Table 1: Hardware specification used in experiments. 0.8 GPU GTX 2080 Ti Camera field of view (v) 66.9 2 R : 0.84113 0.7 Camera field of view (h) 82.4 Total road length 1.8 km 0.6 Camera height 1.4 m 0.5 Average road width 8 m Vehicle width 2.5 m 0.4 Image resolution 224 × 224 Action frequency 10 Hz 0.3 Camera tilt degree 10 degrees GNSS/INS error <0.40 m 0.2 GNSS/INS frequency >20 Hz 0.5 0.6 0.7 0.8 0.9 1.0 Weight of the vehicle 6,480 kg Figure 6: Results of linear regression between throttle and ∆n, which is the distance advanced toward the heading direction. visited point set V. *is reward prevents the vehicle from continuously driving along a small circle. Lastly, the crash penalty is given as −λ T, if l � 0 or r � 0, 0.03 P � 􏼨 (5) 0, else. R : 0.92475 0.02 Here, P is the crash penalty that an agent receives 0.01 whenever the vehicle touches the edges of the road. P is proportional to the throttle value. 0.00 –0.01 3.4.ExperimentalSettings. *e total distance of the test road was 1.8 km. *e road was unpaved and covered with gravel –0.02 and dirt. *e average road width was approximately 8 m. *e –0.2 0.0 0.2 0.4 boundaries of the road were not clear, and there were grasses and trees outside the road. *e height of the camera attached Figure 7: Results of linear regression between steering and ∆θ, to the vehicle was approximately 1.4 m from the ground. *e which is the change of heading angle. test vehicle had six wheels and was a skid-type vehicle, which can reach speeds of up to 50 km/h. Table 1 depicts the details of the experimental setup. the difference between the two consecutive GNSS/INS po- sitions to heading angle vector of the vehicle. 4. Results and Analysis Figure 6 shows the scattered blue points, which represent To gather driving data, an expert driver drove along the the throttle and∆n pair recorded at each moment. Likewise, whole course of the test road. At each frame, the information the blue points in Figure 7 denote the pairs of steering and of the vehicle including steering, throttle, heading angle, and ∆θ recorded at each frame. *e red lines in Figures 6 and 7 position coordinates is recorded. We obtain ∆θ by calcu- visualize the results of linear regression conducted by the lating the difference between heading angle of the recorded least squared error method. Table 2 shows the calibrated two consecutive frames. Also,∆n is computed by projecting value of the model parameters w , b , w , and b , which were s t s t Δθ Δn 6 Journal of Robotics Table 2: Results of dynamic calibration. Parameter Regression result w 0.04495 b 1.25525e − 05 w 0.51856 b 0.0022277 Image SLIC Threshold Watershed Ours Figure 8: Qualitative results of various image processing methods for road segmentation. obtained from linear regression. Our simulator used this *e first row of Figure 9 shows the input images. *e last two rows are the segmented images that were inferred from model and these parameters to mimic the real-world dynamics. the two different models. *e first model was trained using Several pseudosegmentation labeling methods were the ground-truth labels of the test road, whereas the second implemented and compared in the present study. SLIC [21] model was trained using the synthesized pseudolabels. and Watershed [22] are methods based on superpixel al- Both Figures 9(b) and 9(c) show reasonable seg- gorithms. *e threshold method filters pixels whose values mentation performance. *e intersection over union are below a threshold. *e specific threshold was determined (IoU) was higher in (b), except for the three rightmost using the method from Otsu [23]. columns. (c) often predicted a road area that was narrower According to Figure 8, the SLIC algorithm appears to be than the ground truth because the road width of the synthesized labels was fixed at 6 m. *e road widths of the the most promising method, relative to our own. However, the SLIC method is vulnerable to discrepancies resulting synthesized labels and simulator were the same. *us, our from shade. Similarly, the threshold method produces noisy model can produce segmentation images that is more labels and misclassifies the sky as part of the road. Lastly, similar to the simulation scenes. *e input to our RL Watershed barely provided any useful segmentation labels. model was in the form of the lengths of 10 lines in a Two methods have recently been published for pseudo- segmentation image. *e Kullback–Leibler divergence semantic segmentation labeling [24, 25]. *ose methods use was calculated to compare the similarity between the class activation maps from GradCAM [26]. Unfortunately, distributions in simulator observations and the obser- the activation maps of our road images were not suitable for vations from the segmentation models. To calculate the obtaining the road areas, because these pretrained classifi- KL divergence, histograms of each line length were cation models classify roads as a part of the background. generated. *e formula is as follows: Journal of Robotics 7 (a) (b) (c) Figure 9: (a) Camera input image. (b) Inferenced segmentation results from the model trained with the ground-truth labels. (c) Seg- mentation results from the model trained with pseudolabels. p (x) KL � Σp (x)log . (6) q (x) i 2.0 where p (x) is the value of the i-th bin in the histogram of a segmentation model outputs and q (x) represents that of the 1.5 simulator outputs. *e KL divergence of each line is shown in Figure 10. According to Figure 10, the KL divergences 1.0 were lower in our segmentation model for each line, which implies that our model produces much more similar output to the simulator scenes than the comparator models. *e 0.5 average KL divergences for the model trained with ground truths and our model were 1.5450 and 0.42644, respectively. 0.0 *erefore, our pseudosegmentation labeling algorithm sig- 12345 6789 10 nificantly reduced the covariate shift between the simulator Observation lines and the real world. Ground truth labels To compare the suitability of the segmentation outputs Pseudo labels from these models, a dataset collected through actual human driving was used. *e dataset contains images and corre- Figure 10: Comparison by KL divergence. Our segmentation model output results that are more similar to the simulation than sponding values of the throttle and steering. *e images were does the model trained by the ground-truth dataset. processed by both segmentation models, and the RL model produced values of steering and throttle from the seg- the RL model than the model trained from the ground mentation images. *e steering values were compared with the values from the dataset, which are representative of truth. It is remarkable that the RL model behaved simi- actual human decisions. larly to human driving without requiring any steering or In Figure 11, the blue lines represent the visualization throttle data. of the steering values from the dataset collected from In the real environment experiment, our model was manual driving. *e above orange line shows the steering deployed in the test vehicle. Figure 12 shows the trajectory of outputs from segmentation model trained by the our model and the trajectory from human driving. Our model ground-truth labels. *e below orange line represents drove around the entire track without crashing, driving at an the steering values obtained from our model, which is average speed of 26.57 km/h. *e minimum, maximum, and trained with pseudolabels. From the figure, it is clear the speed during the 270-degree hairpin curve were 23.2 km/h, that our segmentation model is more suitable for use in 28.7 km/h, and 23.4 km/h, respectively. KL divergence with simulation 8 Journal of Robotics 0.4 0.2 0.0 –0.2 –0.4 0 500 1000 1500 2000 2500 3000 3500 0.4 0.2 0.0 –0.2 0 500 1000 1500 2000 2500 3000 3500 gt seg_rl_ours Figure 11: Steering recorded by human decisions (blue) and our algorithm (orange). Our method is scaled down to half for better comparison. 0 100 200 300 400 East (m) Manual driving trajectory Ours Figure 12: Recorded locations of the vehicle during the real environment test. 600 600 30 30 500 500 25 25 400 400 20 20 300 300 15 15 200 200 10 10 100 100 5 5 0 0 0 100 200 300 400 0 100 200 300 400 East (m) East (m) Figure 13: Visualized velocities at each point on the track. North (m) North (m) North (m) Journal of Robotics 9 Table 3: Results obtained from various deep reinforcement models typically output full throttle values. Table 5 shows learning models. that using both pseudolabeling and dynamic calibration resulted in steering values that were most similar to manual Deep RL method Reward driving. PPO2 3855.0± 5.00 *e equally distributed velocity heatmap in Figure 13 SAC 3756± 42.47 represents the optimal steering and throttle values that can A2C 3455.0± 105.19 be provided to the vehicle to drive the course. According to TD3 3054.0± 793.85 the results, the speed was maintained during the curved course unless it was a 270 hairpin curve course. *is may be Table 4: Student’s t-test results for comparing PPO2 with other RL considered an un-human-like driving method, but this style algorithms. of driving can be useful for strategic defense purposes. Strategic purposed vehicles, such as self-propelled artillery Method T P value Conclusion and armored vehicles, are required to move swiftly in curves SAC 6.94 1.7276e − 06 Reject H without decelerating, because doing so will leave them A2C 11.40 1.1552e − 09 Reject H vulnerable to enemy fire. TD3 3.03 0.007248 Reject H 5. Conclusion Table 5: Mean squared error of testing models. Applying reinforcement learning to autonomous driving has Pseudolabeling Dynamic calibration MSE been a significant challenge for researchers because of the O O 0.00643 severe mismatches between simulations and the real world. O X 0.01697 Our simulator used dynamic calibration to predict the ve- X O 0.05227 X X 0.07953 hicle’s next location from the given control commands. Moreover, two-class semantic segmentation, which distin- guishes the road from the background, was found to be Figure 13 shows the velocities recorded at each point on effective in reducing the gap between simulation scenes and the track. *e left image shows the velocities from when the real images. *ese methods demonstrated a positive effect on human was driving, and the right image is from our RL the sim-to-real performance of self-driving RL models. As a model. *e human driver was instructed to drive run the result, our model successfully drove on an unpaved road track clockwise along the center of the road at about 30 km/ track without derailment. h. According to the figure, the driver slowed down the vehicle at each turning points and accelerated at the straight 6. Discussion parts of the road. In contrast, our model drove at an almost constant speed. Human drivers consider the safety of the When the driving algorithm passes the simulation stage on human and vehicle when driving. However, the RL model the computer and tested in the real driving environment, was trained in a way that it drives as fast as possible without there are many restrictions besides the core algorithm that considering safety. must be considered. *is is because it is no longer a sim- Table 3 shows the results of applying various deep RL ulation, but a real driving in the off-road condition. When a models to our simulator. For performance comparison, we large vehicle weighing nearly 6.5 tons drives off-road with a used the representative deep reinforcement learning algo- high altitude difference at an average of 28 km/h, the re- rithms including Proximal Policy Optimization (PPO2), Soft strictions are more severe. *is is because driving in a sit- Actor Critic (SAC), Advantage Actor Critic (A2C), and Twin uation where errors and problems exist in the overall system Delayed Deep Deterministic Policy Gradients (TD3). *ose integration. Problems between tests continuously occur, algorithms are chosen because they show state-of-the-art which causes delay, and can continue only when these performances with appropriate hyperparameters and are problems are resolved. It was practically difficult to prove the recommended for continuous action environments. superiority and performance of one method to another Accordingtothereward comparisonresults,PPO2turnedout through autonomous driving results in a real environment to provide the highest reward among the methods. To validate by implementing various reinforcement learning-based statistical superiority of PPO2, we conductedt-test with other autonomous driving due to the project schedule and realistic RL methods, and the results are shown in Table 4. *erefore, conditions. Instead, the most probable and realistic algo- PPO2 was chosen to be the main RL algorithm to test on our rithm through selection and concentration process was vehicle. chosen through simulation, and then, the goal was to im- To validate the effectiveness of pseudolabeling and dy- plement it in actual driving. namic calibration, we evaluated the mean squared error (MSE) between the steering values from the manual driving and the values from the four types of testing models. *e Data Availability testing models were trained with and without pseudolab- eling and dynamic calibration. *e steering values from the *e data that support the findings of this study are available testing models were adjusted to half because the testing from the corresponding author upon reasonable request. 10 Journal of Robotics to-canonical adaptation networks,” 2019, https://arxiv.org/ Disclosure abs/1812.07252. [16] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, *e authors are with the Advanced Defense Technology “Sim-to-real transfer of robotic control with dynamics ran- Research Institute, Agency for Defense Development, domization,” in Proceedings of the IEEE International Conference Daejeon, 34186, South Korea. on Robotics and Automation (ICRA), pp. 3803–3810, IEEE, Philadelphia, PA, USA, 2018. Conflicts of Interest [17] Z. Xie, X. Da, M. v. d. Panne, B. Buck, and A. Garg, “Dynamics randomization revisited: a case study for quadrupedal loco- *e authors declare that they have no conflicts of interest. motion,” 2020, https://arxiv.org/abs/2011.02404. [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning References for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Silver Spring, [1] M. Bojarski, D. D. Testa, D. Dworakowski et al., “End to end MD, USA, 2016. learning for self-driving cars,” 2016, https://arxiv.org/abs/ [19] J. Schulman, W. Filip, P. Dhariwal, A. Radford, and 1604.07316. O. Klimov, “Proximal policy optimization algorithms,” 2017, [2] M. Kull and P. Flach, “Patterns of dataset shift,” in Proceedings https://arxiv.org/abs/1707.06347. of the First International Workshop on Learning over Multiple [20] R. Tan, J. Zhou, H. Du, S. Shang, and L. Dai, “A modeling Contexts, LMCE) at ECML-PKDD, Bristol, UK, 2014. processing method for video games based on deep rein- [3] K. Kisamori, M. Kanagawa, and K. Yamazaki, “Simulator forcement learning,” in Proceedings of the 2019 IEEE 8th Joint calibration under covariate shift with kernels,” in Proceedings International Information Technology and Artificial Intelli- of the International Conference on Artificial Intelligence and gence Conference (ITAIC), pp. 939–942, IEEE, Chongqing, Statistics, pp. 1244–1253, PMLR, San Diego, CA, USA, 2020. China, 2019. [4] J. D. Chang, M. Uehara, D. Sreenivas, R. Kidambi, and [21] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and W. Sun, “Mitigating covariate shift in imitation learning via S. Susstrunk, ¨ “SLIC superpixels compared to state-of-the-art offline data without great coverage,” 2021, https://arxiv.org/ superpixel methods,” IEEE Transactions on Pattern Analysis abs/2106.03207. and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012. [5] H. Zhang-Wei, Y.-M. Chen, H.-K. Yang et al., “Virtual-to- [22] Z. Hu, Z. Qin, and Q. Li, “Watershed superpixel,” in Pro- real: learning to control in visual semantic segmentation,” in ceedings of the 2015 IEEE International Conference on Image Proceedings of the ACM International Joint Conferences on Processing (ICIP), Quebec City, Canada, 2015. Artificial Intelligence (IJCAI), Vienna, Austria, 2018. [23] N. Otsu, “A threshold selection method from gray-level [6] D. A. Pomerleau, “Alvinn: an autonomous land vehicle in a histograms,” IEEE transactions on systems, man, and cyber- neural network,” in Proceedings of the (NeurIPS) Neural In- netics, vol. 9, no. 1, pp. 62–66, 1979. formation Processing Systems, La Jolla, CA, USA, 1989. [24] X. Shi, S. Khademi, Y. Li, and J. V. Gemert, “Zoom-cam: [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional generating fine-grained pixel annotations from image labels,” networks for semantic segmentation,” in Proceedings of the in Proceedings of the 2020 25th International Conference on IEEE Conference on Computer Vision and Pattern Recognition, Pattern Recognition (ICPR), pp. 10289–10296, IEEE, Milano, pp. 3431–3440, Honolulu, HI, USA, 2015. Italy, 2021. [8] M. Everingham, S. M. A. Eslami, L. Van Gool, [25] Y. Zou, Z. Zhang, H. Zhang et al., “Pseudoseg: designing C. K. I. Williams, J. Winn, and A. Zisserman, “*e pascal pseudo labels for semantic segmentation,” 2020, https://arxiv. visual object classes challenge: a retrospective,” International org/abs/2010.09713. Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015. [26] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, [9] L.-C. Chen, P. George, L. Kokkinos, K. Murphy, and and D. Batra, “Grad-cam: visual explanations from deep A. L. Yuille, “Deeplab: semantic image segmentation with networks via gradient-based localization,” in Proceedings of deep convolutional nets, atrous convolution, and fully con- the IEEE International Conference on Computer Vision, nected crfs,” IEEE Transactions on Pattern Analysis and pp. 618–626, Venice, Italy, 2017. Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017. [10] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with deep reinforcement learning,” 2013, https://arxiv.org/abs/ 1312.5602. [11] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” Electronic Imaging, vol. 29, no. 19, pp. 70–76, 2017. [12] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi- agent, reinforcement learning for autonomous driving,” 2016, https://arxiv.org/abs/1610.03295. [13] S. Wang, D. Jia, and X. Weng, “Deep reinforcement learning for autonomous driving,” 2018, https://arxiv.org/abs/2002. [14] Y. Chebotar, A. Handa, V. Makoviychuk et al., “Closing the sim-to-real loop: adapting simulation randomization with real world experience,” in Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), 2019. [15] S. James, W. Paul, M. Kalakrishnan et al., “Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized- http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Robotics Hindawi Publishing Corporation

Sim-to-Real Reinforcement Learning for Autonomous Driving Using Pseudosegmentation Labeling and Dynamic Calibration

Journal of Robotics , Volume 2022 – Jun 26, 2022

Loading next page...
 
/lp/hindawi-publishing-corporation/sim-to-real-reinforcement-learning-for-autonomous-driving-using-0tI0v2XS20

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2022 Jiseong Heo and Hyoung woo Lim. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-9600
eISSN
1687-9619
DOI
10.1155/2022/9916292
Publisher site
See Article on Publisher Site

Abstract

Hindawi Journal of Robotics Volume 2022, Article ID 9916292, 10 pages https://doi.org/10.1155/2022/9916292 Research Article Sim-to-Real Reinforcement Learning for Autonomous Driving Using Pseudosegmentation Labeling and Dynamic Calibration Jiseong Heo and Hyoung woo Lim Agency for Defense Development, Daejeon, Republic of Korea Correspondence should be addressed to Hyoung woo Lim; hwlim@add.re.kr Received 17 January 2022; Accepted 31 May 2022; Published 26 June 2022 Academic Editor: Keigo Watanabe Copyright © 2022 Jiseong Heo and Hyoung woo Lim. �is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Applying reinforcement learning algorithms to autonomous driving is di€cult because of mismatches between the simulation in which the algorithm was trained and the real world. To address this problem, data from global navigation satellite systems and inertial navigation systems (GNSS/INS) were used to gather pseudolabels for semantic segmentation. A very simple dynamics model was used as a simulator, and dynamic parameters were obtained from the linear regression of manual driving records. Segmentation and a dynamic calibration method were found to be eŠective in easing the transition from a simulation to the real world. Pseudosegmentation labels are found to be more suitable for reinforcement learning models. We conducted tests on the e€cacy of our proposed method, and a vehicle using the proposed system successfully drove on an unpaved track for ap- proximately 1.8 km at an average speed of 26.57 km/h without incident. OŠ-road driving environments are quite diŠerent from 1. Introduction road driving environments, wherein the path is kept much Due to the improvements in deep convolutional neural more standard and uniform. In oŠ-road environments, there network architectures and graphical processing units, recent exists much more variability, such as grass growing in the research has aimed at applying deep learning algorithms to middle of the road, road curvature changes due to seasonal factors like rain and snow, and even color changes in the autonomous driving tasks. Previously, traditional computer vision algorithms, including edge detection and template road itself before and after rain. Autonomous driving based on images in oŠ-road environments is inevitably aŠected by matching, were used to infer how the vehicle should drive. Researchers utilized deep learning methods, including these various disturbances. To address this problem, we convolutional neural networks (CNNs) to leverage their require the ability to generalize oŠ-road driving environ- complicated features and enable the self-driving algorithm ments, and we also need a robust algorithm that can make to behave more intelligently. the best choice in a wide variety of driving situations. Imitation learning has been a common approach for In contrast, reinforcement learning (RL) has several autonomous driving [1]. During imitation learning, a CNN is advantages over imitation learning. Agents can learn how to trained to learn human-like control from a given image and drive over many trials in a simulation, and they can be features. However, there are some drawbacks associated with trained from a near-inŸnite number of possible cases without the need for labeled data. Moreover, reinforcement imitation learning. First, imitation learning cannot encom- pass the very diverse number of possible cases that can occur learning has the potential to outperform human drivers in association with driving. Additionally, imitation learning because the driving performance of systems trained through requires large amounts of labeled data for training, which reinforcement learning is not limited by the training dataset. must be collected from actual driving environments and is Nevertheless, deploying reinforcement learning in a real therefore cost-ineŠective and labor-intensive. vehicle remains challenging, in part because of distribution 2 Journal of Robotics Steering, rottle PPO2 Agent Simulator Share Road width Reward, Observation Camera Parameters GNSS/INS North East Elevation Synthesized label Le Camera Loss Center Camera Right Camera image FCN Segmentation Backpropagation steering FCN PPO2 throttle Figure 1: Overall architecture of our method. shifts between the simulations in which they are trained and regression to mediate problems that may occur from dis- the real world [2]. Distribution shift is one of the main tribution shift, covariate shift, and dynamics. *e experi- reasons why a trained model might perform poorly in a real- ments conducted in this article demonstrated that this world test environment. Covariate shift, specifically, refers to simple method was effective in reducing the concept shift the difference between trained input data and testing input between the simulation and real environment. However, modeling the complicated car dynamics in a way that data [3]. Because a simulation cannot perfectly reconstruct the real world, there are often mismatches between the real considers the many relevant parameters requires significant and simulation scenes, which results in negative effects on a computational effort and can often not be generalized to model’s driving performance [4]. Alternatively, concept shift other types of vehicles, our simple and data-driven approach refers to the difference in the relationships between labels can be used on different devices and vehicles with little and their given inputs [2]. For example, the correct control modification. command for a given image might be different in a simu- *e overall architecture of the proposed method, and in lation and in the real world because of the differences in particular its training phase, is illustrated in Figure 1. *ere dynamics of the two environments. are two parts of the training phase. *e first is for training *e covariate shift between a simulator and the real the semantic segmentation network with synthesized labels world can be relieved using intermediate representations of from the global navigation satellite systems (GNSS) and inertial navigation systems (INS) data. *e second is to train the input [5]. For example, two-class semantic segmentation narrows the gap between the simulator and the real world. the RL model using our calibrated simulator. Our simulator For this to work, simulators can easily produce binary and synthesized labels share the same road width and images. In the real world, images can be processed into camera parameters, including focal length, camera height, binary images by semantic segmentation networks. Using and tilt angle. these binary images instead of the raw images can help Figure 2 shows the testing phase of our method, wherein reduce covariance shift. semantic segmentation and RL model inference are con- In this article, dynamic calibration was used to enable a ducted sequentially. *e steering and throttle values pro- trained model to behave similarly in both simulations and duced by the RL model are then passed to the control system of the test vehicle. the real world. A simple vehicle dynamics model was used, and only four parameters were fine-tuned using linear In this article, our main contributions are as follows: Journal of Robotics 3 steering throttle Center Camera Input FCN Segmentation PPO2 policy Action Figure 2: Pipeline of our method used in inference mode. (i) Reconstruction of pseudoroad segmentation labels trained in simulations perform well in the real world [14–17]. from GNSS/INS records (ii) Simulator dynamic calibration through the linear regression of actual driving records and execution of 3. Method the RL model 3.1. Semantic Segmentation Using GNSS/INS. For the se- (iii) Driving test of our algorithm in a real vehicle and mantic segmentation of the road area, a fully convolutional road, demonstrating its efficacy network (FCN) [7] was trained with images and labels. As semantic segmentation is trained using supervised learning, 2. Related Works segmentation labels are necessary for each corresponding raw image. To gather the tremendous number of necessary Autonomous driving has become a key research area in the segmentation labels, GNSS/INS data were utilized to syn- field of artificial intelligence. Pomerleau [6] introduced thesize pseudosegmentation labels, rather than rely on ALVINN, a military vehicle driven by an algorithm, and human labor. demonstrated that it could successfully drive on paved roads. GNSS/INS data can accurately measure the current lo- In 2016, Bojarski et al. [1] applied and achieved end-to-end cation of a vehicle, localizing its position within an error of learning of a self-driving task using a convolutional network. 0.40 meters at 20 Hz. As a vehicle drives around a track, its *eir model had five convolutional layers and two fully location data, including longitude and latitude, are recorded connected layers to output throttle and steering control. *e for every frame. Each location component is estimated in authors demonstrated that each of the convolutional filters meters. was able to successfully recognize the edges of roads without Figure 3 shows how the segmentation labels are pro- explicitly being provided that information. duced from GNSS/INS data. *e left image shows the lo- Semantic segmentation is a traditional computer vision cation points projected on a camera input, which is taken at task that classifies every pixel in a given image. Long et al. [7] one of the recorded locations. *e image in the middle shows used fully convolutional networks (FCNs) and skip con- the lateral points of each location points. *ey are predefined nections to leverage both coarse and fine-grained features distance away from its corresponding location point. Lastly, from the image for semantic segmentation. *ey achieved the road segmentation label is produced by polygons state-of-the-art performance on several segmentation composed of lateral points of each location point. challenges, including PASCAL-VOC [8] and SIFT flows. An FCN [7] with ResNet50 [18] backbone was used as Chen et al. [9] proposed DeepLab, which uses Atrous the semantic segmentation network. *e network consists of convolution and conditional random fields to increase the 57 layers, which takes 224 × 224 sized RGB image (3 performance of semantic segmentation, and achieved state- channels) as an input and produces an output tensor with of-the-art performance at PASCAL-VOC-2012 [8]. the shape of 21 × 224 × 224. We used the default number of Reinforcement learning is one of the most important classes, which is 21, to load the pretrained weights of FCN. branches of artificial intelligence. Mnih et al. [10] applied *e training dataset consisted of two types of labeled deep convolutional networks to reinforcement learning to datasets. *e first was the dataset with synthesized seg- allow an agent to play Atari games. *eir trained model mentation labels from the GNSS/INS records. *e other outperformed humans for most of the games it played, and contained labels produced by humans. *e number of labels the authors showed that only a few consecutive images were in the two datasets was 6,492 and 969, respectively. *e necessary to train the reinforcement learning algorithm. model was trained for 100 epochs, with a learning rate of Deep reinforcement learning has also been applied to self- 0.001. Adam was used as the optimizer during training. driving to handle various scenarios, which cannot be solved with traditional rule-based algorithms [11–13]. Sim-to-real transfer, lastly, regards transferring a model 3.2.Simulator. *e simulator was designed to simulate real- that was trained in a virtual environment to the real world. world situations. *e global map in the simulator was Domain randomization is one of the representative tech- reconstructed as a 10,000 ×10,000 array filled with zeros. *e niques of sim-to-real. It randomizes various property of the global map was derived through accumulating the trajectory inputs, including brightness, contrast, and dynamics, to positions driven by an expert driver. Figure 4 shows the allow a trained model to consider the real-world input as just global map of the simulator and one of the rendered images. one of the randomized simulation data. Researchers have *e car dynamics in the simulator were designed to be as applied domain randomization techniques to make vehicles simple as possible. In the simulator, the car was considered a 4 Journal of Robotics Figure 3: Segmentation labeling process using GNSS/INS data. *e left image shows the location points projected on the corresponding image. *e center image shows the lateral points of each location. *e right image shows a segmentation label constructed from those lateral points. simple dynamics. Moreover, by using complicated and deep networks can cause an overfitting problem, which is critical for sim-to-real transfer. Observation refers to the data that are input to an RL model. *e simulator provides binary images of roads from the camera’s point of view. To compress the input image to a feature vector, the lengths of ten evenly spaced vertical lines were taken as observations. Figure 5 visualizes the obser- vation from the perspective of the simulation. In addition, to Figure 4: Simulator preview. *e left image shows the global map provide the agent with temporal information, the previous of our simulator. *e right image is a rendered image taken from (i.e., historical) steering and throttle values were added to the the camera’s point of view. observation. *erefore, the total number of observation features was 12. *e objective of reinforcement learning is to maximize point with a heading vector. *e movements depend on the the rewards. *e reward function is composed of four types steering and throttle inputs. Steering s ∈ [0, 1] determines of reward. We describe each reward in detail: the change in the heading angle (∆θ), and the throttle t ∈ [0, 1] determines the distance advanced along the R � R + P + R + P , t i e c (2) heading vector (∆n). We assume these variables have linear R � λ T. t t correlations between them. *us, those relationships can be modeled as the following equations. *e weights (w , w ) and Here, R is the throttle reward. *e throttle reward in- biases (b , b ) in the equations are parameters optimized by s t duces the vehicle to move forward. T refers to the throttle linear regression: value. λ is the weight of the throttle reward, which can be empirically determined. *e imbalance penalty is given as Δθ � w ∗ s + b , s s (1) |l − r| Δn � w ∗ t + b . t t (3) P � − . l + r *e outputs from RL models often show unrealistic Here, P is the imbalance penalty, which measures how actions. For example, the steering values from an RL model close the vehicle is to the center of a road. l is the distance of may rapidly change between the maximum and minimum the left side to the road boundary from the car, and r is the values, or the trembling of steering can help an agent to distance from the opposite direction. If the vehicle gets maximize rewards within the learning paradigm. However, closer to one of the road boundaries, the penalty increases. these unrealistic movements can often cause a catastrophic *e imbalance penalty was found to be useful in preventing breakdown of wheel motors in the real world. For this vehicles from driving in zigzags and encouraged a straighter reason, the maximum steering change was limited to M , path: which is obtained from actual driving data. ⎧ ⎪ , if c ∉ V, R � (4) 3.3. Reinforcement Learning. Proximal Policy Optimization (PPO2) [19] with Multilayer Perceptron Policy (MlpPolicy) 0, else. was used as the reinforcement learning model. MlpPolicy consists of two layers, with 64 features each. *e depth of the Here, R is the exploration reward that induces an agent policy network is shallow, and the number of features is to visit an unseen area. It outputs 1000/N when the car significantly lower than many modern CNN networks. reaches a new track tile [20], where N is the total number of However, MlpPolicy is still sufficient to train tasks where location points used to build the global map. Our simulator inputs are simple and state transitions are very consistent. determines that the agent arrives at an unvisited point only Our simulator provides binary images of roads and uses when the current closest location point c is not included in a Journal of Robotics 5 heading (a) (b) Figure 5: (a) Visualization of line segments for constructing the observation, with the lengths of 10 lines taken as an observation. (b) Visualization of l and r for calculating the imbalance penalty. Table 1: Hardware specification used in experiments. 0.8 GPU GTX 2080 Ti Camera field of view (v) 66.9 2 R : 0.84113 0.7 Camera field of view (h) 82.4 Total road length 1.8 km 0.6 Camera height 1.4 m 0.5 Average road width 8 m Vehicle width 2.5 m 0.4 Image resolution 224 × 224 Action frequency 10 Hz 0.3 Camera tilt degree 10 degrees GNSS/INS error <0.40 m 0.2 GNSS/INS frequency >20 Hz 0.5 0.6 0.7 0.8 0.9 1.0 Weight of the vehicle 6,480 kg Figure 6: Results of linear regression between throttle and ∆n, which is the distance advanced toward the heading direction. visited point set V. *is reward prevents the vehicle from continuously driving along a small circle. Lastly, the crash penalty is given as −λ T, if l � 0 or r � 0, 0.03 P � 􏼨 (5) 0, else. R : 0.92475 0.02 Here, P is the crash penalty that an agent receives 0.01 whenever the vehicle touches the edges of the road. P is proportional to the throttle value. 0.00 –0.01 3.4.ExperimentalSettings. *e total distance of the test road was 1.8 km. *e road was unpaved and covered with gravel –0.02 and dirt. *e average road width was approximately 8 m. *e –0.2 0.0 0.2 0.4 boundaries of the road were not clear, and there were grasses and trees outside the road. *e height of the camera attached Figure 7: Results of linear regression between steering and ∆θ, to the vehicle was approximately 1.4 m from the ground. *e which is the change of heading angle. test vehicle had six wheels and was a skid-type vehicle, which can reach speeds of up to 50 km/h. Table 1 depicts the details of the experimental setup. the difference between the two consecutive GNSS/INS po- sitions to heading angle vector of the vehicle. 4. Results and Analysis Figure 6 shows the scattered blue points, which represent To gather driving data, an expert driver drove along the the throttle and∆n pair recorded at each moment. Likewise, whole course of the test road. At each frame, the information the blue points in Figure 7 denote the pairs of steering and of the vehicle including steering, throttle, heading angle, and ∆θ recorded at each frame. *e red lines in Figures 6 and 7 position coordinates is recorded. We obtain ∆θ by calcu- visualize the results of linear regression conducted by the lating the difference between heading angle of the recorded least squared error method. Table 2 shows the calibrated two consecutive frames. Also,∆n is computed by projecting value of the model parameters w , b , w , and b , which were s t s t Δθ Δn 6 Journal of Robotics Table 2: Results of dynamic calibration. Parameter Regression result w 0.04495 b 1.25525e − 05 w 0.51856 b 0.0022277 Image SLIC Threshold Watershed Ours Figure 8: Qualitative results of various image processing methods for road segmentation. obtained from linear regression. Our simulator used this *e first row of Figure 9 shows the input images. *e last two rows are the segmented images that were inferred from model and these parameters to mimic the real-world dynamics. the two different models. *e first model was trained using Several pseudosegmentation labeling methods were the ground-truth labels of the test road, whereas the second implemented and compared in the present study. SLIC [21] model was trained using the synthesized pseudolabels. and Watershed [22] are methods based on superpixel al- Both Figures 9(b) and 9(c) show reasonable seg- gorithms. *e threshold method filters pixels whose values mentation performance. *e intersection over union are below a threshold. *e specific threshold was determined (IoU) was higher in (b), except for the three rightmost using the method from Otsu [23]. columns. (c) often predicted a road area that was narrower According to Figure 8, the SLIC algorithm appears to be than the ground truth because the road width of the synthesized labels was fixed at 6 m. *e road widths of the the most promising method, relative to our own. However, the SLIC method is vulnerable to discrepancies resulting synthesized labels and simulator were the same. *us, our from shade. Similarly, the threshold method produces noisy model can produce segmentation images that is more labels and misclassifies the sky as part of the road. Lastly, similar to the simulation scenes. *e input to our RL Watershed barely provided any useful segmentation labels. model was in the form of the lengths of 10 lines in a Two methods have recently been published for pseudo- segmentation image. *e Kullback–Leibler divergence semantic segmentation labeling [24, 25]. *ose methods use was calculated to compare the similarity between the class activation maps from GradCAM [26]. Unfortunately, distributions in simulator observations and the obser- the activation maps of our road images were not suitable for vations from the segmentation models. To calculate the obtaining the road areas, because these pretrained classifi- KL divergence, histograms of each line length were cation models classify roads as a part of the background. generated. *e formula is as follows: Journal of Robotics 7 (a) (b) (c) Figure 9: (a) Camera input image. (b) Inferenced segmentation results from the model trained with the ground-truth labels. (c) Seg- mentation results from the model trained with pseudolabels. p (x) KL � Σp (x)log . (6) q (x) i 2.0 where p (x) is the value of the i-th bin in the histogram of a segmentation model outputs and q (x) represents that of the 1.5 simulator outputs. *e KL divergence of each line is shown in Figure 10. According to Figure 10, the KL divergences 1.0 were lower in our segmentation model for each line, which implies that our model produces much more similar output to the simulator scenes than the comparator models. *e 0.5 average KL divergences for the model trained with ground truths and our model were 1.5450 and 0.42644, respectively. 0.0 *erefore, our pseudosegmentation labeling algorithm sig- 12345 6789 10 nificantly reduced the covariate shift between the simulator Observation lines and the real world. Ground truth labels To compare the suitability of the segmentation outputs Pseudo labels from these models, a dataset collected through actual human driving was used. *e dataset contains images and corre- Figure 10: Comparison by KL divergence. Our segmentation model output results that are more similar to the simulation than sponding values of the throttle and steering. *e images were does the model trained by the ground-truth dataset. processed by both segmentation models, and the RL model produced values of steering and throttle from the seg- the RL model than the model trained from the ground mentation images. *e steering values were compared with the values from the dataset, which are representative of truth. It is remarkable that the RL model behaved simi- actual human decisions. larly to human driving without requiring any steering or In Figure 11, the blue lines represent the visualization throttle data. of the steering values from the dataset collected from In the real environment experiment, our model was manual driving. *e above orange line shows the steering deployed in the test vehicle. Figure 12 shows the trajectory of outputs from segmentation model trained by the our model and the trajectory from human driving. Our model ground-truth labels. *e below orange line represents drove around the entire track without crashing, driving at an the steering values obtained from our model, which is average speed of 26.57 km/h. *e minimum, maximum, and trained with pseudolabels. From the figure, it is clear the speed during the 270-degree hairpin curve were 23.2 km/h, that our segmentation model is more suitable for use in 28.7 km/h, and 23.4 km/h, respectively. KL divergence with simulation 8 Journal of Robotics 0.4 0.2 0.0 –0.2 –0.4 0 500 1000 1500 2000 2500 3000 3500 0.4 0.2 0.0 –0.2 0 500 1000 1500 2000 2500 3000 3500 gt seg_rl_ours Figure 11: Steering recorded by human decisions (blue) and our algorithm (orange). Our method is scaled down to half for better comparison. 0 100 200 300 400 East (m) Manual driving trajectory Ours Figure 12: Recorded locations of the vehicle during the real environment test. 600 600 30 30 500 500 25 25 400 400 20 20 300 300 15 15 200 200 10 10 100 100 5 5 0 0 0 100 200 300 400 0 100 200 300 400 East (m) East (m) Figure 13: Visualized velocities at each point on the track. North (m) North (m) North (m) Journal of Robotics 9 Table 3: Results obtained from various deep reinforcement models typically output full throttle values. Table 5 shows learning models. that using both pseudolabeling and dynamic calibration resulted in steering values that were most similar to manual Deep RL method Reward driving. PPO2 3855.0± 5.00 *e equally distributed velocity heatmap in Figure 13 SAC 3756± 42.47 represents the optimal steering and throttle values that can A2C 3455.0± 105.19 be provided to the vehicle to drive the course. According to TD3 3054.0± 793.85 the results, the speed was maintained during the curved course unless it was a 270 hairpin curve course. *is may be Table 4: Student’s t-test results for comparing PPO2 with other RL considered an un-human-like driving method, but this style algorithms. of driving can be useful for strategic defense purposes. Strategic purposed vehicles, such as self-propelled artillery Method T P value Conclusion and armored vehicles, are required to move swiftly in curves SAC 6.94 1.7276e − 06 Reject H without decelerating, because doing so will leave them A2C 11.40 1.1552e − 09 Reject H vulnerable to enemy fire. TD3 3.03 0.007248 Reject H 5. Conclusion Table 5: Mean squared error of testing models. Applying reinforcement learning to autonomous driving has Pseudolabeling Dynamic calibration MSE been a significant challenge for researchers because of the O O 0.00643 severe mismatches between simulations and the real world. O X 0.01697 Our simulator used dynamic calibration to predict the ve- X O 0.05227 X X 0.07953 hicle’s next location from the given control commands. Moreover, two-class semantic segmentation, which distin- guishes the road from the background, was found to be Figure 13 shows the velocities recorded at each point on effective in reducing the gap between simulation scenes and the track. *e left image shows the velocities from when the real images. *ese methods demonstrated a positive effect on human was driving, and the right image is from our RL the sim-to-real performance of self-driving RL models. As a model. *e human driver was instructed to drive run the result, our model successfully drove on an unpaved road track clockwise along the center of the road at about 30 km/ track without derailment. h. According to the figure, the driver slowed down the vehicle at each turning points and accelerated at the straight 6. Discussion parts of the road. In contrast, our model drove at an almost constant speed. Human drivers consider the safety of the When the driving algorithm passes the simulation stage on human and vehicle when driving. However, the RL model the computer and tested in the real driving environment, was trained in a way that it drives as fast as possible without there are many restrictions besides the core algorithm that considering safety. must be considered. *is is because it is no longer a sim- Table 3 shows the results of applying various deep RL ulation, but a real driving in the off-road condition. When a models to our simulator. For performance comparison, we large vehicle weighing nearly 6.5 tons drives off-road with a used the representative deep reinforcement learning algo- high altitude difference at an average of 28 km/h, the re- rithms including Proximal Policy Optimization (PPO2), Soft strictions are more severe. *is is because driving in a sit- Actor Critic (SAC), Advantage Actor Critic (A2C), and Twin uation where errors and problems exist in the overall system Delayed Deep Deterministic Policy Gradients (TD3). *ose integration. Problems between tests continuously occur, algorithms are chosen because they show state-of-the-art which causes delay, and can continue only when these performances with appropriate hyperparameters and are problems are resolved. It was practically difficult to prove the recommended for continuous action environments. superiority and performance of one method to another Accordingtothereward comparisonresults,PPO2turnedout through autonomous driving results in a real environment to provide the highest reward among the methods. To validate by implementing various reinforcement learning-based statistical superiority of PPO2, we conductedt-test with other autonomous driving due to the project schedule and realistic RL methods, and the results are shown in Table 4. *erefore, conditions. Instead, the most probable and realistic algo- PPO2 was chosen to be the main RL algorithm to test on our rithm through selection and concentration process was vehicle. chosen through simulation, and then, the goal was to im- To validate the effectiveness of pseudolabeling and dy- plement it in actual driving. namic calibration, we evaluated the mean squared error (MSE) between the steering values from the manual driving and the values from the four types of testing models. *e Data Availability testing models were trained with and without pseudolab- eling and dynamic calibration. *e steering values from the *e data that support the findings of this study are available testing models were adjusted to half because the testing from the corresponding author upon reasonable request. 10 Journal of Robotics to-canonical adaptation networks,” 2019, https://arxiv.org/ Disclosure abs/1812.07252. [16] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, *e authors are with the Advanced Defense Technology “Sim-to-real transfer of robotic control with dynamics ran- Research Institute, Agency for Defense Development, domization,” in Proceedings of the IEEE International Conference Daejeon, 34186, South Korea. on Robotics and Automation (ICRA), pp. 3803–3810, IEEE, Philadelphia, PA, USA, 2018. Conflicts of Interest [17] Z. Xie, X. Da, M. v. d. Panne, B. Buck, and A. Garg, “Dynamics randomization revisited: a case study for quadrupedal loco- *e authors declare that they have no conflicts of interest. motion,” 2020, https://arxiv.org/abs/2011.02404. [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning References for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Silver Spring, [1] M. Bojarski, D. D. Testa, D. Dworakowski et al., “End to end MD, USA, 2016. learning for self-driving cars,” 2016, https://arxiv.org/abs/ [19] J. Schulman, W. Filip, P. Dhariwal, A. Radford, and 1604.07316. O. Klimov, “Proximal policy optimization algorithms,” 2017, [2] M. Kull and P. Flach, “Patterns of dataset shift,” in Proceedings https://arxiv.org/abs/1707.06347. of the First International Workshop on Learning over Multiple [20] R. Tan, J. Zhou, H. Du, S. Shang, and L. Dai, “A modeling Contexts, LMCE) at ECML-PKDD, Bristol, UK, 2014. processing method for video games based on deep rein- [3] K. Kisamori, M. Kanagawa, and K. Yamazaki, “Simulator forcement learning,” in Proceedings of the 2019 IEEE 8th Joint calibration under covariate shift with kernels,” in Proceedings International Information Technology and Artificial Intelli- of the International Conference on Artificial Intelligence and gence Conference (ITAIC), pp. 939–942, IEEE, Chongqing, Statistics, pp. 1244–1253, PMLR, San Diego, CA, USA, 2020. China, 2019. [4] J. D. Chang, M. Uehara, D. Sreenivas, R. Kidambi, and [21] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and W. Sun, “Mitigating covariate shift in imitation learning via S. Susstrunk, ¨ “SLIC superpixels compared to state-of-the-art offline data without great coverage,” 2021, https://arxiv.org/ superpixel methods,” IEEE Transactions on Pattern Analysis abs/2106.03207. and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012. [5] H. Zhang-Wei, Y.-M. Chen, H.-K. Yang et al., “Virtual-to- [22] Z. Hu, Z. Qin, and Q. Li, “Watershed superpixel,” in Pro- real: learning to control in visual semantic segmentation,” in ceedings of the 2015 IEEE International Conference on Image Proceedings of the ACM International Joint Conferences on Processing (ICIP), Quebec City, Canada, 2015. Artificial Intelligence (IJCAI), Vienna, Austria, 2018. [23] N. Otsu, “A threshold selection method from gray-level [6] D. A. Pomerleau, “Alvinn: an autonomous land vehicle in a histograms,” IEEE transactions on systems, man, and cyber- neural network,” in Proceedings of the (NeurIPS) Neural In- netics, vol. 9, no. 1, pp. 62–66, 1979. formation Processing Systems, La Jolla, CA, USA, 1989. [24] X. Shi, S. Khademi, Y. Li, and J. V. Gemert, “Zoom-cam: [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional generating fine-grained pixel annotations from image labels,” networks for semantic segmentation,” in Proceedings of the in Proceedings of the 2020 25th International Conference on IEEE Conference on Computer Vision and Pattern Recognition, Pattern Recognition (ICPR), pp. 10289–10296, IEEE, Milano, pp. 3431–3440, Honolulu, HI, USA, 2015. Italy, 2021. [8] M. Everingham, S. M. A. Eslami, L. Van Gool, [25] Y. Zou, Z. Zhang, H. Zhang et al., “Pseudoseg: designing C. K. I. Williams, J. Winn, and A. Zisserman, “*e pascal pseudo labels for semantic segmentation,” 2020, https://arxiv. visual object classes challenge: a retrospective,” International org/abs/2010.09713. Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015. [26] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, [9] L.-C. Chen, P. George, L. Kokkinos, K. Murphy, and and D. Batra, “Grad-cam: visual explanations from deep A. L. Yuille, “Deeplab: semantic image segmentation with networks via gradient-based localization,” in Proceedings of deep convolutional nets, atrous convolution, and fully con- the IEEE International Conference on Computer Vision, nected crfs,” IEEE Transactions on Pattern Analysis and pp. 618–626, Venice, Italy, 2017. Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017. [10] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with deep reinforcement learning,” 2013, https://arxiv.org/abs/ 1312.5602. [11] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” Electronic Imaging, vol. 29, no. 19, pp. 70–76, 2017. [12] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi- agent, reinforcement learning for autonomous driving,” 2016, https://arxiv.org/abs/1610.03295. [13] S. Wang, D. Jia, and X. Weng, “Deep reinforcement learning for autonomous driving,” 2018, https://arxiv.org/abs/2002. [14] Y. Chebotar, A. Handa, V. Makoviychuk et al., “Closing the sim-to-real loop: adapting simulation randomization with real world experience,” in Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), 2019. [15] S. James, W. Paul, M. Kalakrishnan et al., “Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-

Journal

Journal of RoboticsHindawi Publishing Corporation

Published: Jun 26, 2022

References