Learning a Robust Hybrid Descriptor for Robot Visual Localization
Learning a Robust Hybrid Descriptor for Robot Visual Localization
Shi, Qingwu;Wu, Junjun;Lin, Zeqin;Qin, Ningwei
2022-05-19 00:00:00
Hindawi Journal of Robotics Volume 2022, Article ID 9354909, 11 pages https://doi.org/10.1155/2022/9354909 Research Article Learning a Robust Hybrid Descriptor for Robot Visual Localization Qingwu Shi, Junjun Wu , Zeqin Lin, and Ningwei Qin School of Mechatronic Engineering and Automation, Foshan University, Foshan, China Correspondence should be addressed to Junjun Wu; jjunwu@fosu.edu.cn Received 11 February 2022; Revised 23 March 2022; Accepted 19 April 2022; Published 19 May 2022 Academic Editor: Xianfeng Yuan Copyright © 2022 Qingwu Shi et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Long-term robust visual localization is one of the main challenges of long-term visual navigation for mobile robots. Due to factors such as illumination, weather, and season, mobile robots continuously navigate with visual information in a complex scene, which is likely to lead to failure localization within a few hours. However, semantic segmentation images will be more stable than the original images against considerable drastically variable environments; therefore, to make full use of the advantages of both semantic segmentation image and its original image, this paper solves the above problems with the latest work of semantic segmentation and proposes the novel hybrid descriptor for long-term visual localization, which is generated by combining a semantic image descriptor extracted from segmentation images and an image descriptor extracted from RGB images with a certain weight, and then trained by a convolutional neural network. Our experiments show that the localization performance of our method combining the advantages of semantic image descriptor and image descriptor is superior to those long-term visual localization methods with only an image descriptor or semantic image descriptor. Finally, our experimental results mostly exceed state-of-the-art 2D image-based localization methods under various challenging environmental conditions in the Extended CMU Seasons and RobotCar Seasons datasets in speci“c precision metrics. environments, namely, “nding the pose of the image in the 1. Introduction currently constructed map which is highly similar to the Visual localization is a key part of SLAM for mobile robots, currently observing image. With the above preliminary which can help the robot to determine the general position global localization coarsely, the initial pose can be provided and direction. In the GPS-constrained environment, it plays for the regression of local high-precision 6-DOF camera a vital role in navigation for mobile robots [1]. When the pose with the hierarchical localization methods [9–11]. robot performs visual navigation, it usually generates an Traditional methods such as SIFT, SURF, and ORB rely environmental map based on scene representation under on point descriptor for visual localization. Recently, the certain environmental conditions. However, due to the global image descriptor [12] extracted from the deep con- inšuence such as weather, illumination, and season, when volutional neural network performs better than the above the robot moves in a large range for a long time, envi- traditional point descriptor. However, these methods only ronmental conditions of the scene image that is being ob- aggregate the features of the image region without con- served may vary greatly from the previous. erefore, long- sidering the semantic information contained therein. term visual localization methods need to deal with all these Intuitively, one of the main challenges of mobile robots appearance variations [2, 3]. At present, various visual lo- when performing long-term work is still to obtain the calization problems under the more challenging environ- representation of images under changing conditions. ment have attracted extensive attention of researchers [4–8]. However, semantic information of objects in the scene that is erefore, this paper mainly focuses on the long-term visual extracted by semantic segmentation or object detection can localization problems for the robots under complex generate the invariant representation of images under 2 Journal of Robotics changing conditions. For example, the semantic information 2. Related Works of a tree will not change whether or not it is covered with 2.1. Semantic Segmentation. Semantic segmentation is the snow, so the visual localization methods with semantic task of assigning a category label to each pixel in the input information have attracted the attention of researchers image, which is a very important task for visual perception of [4–7, 13]. mobile robots. In the early stage, researchers mainly use In summary, to improve the accuracy of long-term visual manually designed descriptor or probability graph models to localization for the mobile robots in complex and changing perform semantic segmentation tasks. In recent years, deep environments, a novel method of long-term visual locali- convolutional neural networks based semantic segmentation zation based on hybrid descriptor is proposed, which is have been proved to be superior to traditional methods. -e generated by combining a semantic image descriptor pioneering work of Long et al. [15] shows that convolutional extracted from segmentation images and image descriptor neural networks (CNN) originally used for classification, extracted from RGB images. However, the performance of such as AlexNet or VGG, can be transformed into fully semantic segmentation based on CNN highly depends on convolutional networks (FCN) for semantic segmentation. semantic labels, which are expensive and time-consuming to -e following work improved the structure of its neural obtain. -erefore, to reduce the large costs of manual la- network based on [15], such as expanding the receptive field beling for semantic labels, the paper introduces 3D geo- [16, 17], making full use of global context information [14] metric consistency supervision for the training process of or fusing multiscale features [18, 19]. In addition, some work segmentation network PSPNet [14], so that the segmentation combined FCN [15], with probabilistic graphical models effects of the same scene under changing environmental such as conditional random fields, as a post-processing step conditions are generally consistent. Finally, this paper [16]. verifies the effectiveness of this method on Extended CMU However, the performance of semantic segmentation Seasons and RobotCar Seasons datasets. -e contributions based on CNN highly depends on semantic tags, which are of this paper are as follows: expensive and time-consuming to obtain. In this case, many A new method of long-term visual localization based weakly supervised methods have been proposed with labels on hybrid descriptor is proposed, which is a compact in the form of such as bounding boxes [20], image-level tags hybrid descriptor generated by concatenating a se- [21], or points [22]. In addition, [23] obtains semantic tags in mantic image descriptor extracted from semantic a semiautomatic way, which requires low manual costs than segmentation images and an image descriptor extracted pixel-level annotation to improve the segmentation per- from RGB images with a certain weight and then formance. -is paper adopts the method similar to [23] to trained by a convolutional neural network. obtain the segmentation maps of mapillary street level se- quences [24]. -is paper introduces 3D geometric consistency su- pervision for the training process of segmentation network PSPNet to obtain the semantic labels of the 2.2. Domain Adaptation. -e training of models for deep training and testing datasets with little labor costs. learning requires a large amount of labeled data, but -is paper shows that the localization performance of manually labeling a large amount of data is time-consuming our method by combining the advantages of the se- and laborious. However, the pixel-level annotation in the mantic image descriptor and image descriptor is su- source task is available; therefore, the purpose of the domain perior to that of long-term visual localization methods adaptation method is to learn the knowledge of the source with only an image descriptor or semantic image de- task, so that the model can perform well in the target task. scriptor. Besides, our method is comparable to state-of- Early work includes [25, 26], which converts the features of the-art 2D image-based localization methods under the target domain into the source feature domain [25] or the various challenging environmental conditions in the domain-invariant feature space [26]. Some researchers focus Extended CMU Seasons and RobotCar Seasons datasets on domain adaptation based on the CNN models [27, 28]. in specific precision metrics. -ese methods mainly aim to make the learned model obtain -e organizational structure of this paper is as follows: domain-invariant features, either by training the network Section 1 introduces the research background, as well as based on the adversarial loss to promote the confusion defines the challenging problems and the contribution of our between source domain and target domain [27] or by method. Section 2 reviews the research related to the work of keeping the feature distribution of source domain and target this paper, mainly including semantic segmentation, domain domain consistent [28]. Recently, several domain adaptation adaptation, and long-term visual localization in a changing methods have been proposed for semantic segmentation environment. In Section 3, the network architecture and the tasks [29–33]. Most of them [29–31] use synthetic datasets, loss function of our method are described in details. In such as [34], which can automatically generate a large Section 4, the experimental scheme is introduced in details, number of annotated synthetic images. -e method pro- including datasets, evaluation metrics and experimental posed in [31] utilizes an image translation-based technique, results, and the analysis of the experimental results. In which converts the image from the source domain into the Section 5, the research work of this paper is summarized and target domain and then performs segmentation tasks. An- the future work prospects is given. other common approach is to train the network based on Journal of Robotics 3 labor costs to obtain. -is paper also makes full use of se- adversarial loss, such as [32], which causes the network to fool the domain discriminator to generate roughly the same mantic information as auxiliary information to overcome the impact such as illumination, weather, and season on feature distribution as the generator. Although the domain adaptation method can also obtain visual localization tasks. However, to reduce the high the semantic labels of the training and test datasets, its manual labeling costs for obtaining auxiliary information, performance is not good, we introduce 3D geometric we introduce 3D geometric consistency as the supervision consistency as the supervision signal for the training process signal for the training process of segmentation network of segmentation network PSPNet to obtain the semantic PSPNet to obtain the semantic labels of the training and labels of the training and test datasets used in this paper, so testing datasets used in this paper. that the semantic labels for images of the same scene under different environmental conditions are generally consistent. 3. Proposed Method 3.1. Network Model Structure. Figure 1 shows the network 2.3. Long-Term Visual Localization. Because the benchmark model structure of the long-term visual localization method datasets proposed in [2, 3] are challenging and the evaluation designed in this paper. Above all, this paper adopts a method metrics of the visual localization provided by it are con- similar to [23] to obtain the segmentation images, and the vincing, it has greatly promoted the research and devel- specific steps are as follows: (1) using the 2D-2D matches opment of long-term visual localization. At present, the that are composed of two images of the same scene taken methods of long-term visual localization generally include: under different conditions provided by [23], which provides Sequence-based image retrieval methods [35], learning- constraints for the training process of the segmentation based local feature localization methods [36, 37], 3D network PSPNet, i.e., the segmentation maps of the two structure-based localization methods [38–40], 2D image- images in the same scene should be consistent and (2) using based localization methods [5, 12, 41–45], and hierarchical the cross-season correspondence dataset in [23], some localization methods [9–11]. roughly annotated images and the correspondence loss, the -e 2D image-based localization methods have great segmentation effects of images with the same scene can be advantages in robustness and efficiency, so this paper focuses roughly consistent under changing environmental condi- on the 2D image-based visual localization methods, which tions. For more details, please refer to [23]. do not use any form of 3D reasoning method to calculate the -e network model structure consists of four parts: (1) pose of the query image and is usually used for place rec- training a VGG16 network to extract 16K (1-dimensional) ognition or closure loop detection tasks of visual SLAM. For semantic image descriptors from segmentation images, (2) the image that is being observed by the robot, given a set of training another VGG16 network to extract 16 K (1-di- environmental maps with a known camera pose, the 2D mensional) image descriptors from RGB images, (3) con- image-based localization methods usually approximate the catenating the descriptors of semantic image descriptors and pose of the image that is the most similar visual appearance image descriptors with the weight of λ and λ (λ + λ � 1), 1 2 1 2 in the map to the pose of currently observing image (i.e., respectively, to obtain 16K (1-dimensional) hybrid de- query image). Since 2D image-based localization methods scriptors, and (4) training a convolutional neural network generally perform well at coarse precision, the hierarchical that is composed of three convolutional layers and two fully localization methods use the initial pose obtained by 2D connected layers, which converts 16K (1-dimensional) hy- image-based localization methods to regress the high pre- brid descriptors to 1024 (1-dimensional) learning hybrid cision 6-DOF camera pose further. descriptors for tasks of visual localization. VLAD [46] is a classic method for 2D image-based When the mobile robot builds the environment map localization or place recognition under ideal conditions, but incrementally, each image that is being observed is processed it has poor robustness for tasks of long-term visual locali- with our method to generate 1024 (1-dimensional) learning zation under dramatically changing conditions. On this hybrid descriptors for the visual localization module. -is basis, DenseVLAD [41] uses the VLAD clustering RootSIFT descriptor is feature data that contain invariant represen- descriptor that is used to match for tasks of visual locali- tation of image, so the environment map built by the robot zation. Subsequently, the long-term visual localization exists in the form of feature database. -e main task of the methods based on 2D image localization have achieved good visual localization module is to continuously measure the development with the help of CNN models. -erefore, similar distance between the feature data that are generated NetVLAD [12] integrates the traditional VLAD algorithm from the currently observing image with our method and into CNN network structure to achieve end-to-end visual feature database based on L1 distance, and the pose of the localization. currently observing image is approximated to the known To improve the visual localization performance in pose of the candidate image with the lowest similar distance. complex environments, many works utilize semantic in- formation [5, 13, 23, 44], context information [45], and depth information [5, 47] under the architecture of the 3.2. Loss Function. As shown in Figure 1, this paper needs to convolutional neural network to learn scene descriptor with optimize two types of loss functions: total loss of segmen- invariant environmental conditions. However, these tasks tation task for obtaining semantic segmentation images and require auxiliary information that usually requires large triplet loss for task of visual localization. 4 Journal of Robotics 16 K L λ CNN Triplet Loss Image Descriptor RGB Image 16 K 16 K Triplet Loss CNN Semantic segmentation Sematic Descriptor Conv + Prelu Max Pooling Conv + Prelu Max Pooling Conv + Prelu FC FC Hybrid Descriptor Figure 1: Overview of the network model structure. 3.3. Total Loss for Segmentation Task. In order to obtain the images. In order to enhance the robustness of the descriptor high-quality segmentation images used in this paper with used for visual localization, this paper obtains tuples from little labor costs, we introduce 3D geometric consistency as the training dataset to train the network. Tuples are com- the supervision signal for the training process of segmen- posed of an anchor image (p), a positive sample (q, i.e., the tation network PSPNet and need to optimize the loss same scene as the anchor image), and i negative samples (n , function for segmentation task, which is composed of i.e., different scenes from the anchor image). To make the standard cross-entropy loss function L and correspon- distance between positive pairs decrease and make the sce dence loss L to obtain the high-quality semantic seg- distance between negative pairs increase, the triplet loss used cor mentation images used in this paper. -e total loss function in this paper is as follows: for segmentation task, i.e., L is defined as follows: ‖p − q‖ � � L � max0, 1 − , (3) � � L � L + ωL , (1) � � sce corr m + �p − n � i 2 where ω is the weight, we set ω to 1, and the corre- where m means margin, we set m to 0.1, p, q, and n refer to spondence loss L is designed as follows: cor the cached embeddings for the anchor, positive, and negative images. L � l I , I , p , p , cor r t r t (2) (r,t) 4. Experiment where I is the reference traversal image, I is the target r t traversal image, p and p are the pixel positions of matching r t -is section describes the experimental protocol in details, points in the reference traversal and target traversal images, including experimental dataset, experimental settings, respectively. l is hinge loss or the correspondence cross- evaluation metrics, comparison models, experimental re- entropy loss, for more details about l, please refer to [23]. sults, visualization experiment, and ablation experiment. 3.4. Triplet Loss. In order to guide the model learn robust 4.1. Experimental Dataset descriptor for visual localization tasks, we construct triplet loss in the process of training a VGG16 network to extract 4.1.1. Training Dataset. Mapillary street level sequences [24] 16K (1-dimensional) semantic image descriptors from are currently the most diverse publicly available dataset for segmentation images and training another VGG16 network long-term place recognition tasks, covering the regional to extract 16K (1-dimensional) image descriptors from RGB environment of 30 major cities across six continents from Journal of Robotics 5 Tokyo to San Francisco for more than seven years, which which realized 2D image-based localization task by contains more than 1.6 million images collected from the matching semantic edge transformation. DIFL [29] intro- mapillary and contains huge perceptual changes due to duced feature consistency loss to train the encoder to dynamic objects, seasons, region, weather, cameras, and generate domain-invariant features in a self-supervised illumination. manner to achieve 2D image-based localization tasks. 4.1.2. Testing Dataset. -e Extended CMU Seasons dataset 4.3. Experimental Setting. In this paper, the mapillary street [2] is a subset of the CMU Visual Localization dataset [48]. It level sequence dataset and the segmentation maps obtained records the scene images in a variety of challenging envi- by the method similar to [23] are used for model training. ronments such as suburban, urban, and park in Pittsburgh of We adopts the Extended CMU Seasons and RobotCar the United States for more than a year. -is dataset contains Seasons datasets for testing performance of our method, and a total of 1 reference traversal and 11 query traversals, the the test results were uploaded to the long-term visual lo- environmental conditions of the images in the reference calization performance evaluation website provided in [2, 3] traversal are Sunny + No Foliage. -e 11 query traversals (https://www.visuallocalization.net/). contain different regional environments (suburban, urban, We implemented the proposed method using Pytorch on and Park), different vegetation conditions (no foliage, mixed the computer with two 2080Ti GPUs. -e training dataset of this experiment was resized to 640∗ 480, and we trained foliage, and foliage) and different weather conditions (sunny, cloudy, low sun, overcast, and snow), respectively. mapillary street level sequences using the ADAM optimizer, RobotCar Seasons dataset [2] is from the publicly and batch size was set to 8 tuples (containing no more than available Oxford RobotCar dataset [49], which records the 15 negative samples). -e initial learning rate is set to 0.0002, scene images with various changing conditions in Oxford, the margin is set to 0.1, and epochs are set to 30. UK for one year. -is dataset contains a total of 1 reference traversal (environmental condition is overcast) and 9 query 4.4. Testing Experiment on the Extended CMU Season Dataset. traversals that contain 5 weather conditions (snow, dusk, -e testing files obtained by the proposed method were sun, dawn, and rain), 2 types of seasons (overcast winter and uploaded to the above visual localization performance overcast summer) and 2 night environmental conditions evaluation website. Several state-of-the-art 2D image-based (night and night rain). -e environmental conditions of the localization methods in this website are selected to compare latter two query traversal constitute Night All, which forms a the visual localization performance with our method under comparison of different illumination conditions with the different regional environments, vegetation conditions, and Day All formed by the environment of the previous seven weather conditions. query traversals. 4.5. Environment under Different Regional Environments. 4.1.3. Evaluation Metrics. Reference [2] hosts a performance -e localization performance of the proposed method and evaluation server for evaluating different visual localization the selected comparison models under different regional methods, which has attracted extensive attention from many environments in the Extended CMU Seasons dataset is researchers. -erefore, our experiments use the metrics in shown in Table 1. When the mobile robots move in a large [2] to test the visual localization performance of the pro- range, environmental conditions of the scene image that is posed method. -e experiments upload the 6-DOF pose files being observed may be significantly different from the of the query image obtained by our method to this server, previous, so the long-term visual localization method needs and we obtain the performance results and ranking on the to cover as many regional environments as possible. public evaluation website. -ere are three precision metrics -erefore, the testing environments selected in the experi- on this evaluation site: high precision (0.25 m, 2 ) , medium ment include three typical regional environments, namely, ° ° precision ,(0.5 m, 5 ), and coarse precision (5 m, 10 ). -e urban, suburban, and park. According to the data in Table 1, website calculates the percentage of pose error within the for suburban and park environments, the performance of the three precision metrics to evaluate the performance of model in this paper is 14.59% and 14.62% higher than the various visual localization methods. state-of-the-art baselines under the coarse precision metric. In the urban environment, except for the coarse precision 4.2. Comparison Models. In the experiment, four typical and metric where our model ranks second in performance, our model performs the best in other cases. advanced methods are selected as the comparison model of the method in this paper, which is shown as follows. It can be seen that the proposed method is significantly NetVLAD [12] realized the 2D image-based localization task advanced in the park and suburban environments, and its by integrating the classic VLAD algorithm into the CNN performance is weakened in the urban environments, but it model structure. DenseVLAD [41] realized the 2D image- is still more competitive than other state-of-the-art base- based localization task by using the VLAD clustering lines. -is is mainly because there are a large number of trees RootSIFT descriptor. WASABI [32] proposed a global image and other static objects in the park and suburban envi- ronments, and the proposed model can make full use of the descriptor integrating semantic and topological information constructed by wavelet transform based on semantic edge, semantic information of these two types of scenes to enhance 6 Journal of Robotics Table 1: Results of different regional environments. Table 2: Results of different vegetation conditions. Park Suburban Urban Foliage Mixed foliage 0.25 m/0.5 m/ 0.25 m/0.5 m/ 0.25 m/0.5 m/ Method 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m Method ° ° ° ° ° ° 5m 5m 5m 2 /5 /10 2 /5 /10 ° ° ° ° ° ° ° ° ° 2 /5 /10 2 /5 /10 2 /5 /10 NetVLAD [12] 6.2/18.5/74.3 5.8/17.6/71.1 NetVLAD [12] 2.6/10.4/55.9 3.7/13.9/74.7 12.2/31.5/89.8 DIFL-FCL [29] 8.2/22.2/69.0 9.6/26.0/74.4 DIFL-FCL [29] 6.1/20.7/69.1 5.6/18.2/69.8 14.8/35.1/79.6 DenseVLAD [41] 7.4/21.1/68.0 8.5/24.5/73.0 DenseVLAD WASABI [32] 4.9/15.2/67.6 4.8/14.8/64.9 5.2/19.1/62.0 5.3/18.7/73.9 14.7/36.3/83.9 [41] Ours 9.5/26.5/81.2 10.5/29.4/86.7 WASABI [32] 2.4/9.1/54.5 3.8/13.9/67.3 7.9/21.3/75.2 Ours 7.0/24.5/79.2 6.1/20.7/85.6 16.1/39.0/87.7 4.8. Testing Experiment on RobotCar Seasons Dataset. -e experiment in this section uses the same method as that in the accuracy of visual localization. -erefore, the localization Section 4.5 to verify the effectiveness of our method, and we performance of the model in this paper improves most select several state-of-the-art 2D image-based localization under the coarse precision of these two regional environ- methods in the evaluation website for performance com- ments. However, the semantic information of the same scene parison. -e experimental environments include two kinds in the urban environment is changed due to the existence of of changing conditions: different weather and illumination a large number of dynamic objects such as pedestrians or conditions. cars, which makes the performance of the model in this paper affected. 4.9. Testing under Different Weather Conditions. -e com- In summary, the performance of the proposed method in parison results of testing the robustness between the model different regional environments is significantly better than proposed in this paper and three existing comparison that of the selected representative existing methods. models under different weather conditions in the RobotCar -erefore, the model presented in this paper plays a positive Seasons dataset are shown in Table 4. -e proposed method role in tasks of the long-term localization for mobile robots, has the best localization performance in the medium and especially in the park and suburban environments. high precision metrics under the snow condition. In addi- tion, although there are a large number of dynamic targets 4.6. Environment under Different Vegetation Conditions. such as pedestrians or cars in the RobotCar Seasons dataset, -e results of the experiment for environments with dif- which changes the semantic information in the same scene, ferent vegetation conditions in the Extended CMU Seasons the model in this paper still achieves decent results. dataset are shown in Table 2. For two complex vegetation conditions mixed foliage and foliage, the proposed method 4.10. Testing under Different Illumination Conditions. In shows the best robustness compared with other state-of-the- addition, we also conducted experiments under different art baselines in three precision metrics. illumination conditions. -e test results are shown in What is undoubtedly exciting is that different vegetation conditions are the most challenging environmental condi- Table 5. Compared with the other three comparison models, the proposed model improved 24.00% and 24.62%, re- tions in the Extended CMU Seasons dataset due to the types, numbers, and positions of leaves. -e outstanding visual spectively, in terms of high and medium precision metrics under night conditions, which is undoubtedly exciting be- localization performance of the proposed method is mainly because of the segmentation images used in this paper, cause night conditions are the most challenging environ- mental conditions in the RobotCar Seasons dataset. which makes the extracted environmental features and constructed scene descriptor have stronger invariance representation, which has a practical value for mobile robots 4.11. Visualization Experiment. For the Extended CMU to perform long-term outdoor navigation. Seasons and RobotCar Seasons datasets, we obtained seg- mentation images using self-supervised methods similar to 4.7. Environment under Different Weather Conditions. [23], as shown in Figures 2 and 3, respectively. Because the Mobile robots are inevitably faced with weather variations ultimate purpose of this paper is to take the obtained seg- when they work for a long time. -erefore, we not only mentation images as input and train the model to generate tested the effectiveness under different regional environ- semantic image descriptor to improve the performance of ments and different vegetation conditions in the Extended visual localization tasks, the predicted images of semantic CMU Seasons dataset but also tested it in different weather segmentation in the same scene in different environments conditions. -e experimental results are shown in Table 3. It under the Extended CMU Seasons and RobotCar Seasons can be seen from Table 3 that our model outperforms other datasets should be consistent. As can be seen from Figures 2 and 3, the predicted image of semantic segmentation in the state-of-the-art baselines with most weather conditions, and our method performs even more prominently under the same scene under different environments in the extended CMU seasons and RobotCar Seasons datasets are generally overcast and low sun environments in the three precision metrics especially. the same. Journal of Robotics 7 Table 3: Results of different weather conditions. Overcast Low sun Cloudy Snow Method 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m ° ° ° ° ° ° ° ° ° ° ° ° 2 /5 /10 2 /5 /10 2 /5 /10 2 /5 /10 NetVLAD [12] 6.7/19.1/76.3 5.5/17.5/71.3 6.8/20.1/79.5 5.0/16.4/68.0 DIFL-FCL [29] 9.7/25.3/70.9 8.7/25.3/74.4 8.8/24.7/76.9 7.4/26.7/73.5 DenseVLAD [41] 8.4/23.3/72.1 8.3/26.1/76.0 9.3/27.3/80.5 8.3/29.0/78.9 WASABI [32] 5.4/15.8/70.8 4.2/14.0/62.1 5.1/15.3/71.0 3.4/13.2/58.0 Ours 11.0/29.3/84.4 9.2/28.0/85.5 9.3/27.1/88.2 7.0/24.5/79.2 Table 4: Results of different weather conditions. Dawn Sun Snow Method 0.25 m/0.5 m 0.25 m/0.5 m 0.25 m/0.5 m ° ° ° ° ° ° 2 /5 2 /5 2 /5 NetVLAD [12] 6.2/22.8 5.7/16.5 7.0/25.2 DIFL-FCL [29] 9.5/30.2 9.1/23.7 9.0/25.2 DenseVLAD [41] 8.7/36.9 5.7/16.3 8.6/30.1 Ours 10.1/35.4 8.9/25.9 10.8/30.3 Table 5: Results of different illumination conditions. Night all Day all Method 0.25 m/0.5 m 0.25 m/0.5 m ° ° ° ° 2 /5 2 /5 NetVLAD [12] 0.3/2.3 6.4/26.3 DIFL-FCL [29] 2.5/6.5 7.6/26.2 DenseVLAD [41] 1.0/4.4 7.6/31.2 Ours 3.3/8.1 8.6/30.5 Figure 2: Predicted images of semantic segmentation in the same scene under different environments in the Extended CMU Seasons dataset. 4.12. Experiment for Testing the Proposed Algorithm. the Extended CMU Seasons dataset (10k database images), Considering the computational efficiency of the proposed that is, the algorithm can process about 32 frames per second algorithm, we also test the average computational time for on average, while the algorithm takes 52.17 ms to query each retrieval in different database images. -is paper focuses on frame in the RobotCar Seasons dataset (20k database im- the time when the query image matches the database image. ages), that is, the algorithm can process about 19 frames per RobotCar Seasons and Extended CMU Seasons datasets are second on average. However, usually the acquisition fre- used in the experiment, and the total database images of both quency of mobile robots is 15 frames per second, and a large are 20,862 and 10,338, respectively. As can be seen from number of invalid frames need to be removed, which shows Table 6, the algorithm queries each frame taking 31.42 ms in that our algorithm has a practical value for mobile robots. 8 Journal of Robotics Figure 3: Predicted images of semantic segmentation in the same scene under different environments in the RobotCar Seasons dataset. Table 6: Query time per frame of database images in different datasets. Dataset RobotCar seasons (20k) Extended CMU seasons (10 K) Query time (ms) 52.17 31.42 Table 7: Ablation study for our method with different weights λ and λ . 1 2 Training/ Park Suburban Urban Night all Day all Testing 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m 0.25 m/0.5 m/5m λ λ 1 2 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° 2 /5 /10 2 /5 /10 2 /5 /10 2 /5 /10 2 /5 /10 0 1 2.1/8.3/49.2 2.9/10.9/65.4 11.9/28.9/73.9 1.1/2.8/11.3 4.7/15.4/54.7 0.1 0.9 3.6/10.1/52.4 3.9/13.7/74.1 12.5/32.6/77.8 1.9/6.1/14.5 5.4/20.1/63.0 0.2 0.8 4.9/20.1/75.3 5.1/16.7/75.3 13.1/34.4/78.5 2.2/6.7/18.7 6.1/22.3/71.0 0.3 0.7 5.5/21.3/76.1 5.4/20.4/77.3 14.6/35.7/79.2 1.9/7.3/21.1 5.8/22.8/73.2 0.4 0.6 6.2/21.2/76.9 5.3/19.7/77.5 15.6/37.3/80.8 2.4/6.4/18.1 6.5/21.3/74.3 0.5 0.5 6.5/22.1/79.1 5.8/21.3/76.9 15.7/37.6/81.2 2.7/6.7/17.3 6.3/23.4/73.2 0.6 0.4 6.4/23.3/77.5 5.5/20.1/78.6 15.7/37.5/81.0 2.4/7.8/19.7 7.2/24.1/72.1 0.7 0.3 6.9/24.1/78.7 5.9/20.4/79.6 15.9/37.9/81.4 3.0/7.7/22.1 8.2/29.1/76.5 0.8 0.2 6.7/23.9/79.1 5.7/20.9/85.0 16.0/38.7/82.0 3.3/8.1/26.1 8.6/30.5/82.2 0.9 0.1 7.0/24.5/79.2 6.1/20.7/85.6 16.1/39.0/87.7 2.6/6.9/18.9 6.8/25.3/71.0 1 0 5.6/11.2/69.1 4.9/17.2/79.3 14.1/35.4/83.5 1.7/3.5/14.5 5.7/19.6/65.3 4.13. Ablation Study. A VGG16 network is trained from λ � 1, λ � 0 means that only one VGG16 network is 1 2 semantic segmentation images to extract a 16K (1-dimen- trained from RGB images to extract the 16K (1-dimensional) sional) semantic image descriptor and another VGG16 image descriptors directly for visual localization tasks. Park, network is trained from RGB images to extract the 16K (1- suburban, and urban in Table 7 are the regional environ- dimensional) image descriptor to concatenate with weights mental conditions in the Extended CMU Seasons dataset, λ and λ (λ + λ � 1), respectively. -erefore, this section while Day All and Night All in Table 7 are the illumination 1 2 1 2 discusses the influence of different weights λ and λ on the variation conditions in the RobotCar Seasons dataset. As can 1 2 performance of the proposed method. -e experimental be seen from Table 7, the visual localization performance is results are shown in Table 7. the best when λ � 0.9, λ � 0.1 for the regional environ- 1 2 λ � 0, λ � 1 means that only one VGG16 network is mental conditions in the Extended CMU Seasons dataset, 1 2 trained from semantic segmentation images to extract the while the visual localization performance is the best when 16K (1-dimensional) semantic image descriptor, while λ � 0.8, λ � 0.2 for the illumination variation conditions in 1 2 Journal of Robotics 9 the RobotCar Seasons dataset, which shows that if we trust Acknowledgments image descriptor more than semantic image descriptor, we -is work was supported in part by the Key Area Research can reach the best results for both datasets, and the semantic Projects of Universities of Guangdong Province under Grant image descriptor is more helpful for the illumination con- 2019KZDZX1026, in part by the Natural Science Foundation dition in the RobotCar Seasons dataset compared with the of Guangdong Province under Grant :501100003453, in part regional environment in the Extended CMU Seasons by the Innovation Team Project of Universities of Guang- dataset. Besides, it can be seen from Table 7 that for the dong Province under Grant 2020KCXTD015, in part by Free regional environmental conditions in the Extended CMU Exploration Foundation of Foshan University under Grant Seasons dataset and the illumination variation conditions in 2020ZYTS11. the RobotCar Seasons dataset, the visual localization effect is not ideal if solely using the semantic image descriptor that is trained from semantic segmentation images or solely using References the image descriptor that is trained from RGB images. [1] D. Li, “Dxslam: a robust and efficient visual slam system with deep features,” in Proceedings of the 2020 IEEE/RSJ Inter- 5. Conclusion national Conference on Intelligent Robots and Systems (IROS), pp. 4958–4965, Las Vegas, LV, USA, January 2020. Aiming at the challenges of robustness faced by mobile robot [2] T. Sattler, “Benchmarking 6dof outdoor visual localization in when it performs long-term work under complex changing changing conditions,” in Proceedings of the IEEE Conference conditions, a new method of long-term visual localization on Computer Vision and Pattern Recognition, pp. 8601–8610, based on hybrid descriptor is proposed, which is a compact Salt Lake City, UT, USA, June 2018. hybrid descriptor generated by concatenating a semantic [3] C. Toft, “Long-term visual localization revisited,” IEEE image descriptor extracted from semantic segmentation Transactions on Pattern Analysis and Machine Intelligence, images and the image descriptor extracted from RGB images vol. 44, no. 4, pp. 2074–2088, 2022. with a certain weight, and then trained by a convolutional [4] Y. You, “MISD-SLAM: multimodal semantic SLAM for dy- neural network. In this paper, we verify that the visual lo- namic environments,” in Proceedings of the Wireless Com- munications and Mobile Computing 2022, Dubrovnik, calization performance of solely using q semantic image Croatia, June 2022. descriptor trained from semantically segmented images or [5] J. Wu, Q. Shi, Q. Lu, X. Liu, X. Zhu, and Z. Lin, “Learning solely using image descriptor trained from RGB images is invariant semantic representation for long-term robust visual not better than that of using hybrid descriptor obtained by localization,” Engineering Applications of Artificial Intelli- the combination of both with a certain weight. -is model gence, vol. 111, Article ID 104793, 2022. was trained on mapillary street level sequences dataset and [6] J. Ni, “An improved deep residual network-based semantic subsequently tested on Extended CMU Seasons and simultaneous localization and mapping method for monoc- RobotCar Seasons datasets. -e experimental results verify ular vision robot,” Computational Intelligence And Neuro- that the visual localization performance of the proposed science 2020, vol. 2020, Article ID 7490840, 14 pages, 2020. method is significantly better than that of other state-of-the- [7] J. Li, “Loop closure detection based on image semantic seg- art baselines in the Extended CMU Seasons and RobotCar mentation in indoor environment,” Mathematical Problems Seasons datasets under different regions, vegetation con- in Engineering, vol. 2022, Article ID 7765479, 14 pages, 2022. [8] M. Aladem, S. Baek, and S. A. Rawashdeh, “Evaluation of ditions, weather, and illumination conditions. It can meet image enhancement techniques for vision-based navigation the requirements for mobile robots to perform long-term under low illumination,” Journal of Robotics, vol. 2019, Article visual localization tasks in a variety of complex ID 5015741, 15 pages, 2019. environments. [9] P.-E. Sarlin, “From coarse to fine: robust hierarchical local- -e performance of the visual localization method in this ization at large scale,” in Proceedings of the IEEE/CVF Con- paper depends on the performance of the semantic seg- ference on Computer Vision and Pattern Recognition, mentation method we choose. In addition, the depth in- pp. 12716–12725, Long Beach, CA, USA, June 2019. formation of the object in the same scene is proved to still [10] H. Germain, G. Bourmaud, and V. Lepetit, “Sparse-to-dense have strong stability under changing environmental con- hypercolumn matching for long-term visual localization,” in ditions. -erefore, we will integrate the depth information to Proceedings of the 2019 International Conference on 3D Vision process the visual variation between images and explore the (3DV), pp. 513–523, Quebec ´ City, QC, Canada, September impact of different semantic segmentation methods on the [11] T. Shi, “Visual localization using sparse semantic 3D map,” in performance of the proposed method in the future. Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), pp. 315–319, Taipei, China, Sep- Data Availability tember 2019. [12] R. Arandjelovic, “NetVLAD: CNN architecture for weakly All data used to support the findings of this study are in- supervised place recognition,” in Proceedings of the IEEE cluded within the article. Conference on Computer Vision and Pattern Recognition, pp. 5297–5307, Las Vegas, NV, USA, June 2016. [13] M. Larsson, “Fine-grained segmentation networks: self-su- Conflicts of Interest pervised segmentation for improved long-term visual local- -e authors declare that they have no conflicts of interest. ization,” in Proceedings of the IEEE/CVF International 10 Journal of Robotics Conference on Computer Vision, pp. 31–41, Seoul, South Pattern Recognition, pp. 7892–7901, Salt Lake City, UT, USA, Korea, October 2019. June 2018. [14] H. Zhao, “Pyramid scene parsing network,” in Proceedings of [30] Yi-H. Tsai, “Learning to adapt structured output space for the IEEE Conference on Computer Vision and Pattern Rec- semantic segmentation,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, ognition, pp. 2881–2890, Honolulu, HI, USA, July 2017. pp. 7472–7481, Salt Lake City, UT, USA, June 2018. [15] J. Long, E. Shelhamer, and Trevor Darrell, “Fully convolu- [31] S. Sankaranarayanan, “Learning from synthetic data: tional networks for semantic segmentation,” in Proceedings of addressing domain shift for semantic segmentation,” in the IEEE Conference on Computer Vision and Pattern Rec- Proceedings of the IEEE Conference on Computer Vision and ognition, pp. 3431–3440, Boston, MA, USA, June 2015. Pattern Recognition, pp. 3752–3761, Salt Lake City, UT, USA, [16] L.-C. Chen and G. I. K. A. L. Papandreou, “Deeplab: semantic June 2018. image segmentation with deep convolutional nets, atrous [32] M. Wulfmeier, A. Bewley, and I. Posner, “Incremental convolution, and fully connected crfs,” IEEE Transactions on adversarial domain adaptation for continually changing en- Pattern Analysis and Machine Intelligence, vol. 40, no. 4, vironments,” in Proceedings of the 2018 IEEE International pp. 834–848, 2018. Conference on Robotics and Automation (ICRA), pp. 4489– [17] Yu Fisher and V. Koltun, “Multi-scale context aggregation by 4495, IEEE, Brisbane, Australia, May 2018. dilated convolutions,” 2015, https://arxiv.org/abs/1511.07122. [33] X. Wu, “DANNet: a one-stage domain adaptation network for [18] L.-C. Chen et al., “Attention to scale: scale-aware semantic unsupervised nighttime semantic segmentation,” in Pro- image segmentation,” in Proceedings of the IEEE Conference ceedings of the IEEE/CVF Conference on Computer Vision and on Computer Vision and Pattern Recognition, pp. 3640–3649, Pattern Recognition, pp. 15769–15778, Nashville, TN, USA, Las Vegas, NV, USA, June 2016. June 2021. [19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolu- [34] G. Ros, “-e synthia dataset: a large collection of synthetic tional networks for biomedical image segmentation,” in images for semantic segmentation of urban scenes,” in Pro- Proceedings of the International Conference on Medical Image ceedings of the IEEE Conference on Computer Vision and Computing and Computer-Assisted Intervention, pp. 234–241, Pattern Recognition, pp. 3234–3243, Las Vegas, NV, USA, Springer, Munich, Germany, October 2015. June 2016. [20] K. Anna, “Simple does it: weakly supervised instance and [35] T. Naseer, “Robust visual robot localization across seasons semantic segmentation,” in Proceedings of the IEEE Confer- using network flows,” in Proceedings of the Twenty-eighth ence on Computer Vision and Pattern Recognition, pp. 876– AAAI Conference on Artificial Intelligence, Quebac, Canada, 885, Honolulu, HI, USA, July 2017. July 2014. [21] N. Souly, C. Spampinato, and M. Shah, “Semi supervised [36] R. Clark, “Vidloc: a deep spatio-temporal model for 6-dof semantic segmentation using generative adversarial network,” video-clip relocalization,” in Proceedings of the IEEE Con- in Proceedings of the IEEE International Conference on ference on Computer Vision and Pattern Recognition, Computer Vision, pp. 5688–5696, Venice, Italy, October 2017. pp. 6856–6864, Honolulu, HI, USA, July 2017. [22] B. Amy, “What’s the point: semantic segmentation with point [37] Z. Chen, “Deep learning features at scale for visual place supervision,” in Proceedings of the European Conference on recognition,” in Proceedings of the 2017 IEEE International Computer Vision, pp. 549–565, Springer, Amsterdam, -e Conference on Robotics and Automation (ICRA), pp. 3223– Netherlands, October 2016. 3230, IEEE, Marina Bay Sands, Singapore, June 2017. [23] M. Larsson, “A cross-season correspondence dataset for ro- [38] L. Liu, H. Li, and Y. Dai, “Efficient global 2d-3d matching for bust semantic segmentation,” in Proceedings of the IEEE/CVF camera localization in a large-scale 3d map,” in Proceedings of Conference on Computer Vision and Pattern Recognition, the IEEE International Conference on Computer Vision, pp. 9532–9542, Long Beach, CA, USA, June 2019. pp. 2372–2381, Venice, Italy, October 2017. [24] F. Warburg, “Mapillary street-level sequences: a dataset for [39] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective lifelong place recognition,” in Proceedings of the IEEE/CVF prioritized matching for large-scale image-based localiza- Conference on Computer Vision and Pattern Recognition, tion,” IEEE Transactions on Pattern Analysis and Machine pp. 2626–2635, Seattle, WA, USA, June 2020. Intelligence, vol. 39, no. 9, pp. 1744–1756, 2017. [25] B. Kulis, K. Saenko, and Trevor Darrell, “What you saw is not [40] L. Svarm and O. F. M. Enqvist, “City-scale localization for what you get: domain adaptation using asymmetric kernel cameras with known vertical direction,” IEEE Transactions on transforms,” in Proceedings of the IEEE/CVF Conference on Pattern Analysis and Machine Intelligence, vol. 39, no. 7, Computer Vision and Pattern Recognition CVPR 2011, pp. 1455–1461, 2017. pp. 1785–1792, Colorado Springs, CO, USA, June 2011. [41] A. Torii, “24/7 place recognition by view synthesis,” in Pro- [26] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual ceedings of the IEEE Conference on Computer Vision and category models to new domains,” in Proceedings of the Pattern Recognition, pp. 1808–1817, Boston, MA, USA, June European Conference on Computer Vision, pp. 213–226, Springer, Crete, Greece, September 2010. [42] H. Hu, “Retrieval-based localization based on domain-in- [27] E. Tzeng, “Adversarial discriminative domain adaptation,” in variant feature learning under changing environments,” in Proceedings of the IEEE Conference on Computer Vision and Proceedings of the 2019 IEEE/RSJ International Conference on Pattern Recognition, pp. 7167–7176, Honolulu, HI, USA, July Intelligent Robots and Systems (IROS), pp. 3684–3689, IEEE, 2017. Macau, China, November 2019. [28] M. Long, “Unsupervised domain adaptation with residual [43] A. Benbihi, “Image-based place recognition on bucolic en- transfer networks,” 2016, https://arxiv.org/abs/1602.04433. vironment across seasons from semantic edge description,” in [29] Y. Chen, Li Wen, and L. Van Gool, “Road: reality oriented Proceedings of the 2020 IEEE International Conference on adaptation for semantic segmentation of urban scenes,” in Robotics and Automation (ICRA), pp. 3032–3038, IEEE, Paris, Proceedings of the IEEE Conference on Computer Vision and France, August 2020. Journal of Robotics 11 [44] H. Hu, Z. Qiao, M. Cheng, Z. Liu, and H. Wang, “DASGIL: domain adaptation for semantic and geometric-aware image- based localization,” IEEE Transactions on Image Processing, vol. 30, pp. 1342–1353, 2021. [45] Z. Xin, “Localizing discriminative visual landmarks for place recognition,” in Proceedings of the 2019 International Con- ference on Robotics and Automation (ICRA), pp. 5979–5985, IEEE, Montreal, ´ Canada, May 2019. [46] H. Jegou, ´ “Aggregating local descriptors into a compact image representation,” in Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition, pp. 3304–3311, San Francisco, CA, USA, June 2010. [47] N. Piasco and D. V. C. Sidibe, “Improving image description with auxiliary modality for visual localization in challenging conditions,” International Journal of Computer Vision, vol. 129, no. 1, pp. 185–202, 2021. [48] H. Badino, D. Huber, and T. Kanade, “Visual topometric localization,” in Proceedings of the 2011 IEEE Intelligent Ve- hicles Symposium (IV), pp. 794–799, IEEE, Baden, Germany, July 2011. [49] W. Maddern and G. C. P. Pascoe, “1 year, 1000 km: the oxford robotcar dataset,” ?e International Journal of Robotics Re- search, vol. 36, no. 1, pp. 3–15, 2017.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
Journal of Robotics
Hindawi Publishing Corporation
http://www.deepdyve.com/lp/hindawi-publishing-corporation/learning-a-robust-hybrid-descriptor-for-robot-visual-localization-Ston9ybqYo