Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Occlusion-Free Road Segmentation Leveraging Semantics for Autonomous Vehicles

Occlusion-Free Road Segmentation Leveraging Semantics for Autonomous Vehicles sensors Article Occlusion-Free Road Segmentation Leveraging Semantics for Autonomous Vehicles 1 , 2 , 3 1 , 2 , 3 1 , 2 , 3 , 1 , 2 , 3 1 , 2 , 3 4 Kewei Wang , Fuwu Yan , Bin Zou *, Luqi Tang , Quan Yuan and Chen Lv Hubei Key Laboratory of Advanced Technology for Automotive Components, Wuhan University of Technology, Wuhan 430070, China; wkw199q@whut.edu.cn (K.W.); yanfuwu@vip.sina.com (F.Y.); tlqqidong@163.com (L.T.); 231943@whut.edu.cn (Q.Y.) Hubei Collaborative Innovation Center for Automotive Components Technology, Wuhan University of Technology, Wuhan 430070, China Hubei Research Center for New Energy & Intelligent Connected Vehicle, Wuhan 430070, China School of Mechanical and Aerospace Engineering, Nanyang Technological University, 639798, Singapore; lyuchen@ntu.edu.sg * Correspondence: zoubin@whut.edu.cn; Tel.: +86-138-7115-3253 Received: 3 September 2019; Accepted: 24 October 2019; Published: 30 October 2019 Abstract: The deep convolutional neural network has led the trend of vision-based road detection, however, obtaining a full road area despite the occlusion from monocular vision remains challenging due to the dynamic scenes in autonomous driving. Inferring the occluded road area requires a comprehensive understanding of the geometry and the semantics of the visible scene. To this end, we create a small but e ective dataset based on the KITTI dataset named KITTI-OFRS (KITTI-occlusion-free road segmentation) dataset and propose a lightweight and ecient, fully convolutional neural network called OFRSNet (occlusion-free road segmentation network) that learns to predict occluded portions of the road in the semantic domain by looking around foreground objects and visible road layout. In particular, the global context module is used to build up the down-sampling and joint context up-sampling block in our network, which promotes the performance of the network. Moreover, a spatially-weighted cross-entropy loss is designed to significantly increases the accuracy of this task. Extensive experiments on di erent datasets verify the e ectiveness of the proposed approach, and comparisons with current excellent methods show that the proposed method outperforms the baseline models by obtaining a better trade-o between accuracy and runtime, which makes our approach is able to be applied to autonomous vehicles in real-time. Keywords: autonomous vehicles; scene understanding; occlusion reasoning; road detection 1. Introduction Reliable perception of the surrounding environment plays a crucial role in autonomous driving vehicles, in which robust road detection is one of the key tasks. Many types of road detection methods have been proposed in the literature based on monocular camera, stereo vision, or LiDAR (Light Detector and Ranging) sensors. With the rapid progress in deep learning techniques, significant achievements in segmentation techniques have significantly promoted road detection in monocular images [1–5]. Generally, these algorithms label each and every pixel in the image with one of the object classes by color and textual features. However, the road is often occluded by dynamic trac participants as well as static transport infrastructures when measured with on-board cameras, which makes it hard to directly obtain a full road area. When performing decision-making in extremely challenging scenarios, such as dynamic urban scenes, a comprehensive understanding of the environment needs to Sensors 2019, 19, 4711; doi:10.3390/s19214711 www.mdpi.com/journal/sensors Sensors 2019, 19, 4711 2 of 15 carefully tackle the occlusion problem. As to the road detection task, road segmentation of the visible area is not sucient for path planning and decision-making. It is necessary to get the whole structure and layout of the local road with an occlusion reasoning process in complex driving scenarios where clutter and occlusion occur with high frequency. Inspired by the fact that human beings are capable of completing the road structure in their minds by understanding the on-road objects and the visible road area, we believe that a powerful convolution network could learn to infer the occluded road area as human beings do. Intuitively, to the occlusion reasoning task, the color and texture features are of relatively low importance, what matters is the semantic and spatial features of the elements in the environment. As far as we know, semantic segmentation [6–8] is one of the most complete forms of visual scene understanding, where the goal is to label each pixel with the corresponding semantic label (e.g., tree, pedestrian, car, etc.). So, instead of an RGB image, we performed the occlusion reasoning road segmentation using semantic representation as input, which could be obtained by popular deep learning methods in real applications or human-annotated ground truth in the training phase. As shown in Figure 1, traditional road segmentation takes RGB image as input and labels road only in the visible area. As a comparison, our Sensors 2019, 19, x FOR PEER REVIEW 2 of 15 proposed occlusion-free road segmentation (OFRS) intends to leverage the semantic representation to necessary to get the whole structure and layout of the local road with an occlusion reasoning process infer the occluded road area in the driving scene. Note that the semantic input in the figure is just a in complex driving scenarios where clutter and occlusion occur with high frequency. visualization of the semantic representation, the actual input is the one-hot type of semantic label. Figure 1. Comparison of road segmentation and proposed occlusion-free road segmentation. (a) RGB Figure 1. Comparison of road segmentation and proposed occlusion-free road segmentation. (a) RGB image; (b) visualization of the results of road segmentation; (c) visualization of the semantic image; (b) visualization of the results of road segmentation; (c) visualization of the semantic representation of the scene, which could be obtained by semantic segmentation algorithms in real representation of the scene, which could be obtained by semantic segmentation algorithms in real applications or human annotation in training phase; (d) visualization of the results of the proposed applications or human annotation in training phase; (d) visualization of the results of the proposed occlusion-free road segmentation. Green refers to the road area in (b) and (d). occlusion-free road segmentation. Green refers to the road area in (b) and (d). In this paper, we aim to infer the occluded road area utilizing the semantic features of visible Inspired by the fact that human beings are capable of completing the road structure in their scenes and name this new task as occlusion-free road segmentation. First, a suitable dataset is created minds by understanding the on-road objects and the visible road area, we believe that a powerful based on the popular KITTI dataset, which is referred to as the KITTI-OFRS dataset in the following. convolution network could learn to infer the occluded road area as human beings do. Intuitively, to Second, the occl an usion end-to-end reasoning lightweight task, the co and lor eand cient textfully ure fe convolutional atures are of rel neural atively networks low impor fortanc theenew , what task matters is the semantic and spatial features of the elements in the environment. As far as we know, is proposed to learn the ability of occlusion reasoning. Moreover, a spatially-dependent weight is semantic segmentation [6–8] is one of the most complete forms of visual scene understanding, where applied to the cross-entropy loss to increase the performance of our network. We evaluate our model the goal is to label each pixel with the corresponding semantic label (e.g., tree, pedestrian, car, etc.). on di erent datasets and compare it with some other excellent algorithms which pursue the trade-o So, instead of an RGB image, we performed the occlusion reasoning road segmentation using between accuracy and runtime in the semantic segmentation task. semantic representation as input, which could be obtained by popular deep learning methods in real The main contributions of this paper are as follows: applications or human-annotated ground truth in the training phase. As shown in Figure 1, We analyze the occlusion problem in road detection and propose the novel task of occlusion-free traditional road segmentation takes RGB image as input and labels road only in the visible area. As road segmentation in the semantic domain, which infers the occluded road area using semantic a comparison, our proposed occlusion-free road segmentation (OFRS) intends to leverage the features of the dynamic scenes. semantic representation to infer the occluded road area in the driving scene. Note that the semantic input in the figure is just a visualization of the semantic representation, the actual input is the one- hot type of semantic label. In this paper, we aim to infer the occluded road area utilizing the semantic features of visible scenes and name this new task as occlusion-free road segmentation. First, a suitable dataset is created based on the popular KITTI dataset, which is referred to as the KITTI-OFRS dataset in the following. Second, an end-to-end lightweight and efficient fully convolutional neural networks for the new task is proposed to learn the ability of occlusion reasoning. Moreover, a spatially-dependent weight is applied to the cross-entropy loss to increase the performance of our network. We evaluate our model on different datasets and compare it with some other excellent algorithms which pursue the trade- off between accuracy and runtime in the semantic segmentation task. The main contributions of this paper are as follows: • We analyze the occlusion problem in road detection and propose the novel task of occlusion-free road segmentation in the semantic domain, which infers the occluded road area using semantic features of the dynamic scenes. Sensors 2019, 19, 4711 3 of 15 To complete this task, we create a small but ecient dataset based on the popular KITTI dataset named the KITTI-OFRS dataset, design a lightweight and ecient encoder–decoder fully convolution network referred to as OFRSNet and optimize the cross-entropy loss for the task by adding a spatially-dependent weight that could significantly increase the accuracy. We elaborately design the architecture of OFRSNet to obtain a good trade-o between accuracy and runtime. The down-sampling block and joint context up-sampling block in the network are designed to e ectively capture the contextual features that are essential for the occlusion reasoning process and increase the generalization ability of the model. The remainder of this paper is organized as follows: First, the related works in road detection are briefly introduced in Section 2. Section 3 introduces the methodology in detail, and Section 4 shows the experimental results. Finally, we draw conclusions in Section 5. 2. Related Works Road detection in autonomous driving has benefited from the development of the deep convolutional neural networks in recent years. Generally, the road is represented by its boundaries [9,10] or regions [1,2,11]. Moreover, road lane [12–14] and drivable area [15,16] detection also attract much attention from researchers, which concern the ego lane and the obstacle-free region of the road, respectively. The learning-based methods usually outperform the model-based methods due to the developed segmentation techniques. The model-based methods identify the road structure and road areas by shape [17,18] or appearance models [19]. The learning-based methods [3,6,7,16,20,21] classify the pixels in images as road and non-road, or road boundaries and non-road boundaries. However, the presence of foreground objects makes it hard to obtain full road despite the occlusion. To infer the road boundaries despite the occlusion, Suleymanov et al. [22] presented a convolutional neural network that contained intra-layer convolutions and produced outputs in a hybrid discrete-continuous form. Becattini et al. [23] proposed a GAN-based (Generative Adversarial Network) semantic segmentation inpainting model to remove all dynamic objects from the scene and focus on understanding its static components (such as streets, sidewalks, and buildings) to get a comprehension of the static road scene. In contrast to the above solutions, we conduct occlusion-free road segmentation to infer the occluded road area as a pixel-wise classification task. Even though the deep-learning methods have achieved remarkable performance in the pixel-wise classification task, to achieve the best trade-o between accuracy and eciency is still a challenging problem. Vijay et al. [20] presented a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet, which follows encoder–decoder architecture that is designed to be ecient both in memory and computational time in inference phase. Adam et al. [24] proposed a fast and compact encoder–decoder architecture named ENet that significantly has fewer parameters, and provides similar or better accuracy to SegNet. Romera et al. [25] proposed a novel layer design that leverages skip connections and convolutions with 1D kernels, which highly reduces the compute cost and increase the accuracy. Inspired by these networks, we follow the encoder–decoder architecture and enhance the down-sampling and up-sampling blocks with contextual extraction operations [26–28], which are proved to be helpful for segmentation-related tasks. This contextual information is even more essential and e ective for our occlusion reasoning task, which needs a comprehensive understanding of the driving scenes. 3. Occlusion-Free Road Segmentation 3.1. Task Definition The occlusion-free road segmentation task is defined as a pixel-level classification as the traditional road segmentation but with occlusion reasoning process to obtain a full representation of the road area. The input is fed to the model as a one-hot encoded tensor of the semantic segmentation labels WHC or predicted semantic segmentation probabilities’ tensor I 2 [0, 1] , where W is the width of the Sensors 2019, 19, 4711 4 of 15 Sensors 2019, 19, x FOR PEER REVIEW 4 of 15 image, H its height, and C the number of classes. In the same way, we trained the network to output WH2 ×× output a new tensor O∈[0,1] with the same width and height but containing only two a new tensor O 2 [0, 1] with the same width and height but containing only two categories belonging categories belonging to road and to non-r road and non-ro oad. ad. 3.2. Network Architecture 3.2. Network Architecture The proposed model is illustrated in Table 1 and visualized in Figure 2, and was designed to get The proposed model is illustrated in Table 1 and visualized in Figure 2, and was designed to get the best possible trade-o between accuracy and runtime. We followed the current trend of using the best possible trade-off between accuracy and runtime. We followed the current trend of using convolutions with residual connections [29] as the core elements of our architecture, to leverage their convolutions with residual connections [29] as the core elements of our architecture, to leverage their success in classification and segmentation problems. Inspired by SegNet and ENet, an encoder–decoder success in classification and segmentation problems. Inspired by SegNet and ENet, an encoder– architecture was adopted for the whole network architecture. The residual bottleneck blocks of di erent decoder architecture was adopted for the whole network architecture. The residual bottleneck blocks types were used as the basic blocks in the encoder and decoder. Dilated convolution was applied in the of different types were used as the basic blocks in the encoder and decoder. Dilated convolution was blocks to enlarge the respective field of the encoder. What is more, the context module was combined applied in the blocks to enlarge the respective field of the encoder. What is more, the context module with regular convolution to obtain a global understanding of the environment, which is really essential was combined with regular convolution to obtain a global understanding of the environment, which is to infer the occluded road area. In the decoder, we proposed a joint context up-sampling block to really essential to infer the occluded road area. In the decoder, we proposed a joint context up-sampling leverage the features of di erent resolutions to obtain richer and global information. block to leverage the features of different resolutions to obtain richer and global information. Deconv Down-sampling Joint Contextual Deliated Block Residual Block Factorized Block Block Upsampling Block Figure 2. The proposed occlusion-free road segmentation network architecture. Figure 2. The proposed occlusion-free road segmentation network architecture. Sensors 2019, 19, 4711 5 of 15 Sensors 2019, 19, x FOR PEER REVIEW 5 of 15 Table 1. Our network architecture in detail. Size refers to output feature maps size for an input size of 384 1248. Table 1. Our network architecture in detail. Size refers to output feature maps size for an input size of 384 × 1248. Stage Block Type Size Stage Block Type Size Context Down-sampling 192 624 16 Context Down-sampling 192 × 624 × 16 Context Down-sampling 96 312 32 Context Down-sampling 96 × 312 × 32 Factorized blocks 96 312 32 Encoder Context Down-sampling 48 156 64 Factorized blocks 96 × 312 × 32 Dilated blocks 48 156 64 Encoder Context Down-sampling 48 × 156 × 64 Context down-sampling 24 78 128 Dilated blocks 48 × 156 × 64 Dilated blocks 24 78 128 Context down-sampling 24 × 78 × 128 Joint Context Up-sampling 48 156 64 Dilated blocks 24 × 78 × 128 Bottleneck Blocks 48 156 64 Joint Context Up-sampling 48 × 156 × 64 Joint Context Up-sampling 96 312 32 Bottleneck Blocks 48 × 156 × 64 Decoder Bottleneck Blocks 96 312 32 Joint Context Up-sampling 96 × 312 × 32 Joint Context Up-sampling 192 624 16 Decoder Bottleneck Blocks 96 × 312 × 32 Bottleneck Blocks 192 624 16 Joint Context Up-sampling 192 × 624 × 16 Deconv 384 1248 2 Bottleneck Blocks 192 × 624 × 16 Deconv 384 × 1248 × 2 Context Convolution Block Recent works have shown that contextual information is helpful for Context Convolution Block Recent works have shown that contextual information is helpful for models to predict high-quality segmentation results. Modules which could enlarge the receptive field, models to predict high-quality segmentation results. Modules which could enlarge the receptive such as ASPP [21], DenseASPP [30], and CRFasRNN [31], have been proposed in the past years. Most field, such as ASPP [21], DenseASPP [30], and CRFasRNN [31], have been proposed in the past years. of these works explore context information in the decoder phase and ignore the surrounding context Most of these works explore context information in the decoder phase and ignore the surrounding when encoding the features in the early stage. On the other hand, the attention mechanism has been context when encoding the features in the early stage. On the other hand, the attention mechanism widely used for increasing model capability. Inspired by the non-local block [27] and SE block [26], has been widely used for increasing model capability. Inspired by the non-local block [27] and SE we proposed the context convolution, as shown in Figure 3. A context branch from [28] was added, block [26], we proposed the context convolution, as shown in Figure 3. A context branch from [28] bypassing the main branch of the convolution operation. As can be seen in Equation (1), the context was added, bypassing the main branch of the convolution operation. As can be seen in Equation (1), branch first adopted a 1 1 convolution W and softmax function to obtain the attention weights, and the context branch first adopted a 1 × 1 convolution 𝑊 and softmax function to obtain the attention then performed the attention pooling to obtain the global context features; then the global context weights, and then performed the attention pooling to obtain the global context features; then the features were transformed via a 1  1 convolution W and was added to the features of the main global context features were transformed via a 1 × 1 convolution 𝑊 and was added to the features convolution branch. of the main convolution branch. exp W x N k j ( ) z = x + W x , (1) i i  j ∑ N 𝑧 = 𝑥 +W j=1 p 𝑥 , (1) exp(W x ) ∑ ( )m m =1 where W and W denote linear transformation matrices. where 𝑊 and  𝑊 denote linear transformation matrices. C × H × W conv(1×1) 1 × H × W C × HW HW × 1 × 1 softmax conv(k×k), C1 C × 1 × 1 conv(1×1) Wv C1 × 1 × 1 C1 × H × W BN, ReLU C1 × H × W Figure 3. The context convolution block. Figure 3. The context convolution block. Sensors 2019, 19, 4711 6 of 15 Sensors 2019, 19, x FOR PEER REVIEW 6 of 15 Down-Sampling Block In our work, the down-sampling block performed down-sampling by Down-Sampling Block In our work, the down-sampling block performed down-sampling by using a 3  3 convolution with stride 2 in the main branch of a context convolution block, as stated using a 3 × 3 convolution with stride 2 in the main branch of a context convolution block, as stated above. The context branch extracted the global context information to obtain a global understanding above. The context branch extracted the global context information to obtain a global understanding of features. Down-sampling lets the deeper layers gather more context (to improve classification) and of features. Down-sampling lets the deeper layers gather more context (to improve classification) and helps to reduce computation. And we used two down-sampling blocks at the start of the network to helps to reduce computation. And we used two down-sampling blocks at the start of the network to reduce the feature size and make the network works eciently for large input. reduce the feature size and make the network works efficiently for large input. Joint Context Up-Sampling Block In the decoder, we proposed a joint context up-sampling block, Joint Context Up-Sampling Block In the decoder, we proposed a joint context up-sampling which takes two feature maps from di erent stages in the encoder, as shown in Figure 4. The feature block, which takes two feature maps from different stages in the encoder, as shown in Figure 4. The map from the earlier stage with bigger resolution and fewer channels carry sucient details in spatial, feature map from the earlier stage with bigger resolution and fewer channels carry sufficient details and the feature map from the later stage with a smaller resolution and more channels contain the in spatial, and the feature map from the later stage with a smaller resolution and more channels necessary facts in context. The joint context up-sampling block combines these two feature maps gently contain the necessary facts in context. The joint context up-sampling block combines these two feature and eciently using a context convolution block and bilinear up-sampling. The two branches of the maps gently and efficiently using a context convolution block and bilinear up-sampling. The two two feature maps were concatenated along the channels, and a context convolution block was applied branches of the two feature maps were concatenated along the channels, and a context convolution to the concatenated feature map. As shown in Figure 2, the joint context up-sampling blocks follow a block was applied to the concatenated feature map. As shown in Figure 2, the joint context up- sequential architecture, the current block utilized the former results and the corresponding decoder sampling blocks follow a sequential architecture, the current block utilized the former results and the features, which made the up-sampling operation more e ective. corresponding decoder features, which made the up-sampling operation more effective. 1×,C1 Context Convolution Context Convolution 1×,C1 Concat Block (1x1) Block (1x1) Context Convolution Bilinear 2×,C2 Block (1x1) Up-sampling Figure 4. The joint context up-sampling block. Figure 4. The joint context up-sampling block. Residual Bottleneck Blocks Between the down-sampling and up-sampling blocks, some residual Residual Bottleneck Blocks Between the down-sampling and up-sampling blocks, some blocks were inserted to perform the encoding and decoding. In the early stage of the encoder, we residual blocks were inserted to perform the encoding and decoding. In the early stage of the encoder, applied factorized residual blocks to extract dense features. As shown in Figure 5b, a 3 3 convolution we applied factorized residual blocks to extract dense features. As shown in Figure 5b, a 3 × 3 was replaced by a 3 1 convolution and a 1 3 convolution in the residual branch to reduce parameters convolution was replaced by a 3 × 1 convolution and a 1 × 3 convolution in the residual branch to and computation. In the later stage of the encoder, we stacked dilated convolution blocks with di erent reduce parameters and computation. In the later stage of the encoder, we stacked dilated convolution rates to obtain a larger receptive field and obtain more contextual information. The dilated convolution blocks with different rates to obtain a larger receptive field and obtain more contextual information. block applied a dilated convolution on the 3 3 convolution in the residual branch compared to the The dilated convolution block applied a dilated convolution on the 3 × 3 convolution in the residual regular residual block, as shown in Figure 5c. The dilate rates in the stacked dilated residual blocks branch compared to the regular residual block, as shown in Figure 5c. The dilate rates in the stacked were 1, 2, 5, and 9, which were carefully chosen to avoid the gridding problem when inappropriate dilated residual blocks were 1, 2, 5, and 9, which were carefully chosen to avoid the gridding problem dilation rate is used. One dilated residual block consisted of two groups of stacked dilated residual when inappropriate dilation rate is used. One dilated residual block consisted of two groups of blocks in our network. In the decoder phase, two continuous regular residual blocks were inserted stacked dilated residual blocks in our network. In the decoder phase, two continuous regular residual between the joint context up-sampling blocks. blocks were inserted between the joint context up-sampling blocks. Sensors 2019, 19, 4711 7 of 15 Sensors 2019, 19, x FOR PEER REVIEW 7 of 15 C x H x W C x H x W C x H x W conv(1x1), c/4 Sensors 2019, 19, x FOR PEER REVIEW conv(1x1), c/4 7 of 15 conv(1x1), c/4 conv(3x1), c/4 conv(3x3), conv(3x3), C x H x W C x H x W C x H x W c/4 c/4, r conv(3x1), c/4 conv(1x1), c/4 conv(1x1), c/4 conv(1x1), c/4 conv(1x1), c conv(1x1), c conv(1x1), c conv(3x1), c/4 conv(3x3), conv(3x3), c/4 c/4, r conv(3x1), c/4 C x H x W C x H x W C x H x W conv(1x1), c conv(1x1), c conv(1x1), c (a) Regular Residual Block (b) Factorized Residual Block (c) Dilated Residual Block Figure 5. Residual blocks in our network. + Figure 5. + Residual blocks in our network. C x H x W C x H x W C x H x W 3.3. Loss Function 3.3. Loss Function (a) Regular Residual Block (b) Factorized Residual Block (c) Dilated Residual Block As to the classification tasks, the cross-entropy loss has proved very e ective. However, in our As to the classification tasks, the cross-entropy loss has proved very effective. However, in our Figure 5. Residual blocks in our network. task, the road edge area needs more attention paid to it when performing the inference process, and the task, the road edge area needs more attention paid to it when performing the inference process, and faraway road in the image took fewer pixels. We proposed a spatially-dependent weight to handle this the faraway road in the image took fewer pixels. We proposed a spatially-dependent weight to handle 3.3. Loss Function problem to enhance the loss on the road edge region and faraway road area. The road edge region (ER) this problem to enhance the loss on the road edge region and faraway road area. The road edge region As to the classification tasks, the cross-entropy loss has proved very effective. However, in our was defined as a set of the pixels around the road edge pixels E, which was obtained from the ground (ER) was defined as a set of the pixels around the road edge pixels E, which was obtained from the task, the road edge area needs more attention paid to it when performing the inference process, and truth label image using the Canny algorithm [32], as shown in Figure 6. The Manhattan distance was ground truth label image using the Canny algorithm [32], as shown in Figure 6. The Manhattan the faraway road in the image took fewer pixels. We proposed a spatially-dependent weight to handle adopted to calculate the distance between other pixels and edge pixels, and T 2 R was used to control distance was adopted to calculate the distance between other pixels and edge pixels, and T ∈𝑅 was this problem to enhance the loss on the road edge region and faraway road area. The road edge region the region size. Then the weight is defined as Equation (3), which takes into account the road edge used to control the region size. Then the weight is defined as Equation (3), which takes into account (ER) was defined as a set of the pixels around the road edge pixels E, which was obtained from the region and the faraway distance factor. The loss function with spatial weight is shown in Equation (4), the road edge region and the faraway distance factor. The loss function with spatial weight is shown ground truth label image using the Canny algorithm [32], as shown in Figure 6. The Manhattan which is referred to as CE-SW, and the traditional cross-entropy loss is referred to as CE in our paper. distance was adopted to calculate the distance between other pixels and edge pixels, and T ∈𝑅 was in Equation (4), which is referred to as CE-SW, and the traditional cross-entropy loss is referred to as The experiment used to contshowed rol the rethat gion the size. Then CE-SW the could weigh significantly t is defined as impr Equ ove ation ( the3performance ), which takes of into the acmodels count on CE in our paper. The experiment showed that the CE-SW could significantly improve the the road edge region and the faraway distance factor. The loss function with spatial weight is shown the occlusion-free road segmentation task. performance of the models on the occlusion-free road segmentation task. in Equation (4), which is referred to as CE-SW, and the traditional cross-entropy loss is referred to as 0 0 0 0 0 0 ER = v(i ,j ) | |i−i | + |j− j | T ,e(i, j) ∈E, v(i ,j )∈ Img , (2) CE in our paper. The experiment showed that the CE-SW could significantly improve the ER = fv(i , j ) i i + j j < T , e(i, j) 2 E, v(i , j ) 2 Img , (2) performance of the models on the occlusion-free road segmentation task. 1, 𝑖𝑓 𝑝 (𝑖 , 𝑗 ) ∈ > 1, i f p(i, j) 2 ER ER = v(i ,j ) | |i−i | + |j− j | T ,e(i, j) ∈E, v(i ,j )∈ Img , (2) w(i, j) = | | | | , (3) ( ) w i, j = > , (3) kjii j+j j j j > 0 0 ∗2 +2, 𝑖𝑓 𝑝 (𝑖 , 𝑗 )∈ ∗ /  2 + 2, i f p(i, j) 2 ER kh+w/2 ( ) 1, 𝑖𝑓 𝑝 𝑖 , 𝑗 ∈ w(i, j) = ∗ | | | | , (3) where w and h are the width and height of the i mage, k=h/w is the rate to balance the height and where w and h are the width and height of the image, k=h/w is the rate to balance the height and width ∗2 +2, 𝑖𝑓 𝑝 (𝑖 , 𝑗 )∈ ∗ / width of the image, i and j are the pixel index, 𝑖 and 𝑗 the bottom center pixel index. of the image, i and j are the pixel index, i and j the bottom center pixel index. 0 0 where w and h are the width and height of the image, k=h/w is the rate to balance the height and Loss(y,p) =X ∑ ∑ X−𝑤(𝑖, 𝑗 )[y log 𝑝 +(1−𝑦 )log(1 − 𝑝 ))] , h   i (4) H W , , , , width of the image, i and j are the pixel index, 𝑖 and 𝑗 the bottom center pixel index. Loss(y, p) = w(i, j) y log p + 1 y ) log 1 p , (4) i, j i, j i, j i,j i j where y is the ground truth, p is the pre ∑ ∑ dict logits, i and j are the pixel index in the image. Loss(y,p) = −𝑤(𝑖, 𝑗 )[y log 𝑝 +(1−𝑦 )log(1 − 𝑝 ))] , (4) , , , , where y is the ground truth, p is the predict logits, i and j are the pixel index in the image. where y is the ground truth, p is the predict logits, i and j are the pixel index in the image. Figure 6. Visualization of the road edge region. (a) The road segmentation label; (b) road edge Figure 6. Visualization of the road edge region. (a) The road segmentation label; (b) road edge Figure 6. Visualization of the road edge region. (a) The road segmentation label; (b) road edge obtained obtained from (a) by the Canny algorithm; (c) road edge region with a width of 10 pixels. obtained from (a) by the Canny algorithm; (c) road edge region with a width of 10 pixels. from (a) by the Canny algorithm; (c) road edge region with a width of 10 pixels. 𝐸𝑅 𝐸𝑅 𝐸𝑅 𝐸𝑅 Sensors 2019, 19, 4711 8 of 15 Sensors 2019, 19, x FOR PEER REVIEW 8 of 15 4. Experiments 4. Experiments In this section, we provide qualitative and quantitative results for experiments carried out to In this section, we provide qualitative and quantitative results for experiments carried out to test test the performance of our approach. There are numerous approaches in semantic segmentation; we the performance of our approach. There are numerous approaches in semantic segmentation; we mainly compare our method to those pursuing a good tradeo between high quality and computation, mainly compare our method to those pursuing a good tradeoff between high quality and such as SegNet, ENet, and ERFNet. Moreover, to compare [22], we verified the model of inferring computation, such as SegNet, ENet, and ERFNet. Moreover, to compare [22], we verified the model occluded road boundaries by replacing the decoder part of the model with a new one that is suitable of inferring occluded road boundaries by replacing the decoder part of the model with a new one for our task. The verified model is referred to as ORBNet in our work, which retained the encoder that is suitable for our task. The verified model is referred to as ORBNet in our work, which retained and employed a decoder similar to that in the DeepLabv3+ algorithm [6]. We present quantitative the encoder and employed a decoder similar to that in the DeepLabv3+ algorithm [6]. We present results based on evaluations with our manually annotated dataset based on the KITTI dataset named quantitative results based on evaluations with our manually annotated dataset based on the KITTI KITTI-OFRS dataset. The presented results appear all to be based on the manual dataset annotations dataset named KITTI-OFRS dataset. The presented results appear all to be based on the manual except the qualitative results on Cityscapes dataset using predicted semantics as input. We first trained dataset annotations except the qualitative results on Cityscapes dataset using predicted semantics as the models on the proposed KITTI-OFRS dataset, and the experimental results demonstrate that the input. We first trained the models on the proposed KITTI-OFRS dataset, and the experimental results proposed approach spends less time on inference and obtains better performance. Then, we compared demonstrate that the proposed approach spends less time on inference and obtains better the performance of those models when trained with traditional cross-entropy loss function and the performance. Then, we compared the performance of those models when trained with traditional proposed cross-entrop spatially-weighted y loss function and cross-entr the p opy roposed loss function. spatially-we Mor igh eover ted cros , wes-en tested tropy lo the generalization ss function. performance Moreover, we of tes the ted models the geon nera the liza Cityscapes tion perform dataset ance of . the m Finally odel , the s operformance n the Cityscapof es da the tamodels set. Finabased lly, the performance of the models based on automatically inferred semantics was visualized to show on automatically inferred semantics was visualized to show that our network works well in the that our network works well in the real system. real system. 4.1. 4.1. Data Datasets sets There were no available datasets for the proposed occlusion-free road segmentation task, so we There were no available datasets for the proposed occlusion-free road segmentation task, so we built built our own datasets. We built a real-world dataset named KITTI-OFRS based on the public KITTI our own datasets. We built a real-world dataset named KITTI-OFRS based on the public KITTI semantic semantic segmentation benchmark, which is used for training and evaluation. Moreover, we segmentation benchmark, which is used for training and evaluation. Moreover, we qualitatively tested qualitatively tested our well-trained model on the Cityscape dataset [33] for a view of its our well-trained model on the Cityscape dataset [33] for a view of its generalization ability. generalization ability. KITTI-OFRS Dataset The real-world dataset was built on the public KITTI semantic segmentation KITTI-OFRS Dataset The real-world dataset was built on the public KITTI semantic benchmark, which is part of the KITTI dataset [34]. The KITTI dataset is the largest data collection for segmentation benchmark, which is part of the KITTI dataset [34]. The KITTI dataset is the largest data computer vision algorithms in the world’s largest autopilot scenario. The dataset is used to evaluate collection for computer vision algorithms in the world’s largest autopilot scenario. The dataset is used the performance of computer vision technologies and contains real-world image data collected from to evaluate the performance of computer vision technologies and contains real-world image data scenes such as urban, rural, and highways, with up to 15 vehicles and 30 pedestrians per image, collected from scenes such as urban, rural, and highways, with up to 15 vehicles and 30 pedestrians as well as varying degrees of occlusion. The KITTI semantic segmentation benchmark consists of 200 per image, as well as varying degrees of occlusion. The KITTI semantic segmentation benchmark semantically annotated train as well as 200 test samples corresponding to the KITTI Stereo and Flow consists of 200 semantically annotated train as well as 200 test samples corresponding to the KITTI Benchmark 2015. We only annotated the available 200 semantically annotated training samples for Stereo and Flow Benchmark 2015. We only annotated the available 200 semantically annotated our task and randomly split them into two parts, one contained 160 samples for training, and the training samples for our task and randomly split them into two parts, one contained 160 samples for other contained 40 samples for evaluation. We named this real-world dataset as KITTI-OFRS dataset. training, and the other contained 40 samples for evaluation. We named this real-world dataset as One sample in this dataset contained the RGB image, normal semantic labels, and occlusion-free road KITTI-OFRS dataset. One sample in this dataset contained the RGB image, normal semantic labels, segmentation labels, as demonstrated in Figure 7. and occlusion-free road segmentation labels, as demonstrated in Figure 7. Figure Figure 7. 7. An An e example xample of the of the KITTI-occlusion-fr KITTI-occlusion-free road ee roadsegmentati segmentation on (KITTI (KITTI-OFRS) -OFRS) dadataset taset sampl sample. e. (a)(a the ) the RGB im RGB image; age; ( (b) b) annotation of semantic segmentation; annotation of semantic segmentation; (c (c ) annotation ) annotation of full road ar of full road ea, white area, white denotes road. denotes road. Cityscapes Dataset The Cityscapes dataset contains 5000 images collected in street scenes from Cityscapes Dataset The Cityscapes dataset contains 5000 images collected in street scenes from 5050 di di er ffer ent ent citie citie s. sThe . The dataset dataset is is divided divided into three sub into three subsets, sets, inc including luding 29 2975 75 im images ages in the tra in the training ining set,set, 50050 images 0 imagin es i the n validation the validaset, tion and set, and 1525 images 1525 im inag the estesting in the te set.st Hin igh-quality g set. High pixel-level -quality pannotations ixel-level annotations of 19 semantic classes are provided in this dataset. We only used this dataset for the of 19 semantic classes are provided in this dataset. We only used this dataset for the generalization generalization ability test. ability test. Sensors 2019, 19, 4711 9 of 15 Classes Transformation The occlusion-free road segmentation network was designed to apply in the semantic domain. However, di erent semantic segmentation datasets may have di erent categories, and one category may have a di erent class labels in di erent datasets. It is obvious that some categories are not involved in occluding the road, such as sky, and some categories could be aggregated to one category to get a more compact representation, for example, car, truck, bus, train, motorcycle, and bicycle could be aggregated to vehicle. Therefore, a classes transformation layer is proposed to transform di erent semantic representations to a unify form before being fed to the occlusion-free road segmentation network. The classes transformation layer is a matrix multiplication operation, taking one-hot liked WHC encoded semantic representation of variable categories R 2 [0, 1] as input and output one-hot in WHC [ ] representation of a unify categories R 2 0, 1 . out R = R T, (5) out in 1, i f C(i) ! C ( j) < u ( ) T i, j = > , (6) 0, otherwise CC where T 2 f0, 1g is the transformation matrix, C is the set of original class labels and C the set of target class labels. C(i) ! C ( j) refers to that the i-th label in C should be set to the j-th label in C . u u The classes transformation layer could aggregate and unify labels of di erent semantic segmentation representations from di erent datasets or di erent semantic segmentation algorithms. In our work, the unified semantic representation contained 11 classes, namely road, sidewalk, building, wall, fence, pole, trac sign, vegetation, person, vehicle, and unlabeled. Data Augmentation In the training phase, the training data was augmented with random cropping and padding, flipping left to right. Moreover, to tackle the uncertainty of the semantic labels due to annotation errors, we augmented the training data by the technique of label smoothing, which is firstly proposed in InceptionV2 [35] to reduce over-fitting and increase the adaptive ability of the model. We used this method to add noise to the semantic one-hot, which could make our model more adaptive to annotation errors and prediction errors from other semantic segmentation methods. Unlike the original usage that takes a constant value for all the samples, we choose as a random value between 0.1 and 0.2 following uniform distribution, which was independent of each pixel in a training batch. LS y = y + (1 ) + /K. (7) 4.2. Evaluation Metrics For quantitative evaluation, precision (PRE), recall (REC), F1 score, average precision (ACC), and intersection-over-union (IoU) were used as the metrics within a region around the road edges within 4 pixels. The metrics acting on such a region are more powerful to test the network performance than on the whole pixels taking into account the primary task of occlusion reasoning. The metrics are calculated as in Equations (8)–(12), where TP, TN, FP, FN are, respectively, the number of true positives, true negatives, false positives, and false negatives at the pixel level. Our experiments considered an assessment that demonstrates the e ectiveness of our approach for inferring occluded road in the semantic domain. TP PRE = , (8) TP + FP TP REC = , (9) TP + FN 2PRE REC F1 = , (10) PRE + REC Sensors 2019, 19, 4711 10 of 15 TP + TN ACC = , (11) TP + FP + TN + FN TP IoU = . (12) TP + FP + FN 4.3. Implementation Details In the experiments, we implemented our architectures in PyTorch version 1.2 [36] (FaceBook, State of California, USA) with CUDA 10.0 and cuDNN back-ends. All experiments were run on a single NVIDIA GTX-1080Ti GPU. Due to GPU memory limitations, we had a maximum batch size of 4. During optimization, we used the SGD optimizer [37] with a weight decay of 0.0001 and a momentum of 0.9. The learning rate was set using the poly strategy with a start value of 0.01 and a power of 0.9. The edge region width T was set to 10 in the training phase and 4 in the evaluation phase. 4.4. Results and Analysis To evaluate the e ectiveness of our method on the occlusion-free road segmentation task, we trained the proposed model on the KITTI-OFRS dataset, as well as some other lightweight baseline models, such as ENet, SegNet, ERFNet, and ORBNet. The samples were resized to 384 1248 when training and testing. The quantitative and qualitative results are shown in Table 2 and Figure 8, respectively. As shown in Table 2 and Figure 8, both models achieved comparable results on the proposed task, and our method was superior to the baseline models in both accuracy and runtime. In Figure 8, red denotes false negatives; blue areas correspond to false positives, and green represents true positives. The models both performed well in the semantic domain containing more compact information of the driving environment, which indicates that the semantic and spatial information were more essential for occlusion reasoning than color and textural features. As can be seen from Figure 8, the models obtained significant results on both simple straight roads and complex intersection areas. Variable occlusion situations could be handled well, even though there were some heavy occlusion scenes. Based on the results of the proposed task, the whole road structure could be obtained and could be easily transformed into 3D world representations by an inverse perspective transformation without the a ectation of the on-road objects. Empirically, higher road detection precision may lead to a better road model for better path planning. Comparison of accuracy and computation complexity Our model achieved a significant trade-o between accuracy and eciency, which conclusion is drawn by comparing with other models. To compare the computation complexity, we employed several parameters, GFLOPs, and frames per second (FPS) as the evaluation metrics. FPS was measured on an Nvidia GTX1080Ti GPU with an input size of 384 1248 and was averaged among 100 runs. As can be seen from Table 2, our model outperformed ENet by 1.5% in the F1 score and 2.6% in the IoU while runs were only a little slower than it. Our model ran almost two times faster than ERFNet and improved 1.0% in the F1 score and 1.7% in the IoU. Compared to SegNet and ORBNet, our model got a little improvement in accuracy but achieved three times faster in the inference phase. In conclusion, our model achieved a better trade-o between accuracy and eciency. Table 2. Evaluation results of models trained with spatially-weighted cross-entropy loss (CE-SW). Model Parameters GFLOPs FPS ACC PRE REC F1 IoU ENet 0.37M 3.83 52 91.8% 92.1% 89.3% 90.7% 82.9% ERFNet 2.06M 24.43 25 92.3% 92.6% 89.7% 91.2% 83.8% SegNet 29.46M 286.03 16 92.9% 93.6% 90.2% 91.8% 84.9% ORBNet 1.91M 48.48 11.5 92.7% 93.4% 89.9% 91.6% 84.5% OFRSNet 0.39M 2.99 46 93.2% 94.2% 90.3% 92.2% 85.5% Sensors 2019, 19, 4711 11 of 15 Sensors 2019, 19, x FOR PEER REVIEW 11 of 15 Figure Figure 8 8. Qualitative . Qualitative results results o onnthe theKITTI-OFRS KITTI-OFRS data dataset. set. The The co columns lumns from from left left toto right are right ar the result e the results s of GT, ENet, ORBNet, and OFRSNet, respectively. Red denotes false negatives; blue areas correspond of GT, ENet, ORBNet, and OFRSNet, respectively. Red denotes false negatives; blue areas correspond to false positives, and green represents true positives. to false positives, and green represents true positives. Table 2. Evaluation results of models trained with spatially-weighted cross-entropy loss (CE-SW). Comparison of loss function To evaluate the e ectiveness of the proposed spatially-weighted cross-entropy loss, we trained the models both with traditional cross-entropy loss (CE) and the Model Parameters GFLOPs FPS ACC PRE REC F1 IoU spatially-weighted cross-entropy loss (CE-SW), and the evaluation results of the CE and metrics ENet 0.37M 3.83 52 91.8% 92.1% 89.3% 90.7% 82.9% degradation are shown in Table 3. When trained with CE, the models saw obvious metrics degradation ERFNet 2.06M 24.43 25 92.3% 92.6% 89.7% 91.2% 83.8% compared to CE-SW. The values in parentheses are the metrics degradation compared to that when SegNet 29.46M 286.03 16 92.9% 93.6% 90.2% 91.8% 84.9% models were trained with CE-SW, which shows that the spatially-weighted cross-entropy loss was ORBNet 1.91M 48.48 11.5 92.7% 93.4% 89.9% 91.6% 84.5% very beneficial for increasing accuracy. Intuitively, the spatially-weighted cross-entropy loss forced the OFRSNet 0.39M 2.99 46 93.2% 94.2% 90.3% 92.2% 85.5% models to take care of the road edge region where the occlusion occurs mostly. Comparison of loss function To evaluate the effectiveness of the proposed spatially-weighted cross-entropy loss, we trained the models both with traditional cross-entropy loss (CE) and the Table 3. Evaluation results of models trained with cross-entropy loss (CE). The values in parentheses spatially-weighted cross-entropy loss (CE-SW), and the evaluation results of the CE and metrics are the metrics degradation compared to that when models were trained with spatially-weighted degradation are shown in Table 3. When trained with CE, the models saw obvious metrics cross-entropy loss (CE-SW). degradation compared to CE-SW. The values in parentheses are the metrics degradation compared Model ACC PRE REC F1 IoU to that when models were trained with CE-SW, which shows that the spatially-weighted cross- entropy loss was very beneficial for increasing accuracy. Intuitively, the spatially-weighted cross- ENet 90.4%(1.4%) 90.5%(1.6%) 87.6%(1.7%) 89.0%(1.7%) 80.2%(2.7%) ERFNet 90.5%(1.8%) 90.9%(1.7%) 87.3%(2.4%) 89.1%(2.1%) 80.3%(3.5%) entropy loss forced the models to take care of the road edge region where the occlusion occurs mostly. SegNet 92.1%(0.8%) 92.6%(1.0%) 89.4%(0.8%) 91.0%(0.8%) 83.5%(1.4%) ORBNet 91.5% (1.2%) 92.2% (1.2%) 88.4% (1.5%) 90.2% (1.4%) 82.2% (2.3%) Table 3. Evaluation results of models trained with cross-entropy loss (CE). The values in parentheses OFRSNet 91.7%(1.5%) 92.4%(1.8%) 88.6%(1.7%) 90.5%(1.7%) 82.6%(2.9%) are the metrics degradation compared to that when models were trained with spatially-weighted cross-entropy loss (CE-SW). Comparison of convolution with and without context To evaluate the benefits of the context Model ACC PRE REC F1 IoU convolution block, we replaced the context convolution block with regular convolution operation in ENet 90.4%(−1.4%) 90.5%(−1.6%) 87.6%(−1.7%) 89.0%(−1.7%) 80.2%(−2.7%) the down-sampling and up-sampling blocks. As shown in Table 4, the model with context information ERFNet 90.5%(−1.8%) 90.9%(−1.7%) 87.3%(−2.4%) 89.1%(−2.1%) 80.3%(−3.5%) outperformed the model without that by 0.6% in the F1 score and 1.0% in the IoU, which demonstrates SegNet 92.1%(−0.8%) 92.6%(−1.0%) 89.4%(−0.8%) 91.0%(−0.8%) 83.5%(−1.4%) that the context information is desirable for the proposed approach. ORBNet 91.5% (−1.2%) 92.2% (−1.2%) 88.4% (−1.5%) 90.2% (−1.4%) 82.2% (−2.3%) OFRSNet 91.7%(−1.5%) 92.4%(−1.8%) 88.6%(−1.7%) 90.5%(−1.7%) 82.6%(−2.9%) Comparison of convolution with and without context To evaluate the benefits of the context convolution block, we replaced the context convolution block with regular convolution operation in Sensors 2019, 19, 4711 12 of 15 Sensors 2019, 19, x FOR PEER REVIEW 12 of 15 the down-sampling and up-sampling blocks. As shown in Table 4, the model with context Table 4. Performance comparison of the model with and without context. information outperformed the model without that by 0.6% in the F1 score and 1.0% in the IoU, which demonstrates that the context information is desirable for the proposed approach. Model Context Parameters GFLOPs ACC PRE REC F1 IoU OFRSNet w/o 0.34M 2.96 92.7% 92.8% 90.4% 91.6% 84.5% Table 4. Performance comparison of the model with and without context. OFRSNet w/ 0.39M 2.99 93.2% 94.2% 90.3% 92.2% 85.5% Model Context Parameters GFLOPs ACC PRE REC F1 IoU OFRSNet w/o 0.34M 2.96 92.7% 92.8% 90.4% 91.6% 84.5% Generalization on Cityscape Dataset To further test the generalization ability of our model, OFRSNet w/ 0.39M 2.99 93.2% 94.2% 90.3% 92.2% 85.5% we conducted qualitative test experiments on the Cityscape dataset with the model trained only on Generalization on Cityscape Dataset To further test the generalization ability of our model, we the KITTI-OFRS dataset. As can be seen from Figure 9, the well-trained model performed well on conducted qualitative test experiments on the Cityscape dataset with the model trained only on the the complex real-world Cityscapes dataset, which indicates that our model obtained quite a good KITTI-OFRS dataset. As can be seen from Figure 9, the well-trained model performed well on the generalization ability on the occlusion-free road segmentation task. The generalization ability of our complex real-world Cityscapes dataset, which indicates that our model obtained quite a good model benefited from inferring the occluded road in the semantic domain, which made the model generalization ability on the occlusion-free road segmentation task. The generalization ability of our focus on learning the occlusion mechanism in the driving scenes without the a ectation of sensing model benefited from inferring the occluded road in the semantic domain, which made the model noise. focus on In the scenes, learning th thee occl color usion mech and textual anism in featurthe es may driving scenes witho di er very much ut the a in the ffec same tation position of sensing due to noise. In the scenes, the color and textual features may differ very much in the same position due to di erent camera configurations and lighting conditions while the semantic features share a similar different camera configurations and lighting conditions while the semantic features share a similar distribution. The occlusion situations were able to understand that the occluded road area was correctly distribution. The occlusion situations were able to understand that the occluded road area was inferred in variable occlusion scenes by the proposed method according to the results. As shown in correctly inferred in variable occlusion scenes by the proposed method according to the results. As Figure 9, the detection results obtained the overall structure of the road and accurate segmentation shown in Figure 9, the detection results obtained the overall structure of the road and accurate despite occlusion. Moreover, it is applicable to combine our method with other semantic segmentation segmentation despite occlusion. Moreover, it is applicable to combine our method with other algorithms in the real system due to its lightweight and eciency. As shown in Figure 10, when taking semantic segmentation algorithms in the real system due to its lightweight and efficiency. As shown the predicted semantics obtained by the DeepLabv3+ algorithm as input, the proposed OFRSNet still in Figure 10, when taking the predicted semantics obtained by the DeepLabv3+ algorithm as input, works well to predict the occluded road areas and outperforms ENet and ORBNet in terms of accuracy the proposed OFRSNet still works well to predict the occluded road areas and outperforms ENet and and robustness. ORBNet in terms of accuracy and robustness. Figure 9. Qualitative results on the Cityscapes dataset using ground truth semantics as input. Green represents the detected full road area. Sensors 2019, 19, x FOR PEER REVIEW 13 of 15 Sensors 2019 Figure 9. , 19, 4711 Qualitative results on the Cityscapes dataset using ground truth semantics as input. Green 13 of 15 represents the detected full road area. Figure 10. Qualitative results on the Cityscapes dataset using predicted semantics as input, which Figure 10. Qualitative results on the Cityscapes dataset using predicted semantics as input, which were were obtained by the DeepLabv3+ algorithm. Green represents the detected full road area. obtained by the DeepLabv3+ algorithm. Green represents the detected full road area. 5. Conclusions 5. Conclusions In In t this his pa paper per, we present , we presented ed a annocclusion-fr occlusion-free ee roa road d segmenta segmentation tion nnetwork etwork to to inf infer er the oc the occluded cluded road road areaare ofaan ofurban an urb driving an driv scenario ing scen frar om io from monocular mono vision. cular v The ision model . The m weopr del esented we pris esen a lightweight ted is a lightweight and efficient encoder–decoder fully convolutional architecture that contains down- and ecient encoder–decoder fully convolutional architecture that contains down-sampling and sampling and up-sampling blocks combined with global contextual operations. Meanwhile, a up-sampling blocks combined with global contextual operations. Meanwhile, a spatially-weighted spatially-weighted cross-entropy loss was proposed to induce the network to pay more attention to cross-entropy loss was proposed to induce the network to pay more attention to the road edge region the road edge region in the training phase. We showed the effectiveness of the model on the self-built in the training phase. We showed the e ectiveness of the model on the self-built small but ecient small but efficient KITTI-OFRS dataset. Compared to other recent lightweight semantic segmentation KITTI-OFRS dataset. Compared to other recent lightweight semantic segmentation algorithms, our algorithms, our network obtained a better trade-off between accuracy and runtime. The comparisons network obtained a better trade-o between accuracy and runtime. The comparisons of the models of the models trained with different loss functions highlighted the benefits of the proposed spatially- trained with di erent loss functions highlighted the benefits of the proposed spatially-weighted weighted cross-entropy loss for the occlusion reasoning road segmentation task. The generalization cross-entropy loss for the occlusion reasoning road segmentation task. The generalization ability of our ability of our model was further qualitatively tested on the Cityscape datasets, and the results clearly model was further qualitatively tested on the Cityscape datasets, and the results clearly demonstrated demonstrated our model’s inferring ability of the occluded road even in complex scenes. Moreover, our model’s inferring ability of the occluded road even in complex scenes. Moreover, the proposed the proposed OFRSNet could be efficiently combined with other semantic segmentation algorithms OFRSNet could be eciently combined with other semantic segmentation algorithms due to its small due to its small size and minimal runtime. We believe that being able to infer occluded road regions size and minimal runtime. We believe that being able to infer occluded road regions in autonomous in autonomous driving systems is a key component to achieve a full comprehension of the scene and driving systems is a key component to achieve a full comprehension of the scene and will allow better will allow better planning of the ego-vehicle trajectories. planning of the ego-vehicle trajectories. Author Contributions: Conceptualization, K.W., F.Y., and B.Z.; Data curation, K.W. and Q.Y.; Formal analysis, Author Contributions: Conceptualization, K.W., F.Y., and B.Z.; Data curation, K.W. and Q.Y.; Formal analysis, K.W., B.Z., L.T., and C.L.; Funding acquisition, F.Y.; Investigation, K.W., L.T., Q.Y., and C.L.; Methodology, K.W.; K.W., B.Z., L.T., and C.L.; Funding acquisition, F.Y.; Investigation, K.W., L.T., Q.Y., and C.L.; Methodology, K.W.; Project administration, F.Y. and B.Z.; Resources, B.Z.; Software, K.W. and L.T.; Supervision, K.W.; Validation, Project administration, F.Y. and B.Z.; Resources, B.Z.; Software, K.W. and L.T.; Supervision, K.W.; Validation, K.W., B.Z., L.T., and C.L.; Visualization, K.W. and Q.Y.; Writing—original draft, K.W. and B.Z.; Writing—review and editing, K.W. and B.Z. Sensors 2019, 19, 4711 14 of 15 Funding: This research was funded by the National Natural Science Foundation of China (Grant No. 51975434), the Overseas Expertise Introduction Project for Discipline Innovation (Grant No. B17034), the Innovative Research Team in University of Ministry of Education of China (Grant No. IRT_17R83) and the Department of Science and Technology, the Hubei Provincial People’s Government (Grant No. 2017BEC196). Acknowledgments: Thanks for the help of reviewers and editors. Conflicts of Interest: The authors declare no conflict of interest. References 1. Oliveira, G.L.; Burgard, W.; Brox, T. Ecient deep models for monocular road segmentation. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4885–4891. 2. Mendes, C.C.T.; Fremont, V.; Wolf, D.F. Exploiting Fully Convolutional Neural Networks for Fast Road Detection. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; Okamura, A., Menciassi, A., Ude, A., Burschka, D., Lee, D., Arrichiello, F., Liu, H., Eds.; IEEE: New York, NY, USA, 2016; pp. 3174–3179. 3. Zhang, X.; Chen, Z.; Wu, Q.M.J.; Cai, L.; Lu, D.; Li, X. Fast Semantic Segmentation for Scene Perception. IEEE Trans. Ind. Inform. 2019, 15, 1183–1192. [CrossRef] 4. Wang, B.; Fremont, V.; Rodriguez, S.A. Color-based Road Detection and its Evaluation on the KITTI Road Benchmark. In Proceedings of the 2014 IEEE Intelligent Vehicles Symposium Proceedings, Dearborn, MI, USA, 8–11 June 2014; pp. 31–36. 5. Song, X.; Rui, T.; Zhang, S.; Fei, J.; Wang, X. A road segmentation method based on the deep auto-encoder with supervised learning. Comput. Electr. Eng. 2018, 68, 381–388. [CrossRef] 6. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schro , F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. 7. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 Ieee Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. 8. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed] 9. Mano, K.; Masuzawa, H.; Miura, J.; Ardiyanto, I. Road Boundary Estimation for Mobile Robot Using Deep Learning and Particle Filter; IEEE: New York, NY, USA, 2018; pp. 1545–1550. 10. Li, K.; Shao, J.; Guo, D. A Multi-Feature Search Window Method for Road Boundary Detection Based on LIDAR Data. Sensors 2019, 19, 1551. [CrossRef] 11. Khalilullah, K.M.I.; Jindai, M.; Ota, S.; Yasuda, T. Fast Road Detection Methods on a Large Scale Dataset for assisting robot navigation Using Kernel Principal Component Analysis and Deep Learning; IEEE: New York, NY, USA, 2018; pp. 798–803. 12. Son, J.; Yoo, H.; Kim, S.; Sohn, K. Real-time illumination invariant lane detection for lane departure warning system. Expert Syst. Appl. 2015, 42, 1816–1824. [CrossRef] 13. Li, Q.; Zhou, J.; Li, B.; Guo, Y.; Xiao, J. Robust Lane-Detection Method for Low-Speed Environments. Sensors 2018, 18, 4274. [CrossRef] [PubMed] 14. Cao, J.; Song, C.; Song, S.; Xiao, F.; Peng, S. Lane Detection Algorithm for Intelligent Vehicles in Complex Road Conditions and Dynamic Environments. Sensors 2019, 19, 3166. [CrossRef] [PubMed] 15. Liu, X.; Deng, Z. Segmentation of Drivable Road Using Deep Fully Convolutional Residual Network with Pyramid Pooling. Cogn. Comput. 2017, 10, 272–281. [CrossRef] 16. Cai, Y.; Li, D.; Zhou, X.; Mou, X. Robust Drivable Road Region Detection for Fixed-Route Autonomous Vehicles Using Map-Fusion Images. Sensors 2018, 18, 4158. [CrossRef] [PubMed] 17. Aly, M. Real time Detection of Lane Markers in Urban Streets. In Proceedings of the Intelligent Vehicles Symposium, Eindhoven, The Netherlands, 4–6 June 2008. 18. Laddha, A.; Kocamaz, M.K.; Navarro-Serment, L.E.; Hebert, M. Map-supervised road detection. In Proceedings of the Intelligent Vehicles Symposium, Gothenburg, Sweden, 19–22 June 2016; pp. 118–123. Sensors 2019, 19, 4711 15 of 15 19. Alvarez, J.M.; Salzmann, M.; Barnes, N. Learning Appearance Models for Road Detection. In Proceedings of the Intelligent Vehicles Symposium, Gold Coast, QLD, Australia, 23–26 June 2013. 20. Badrinarayanan, V.; Handa, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling. arXiv 2015, arXiv:1505.07293. 21. Chen, L.-C.; Papandreou, G.; Schro , F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. 22. Suleymanov, T.; Amayo, P.; Newman, P. Inferring Road Boundaries Through and Despite Trac. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems, Maui, HI, USA, 4–7 November 2018; pp. 409–416. 23. Becattini, F.; Berlincioni, L.; Galteri, L.; Seidenari, L.; Del Bimbo, A. Semantic Road Layout Understanding by Generative Adversarial Inpainting. arXiv 2018, arXiv:1805.11746. 24. Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. 25. Romera, E.; Álvarez, J.M.; Bergasa, L.M.; Arroyo, R. ERFNet: Ecient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [CrossRef] 26. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. 27. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. 28. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond. arXiv 2019, arXiv:1904.11492. 29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 30. Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. 31. Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H.S. Conditional Random Fields as Recurrent Neural Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1529–1537. 32. Canny, J. A Computational Approach to Edge Detection. IEEE Trans.Pattern Anal. Mach. Intell. 1986, 8, 679–698. [CrossRef] [PubMed] 33. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. 34. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [CrossRef] 35. Szegedy, C.; Vanhoucke, V.; Io e, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. 36. PyTorch. Available online: http://pytorch.org/ (accessed on 1 September 2019). 37. Bottou, L. Large-Scale Machine Learning with Stochastic Gradient Descent; Physica-Verlag HD: Heidelberg, Germany, 2010; pp. 177–186. © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Sensors Multidisciplinary Digital Publishing Institute

Occlusion-Free Road Segmentation Leveraging Semantics for Autonomous Vehicles

Sensors , Volume 19 (21) – Oct 30, 2019

Loading next page...
 
/lp/multidisciplinary-digital-publishing-institute/occlusion-free-road-segmentation-leveraging-semantics-for-autonomous-1lTeqmDLNR

References (37)

Publisher
Multidisciplinary Digital Publishing Institute
Copyright
© 1996-2019 MDPI (Basel, Switzerland) unless otherwise stated Terms and Conditions Privacy Policy
ISSN
1424-8220
DOI
10.3390/s19214711
Publisher site
See Article on Publisher Site

Abstract

sensors Article Occlusion-Free Road Segmentation Leveraging Semantics for Autonomous Vehicles 1 , 2 , 3 1 , 2 , 3 1 , 2 , 3 , 1 , 2 , 3 1 , 2 , 3 4 Kewei Wang , Fuwu Yan , Bin Zou *, Luqi Tang , Quan Yuan and Chen Lv Hubei Key Laboratory of Advanced Technology for Automotive Components, Wuhan University of Technology, Wuhan 430070, China; wkw199q@whut.edu.cn (K.W.); yanfuwu@vip.sina.com (F.Y.); tlqqidong@163.com (L.T.); 231943@whut.edu.cn (Q.Y.) Hubei Collaborative Innovation Center for Automotive Components Technology, Wuhan University of Technology, Wuhan 430070, China Hubei Research Center for New Energy & Intelligent Connected Vehicle, Wuhan 430070, China School of Mechanical and Aerospace Engineering, Nanyang Technological University, 639798, Singapore; lyuchen@ntu.edu.sg * Correspondence: zoubin@whut.edu.cn; Tel.: +86-138-7115-3253 Received: 3 September 2019; Accepted: 24 October 2019; Published: 30 October 2019 Abstract: The deep convolutional neural network has led the trend of vision-based road detection, however, obtaining a full road area despite the occlusion from monocular vision remains challenging due to the dynamic scenes in autonomous driving. Inferring the occluded road area requires a comprehensive understanding of the geometry and the semantics of the visible scene. To this end, we create a small but e ective dataset based on the KITTI dataset named KITTI-OFRS (KITTI-occlusion-free road segmentation) dataset and propose a lightweight and ecient, fully convolutional neural network called OFRSNet (occlusion-free road segmentation network) that learns to predict occluded portions of the road in the semantic domain by looking around foreground objects and visible road layout. In particular, the global context module is used to build up the down-sampling and joint context up-sampling block in our network, which promotes the performance of the network. Moreover, a spatially-weighted cross-entropy loss is designed to significantly increases the accuracy of this task. Extensive experiments on di erent datasets verify the e ectiveness of the proposed approach, and comparisons with current excellent methods show that the proposed method outperforms the baseline models by obtaining a better trade-o between accuracy and runtime, which makes our approach is able to be applied to autonomous vehicles in real-time. Keywords: autonomous vehicles; scene understanding; occlusion reasoning; road detection 1. Introduction Reliable perception of the surrounding environment plays a crucial role in autonomous driving vehicles, in which robust road detection is one of the key tasks. Many types of road detection methods have been proposed in the literature based on monocular camera, stereo vision, or LiDAR (Light Detector and Ranging) sensors. With the rapid progress in deep learning techniques, significant achievements in segmentation techniques have significantly promoted road detection in monocular images [1–5]. Generally, these algorithms label each and every pixel in the image with one of the object classes by color and textual features. However, the road is often occluded by dynamic trac participants as well as static transport infrastructures when measured with on-board cameras, which makes it hard to directly obtain a full road area. When performing decision-making in extremely challenging scenarios, such as dynamic urban scenes, a comprehensive understanding of the environment needs to Sensors 2019, 19, 4711; doi:10.3390/s19214711 www.mdpi.com/journal/sensors Sensors 2019, 19, 4711 2 of 15 carefully tackle the occlusion problem. As to the road detection task, road segmentation of the visible area is not sucient for path planning and decision-making. It is necessary to get the whole structure and layout of the local road with an occlusion reasoning process in complex driving scenarios where clutter and occlusion occur with high frequency. Inspired by the fact that human beings are capable of completing the road structure in their minds by understanding the on-road objects and the visible road area, we believe that a powerful convolution network could learn to infer the occluded road area as human beings do. Intuitively, to the occlusion reasoning task, the color and texture features are of relatively low importance, what matters is the semantic and spatial features of the elements in the environment. As far as we know, semantic segmentation [6–8] is one of the most complete forms of visual scene understanding, where the goal is to label each pixel with the corresponding semantic label (e.g., tree, pedestrian, car, etc.). So, instead of an RGB image, we performed the occlusion reasoning road segmentation using semantic representation as input, which could be obtained by popular deep learning methods in real applications or human-annotated ground truth in the training phase. As shown in Figure 1, traditional road segmentation takes RGB image as input and labels road only in the visible area. As a comparison, our Sensors 2019, 19, x FOR PEER REVIEW 2 of 15 proposed occlusion-free road segmentation (OFRS) intends to leverage the semantic representation to necessary to get the whole structure and layout of the local road with an occlusion reasoning process infer the occluded road area in the driving scene. Note that the semantic input in the figure is just a in complex driving scenarios where clutter and occlusion occur with high frequency. visualization of the semantic representation, the actual input is the one-hot type of semantic label. Figure 1. Comparison of road segmentation and proposed occlusion-free road segmentation. (a) RGB Figure 1. Comparison of road segmentation and proposed occlusion-free road segmentation. (a) RGB image; (b) visualization of the results of road segmentation; (c) visualization of the semantic image; (b) visualization of the results of road segmentation; (c) visualization of the semantic representation of the scene, which could be obtained by semantic segmentation algorithms in real representation of the scene, which could be obtained by semantic segmentation algorithms in real applications or human annotation in training phase; (d) visualization of the results of the proposed applications or human annotation in training phase; (d) visualization of the results of the proposed occlusion-free road segmentation. Green refers to the road area in (b) and (d). occlusion-free road segmentation. Green refers to the road area in (b) and (d). In this paper, we aim to infer the occluded road area utilizing the semantic features of visible Inspired by the fact that human beings are capable of completing the road structure in their scenes and name this new task as occlusion-free road segmentation. First, a suitable dataset is created minds by understanding the on-road objects and the visible road area, we believe that a powerful based on the popular KITTI dataset, which is referred to as the KITTI-OFRS dataset in the following. convolution network could learn to infer the occluded road area as human beings do. Intuitively, to Second, the occl an usion end-to-end reasoning lightweight task, the co and lor eand cient textfully ure fe convolutional atures are of rel neural atively networks low impor fortanc theenew , what task matters is the semantic and spatial features of the elements in the environment. As far as we know, is proposed to learn the ability of occlusion reasoning. Moreover, a spatially-dependent weight is semantic segmentation [6–8] is one of the most complete forms of visual scene understanding, where applied to the cross-entropy loss to increase the performance of our network. We evaluate our model the goal is to label each pixel with the corresponding semantic label (e.g., tree, pedestrian, car, etc.). on di erent datasets and compare it with some other excellent algorithms which pursue the trade-o So, instead of an RGB image, we performed the occlusion reasoning road segmentation using between accuracy and runtime in the semantic segmentation task. semantic representation as input, which could be obtained by popular deep learning methods in real The main contributions of this paper are as follows: applications or human-annotated ground truth in the training phase. As shown in Figure 1, We analyze the occlusion problem in road detection and propose the novel task of occlusion-free traditional road segmentation takes RGB image as input and labels road only in the visible area. As road segmentation in the semantic domain, which infers the occluded road area using semantic a comparison, our proposed occlusion-free road segmentation (OFRS) intends to leverage the features of the dynamic scenes. semantic representation to infer the occluded road area in the driving scene. Note that the semantic input in the figure is just a visualization of the semantic representation, the actual input is the one- hot type of semantic label. In this paper, we aim to infer the occluded road area utilizing the semantic features of visible scenes and name this new task as occlusion-free road segmentation. First, a suitable dataset is created based on the popular KITTI dataset, which is referred to as the KITTI-OFRS dataset in the following. Second, an end-to-end lightweight and efficient fully convolutional neural networks for the new task is proposed to learn the ability of occlusion reasoning. Moreover, a spatially-dependent weight is applied to the cross-entropy loss to increase the performance of our network. We evaluate our model on different datasets and compare it with some other excellent algorithms which pursue the trade- off between accuracy and runtime in the semantic segmentation task. The main contributions of this paper are as follows: • We analyze the occlusion problem in road detection and propose the novel task of occlusion-free road segmentation in the semantic domain, which infers the occluded road area using semantic features of the dynamic scenes. Sensors 2019, 19, 4711 3 of 15 To complete this task, we create a small but ecient dataset based on the popular KITTI dataset named the KITTI-OFRS dataset, design a lightweight and ecient encoder–decoder fully convolution network referred to as OFRSNet and optimize the cross-entropy loss for the task by adding a spatially-dependent weight that could significantly increase the accuracy. We elaborately design the architecture of OFRSNet to obtain a good trade-o between accuracy and runtime. The down-sampling block and joint context up-sampling block in the network are designed to e ectively capture the contextual features that are essential for the occlusion reasoning process and increase the generalization ability of the model. The remainder of this paper is organized as follows: First, the related works in road detection are briefly introduced in Section 2. Section 3 introduces the methodology in detail, and Section 4 shows the experimental results. Finally, we draw conclusions in Section 5. 2. Related Works Road detection in autonomous driving has benefited from the development of the deep convolutional neural networks in recent years. Generally, the road is represented by its boundaries [9,10] or regions [1,2,11]. Moreover, road lane [12–14] and drivable area [15,16] detection also attract much attention from researchers, which concern the ego lane and the obstacle-free region of the road, respectively. The learning-based methods usually outperform the model-based methods due to the developed segmentation techniques. The model-based methods identify the road structure and road areas by shape [17,18] or appearance models [19]. The learning-based methods [3,6,7,16,20,21] classify the pixels in images as road and non-road, or road boundaries and non-road boundaries. However, the presence of foreground objects makes it hard to obtain full road despite the occlusion. To infer the road boundaries despite the occlusion, Suleymanov et al. [22] presented a convolutional neural network that contained intra-layer convolutions and produced outputs in a hybrid discrete-continuous form. Becattini et al. [23] proposed a GAN-based (Generative Adversarial Network) semantic segmentation inpainting model to remove all dynamic objects from the scene and focus on understanding its static components (such as streets, sidewalks, and buildings) to get a comprehension of the static road scene. In contrast to the above solutions, we conduct occlusion-free road segmentation to infer the occluded road area as a pixel-wise classification task. Even though the deep-learning methods have achieved remarkable performance in the pixel-wise classification task, to achieve the best trade-o between accuracy and eciency is still a challenging problem. Vijay et al. [20] presented a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet, which follows encoder–decoder architecture that is designed to be ecient both in memory and computational time in inference phase. Adam et al. [24] proposed a fast and compact encoder–decoder architecture named ENet that significantly has fewer parameters, and provides similar or better accuracy to SegNet. Romera et al. [25] proposed a novel layer design that leverages skip connections and convolutions with 1D kernels, which highly reduces the compute cost and increase the accuracy. Inspired by these networks, we follow the encoder–decoder architecture and enhance the down-sampling and up-sampling blocks with contextual extraction operations [26–28], which are proved to be helpful for segmentation-related tasks. This contextual information is even more essential and e ective for our occlusion reasoning task, which needs a comprehensive understanding of the driving scenes. 3. Occlusion-Free Road Segmentation 3.1. Task Definition The occlusion-free road segmentation task is defined as a pixel-level classification as the traditional road segmentation but with occlusion reasoning process to obtain a full representation of the road area. The input is fed to the model as a one-hot encoded tensor of the semantic segmentation labels WHC or predicted semantic segmentation probabilities’ tensor I 2 [0, 1] , where W is the width of the Sensors 2019, 19, 4711 4 of 15 Sensors 2019, 19, x FOR PEER REVIEW 4 of 15 image, H its height, and C the number of classes. In the same way, we trained the network to output WH2 ×× output a new tensor O∈[0,1] with the same width and height but containing only two a new tensor O 2 [0, 1] with the same width and height but containing only two categories belonging categories belonging to road and to non-r road and non-ro oad. ad. 3.2. Network Architecture 3.2. Network Architecture The proposed model is illustrated in Table 1 and visualized in Figure 2, and was designed to get The proposed model is illustrated in Table 1 and visualized in Figure 2, and was designed to get the best possible trade-o between accuracy and runtime. We followed the current trend of using the best possible trade-off between accuracy and runtime. We followed the current trend of using convolutions with residual connections [29] as the core elements of our architecture, to leverage their convolutions with residual connections [29] as the core elements of our architecture, to leverage their success in classification and segmentation problems. Inspired by SegNet and ENet, an encoder–decoder success in classification and segmentation problems. Inspired by SegNet and ENet, an encoder– architecture was adopted for the whole network architecture. The residual bottleneck blocks of di erent decoder architecture was adopted for the whole network architecture. The residual bottleneck blocks types were used as the basic blocks in the encoder and decoder. Dilated convolution was applied in the of different types were used as the basic blocks in the encoder and decoder. Dilated convolution was blocks to enlarge the respective field of the encoder. What is more, the context module was combined applied in the blocks to enlarge the respective field of the encoder. What is more, the context module with regular convolution to obtain a global understanding of the environment, which is really essential was combined with regular convolution to obtain a global understanding of the environment, which is to infer the occluded road area. In the decoder, we proposed a joint context up-sampling block to really essential to infer the occluded road area. In the decoder, we proposed a joint context up-sampling leverage the features of di erent resolutions to obtain richer and global information. block to leverage the features of different resolutions to obtain richer and global information. Deconv Down-sampling Joint Contextual Deliated Block Residual Block Factorized Block Block Upsampling Block Figure 2. The proposed occlusion-free road segmentation network architecture. Figure 2. The proposed occlusion-free road segmentation network architecture. Sensors 2019, 19, 4711 5 of 15 Sensors 2019, 19, x FOR PEER REVIEW 5 of 15 Table 1. Our network architecture in detail. Size refers to output feature maps size for an input size of 384 1248. Table 1. Our network architecture in detail. Size refers to output feature maps size for an input size of 384 × 1248. Stage Block Type Size Stage Block Type Size Context Down-sampling 192 624 16 Context Down-sampling 192 × 624 × 16 Context Down-sampling 96 312 32 Context Down-sampling 96 × 312 × 32 Factorized blocks 96 312 32 Encoder Context Down-sampling 48 156 64 Factorized blocks 96 × 312 × 32 Dilated blocks 48 156 64 Encoder Context Down-sampling 48 × 156 × 64 Context down-sampling 24 78 128 Dilated blocks 48 × 156 × 64 Dilated blocks 24 78 128 Context down-sampling 24 × 78 × 128 Joint Context Up-sampling 48 156 64 Dilated blocks 24 × 78 × 128 Bottleneck Blocks 48 156 64 Joint Context Up-sampling 48 × 156 × 64 Joint Context Up-sampling 96 312 32 Bottleneck Blocks 48 × 156 × 64 Decoder Bottleneck Blocks 96 312 32 Joint Context Up-sampling 96 × 312 × 32 Joint Context Up-sampling 192 624 16 Decoder Bottleneck Blocks 96 × 312 × 32 Bottleneck Blocks 192 624 16 Joint Context Up-sampling 192 × 624 × 16 Deconv 384 1248 2 Bottleneck Blocks 192 × 624 × 16 Deconv 384 × 1248 × 2 Context Convolution Block Recent works have shown that contextual information is helpful for Context Convolution Block Recent works have shown that contextual information is helpful for models to predict high-quality segmentation results. Modules which could enlarge the receptive field, models to predict high-quality segmentation results. Modules which could enlarge the receptive such as ASPP [21], DenseASPP [30], and CRFasRNN [31], have been proposed in the past years. Most field, such as ASPP [21], DenseASPP [30], and CRFasRNN [31], have been proposed in the past years. of these works explore context information in the decoder phase and ignore the surrounding context Most of these works explore context information in the decoder phase and ignore the surrounding when encoding the features in the early stage. On the other hand, the attention mechanism has been context when encoding the features in the early stage. On the other hand, the attention mechanism widely used for increasing model capability. Inspired by the non-local block [27] and SE block [26], has been widely used for increasing model capability. Inspired by the non-local block [27] and SE we proposed the context convolution, as shown in Figure 3. A context branch from [28] was added, block [26], we proposed the context convolution, as shown in Figure 3. A context branch from [28] bypassing the main branch of the convolution operation. As can be seen in Equation (1), the context was added, bypassing the main branch of the convolution operation. As can be seen in Equation (1), branch first adopted a 1 1 convolution W and softmax function to obtain the attention weights, and the context branch first adopted a 1 × 1 convolution 𝑊 and softmax function to obtain the attention then performed the attention pooling to obtain the global context features; then the global context weights, and then performed the attention pooling to obtain the global context features; then the features were transformed via a 1  1 convolution W and was added to the features of the main global context features were transformed via a 1 × 1 convolution 𝑊 and was added to the features convolution branch. of the main convolution branch. exp W x N k j ( ) z = x + W x , (1) i i  j ∑ N 𝑧 = 𝑥 +W j=1 p 𝑥 , (1) exp(W x ) ∑ ( )m m =1 where W and W denote linear transformation matrices. where 𝑊 and  𝑊 denote linear transformation matrices. C × H × W conv(1×1) 1 × H × W C × HW HW × 1 × 1 softmax conv(k×k), C1 C × 1 × 1 conv(1×1) Wv C1 × 1 × 1 C1 × H × W BN, ReLU C1 × H × W Figure 3. The context convolution block. Figure 3. The context convolution block. Sensors 2019, 19, 4711 6 of 15 Sensors 2019, 19, x FOR PEER REVIEW 6 of 15 Down-Sampling Block In our work, the down-sampling block performed down-sampling by Down-Sampling Block In our work, the down-sampling block performed down-sampling by using a 3  3 convolution with stride 2 in the main branch of a context convolution block, as stated using a 3 × 3 convolution with stride 2 in the main branch of a context convolution block, as stated above. The context branch extracted the global context information to obtain a global understanding above. The context branch extracted the global context information to obtain a global understanding of features. Down-sampling lets the deeper layers gather more context (to improve classification) and of features. Down-sampling lets the deeper layers gather more context (to improve classification) and helps to reduce computation. And we used two down-sampling blocks at the start of the network to helps to reduce computation. And we used two down-sampling blocks at the start of the network to reduce the feature size and make the network works eciently for large input. reduce the feature size and make the network works efficiently for large input. Joint Context Up-Sampling Block In the decoder, we proposed a joint context up-sampling block, Joint Context Up-Sampling Block In the decoder, we proposed a joint context up-sampling which takes two feature maps from di erent stages in the encoder, as shown in Figure 4. The feature block, which takes two feature maps from different stages in the encoder, as shown in Figure 4. The map from the earlier stage with bigger resolution and fewer channels carry sucient details in spatial, feature map from the earlier stage with bigger resolution and fewer channels carry sufficient details and the feature map from the later stage with a smaller resolution and more channels contain the in spatial, and the feature map from the later stage with a smaller resolution and more channels necessary facts in context. The joint context up-sampling block combines these two feature maps gently contain the necessary facts in context. The joint context up-sampling block combines these two feature and eciently using a context convolution block and bilinear up-sampling. The two branches of the maps gently and efficiently using a context convolution block and bilinear up-sampling. The two two feature maps were concatenated along the channels, and a context convolution block was applied branches of the two feature maps were concatenated along the channels, and a context convolution to the concatenated feature map. As shown in Figure 2, the joint context up-sampling blocks follow a block was applied to the concatenated feature map. As shown in Figure 2, the joint context up- sequential architecture, the current block utilized the former results and the corresponding decoder sampling blocks follow a sequential architecture, the current block utilized the former results and the features, which made the up-sampling operation more e ective. corresponding decoder features, which made the up-sampling operation more effective. 1×,C1 Context Convolution Context Convolution 1×,C1 Concat Block (1x1) Block (1x1) Context Convolution Bilinear 2×,C2 Block (1x1) Up-sampling Figure 4. The joint context up-sampling block. Figure 4. The joint context up-sampling block. Residual Bottleneck Blocks Between the down-sampling and up-sampling blocks, some residual Residual Bottleneck Blocks Between the down-sampling and up-sampling blocks, some blocks were inserted to perform the encoding and decoding. In the early stage of the encoder, we residual blocks were inserted to perform the encoding and decoding. In the early stage of the encoder, applied factorized residual blocks to extract dense features. As shown in Figure 5b, a 3 3 convolution we applied factorized residual blocks to extract dense features. As shown in Figure 5b, a 3 × 3 was replaced by a 3 1 convolution and a 1 3 convolution in the residual branch to reduce parameters convolution was replaced by a 3 × 1 convolution and a 1 × 3 convolution in the residual branch to and computation. In the later stage of the encoder, we stacked dilated convolution blocks with di erent reduce parameters and computation. In the later stage of the encoder, we stacked dilated convolution rates to obtain a larger receptive field and obtain more contextual information. The dilated convolution blocks with different rates to obtain a larger receptive field and obtain more contextual information. block applied a dilated convolution on the 3 3 convolution in the residual branch compared to the The dilated convolution block applied a dilated convolution on the 3 × 3 convolution in the residual regular residual block, as shown in Figure 5c. The dilate rates in the stacked dilated residual blocks branch compared to the regular residual block, as shown in Figure 5c. The dilate rates in the stacked were 1, 2, 5, and 9, which were carefully chosen to avoid the gridding problem when inappropriate dilated residual blocks were 1, 2, 5, and 9, which were carefully chosen to avoid the gridding problem dilation rate is used. One dilated residual block consisted of two groups of stacked dilated residual when inappropriate dilation rate is used. One dilated residual block consisted of two groups of blocks in our network. In the decoder phase, two continuous regular residual blocks were inserted stacked dilated residual blocks in our network. In the decoder phase, two continuous regular residual between the joint context up-sampling blocks. blocks were inserted between the joint context up-sampling blocks. Sensors 2019, 19, 4711 7 of 15 Sensors 2019, 19, x FOR PEER REVIEW 7 of 15 C x H x W C x H x W C x H x W conv(1x1), c/4 Sensors 2019, 19, x FOR PEER REVIEW conv(1x1), c/4 7 of 15 conv(1x1), c/4 conv(3x1), c/4 conv(3x3), conv(3x3), C x H x W C x H x W C x H x W c/4 c/4, r conv(3x1), c/4 conv(1x1), c/4 conv(1x1), c/4 conv(1x1), c/4 conv(1x1), c conv(1x1), c conv(1x1), c conv(3x1), c/4 conv(3x3), conv(3x3), c/4 c/4, r conv(3x1), c/4 C x H x W C x H x W C x H x W conv(1x1), c conv(1x1), c conv(1x1), c (a) Regular Residual Block (b) Factorized Residual Block (c) Dilated Residual Block Figure 5. Residual blocks in our network. + Figure 5. + Residual blocks in our network. C x H x W C x H x W C x H x W 3.3. Loss Function 3.3. Loss Function (a) Regular Residual Block (b) Factorized Residual Block (c) Dilated Residual Block As to the classification tasks, the cross-entropy loss has proved very e ective. However, in our As to the classification tasks, the cross-entropy loss has proved very effective. However, in our Figure 5. Residual blocks in our network. task, the road edge area needs more attention paid to it when performing the inference process, and the task, the road edge area needs more attention paid to it when performing the inference process, and faraway road in the image took fewer pixels. We proposed a spatially-dependent weight to handle this the faraway road in the image took fewer pixels. We proposed a spatially-dependent weight to handle 3.3. Loss Function problem to enhance the loss on the road edge region and faraway road area. The road edge region (ER) this problem to enhance the loss on the road edge region and faraway road area. The road edge region As to the classification tasks, the cross-entropy loss has proved very effective. However, in our was defined as a set of the pixels around the road edge pixels E, which was obtained from the ground (ER) was defined as a set of the pixels around the road edge pixels E, which was obtained from the task, the road edge area needs more attention paid to it when performing the inference process, and truth label image using the Canny algorithm [32], as shown in Figure 6. The Manhattan distance was ground truth label image using the Canny algorithm [32], as shown in Figure 6. The Manhattan the faraway road in the image took fewer pixels. We proposed a spatially-dependent weight to handle adopted to calculate the distance between other pixels and edge pixels, and T 2 R was used to control distance was adopted to calculate the distance between other pixels and edge pixels, and T ∈𝑅 was this problem to enhance the loss on the road edge region and faraway road area. The road edge region the region size. Then the weight is defined as Equation (3), which takes into account the road edge used to control the region size. Then the weight is defined as Equation (3), which takes into account (ER) was defined as a set of the pixels around the road edge pixels E, which was obtained from the region and the faraway distance factor. The loss function with spatial weight is shown in Equation (4), the road edge region and the faraway distance factor. The loss function with spatial weight is shown ground truth label image using the Canny algorithm [32], as shown in Figure 6. The Manhattan which is referred to as CE-SW, and the traditional cross-entropy loss is referred to as CE in our paper. distance was adopted to calculate the distance between other pixels and edge pixels, and T ∈𝑅 was in Equation (4), which is referred to as CE-SW, and the traditional cross-entropy loss is referred to as The experiment used to contshowed rol the rethat gion the size. Then CE-SW the could weigh significantly t is defined as impr Equ ove ation ( the3performance ), which takes of into the acmodels count on CE in our paper. The experiment showed that the CE-SW could significantly improve the the road edge region and the faraway distance factor. The loss function with spatial weight is shown the occlusion-free road segmentation task. performance of the models on the occlusion-free road segmentation task. in Equation (4), which is referred to as CE-SW, and the traditional cross-entropy loss is referred to as 0 0 0 0 0 0 ER = v(i ,j ) | |i−i | + |j− j | T ,e(i, j) ∈E, v(i ,j )∈ Img , (2) CE in our paper. The experiment showed that the CE-SW could significantly improve the ER = fv(i , j ) i i + j j < T , e(i, j) 2 E, v(i , j ) 2 Img , (2) performance of the models on the occlusion-free road segmentation task. 1, 𝑖𝑓 𝑝 (𝑖 , 𝑗 ) ∈ > 1, i f p(i, j) 2 ER ER = v(i ,j ) | |i−i | + |j− j | T ,e(i, j) ∈E, v(i ,j )∈ Img , (2) w(i, j) = | | | | , (3) ( ) w i, j = > , (3) kjii j+j j j j > 0 0 ∗2 +2, 𝑖𝑓 𝑝 (𝑖 , 𝑗 )∈ ∗ /  2 + 2, i f p(i, j) 2 ER kh+w/2 ( ) 1, 𝑖𝑓 𝑝 𝑖 , 𝑗 ∈ w(i, j) = ∗ | | | | , (3) where w and h are the width and height of the i mage, k=h/w is the rate to balance the height and where w and h are the width and height of the image, k=h/w is the rate to balance the height and width ∗2 +2, 𝑖𝑓 𝑝 (𝑖 , 𝑗 )∈ ∗ / width of the image, i and j are the pixel index, 𝑖 and 𝑗 the bottom center pixel index. of the image, i and j are the pixel index, i and j the bottom center pixel index. 0 0 where w and h are the width and height of the image, k=h/w is the rate to balance the height and Loss(y,p) =X ∑ ∑ X−𝑤(𝑖, 𝑗 )[y log 𝑝 +(1−𝑦 )log(1 − 𝑝 ))] , h   i (4) H W , , , , width of the image, i and j are the pixel index, 𝑖 and 𝑗 the bottom center pixel index. Loss(y, p) = w(i, j) y log p + 1 y ) log 1 p , (4) i, j i, j i, j i,j i j where y is the ground truth, p is the pre ∑ ∑ dict logits, i and j are the pixel index in the image. Loss(y,p) = −𝑤(𝑖, 𝑗 )[y log 𝑝 +(1−𝑦 )log(1 − 𝑝 ))] , (4) , , , , where y is the ground truth, p is the predict logits, i and j are the pixel index in the image. where y is the ground truth, p is the predict logits, i and j are the pixel index in the image. Figure 6. Visualization of the road edge region. (a) The road segmentation label; (b) road edge Figure 6. Visualization of the road edge region. (a) The road segmentation label; (b) road edge Figure 6. Visualization of the road edge region. (a) The road segmentation label; (b) road edge obtained obtained from (a) by the Canny algorithm; (c) road edge region with a width of 10 pixels. obtained from (a) by the Canny algorithm; (c) road edge region with a width of 10 pixels. from (a) by the Canny algorithm; (c) road edge region with a width of 10 pixels. 𝐸𝑅 𝐸𝑅 𝐸𝑅 𝐸𝑅 Sensors 2019, 19, 4711 8 of 15 Sensors 2019, 19, x FOR PEER REVIEW 8 of 15 4. Experiments 4. Experiments In this section, we provide qualitative and quantitative results for experiments carried out to In this section, we provide qualitative and quantitative results for experiments carried out to test test the performance of our approach. There are numerous approaches in semantic segmentation; we the performance of our approach. There are numerous approaches in semantic segmentation; we mainly compare our method to those pursuing a good tradeo between high quality and computation, mainly compare our method to those pursuing a good tradeoff between high quality and such as SegNet, ENet, and ERFNet. Moreover, to compare [22], we verified the model of inferring computation, such as SegNet, ENet, and ERFNet. Moreover, to compare [22], we verified the model occluded road boundaries by replacing the decoder part of the model with a new one that is suitable of inferring occluded road boundaries by replacing the decoder part of the model with a new one for our task. The verified model is referred to as ORBNet in our work, which retained the encoder that is suitable for our task. The verified model is referred to as ORBNet in our work, which retained and employed a decoder similar to that in the DeepLabv3+ algorithm [6]. We present quantitative the encoder and employed a decoder similar to that in the DeepLabv3+ algorithm [6]. We present results based on evaluations with our manually annotated dataset based on the KITTI dataset named quantitative results based on evaluations with our manually annotated dataset based on the KITTI KITTI-OFRS dataset. The presented results appear all to be based on the manual dataset annotations dataset named KITTI-OFRS dataset. The presented results appear all to be based on the manual except the qualitative results on Cityscapes dataset using predicted semantics as input. We first trained dataset annotations except the qualitative results on Cityscapes dataset using predicted semantics as the models on the proposed KITTI-OFRS dataset, and the experimental results demonstrate that the input. We first trained the models on the proposed KITTI-OFRS dataset, and the experimental results proposed approach spends less time on inference and obtains better performance. Then, we compared demonstrate that the proposed approach spends less time on inference and obtains better the performance of those models when trained with traditional cross-entropy loss function and the performance. Then, we compared the performance of those models when trained with traditional proposed cross-entrop spatially-weighted y loss function and cross-entr the p opy roposed loss function. spatially-we Mor igh eover ted cros , wes-en tested tropy lo the generalization ss function. performance Moreover, we of tes the ted models the geon nera the liza Cityscapes tion perform dataset ance of . the m Finally odel , the s operformance n the Cityscapof es da the tamodels set. Finabased lly, the performance of the models based on automatically inferred semantics was visualized to show on automatically inferred semantics was visualized to show that our network works well in the that our network works well in the real system. real system. 4.1. 4.1. Data Datasets sets There were no available datasets for the proposed occlusion-free road segmentation task, so we There were no available datasets for the proposed occlusion-free road segmentation task, so we built built our own datasets. We built a real-world dataset named KITTI-OFRS based on the public KITTI our own datasets. We built a real-world dataset named KITTI-OFRS based on the public KITTI semantic semantic segmentation benchmark, which is used for training and evaluation. Moreover, we segmentation benchmark, which is used for training and evaluation. Moreover, we qualitatively tested qualitatively tested our well-trained model on the Cityscape dataset [33] for a view of its our well-trained model on the Cityscape dataset [33] for a view of its generalization ability. generalization ability. KITTI-OFRS Dataset The real-world dataset was built on the public KITTI semantic segmentation KITTI-OFRS Dataset The real-world dataset was built on the public KITTI semantic benchmark, which is part of the KITTI dataset [34]. The KITTI dataset is the largest data collection for segmentation benchmark, which is part of the KITTI dataset [34]. The KITTI dataset is the largest data computer vision algorithms in the world’s largest autopilot scenario. The dataset is used to evaluate collection for computer vision algorithms in the world’s largest autopilot scenario. The dataset is used the performance of computer vision technologies and contains real-world image data collected from to evaluate the performance of computer vision technologies and contains real-world image data scenes such as urban, rural, and highways, with up to 15 vehicles and 30 pedestrians per image, collected from scenes such as urban, rural, and highways, with up to 15 vehicles and 30 pedestrians as well as varying degrees of occlusion. The KITTI semantic segmentation benchmark consists of 200 per image, as well as varying degrees of occlusion. The KITTI semantic segmentation benchmark semantically annotated train as well as 200 test samples corresponding to the KITTI Stereo and Flow consists of 200 semantically annotated train as well as 200 test samples corresponding to the KITTI Benchmark 2015. We only annotated the available 200 semantically annotated training samples for Stereo and Flow Benchmark 2015. We only annotated the available 200 semantically annotated our task and randomly split them into two parts, one contained 160 samples for training, and the training samples for our task and randomly split them into two parts, one contained 160 samples for other contained 40 samples for evaluation. We named this real-world dataset as KITTI-OFRS dataset. training, and the other contained 40 samples for evaluation. We named this real-world dataset as One sample in this dataset contained the RGB image, normal semantic labels, and occlusion-free road KITTI-OFRS dataset. One sample in this dataset contained the RGB image, normal semantic labels, segmentation labels, as demonstrated in Figure 7. and occlusion-free road segmentation labels, as demonstrated in Figure 7. Figure Figure 7. 7. An An e example xample of the of the KITTI-occlusion-fr KITTI-occlusion-free road ee roadsegmentati segmentation on (KITTI (KITTI-OFRS) -OFRS) dadataset taset sampl sample. e. (a)(a the ) the RGB im RGB image; age; ( (b) b) annotation of semantic segmentation; annotation of semantic segmentation; (c (c ) annotation ) annotation of full road ar of full road ea, white area, white denotes road. denotes road. Cityscapes Dataset The Cityscapes dataset contains 5000 images collected in street scenes from Cityscapes Dataset The Cityscapes dataset contains 5000 images collected in street scenes from 5050 di di er ffer ent ent citie citie s. sThe . The dataset dataset is is divided divided into three sub into three subsets, sets, inc including luding 29 2975 75 im images ages in the tra in the training ining set,set, 50050 images 0 imagin es i the n validation the validaset, tion and set, and 1525 images 1525 im inag the estesting in the te set.st Hin igh-quality g set. High pixel-level -quality pannotations ixel-level annotations of 19 semantic classes are provided in this dataset. We only used this dataset for the of 19 semantic classes are provided in this dataset. We only used this dataset for the generalization generalization ability test. ability test. Sensors 2019, 19, 4711 9 of 15 Classes Transformation The occlusion-free road segmentation network was designed to apply in the semantic domain. However, di erent semantic segmentation datasets may have di erent categories, and one category may have a di erent class labels in di erent datasets. It is obvious that some categories are not involved in occluding the road, such as sky, and some categories could be aggregated to one category to get a more compact representation, for example, car, truck, bus, train, motorcycle, and bicycle could be aggregated to vehicle. Therefore, a classes transformation layer is proposed to transform di erent semantic representations to a unify form before being fed to the occlusion-free road segmentation network. The classes transformation layer is a matrix multiplication operation, taking one-hot liked WHC encoded semantic representation of variable categories R 2 [0, 1] as input and output one-hot in WHC [ ] representation of a unify categories R 2 0, 1 . out R = R T, (5) out in 1, i f C(i) ! C ( j) < u ( ) T i, j = > , (6) 0, otherwise CC where T 2 f0, 1g is the transformation matrix, C is the set of original class labels and C the set of target class labels. C(i) ! C ( j) refers to that the i-th label in C should be set to the j-th label in C . u u The classes transformation layer could aggregate and unify labels of di erent semantic segmentation representations from di erent datasets or di erent semantic segmentation algorithms. In our work, the unified semantic representation contained 11 classes, namely road, sidewalk, building, wall, fence, pole, trac sign, vegetation, person, vehicle, and unlabeled. Data Augmentation In the training phase, the training data was augmented with random cropping and padding, flipping left to right. Moreover, to tackle the uncertainty of the semantic labels due to annotation errors, we augmented the training data by the technique of label smoothing, which is firstly proposed in InceptionV2 [35] to reduce over-fitting and increase the adaptive ability of the model. We used this method to add noise to the semantic one-hot, which could make our model more adaptive to annotation errors and prediction errors from other semantic segmentation methods. Unlike the original usage that takes a constant value for all the samples, we choose as a random value between 0.1 and 0.2 following uniform distribution, which was independent of each pixel in a training batch. LS y = y + (1 ) + /K. (7) 4.2. Evaluation Metrics For quantitative evaluation, precision (PRE), recall (REC), F1 score, average precision (ACC), and intersection-over-union (IoU) were used as the metrics within a region around the road edges within 4 pixels. The metrics acting on such a region are more powerful to test the network performance than on the whole pixels taking into account the primary task of occlusion reasoning. The metrics are calculated as in Equations (8)–(12), where TP, TN, FP, FN are, respectively, the number of true positives, true negatives, false positives, and false negatives at the pixel level. Our experiments considered an assessment that demonstrates the e ectiveness of our approach for inferring occluded road in the semantic domain. TP PRE = , (8) TP + FP TP REC = , (9) TP + FN 2PRE REC F1 = , (10) PRE + REC Sensors 2019, 19, 4711 10 of 15 TP + TN ACC = , (11) TP + FP + TN + FN TP IoU = . (12) TP + FP + FN 4.3. Implementation Details In the experiments, we implemented our architectures in PyTorch version 1.2 [36] (FaceBook, State of California, USA) with CUDA 10.0 and cuDNN back-ends. All experiments were run on a single NVIDIA GTX-1080Ti GPU. Due to GPU memory limitations, we had a maximum batch size of 4. During optimization, we used the SGD optimizer [37] with a weight decay of 0.0001 and a momentum of 0.9. The learning rate was set using the poly strategy with a start value of 0.01 and a power of 0.9. The edge region width T was set to 10 in the training phase and 4 in the evaluation phase. 4.4. Results and Analysis To evaluate the e ectiveness of our method on the occlusion-free road segmentation task, we trained the proposed model on the KITTI-OFRS dataset, as well as some other lightweight baseline models, such as ENet, SegNet, ERFNet, and ORBNet. The samples were resized to 384 1248 when training and testing. The quantitative and qualitative results are shown in Table 2 and Figure 8, respectively. As shown in Table 2 and Figure 8, both models achieved comparable results on the proposed task, and our method was superior to the baseline models in both accuracy and runtime. In Figure 8, red denotes false negatives; blue areas correspond to false positives, and green represents true positives. The models both performed well in the semantic domain containing more compact information of the driving environment, which indicates that the semantic and spatial information were more essential for occlusion reasoning than color and textural features. As can be seen from Figure 8, the models obtained significant results on both simple straight roads and complex intersection areas. Variable occlusion situations could be handled well, even though there were some heavy occlusion scenes. Based on the results of the proposed task, the whole road structure could be obtained and could be easily transformed into 3D world representations by an inverse perspective transformation without the a ectation of the on-road objects. Empirically, higher road detection precision may lead to a better road model for better path planning. Comparison of accuracy and computation complexity Our model achieved a significant trade-o between accuracy and eciency, which conclusion is drawn by comparing with other models. To compare the computation complexity, we employed several parameters, GFLOPs, and frames per second (FPS) as the evaluation metrics. FPS was measured on an Nvidia GTX1080Ti GPU with an input size of 384 1248 and was averaged among 100 runs. As can be seen from Table 2, our model outperformed ENet by 1.5% in the F1 score and 2.6% in the IoU while runs were only a little slower than it. Our model ran almost two times faster than ERFNet and improved 1.0% in the F1 score and 1.7% in the IoU. Compared to SegNet and ORBNet, our model got a little improvement in accuracy but achieved three times faster in the inference phase. In conclusion, our model achieved a better trade-o between accuracy and eciency. Table 2. Evaluation results of models trained with spatially-weighted cross-entropy loss (CE-SW). Model Parameters GFLOPs FPS ACC PRE REC F1 IoU ENet 0.37M 3.83 52 91.8% 92.1% 89.3% 90.7% 82.9% ERFNet 2.06M 24.43 25 92.3% 92.6% 89.7% 91.2% 83.8% SegNet 29.46M 286.03 16 92.9% 93.6% 90.2% 91.8% 84.9% ORBNet 1.91M 48.48 11.5 92.7% 93.4% 89.9% 91.6% 84.5% OFRSNet 0.39M 2.99 46 93.2% 94.2% 90.3% 92.2% 85.5% Sensors 2019, 19, 4711 11 of 15 Sensors 2019, 19, x FOR PEER REVIEW 11 of 15 Figure Figure 8 8. Qualitative . Qualitative results results o onnthe theKITTI-OFRS KITTI-OFRS data dataset. set. The The co columns lumns from from left left toto right are right ar the result e the results s of GT, ENet, ORBNet, and OFRSNet, respectively. Red denotes false negatives; blue areas correspond of GT, ENet, ORBNet, and OFRSNet, respectively. Red denotes false negatives; blue areas correspond to false positives, and green represents true positives. to false positives, and green represents true positives. Table 2. Evaluation results of models trained with spatially-weighted cross-entropy loss (CE-SW). Comparison of loss function To evaluate the e ectiveness of the proposed spatially-weighted cross-entropy loss, we trained the models both with traditional cross-entropy loss (CE) and the Model Parameters GFLOPs FPS ACC PRE REC F1 IoU spatially-weighted cross-entropy loss (CE-SW), and the evaluation results of the CE and metrics ENet 0.37M 3.83 52 91.8% 92.1% 89.3% 90.7% 82.9% degradation are shown in Table 3. When trained with CE, the models saw obvious metrics degradation ERFNet 2.06M 24.43 25 92.3% 92.6% 89.7% 91.2% 83.8% compared to CE-SW. The values in parentheses are the metrics degradation compared to that when SegNet 29.46M 286.03 16 92.9% 93.6% 90.2% 91.8% 84.9% models were trained with CE-SW, which shows that the spatially-weighted cross-entropy loss was ORBNet 1.91M 48.48 11.5 92.7% 93.4% 89.9% 91.6% 84.5% very beneficial for increasing accuracy. Intuitively, the spatially-weighted cross-entropy loss forced the OFRSNet 0.39M 2.99 46 93.2% 94.2% 90.3% 92.2% 85.5% models to take care of the road edge region where the occlusion occurs mostly. Comparison of loss function To evaluate the effectiveness of the proposed spatially-weighted cross-entropy loss, we trained the models both with traditional cross-entropy loss (CE) and the Table 3. Evaluation results of models trained with cross-entropy loss (CE). The values in parentheses spatially-weighted cross-entropy loss (CE-SW), and the evaluation results of the CE and metrics are the metrics degradation compared to that when models were trained with spatially-weighted degradation are shown in Table 3. When trained with CE, the models saw obvious metrics cross-entropy loss (CE-SW). degradation compared to CE-SW. The values in parentheses are the metrics degradation compared Model ACC PRE REC F1 IoU to that when models were trained with CE-SW, which shows that the spatially-weighted cross- entropy loss was very beneficial for increasing accuracy. Intuitively, the spatially-weighted cross- ENet 90.4%(1.4%) 90.5%(1.6%) 87.6%(1.7%) 89.0%(1.7%) 80.2%(2.7%) ERFNet 90.5%(1.8%) 90.9%(1.7%) 87.3%(2.4%) 89.1%(2.1%) 80.3%(3.5%) entropy loss forced the models to take care of the road edge region where the occlusion occurs mostly. SegNet 92.1%(0.8%) 92.6%(1.0%) 89.4%(0.8%) 91.0%(0.8%) 83.5%(1.4%) ORBNet 91.5% (1.2%) 92.2% (1.2%) 88.4% (1.5%) 90.2% (1.4%) 82.2% (2.3%) Table 3. Evaluation results of models trained with cross-entropy loss (CE). The values in parentheses OFRSNet 91.7%(1.5%) 92.4%(1.8%) 88.6%(1.7%) 90.5%(1.7%) 82.6%(2.9%) are the metrics degradation compared to that when models were trained with spatially-weighted cross-entropy loss (CE-SW). Comparison of convolution with and without context To evaluate the benefits of the context Model ACC PRE REC F1 IoU convolution block, we replaced the context convolution block with regular convolution operation in ENet 90.4%(−1.4%) 90.5%(−1.6%) 87.6%(−1.7%) 89.0%(−1.7%) 80.2%(−2.7%) the down-sampling and up-sampling blocks. As shown in Table 4, the model with context information ERFNet 90.5%(−1.8%) 90.9%(−1.7%) 87.3%(−2.4%) 89.1%(−2.1%) 80.3%(−3.5%) outperformed the model without that by 0.6% in the F1 score and 1.0% in the IoU, which demonstrates SegNet 92.1%(−0.8%) 92.6%(−1.0%) 89.4%(−0.8%) 91.0%(−0.8%) 83.5%(−1.4%) that the context information is desirable for the proposed approach. ORBNet 91.5% (−1.2%) 92.2% (−1.2%) 88.4% (−1.5%) 90.2% (−1.4%) 82.2% (−2.3%) OFRSNet 91.7%(−1.5%) 92.4%(−1.8%) 88.6%(−1.7%) 90.5%(−1.7%) 82.6%(−2.9%) Comparison of convolution with and without context To evaluate the benefits of the context convolution block, we replaced the context convolution block with regular convolution operation in Sensors 2019, 19, 4711 12 of 15 Sensors 2019, 19, x FOR PEER REVIEW 12 of 15 the down-sampling and up-sampling blocks. As shown in Table 4, the model with context Table 4. Performance comparison of the model with and without context. information outperformed the model without that by 0.6% in the F1 score and 1.0% in the IoU, which demonstrates that the context information is desirable for the proposed approach. Model Context Parameters GFLOPs ACC PRE REC F1 IoU OFRSNet w/o 0.34M 2.96 92.7% 92.8% 90.4% 91.6% 84.5% Table 4. Performance comparison of the model with and without context. OFRSNet w/ 0.39M 2.99 93.2% 94.2% 90.3% 92.2% 85.5% Model Context Parameters GFLOPs ACC PRE REC F1 IoU OFRSNet w/o 0.34M 2.96 92.7% 92.8% 90.4% 91.6% 84.5% Generalization on Cityscape Dataset To further test the generalization ability of our model, OFRSNet w/ 0.39M 2.99 93.2% 94.2% 90.3% 92.2% 85.5% we conducted qualitative test experiments on the Cityscape dataset with the model trained only on Generalization on Cityscape Dataset To further test the generalization ability of our model, we the KITTI-OFRS dataset. As can be seen from Figure 9, the well-trained model performed well on conducted qualitative test experiments on the Cityscape dataset with the model trained only on the the complex real-world Cityscapes dataset, which indicates that our model obtained quite a good KITTI-OFRS dataset. As can be seen from Figure 9, the well-trained model performed well on the generalization ability on the occlusion-free road segmentation task. The generalization ability of our complex real-world Cityscapes dataset, which indicates that our model obtained quite a good model benefited from inferring the occluded road in the semantic domain, which made the model generalization ability on the occlusion-free road segmentation task. The generalization ability of our focus on learning the occlusion mechanism in the driving scenes without the a ectation of sensing model benefited from inferring the occluded road in the semantic domain, which made the model noise. focus on In the scenes, learning th thee occl color usion mech and textual anism in featurthe es may driving scenes witho di er very much ut the a in the ffec same tation position of sensing due to noise. In the scenes, the color and textual features may differ very much in the same position due to di erent camera configurations and lighting conditions while the semantic features share a similar different camera configurations and lighting conditions while the semantic features share a similar distribution. The occlusion situations were able to understand that the occluded road area was correctly distribution. The occlusion situations were able to understand that the occluded road area was inferred in variable occlusion scenes by the proposed method according to the results. As shown in correctly inferred in variable occlusion scenes by the proposed method according to the results. As Figure 9, the detection results obtained the overall structure of the road and accurate segmentation shown in Figure 9, the detection results obtained the overall structure of the road and accurate despite occlusion. Moreover, it is applicable to combine our method with other semantic segmentation segmentation despite occlusion. Moreover, it is applicable to combine our method with other algorithms in the real system due to its lightweight and eciency. As shown in Figure 10, when taking semantic segmentation algorithms in the real system due to its lightweight and efficiency. As shown the predicted semantics obtained by the DeepLabv3+ algorithm as input, the proposed OFRSNet still in Figure 10, when taking the predicted semantics obtained by the DeepLabv3+ algorithm as input, works well to predict the occluded road areas and outperforms ENet and ORBNet in terms of accuracy the proposed OFRSNet still works well to predict the occluded road areas and outperforms ENet and and robustness. ORBNet in terms of accuracy and robustness. Figure 9. Qualitative results on the Cityscapes dataset using ground truth semantics as input. Green represents the detected full road area. Sensors 2019, 19, x FOR PEER REVIEW 13 of 15 Sensors 2019 Figure 9. , 19, 4711 Qualitative results on the Cityscapes dataset using ground truth semantics as input. Green 13 of 15 represents the detected full road area. Figure 10. Qualitative results on the Cityscapes dataset using predicted semantics as input, which Figure 10. Qualitative results on the Cityscapes dataset using predicted semantics as input, which were were obtained by the DeepLabv3+ algorithm. Green represents the detected full road area. obtained by the DeepLabv3+ algorithm. Green represents the detected full road area. 5. Conclusions 5. Conclusions In In t this his pa paper per, we present , we presented ed a annocclusion-fr occlusion-free ee roa road d segmenta segmentation tion nnetwork etwork to to inf infer er the oc the occluded cluded road road areaare ofaan ofurban an urb driving an driv scenario ing scen frar om io from monocular mono vision. cular v The ision model . The m weopr del esented we pris esen a lightweight ted is a lightweight and efficient encoder–decoder fully convolutional architecture that contains down- and ecient encoder–decoder fully convolutional architecture that contains down-sampling and sampling and up-sampling blocks combined with global contextual operations. Meanwhile, a up-sampling blocks combined with global contextual operations. Meanwhile, a spatially-weighted spatially-weighted cross-entropy loss was proposed to induce the network to pay more attention to cross-entropy loss was proposed to induce the network to pay more attention to the road edge region the road edge region in the training phase. We showed the effectiveness of the model on the self-built in the training phase. We showed the e ectiveness of the model on the self-built small but ecient small but efficient KITTI-OFRS dataset. Compared to other recent lightweight semantic segmentation KITTI-OFRS dataset. Compared to other recent lightweight semantic segmentation algorithms, our algorithms, our network obtained a better trade-off between accuracy and runtime. The comparisons network obtained a better trade-o between accuracy and runtime. The comparisons of the models of the models trained with different loss functions highlighted the benefits of the proposed spatially- trained with di erent loss functions highlighted the benefits of the proposed spatially-weighted weighted cross-entropy loss for the occlusion reasoning road segmentation task. The generalization cross-entropy loss for the occlusion reasoning road segmentation task. The generalization ability of our ability of our model was further qualitatively tested on the Cityscape datasets, and the results clearly model was further qualitatively tested on the Cityscape datasets, and the results clearly demonstrated demonstrated our model’s inferring ability of the occluded road even in complex scenes. Moreover, our model’s inferring ability of the occluded road even in complex scenes. Moreover, the proposed the proposed OFRSNet could be efficiently combined with other semantic segmentation algorithms OFRSNet could be eciently combined with other semantic segmentation algorithms due to its small due to its small size and minimal runtime. We believe that being able to infer occluded road regions size and minimal runtime. We believe that being able to infer occluded road regions in autonomous in autonomous driving systems is a key component to achieve a full comprehension of the scene and driving systems is a key component to achieve a full comprehension of the scene and will allow better will allow better planning of the ego-vehicle trajectories. planning of the ego-vehicle trajectories. Author Contributions: Conceptualization, K.W., F.Y., and B.Z.; Data curation, K.W. and Q.Y.; Formal analysis, Author Contributions: Conceptualization, K.W., F.Y., and B.Z.; Data curation, K.W. and Q.Y.; Formal analysis, K.W., B.Z., L.T., and C.L.; Funding acquisition, F.Y.; Investigation, K.W., L.T., Q.Y., and C.L.; Methodology, K.W.; K.W., B.Z., L.T., and C.L.; Funding acquisition, F.Y.; Investigation, K.W., L.T., Q.Y., and C.L.; Methodology, K.W.; Project administration, F.Y. and B.Z.; Resources, B.Z.; Software, K.W. and L.T.; Supervision, K.W.; Validation, Project administration, F.Y. and B.Z.; Resources, B.Z.; Software, K.W. and L.T.; Supervision, K.W.; Validation, K.W., B.Z., L.T., and C.L.; Visualization, K.W. and Q.Y.; Writing—original draft, K.W. and B.Z.; Writing—review and editing, K.W. and B.Z. Sensors 2019, 19, 4711 14 of 15 Funding: This research was funded by the National Natural Science Foundation of China (Grant No. 51975434), the Overseas Expertise Introduction Project for Discipline Innovation (Grant No. B17034), the Innovative Research Team in University of Ministry of Education of China (Grant No. IRT_17R83) and the Department of Science and Technology, the Hubei Provincial People’s Government (Grant No. 2017BEC196). Acknowledgments: Thanks for the help of reviewers and editors. Conflicts of Interest: The authors declare no conflict of interest. References 1. Oliveira, G.L.; Burgard, W.; Brox, T. Ecient deep models for monocular road segmentation. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4885–4891. 2. Mendes, C.C.T.; Fremont, V.; Wolf, D.F. Exploiting Fully Convolutional Neural Networks for Fast Road Detection. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; Okamura, A., Menciassi, A., Ude, A., Burschka, D., Lee, D., Arrichiello, F., Liu, H., Eds.; IEEE: New York, NY, USA, 2016; pp. 3174–3179. 3. Zhang, X.; Chen, Z.; Wu, Q.M.J.; Cai, L.; Lu, D.; Li, X. Fast Semantic Segmentation for Scene Perception. IEEE Trans. Ind. Inform. 2019, 15, 1183–1192. [CrossRef] 4. Wang, B.; Fremont, V.; Rodriguez, S.A. Color-based Road Detection and its Evaluation on the KITTI Road Benchmark. In Proceedings of the 2014 IEEE Intelligent Vehicles Symposium Proceedings, Dearborn, MI, USA, 8–11 June 2014; pp. 31–36. 5. Song, X.; Rui, T.; Zhang, S.; Fei, J.; Wang, X. A road segmentation method based on the deep auto-encoder with supervised learning. Comput. Electr. Eng. 2018, 68, 381–388. [CrossRef] 6. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schro , F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. 7. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 Ieee Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. 8. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed] 9. Mano, K.; Masuzawa, H.; Miura, J.; Ardiyanto, I. Road Boundary Estimation for Mobile Robot Using Deep Learning and Particle Filter; IEEE: New York, NY, USA, 2018; pp. 1545–1550. 10. Li, K.; Shao, J.; Guo, D. A Multi-Feature Search Window Method for Road Boundary Detection Based on LIDAR Data. Sensors 2019, 19, 1551. [CrossRef] 11. Khalilullah, K.M.I.; Jindai, M.; Ota, S.; Yasuda, T. Fast Road Detection Methods on a Large Scale Dataset for assisting robot navigation Using Kernel Principal Component Analysis and Deep Learning; IEEE: New York, NY, USA, 2018; pp. 798–803. 12. Son, J.; Yoo, H.; Kim, S.; Sohn, K. Real-time illumination invariant lane detection for lane departure warning system. Expert Syst. Appl. 2015, 42, 1816–1824. [CrossRef] 13. Li, Q.; Zhou, J.; Li, B.; Guo, Y.; Xiao, J. Robust Lane-Detection Method for Low-Speed Environments. Sensors 2018, 18, 4274. [CrossRef] [PubMed] 14. Cao, J.; Song, C.; Song, S.; Xiao, F.; Peng, S. Lane Detection Algorithm for Intelligent Vehicles in Complex Road Conditions and Dynamic Environments. Sensors 2019, 19, 3166. [CrossRef] [PubMed] 15. Liu, X.; Deng, Z. Segmentation of Drivable Road Using Deep Fully Convolutional Residual Network with Pyramid Pooling. Cogn. Comput. 2017, 10, 272–281. [CrossRef] 16. Cai, Y.; Li, D.; Zhou, X.; Mou, X. Robust Drivable Road Region Detection for Fixed-Route Autonomous Vehicles Using Map-Fusion Images. Sensors 2018, 18, 4158. [CrossRef] [PubMed] 17. Aly, M. Real time Detection of Lane Markers in Urban Streets. In Proceedings of the Intelligent Vehicles Symposium, Eindhoven, The Netherlands, 4–6 June 2008. 18. Laddha, A.; Kocamaz, M.K.; Navarro-Serment, L.E.; Hebert, M. Map-supervised road detection. In Proceedings of the Intelligent Vehicles Symposium, Gothenburg, Sweden, 19–22 June 2016; pp. 118–123. Sensors 2019, 19, 4711 15 of 15 19. Alvarez, J.M.; Salzmann, M.; Barnes, N. Learning Appearance Models for Road Detection. In Proceedings of the Intelligent Vehicles Symposium, Gold Coast, QLD, Australia, 23–26 June 2013. 20. Badrinarayanan, V.; Handa, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling. arXiv 2015, arXiv:1505.07293. 21. Chen, L.-C.; Papandreou, G.; Schro , F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. 22. Suleymanov, T.; Amayo, P.; Newman, P. Inferring Road Boundaries Through and Despite Trac. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems, Maui, HI, USA, 4–7 November 2018; pp. 409–416. 23. Becattini, F.; Berlincioni, L.; Galteri, L.; Seidenari, L.; Del Bimbo, A. Semantic Road Layout Understanding by Generative Adversarial Inpainting. arXiv 2018, arXiv:1805.11746. 24. Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. 25. Romera, E.; Álvarez, J.M.; Bergasa, L.M.; Arroyo, R. ERFNet: Ecient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [CrossRef] 26. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. 27. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. 28. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond. arXiv 2019, arXiv:1904.11492. 29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 30. Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. 31. Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H.S. Conditional Random Fields as Recurrent Neural Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1529–1537. 32. Canny, J. A Computational Approach to Edge Detection. IEEE Trans.Pattern Anal. Mach. Intell. 1986, 8, 679–698. [CrossRef] [PubMed] 33. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. 34. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [CrossRef] 35. Szegedy, C.; Vanhoucke, V.; Io e, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. 36. PyTorch. Available online: http://pytorch.org/ (accessed on 1 September 2019). 37. Bottou, L. Large-Scale Machine Learning with Stochastic Gradient Descent; Physica-Verlag HD: Heidelberg, Germany, 2010; pp. 177–186. © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Journal

SensorsMultidisciplinary Digital Publishing Institute

Published: Oct 30, 2019

There are no references for this article.