Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Improving Faster R-CNN Framework for Fast Vehicle Detection

Improving Faster R-CNN Framework for Fast Vehicle Detection Hindawi Mathematical Problems in Engineering Volume 2019, Article ID 3808064, 11 pages https://doi.org/10.1155/2019/3808064 Research Article Hoanh Nguyen Faculty of Electrical Engineering Technology, Industrial University of Ho Chi Minh City, Ho Chi Minh City, Vietnam Correspondence should be addressed to Hoanh Nguyen; nguyenhoanh@iuh.edu.vn Received 21 August 2019; Revised 16 October 2019; Accepted 5 November 2019; Published 22 November 2019 Academic Editor: Daniel Zaldivar Copyright © 2019 Hoanh Nguyen. (is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Vision-based vehicle detection plays an important role in intelligent transportation systems. With the fast development of deep convolutional neural networks (CNNs), vision-based vehicle detection approaches have achieved significant improvements compared to traditional approaches. However, due to large vehicle scale variation, heavy occlusion, or truncation of the vehicle in an image, recent deep CNN-based object detectors still showed a limited performance. (is paper proposes an improved framework based on Faster R-CNN for fast vehicle detection. Firstly, MobileNet architecture is adopted to build the base convolution layer in Faster R-CNN. (en, NMS algorithm after the region proposal network in the original Faster R-CNN is replaced by the soft-NMS algorithm to solve the issue of duplicate proposals. Next, context-aware RoI pooling layer is adopted to adjust the proposals to the specified size without sacrificing important contextual information. Finally, the structure of depthwise separable convolution in MobileNet architecture is adopted to build the classifier at the final stage of the Faster R-CNN framework to classify proposals and adjust the bounding box for each of the detected vehicle. Experimental results on the KITTI vehicle dataset and LSVH dataset show that the proposed approach achieved better performance compared to original Faster R-CNN in both detection accuracy and inference time. More specific, the performance of the proposed method is improved comparing with the original Faster R-CNN framework by 4% on the KITTI test set and 24.5% on the LSVH test set. methods focus on modifying the base network to fit different 1. Introduction scales by applying multiscale feature maps of CNN [4] or Vision-based vehicle detection is an essential prerequisite in utilizing input images with multiple resolutions [3]. In most many intelligent transportation systems, such as advanced public test datasets, these methods show better detection driving assistance systems, autonomous driving, intelligent accuracy compared to traditional CNN-based object de- traffic management systems, and so on. Traditional methods tectors. However, these methods still need significant usually use motion and handcrafted features to detect ve- computation cost and thus are still incapable of real-time hicles from images directly. In recent years, deep con- vehicle detection. volutional neural networks (CNNs) have achieved incredible In view of the aforementioned research challenges, this success on object detection tasks as well as vehicle detection paper proposes an improved framework based on Faster R-CNN for real-time vehicle detection. First, MobileNet [1]. However, when applying CNNs to vehicle detection, real-time vehicle detection in driving environment is still architecture [5] is adopted to build the base network instead very challenging. (ese challenges come from many oc- of VGG architecture in the original Faster R-CNN frame- cluded and truncated vehicles with large vehicle scale var- work. MobileNet splits the convolution into a 3 ×3 iations in traffic images. (us, the popular CNN-based depthwise convolution and a 1 ×1 pointwise convolution, object detectors such as Faster R-CNN [2] and SSD [3] effectively reducing both computational cost and number of without modification did not achieve very good perfor- parameters. (us, the proposed framework improves both mance on vehicle detection. Many recent methods are based computation cost and inference time. In the region proposal on modifying the popular CNN-based object detectors to network, nonmaximum suppression algorithm is replaced enhance the performance of detection results. (ese by soft nonmaximum suppression algorithm [6] in order to 2 Mathematical Problems in Engineering detection performance improvement. SSD framework [18] solve the issue of heavy vehicle occlusion. Furthermore, context-aware RoI pooling [7] is used instead of RoI pooling skips the region proposal stage and directly uses multiple feature maps with different resolutions to perform object to maintain the original structures of the small objects. Finally, a classifier based on MobileNet architecture is built localization and classification. YOLOv2 [19] introduces to classify proposals and adjust the bounding box for each of improvements of batch normalization, high-resolution the proposal. (e proposed approach is evaluated on the classifier, convolutional with anchor boxes, and dimension KITTI benchmark dataset and the LSVH dataset. (e results clusters compared to original YOLO [20]. Comparing to show that the proposed approach achieved better perfor- YOLO, YOLOv2 achieves higher accuracy and higher speed. mance compared to other traditional deep CNN-based To better handle the detection problem of vehicles in object detectors. More specific, the performance of the complex conditions, Chu et al. [21] proposed a vehicle proposed method is improved comparing with the Faster detection scheme based on multitask deep CNN in which learning is trained on four tasks: category classification, R-CNN framework by 4% average with the KITTI test set and 24.5% with the LSVH test set. bounding box regression, overlap prediction, and sub- category classification. A region of interest voting scheme (is paper is organized as follows: an overview of pre- vious methods on vehicle detection is presented in Section 2. and multilevel localization are then used to further improve Section 3 describes in detail the proposed method. Section 4 detection accuracy and reliability. Experimental results on demonstrates experimental results. Finally, the conclusion is the standard test dataset showed better performance than made in Section 5. other methods. In [22], the authorsproposedthe deepmodel for vehicle detection which consists of feature extraction, deformation processing, occlusion processing, and classifier 2. Theoretical Basis training using the back propagation algorithm. Li et al. [23] In this section, this paper introduces previous methods on proposed a multivehicle detection method which consists of YOLO under the Darknet framework.Tomake the fulluse of vehicle detection, including traditional methods and re- cently proposed methods based on deep CNN. the advantages of the depth information of lidar and the obstacle classification ability of vision, Wang et al. [24] Vision-based vehicle detection system firstly locates vehicle candidate regions. (en, a classifier is constructed to proposed a real-time vehicledetection algorithm whichfuses vision and lidar point cloud information. (e experimental eliminate false vehicle candidate regions. Traditional methods can be divided into two categories: motion-based results showed that the proposed algorithm significantly methods and static appearance feature-based methods. improved the vehicle detection accuracy at different de- Motion-based methods use the motion to detect the vehicles tection difficulty levels compared to the original YOLOv3 in the image frame. Background subtraction methods [8] are algorithm, especially for the vehicles with severe occlusion. most widely used. (e backgroundremoval methods include In [25], the authors presented a two-stage detector based on Kalman filter [9], single Gaussian pixel distribution [10], Faster R-CNN for high-occluded vehicledetection. (e part- aware RPN is proposed to replace the original RPN at the Gaussian mixture model (GMM) [11], and wavelets [12]. Another method of motion feature is based on the optical first stage of the Faster R-CNN module, and the part-aware NMS is proposed to refine final results. Kim et al. [26] flow [13]. Optical flow is widely used in vehicle detection since it is less susceptible to occlusion issues. Static ap- proposed to integrate additional prediction layers into pearance feature-based methods focus on external physical conventional YOLOv3 using spatial pyramid pooling to features such as color, texture, edge, and shape. A variety of complement the detection accuracy of the vehicle for large- feature descriptors have been used in this field such as HOG scale changes or being occluded by other objects. (is ar- [14], SURF [15], Gabor [16], and Haar-like [17]. (ese chitecture showed a state-of-the-art mAP detection ratio feature descriptors are usually followed by classifiers like against the other vehicle detection approaches with rea- SVM [14], artificial neural network [16], and AdaBoost [17]. sonable run-time speed. Traditional methods showed high accuracy in limited conditions. However, with the effect of shadow, occluded 3. The Proposed Framework vehicle, complex scenarios, and environments, traditional methods showed poor performance. Figure 1 shows the overall framework of the proposed ap- Recently, deep CNN-based methods have become the proach. To differentiate from the original Faster R-CNN leading method for high-quality general object detection, framework, the proposed enhancements are highlighted by including vehicle detection. Faster region-based convolu- red boxes in Figure 1. In the first stage, MobileNet archi- tional neural network (Faster R-CNN) [2] defined a region tecture [5] is used to build the base convolution layer instead proposal network (RPN) for generating region proposals of VGG-16 architecture [27] in the original Faster R-CNN and a network using these proposals to detect objects. RPN framework. In the region proposal network, soft-NMS al- shares full-image convolutional features with the detection gorithm is used to solve the issue of heavy vehicle occlusion. network, thus enabling nearly cost-free region proposals. RoI pooling layer is then replaced by the context-aware RoI (is method has achieved state-of-the-art detection per- pooling layer to maintain the original structures of the small formance and becomes a commonly employed paradigm for vehicles. (e classifier based on MobileNet architecture is general object detection. MS-CNN [4] extends the detection built at the final stage to classify proposals into the vehicle over multiple scales of feature layers, which produce good and background and adjust the bounding box for each of the Mathematical Problems in Engineering 3 The classifier Classification The base network Context-aware Region RoI pooling proposal network Regression Fixed size Object proposals feature maps Input image Final detection Figure 1: (e overall framework of the proposed approach. Table 1: Comparison of the MobileNet model with the VGG detected vehicle. In the following section, the proposed model. approach is explained in detail. ImageNet Multiply-adds Parameters Model accuracy (%) (million) (million) 3.1. "e Base Network. (e original Faster R-CNN frame- MobileNet 70.6 569 4.2 work used VGG-16 [27] as the base network. In [18], Liu VGG-16 71.5 15300 138 et al. proved that about 80% of the forward time is spent on the base network so that using a faster base network can Table 2: (e architecture of the base network. greatly improve the speed of the whole framework. Mobi- leNet architecture [5] is an efficient network which splits the Type/stride Filter shape Input size convolution into a 3 ×3 depthwise convolution and a 1 ×1 Conv/s2 3 ×3 ×3 ×32 224 ×224 ×3 pointwise convolution, effectively reducing both computa- Conv dw/s1 3 ×3 ×32 dw 112 ×112 ×32 tional cost and number of parameters. Table 1 shows the Conv/s1 1 ×1 ×32 ×64 112 ×112 ×32 comparison of MobileNet and VGG-16 on ImageNet [28]. Conv dw/s2 3 ×3 ×64 dw 112 ×112 ×64 As shown, MobileNet is nearly as accurate as VGG-16 while Conv/s1 1 ×1 ×64 ×128 56 ×56 ×64 being 32 times smaller and 27 times less compute intensive. Conv dw/s1 3 ×3 ×128 dw 56 ×56 ×128 Conv/s1 1 ×1 ×128 ×128 56 ×56 ×128 With the purpose of real-time vehicle detection in traffic Conv dw/s2 3 ×3 ×128 dw 56 ×56 ×128 scenes, MobileNet architecture is used as the base network in Conv/s1 1 ×1 ×128 ×256 28 ×28 ×128 this study. MobileNet introduces two parameters which can Conv dw/s1 3 ×3 ×256 dw 28 ×28 ×256 be used to tune to fit the resource/accuracy trade-off, in- Conv/s1 1 ×1 ×256 ×256 28 ×28 ×256 cluding width multiplier and resolution multiplier. (e Conv dw/s2 3 ×3 ×256 dw 28 ×28 ×256 width multiplier allows us to thin the network, while the Conv/s1 1 ×1 ×256 ×512 14 ×14 ×256 resolution multiplier changes the input dimensions of the 5 ×conv dw/s1 3 ×3 ×512 dw 14 ×14 ×512 image, thus reducing the internal representation at every 5 ×conv/s1 1 ×1 ×512 ×512 14 ×14 ×512 layer. In this study, MobileNet is adopted to build the base convolutional layers in Faster R-CNN instead of VGG-16 in the original framework for fast vehicle detection. Since this of kernel size. More details about MobileNet architecture paper uses only the convolution layers in MobileNet ar- can be found in [5]. chitecture, the size of the input image does not have to be fixed. Supposing the size of the input image is 224 ×224 ×3, 3.2. Region Proposal Network (RPN). (e RPN first generates the architecture of the base network is defined, as shown in a set of anchor boxes from the convolution feature map Table 2. generated by the base network. An anchor is centered at the In Table 2, “Conv” represents as a standard convolution; sliding window and is associated with a scale and aspect “Conv dw” represents as a depthwise separable convolution; “s1” represents that the convolution stride is 1 ×1; and “s2” ratio. For the trade-off between recall and processing speed, three anchor box scales of 128, 256, and 512 and three represents that the convolution stride is 2 ×2. Depthwise separable convolution is made up of two anchor box ratios of 1:1, 1:2, and 2:1 are used for each anchor in this paper as in [2], yielding 9 anchors at each layers: depthwise convolutions and pointwise convolutions. Depthwise convolutions are used to apply a single filter per sliding position. For a convolutional feature map of a size 14 ×14,there are1,764anchors intotal,asshowninFigure2. each input channel, while pointwise convolution, a simple 1 ×1 convolution, is used to create a linear combination of (e RPN then takes all the anchor boxes and outputs two different outputs for each of the anchors. (e first one is the output of the depthwise layer. MobileNet architecture objectness score, which means the probability that an anchor uses both batch norm and ReLU nonlinearities for both is an object. (e second output is the bounding box re- layers. (e reduction of computational cost is in proportion gression for adjusting the anchors to better fit the object, as to the number of output feature map channel and the square Soft-NMS Conv layer FC layer FC layer 4 Mathematical Problems in Engineering Sliding window Convolution feature map For each position of the sliding window on convolution feature map, 9 anchor boxes with different scales and aspect ratios are created. Figure 2: Anchor boxes generated by RPN. 2k classification 4k bounding box (a) (b) scores offsets Figure 4:Detectionresults with(a)Soft-NMS and(b)NMS. Dueto heavy vehicle occlusion, NMS removed one car in detection results, Classification layer Regression layer while soft-NMS detected two cars separately. suppression (NMS) algorithm will be discussed more in detail in the following section. Figure 4 shows the example of detection results with NMS (right) and soft-NMS (left). As 256-d shown, NMS removed one car due to high overlap between two cars, while soft-NMS kept two cars separately. 3.2.1. Soft Nonmaximum Suppression Algorithm. Let P � in 􏼈p , p , p , . . . , p 􏼉 denote an initial proposal set output 1 2 3 n from the object proposal layers, in which the proposals are sorted by their objectiveness scores. For a proposal p , any Sliding window other proposal that has an overlap more than a predefined threshold Twith proposal p is called a neighbor proposal of proposal p . In this paper, the neighbor proposal threshold T Convolution feature map generated by is set to 0.5 by cross-validation. Let S denote the objec- the base network tiveness score of p , which is the maximum value in the Figure 3: (e region proposal network. classification score vector of p . For a proposal set, the proposal with the highest objectiveness score is called the winning proposal. Let p be a winning proposal and p be a i j shown in Figure 3. Using the final proposal coordinates and neighbor proposal of p . (e updated objectiveness score of their objectness score, a good set of proposals for vehicles is p (denoted by S ) is computed by the following formula [6]: j j created. Since anchors usually overlap, proposals end up also S � S 􏼐1 − O 􏼑, (1) i p ,p overlapping over the same object. Soft nonmaximum sup- i i pression (NMS) algorithm [6] is used to solve the issue of where O denotes the intersection of union (IoU) between p ,p i i duplicate proposals. In most state-of-the-art object de- proposal p and proposal p and is computed by the fol- i j tections, including Faster R-CNN, NMS algorithm is used to lowing formula: remove duplicate proposals. Traditional NMS removes any area􏼐p ∩ p 􏼑 other proposal that has an overlap more than a predefined i j O � . (2) p ,p threshold with a winning proposal. Due to heavy vehicle i i area􏼐p ∪ p 􏼑 i j occlusion in traffic scenes, traditional NMS algorithm may remove positive proposals unexpectedly (as shown in Fig- Soft-NMS algorithm is described by the flowchart in ure 4). To address the NMS issue with occluded vehicles, this Figure 5. paper adopts soft-NMS algorithm. With soft-NMS, the neighbor proposals of a winning proposal are not completely suppressed. Instead they are suppressed based on updated 3.3. Context-Aware RoI Pooling. In most two-stage object objectiveness scores of the neighbor proposals, which are detection algorithms, such as Fast R-CNN, Faster R-CNN, computed according to the overlap level of the neighbor and so on, the RoI pooling layer [29] is used to adjust the proposals and the winning proposal. Soft nonmaximum size of proposals to the fixed size. (e principle of an RoI Mathematical Problems in Engineering 5 Start P = input proposal set in Create a temporary proposal set P = P temp in Yes Check if P temp Final proposal set = P End out is empty? No Move the winning proposal in P to final proposal set P temp out Compute the updated score of the neighbor proposals of winning proposal in P based on (1) temp Update set P by removing the neighbor temp proposals of the winning proposal if their updated scores are lower than a predefined threshold T (T is set to 0.005 in this paper) s s Figure 5: (e flowchart of the soft-NMS algorithm. (b) RoI pooling process Max pooling Proposal operation Output feature map (a) e base network and the RPN network Max pooling operation Proposal Output feature map (c) CARoI pooling Deconvolution process operation Figure 6: Context-aware RoI pooling scheme. (a) Feature maps and proposals generated by the base network and the RPN. (b) Traditional RoI pooling process. (c) CARoI pooling process. pooling layer is illustrated in Figure 6(b). (e RoI pooling of approximate size (h/H) × (w/W), and then max pooling layer uses max pooling to convert the features inside any the values in each subwindow into the corresponding valid region of interest into a small feature map with a fixed output grid cell. If a proposal is smaller than H × W, it will spatial extent of H × W. RoI max pooling works by dividing be enlarged to H × W by adding replicated values to fill the the h × w RoI proposal into an H × W grid of subwindows new space. RoI pooling avoids repeatedly computing the 6 Mathematical Problems in Engineering Table 3: (e architecture of the classifier. convolutional layers, so it can significantly speed up both train and test time. However, adding replicated values to Type/stride Filter shape Input size small proposals is not appropriate, especially with small Conv dw/s2 3 ×3 ×512 dw 14 ×14 ×512 vehicles, as it may destroy the original structures of the Conv/s1 1 ×1 ×512 ×1024 7 ×7 ×512 small vehicles. Moreover, adding replicated values for small Conv dw/s2 3 ×3 ×1024 dw 7 ×7 ×1024 proposals will lead to inaccurate representations in the Conv/s1 1 ×1 ×1024 ×1024 7 ×7 ×1024 forward propagation and accumulation of errors in the Avg pool/s1 Pool 7 ×7 7 ×7 ×1024 backward propagation during the training process. (us, FC/s1 1024 ×2 1 ×1 ×1024 the performance of detecting small vehicles will be reduced. Softmax Classification RoI ×2 FC/s1 1024 ×4 1 ×1 ×1024 To adjust the size of proposals to the fixed size without Linear Regression RoI ×4 destroying the original structures of the small vehicles and enhance the performance of the proposed approach on detecting small vehicles, context-aware RoI pooling loss function of the RPN and the loss function of the (CARoI pooling) [7] is used in this paper. (e principle classifier share the same form but are optimized separately. of context-aware RoI pooling layer is illustrated in Figure 6(c). In CARoI pooling process, if the size of a 4. Results and Discussion proposal is larger than the fixed size of the output feature map, max pooling is used to reduce the size of the proposal In order to compare the effectiveness of the proposed ap- to a fixed size as in traditional RoI pooling. If the size of a proach with other state-of-the-art approaches on vehicle proposal is smaller than the fixed size of the output feature detection, this paper conducts experiments on widely used map, deconvolution operation is applied to enlarge the size public datasets: KITTI dataset [30] and LSVH dataset [7]. of the proposal to a fixed size as the following formula: (e proposed approach is implemented on a Window y � F ⊕ h , (3) k k k system machine with Intel Core i7 8700 CPU, NVIDIA GeForce GTX 1080Ti GPU and 16Gb of RAM. TensorFlow where y represents the output feature map with the fixed is adopted for implementing deep CNN frameworks, and size, F represents the input proposal, and h is the kernel of k k OpenCV library is used for real-time processing. the deconvolution operation. (e size of the kernel is equal to the ratio between the size of the output feature map and the size of the input proposal. Moreover, when the width of a 4.1. Dataset. KITTI dataset [30] is a widely used dataset for proposal is larger than the fixed width of the output feature evaluating vehicle detection algorithms. (is dataset consists map and the height of this proposal is smaller than the fixed of 7,481 images for training with available ground truth and height of the output feature map, deconvolution operation 7,518 images for testing with no available ground truth. as in (3) is adopted to enlarge the height of this proposal, and Images in this dataset include various scales of car in dif- max pooling is applied to this proposal to reduce the width ferent scenes and conditions and were divided into three of this proposal. With the CARoI pooling layer, the size of difficulty level groups: easy, moderate, and hard. If the proposals has been adjusted to the fixed size while dis- bounding box size was larger than 40 pixels, a completely criminative features from the small proposals can be still unshielded vehicle was considered to be an easy object, if the extracted. bounding box size was larger than 25 pixels but smaller than 40 pixels, a partially shielded vehicle was considered as a moderate object, and a vehicle with the bounding box size 3.4. "e Classifier. (e classifier is the final stage in the smaller than 25 pixels and an invisible vehicle that was proposed framework. After extracting features for each of difficult to see with the naked eye were considered as hard proposals via context-aware RoI pooling, these features are objects. LSVH dataset [7] contains 16 videos captured under used for classification. (e classifier has two different goals: different scenes, time, weathers, and resolutions and is di- classify proposals into the vehicle and background class and vided into two groups: sparse and crowded. A video scene adjust the bounding box for each of the detected vehicle containing more than 15 vehicles per frame on average is according to the predicted class. (e proposed classifier is considered as a crowded scene. Otherwise, it is considered as defined in Table 3. (is study uses the structure of depthwise a sparse scene. As in [7], this paper uses the eight videos in separable convolution in MobileNet architecture to build the the sparse group as the training data and the left four videos classifier. (e classifier has two fully connected (FC) layers, a in the sparse group as the testing data. box classification layer and a box regression layer. (e first FC layer is fed into the softmax layer to compute the confidence probabilities of being vehicle and background. 4.2. Evaluation Metrics. (is paper uses the average pre- (e second FC layer with linear activation functions re- cision(AP) andintersectionoverunion(IoU)metrics[31]to gresses the bounding box of the detected vehicle. All con- evaluate the performance of the proposed method in all volutional layers are followed by a batch normalization layer three difficulty level groups of the KITTI dataset and LSVH and a ReLU layer. (e loss function and the parameteri- dataset. (ese criteria have been used to assess various object zation of coordinates for bounding box regression are the detection algorithms [30, 31]. (e IoU is set to 0.7 in this same as in the original Faster R-CNN framework [2]. (e paper, which means only the overlap between the detected Mathematical Problems in Engineering 7 original Faster R-CNN. (ese results demonstrate the ef- bounding box and the ground truth bounding box greater than or equal to 70% is considered as a correct detection. fectiveness of soft-NMS on solving the issue of duplicate vehicles in driving environments. Comparing with RoI pooling in the original Faster R-CNN, context-aware RoI 4.3. Training. For the base network, this paper uses the poolingprocessdramatically improvesthe accuracywhileno MobileNet pretrained model on the ImageNet dataset [32] rd extra time is introduced (as shown in the 3 row). Par- and further fine-tuned on the KITTI dataset and the LSVH ticularly, the improvements are significant with the “mod- dataset. To accelerate training and reduce overfitting, the erate” and “hard” groups. (ese results demonstrate that the weights of each batch normalization layer in the pretrained recovered high-resolution semantic features are very useful model are frozen during the training process. (e RPN and for detecting small vehicles. (e last row in Table 4 shows the the classifier are trained by turns. First, the RPN is trained on AP results of Faster R-CNN with MobileNet architecture. As a mini-batch, and the parameters of the RPN and the base shown, MobileNet is nearly as accurate as VGG while network are updated once. (en, positive proposals and dramatically improving the inference time. More specific, negative proposals generated by the RPN are used to train Faster R-CNN with MobileNet needs 0.15 second to process and update the classifier. (e parameters of the classifier are an image, while Faster R-CNN with VGG-16 needs up to 2 updated once and the parameters of the base convolutional seconds. layers are updated once again. (e RPN and the MobileNet- Figure 7 presents some examples of detection results of based classifier share the base convolutional layers. (e loss the proposed method (shown in the left column) and the function and the parameterization of coordinates for original Faster R-CNN framework (shown in the right bounding box regression in this study are the same as those column) onthe KITTI validationset. As showninthis figure, in original Faster R-CNN. (e balancing parameter λ is set with the contribution of context-aware RoI pooling, the to1 in the loss function.(e Adamalgorithm [33] is adopted proposed approachcandetect moresmallvehicles compared for optimizing the loss functions. (e initial learning rates of to Faster R-CNN. Furthermore, with the contribution of the RPN and the classifier are set to 0.0001 with the learning soft-NMS postprocessing, the proposed method can avoid rate decay of 0.0005 per mini-batch. (e network is trained removing positive vehicle proposals unexpectedly (shown in for 200 epochs. the first row and the third row in Figure 7). 4.4. Performance Results. In this section, this study first 4.4.2. Experimental Results on the KITTI Test Set. Next, this checks the effectiveness of each enhanced module, and then study trains the proposed network with the KITTI training compares the performance of the proposed approach with set and compares the results of the proposed method with other state-of-the-art approaches on the KITTI dataset and recently published methods over the KITTI test set. Table 5 the LSVH dataset, including original Faster R-CNN [2], SSD shows the comparison of detection results on all three [18], YOLO [20], YOLOv2 [19], and MS-CNN [4]. categories of the KITTI test set. As shown from Table 5, the performance of the proposed method is improved com- paring with the Faster R-CNN framework by 2.49%, 5.92%, 4.4.1. Experimental Results on the KITTI Validation Set. and 3.6% in “easy,” “moderate,” and “hard” groups, re- Since the ground truth of the KITTI test set is not publicly available, this paper splits the KITTI training images into a spectively. Furthermore, comparing with the SSD frame- work, the proposed algorithm improves by 11.49%, 23.8%, train set and a validation set to conduct experiments as in [3], which results in 3,682 images for training and 3,799 and 18.55% in “easy,” “moderate,” and “hard” groups, respectively. For the computational efficiency, the pro- images for validation. To examine the effectiveness of each proposed enhancement, this paper conducts separate ex- posed method takes 0.15 second for processing an image, while the original Faster R-CNN framework takes up to 2 periments on each of the enhanced module and compares the resultswith the originalFaster R-CNN framework.In the seconds. (e MobileNet architecture dramatically im- proves processing speed of the proposed approach. (us, first experiment, the NMS algorithm is replaced by the soft- NMS algorithm, while other modules in original Faster the proposed approach meets the real-time detection standard and can be applied to the road driving envi- R-CNN are kept unchanged. In the second experiment, ronment of actual vehicles. Comparing the average pre- context-aware RoI pooling is adopted to replace RoI pooling process, and the NMS algorithm is kept unchanged in this cision and the processing time results in Table 5, it can be concluded that there is no absolute winner with dominant experiment. Finally, MobileNet architecture is adopted as the base network instead of VGG-16 architecture, while the performance over all the comparison aspects. Among the compared leading approaches, MS-CNN [4] ranked the NMS algorithm and RoI pooling layer are kept unchanged. Table 4 reports the AP results and the inference time of each first. However, MS-CNN has the second longest processing time (0.4 second), while the proposed approach needs only enhanced module and the original Faster R-CNN for vehicle detection over the KITTI validation set. It can be observed 0.15second. Other one-stage deep learning-based detectors (YOLO, YOLOv2, and SSD) are faster than the proposed that soft-NMS improves the performance in all groups with no extra computation time. More specific, the AP with soft- approach, but with very low accuracy. Figure 8 shows NMS increases by 1.33%, 0.6%, and 0.01% in “easy,” detection results of the proposed method on the KITTI test set (the left column). “moderate,” and “hard” groups, respectively, compared to 8 Mathematical Problems in Engineering Table 4: (e AP results and the inference time of each enhanced module and the original Faster R-CNN. Average precision Method Processing time (s) Easy (%) Moderate (%) Hard (%) Faster R-CNN [2] 87.33 86.67 76.78 2 Soft-NMS 88.66 87.27 76.79 2 Context-aware RoI pooling 88.05 90.84 80.16 2 MobileNet 86.25 86.07 76.18 0.15 (a) (b) Figure 7: Detection results of the proposed method (a) and the original Faster R-CNN framework (b) on the KITTI validation set. As shown, the proposed approach can detect more small vehicles and avoid removing positive vehicle proposals unexpectedly (shown in the first row and the third row) compared to the original Faster R-CNN. Mathematical Problems in Engineering 9 Table 5: Detection results of the proposed method and other methods on the KITTI test set. Average precision Method Processing time (s) Easy (%) Moderate (%) Hard (%) Faster R-CNN [2] 86.71 81.84 71.12 2 SSD [18] 77.71 64.06 56.17 0.06 MS-CNN [4] 90.03 89.02 76.11 0.4 YOLO [20] 47.69 35.74 29.65 0.03 YOLOv2 [19] 76.79 61.31 50.25 0.03 Proposed approach 89.20 87.86 74.72 0.15 (a) (b) Figure 8: Detection results of the proposed method on the KITTI test set (a) and the LSVH test set (b). 10 Mathematical Problems in Engineering Table 6: Comparison of the results of the proposed method and license plate, pedestrian, traffic sign, and so on. (e good other methods on the LSVH test set. performance of the proposed algorithm on vehicle de- tection has a high reference value in the field of intelligent Method Average precision (%) Processing time (s) driving. In the future, this paper will investigate more Faster R-CNN [2] 40.22 0.48 enhancements to improve detection results. MS-CNN [4] 72.66 0.35 YOLO [20] 23.78 0.03 YOLOv2 [19] 54.00 0.03 Data Availability Proposed approach 64.72 0.10 (e codes used in this paper are available from the author upon request. 4.4.3. Experimental Results on the LSVH Test Set. (is paper also compares the proposed method on another public Conflicts of Interest dataset: LSVH dataset [7]. In this paper, the eight videos in the sparse group are used for training the proposed network, (e author declares that there are no conflicts of interest and the left four videos in the sparse group are used for regarding the publication of this paper. testing the proposed network. To avoid retrieving similar images, this paper extracts one frame in every seven frames References of these videos as the training/testing images as in [7]. Table 6 shows the comparison of detection results on the [1] J. Huang, V. Rathod, C. Sun et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” 2017, https:// LSVH test set. As shown in Table 6, the performance of the arxiv.org/abs/1611.10012. proposed method is improved comparing with the Faster [2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards R-CNN framework by 24.5%. Figure 8 shows detection real-time object detection with region proposal networks,” in results of the proposed method on the LSVH test set (the Proceedings of the Advances in Neural Information Processing right column). Systems, pp. 91–99, Montreal, Canada, December 2015. [3] X. Chen, K. Kundu, Y. Zhu et al., “3d object proposals for accurate objectclass detection,” in Proceedings of the Advances 5. Conclusions in Neural Information Processing Systems 28 (NIPS 2015), Most of state-of-the-art approaches on vehicle detection Montreal, Canada, December 2015. [4] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified are focused on detection accuracy. In driving environment, multi-scale deep convolutional neural network for fast object apart from the detection accuracy, the inference speed is detection,” Computer Vision—ECCV 2016, pp. 354–370, also a large concern. Moreover, vehicles are unlikely to be Springer, Berlin, Germany, 2016. equipped with high-end graphic cards as powerful as used [5] A. G. Howard, M. Zhu, B. Chen et al., “MobileNets: efficient in research environments. (us, it is necessary to build a convolutional neural networks for mobile vision applica- faster framework for vehicle detection in driving envi- tions,” 2017, https://arxiv.org/pdf/1704.04861.pdf. ronments. In this paper, an improved Faster R-CNN [6] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Improving framework for fast vehicle detection is proposed. To im- objectdetectionwith one lineofcode,”2017,https://arxiv.org/ prove the detection accuracy and the inference time in the abs/1704.04503. challenging driving environment such as large vehicle scale [7] X. Hu, X. Xu, Y. Xiao et al., “SINet: a scale-insensitive convolutional neural network for fast vehicle detection,” IEEE variation, vehicle occlusion, and bad light conditions, Transactions on Intelligent Transportation Systems, vol. 20, MobileNet architecture is first adopted to build the base no. 3, pp. 1010–1019, 2019. network of the Faster R-CNN framework. Soft-NMS al- [8] N. C. Mithun, N. U. Rashid, and S. M. M. Rahman, “Detection gorithm is used after the region proposal network to solve and classification of vehicles from video using multiple time- the issue of duplicate proposals. Context-aware RoI spatial images,” IEEE Transactions on Intelligent Trans- pooling is then used to adjust the proposals to the specified portation Systems, vol. 13, no. 3, pp. 1215–1225, 2012. size without sacrificing important contextual information. [9] S. Messelodi, C. M. Modena, and M. Zanin, “A computer Furthermore, the structure of depthwise separable con- vision system for the detection and classification of vehicles at volution in MobileNet architecture is adopted to build the urban road intersections,” Pattern Analysis and Applications, classifier at the final stage of the Faster R-CNN framework vol. 8, no. 1-2, pp. 17–31, 2005. to classify proposals and adjust the bounding box for each [10] B. T. Morris and M. M. Trivedi, “Learning, modeling, and classification of vehicle track patterns from live video,” IEEE of the proposal. (e proposed approach is evaluated on the Transactions on Intelligent Transportation Systems, vol. 9, KITTI dataset and the LSVH dataset. Compared with the no. 3, pp. 425–437, 2008. original Faster R-CNN framework, the proposed approach [11] J. Zheng, Y. Wang, N. L. Nihan, and M. E. Hallenbeck, showed better results in both detection accuracy and “Extracting roadway background image,” Transportation processing time. (e results demonstrated that the pro- Research Record: Journal of the Transportation Research posed network is simple, fast, and efficient. Moreover, Board, vol. 1944, no. 1, pp. 82–88, 2006. compared with other state-of-the-art methods on vehicle [12] T. Gao, Z.-G. Liu, W.-C. Gao, and J. Zhang, “A robust detection, the proposed framework can easily be extended technique for background subtraction in traffic video,” Ad- and applied to the detection and recognition of other types vances in Neuro-Information Processing, pp. 736–744, of objects encountered in the driving environment, such as Springer, Berlin, Germany, 2009. Mathematical Problems in Engineering 11 [13] A. Ottlik and H.-H. Nagel, “Initialization of model-based [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and vehicle tracking in video sequences of inner-city in- A. Zisserman, “(e Pascal visual object classes (VOC) chal- tersections,” International Journal of Computer Vision, vol. 80, lenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2009. no. 2, pp. 211–225, 2008. [14] Q. Yuan, A. (angali, V. Ablavsky, and S. Sclaroff, “Learning a [32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, family of detectors via multiplicative kernels,” IEEE Trans- “ImageNet: a large-scale hierarchical image database,” in actions on Pattern Analysis and Machine Intelligence, vol. 33, Proceedings of the 2009 IEEE Conference on Computer Vision no. 3, pp. 514–530, 2011. and Pattern Recognition, pp. 248–255, Miami, FL, USA, June [15] J.-W. Hsieh, L.-C. Chen, and D.-Y. Chen, “Symmetrical SURF 2009. and its applications to vehicle detection and vehicle make and [33] D. Kingma and J. Ba, “Adam: a method for stochastic opti- model recognition,” IEEE Transactions on Intelligent Trans- mization,” 2014, https://arxiv.org/abs/1412.6980. portation Systems, vol. 15, no. 1, pp. 6–20, 2014. [16] R. M. Z. Sun and G. Bebis, “Monocular precrash vehicle detection: features and classifiers,” IEEE Trans. Image Process, vol. 15, no. 7, pp. 2019–2034, 2006. [17] W. C. Chang and C. W. Cho, “Online boosting for vehicle detection,” IEEE Transactions on Systems, Man, and Cyber- netics, Part B (Cybernetics), vol. 40, no. 3, pp. 892–902, 2010. [18] W. Liu, D. Anguelov, D. Erhan et al., “Single shot multibox detector,” in Proceedings of the European Conference on Computer Vision, pp. 21–37, Springer, Amsterdam, (e Netherlands, October 2016. [19] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” 2017, https://arxiv.org/abs/1612.08242. [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, real-time object detection,” 2016, https:// arxiv.org/abs/1506.02640. [21] W. Chu, Y. Liu, C. Shen, D. Cai, and X.-S. Hua, “Multi-task vehicle detection with region-of-interest voting,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 432–441, [22] Y. Cai, Z. Liu, X. Sun, L. Chen, H. Wang, and Y. Zhang, “Vehicle detection based on deep dual-vehicle deformable PartModels,” Journal of Sensors,vol.2017,ArticleID5627281, 10 pages, 2017. [23] X. Li, Y. Liu, Z. Zhao, Y. Zhang, and L. He, “A deep learning approach of vehicle multitarget detection from traffic video,” Journal of Advanced Transportation, vol. 2018, Article ID 7075814, 11 pages, 2018. [24] H. Wang, X. Lou, Y. Cai, Y. Li, and L. Chen, “Real-time vehicle detection algorithm based on vision and lidar point cloud fusion,” Journal of Sensors, vol. 2019, Article ID 8473980, 9 pages, 2019. [25] W. Zhang, Y. Zheng, Q. Gao, and Z. Mi, “Part-aware region proposal for vehicle detection in high occlusion environ- ment,” IEEE Access, vol. 7, pp. 100383–100393, 2019. [26] K.-J. Kim, P.-K. Kim, Y.-S. Chung, and D.-H. Choi, “Multi- scale detector for accurate vehicle detection in traffic sur- veillance data,” IEEE Access, vol. 7, pp. 78311–78319, 2019. [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, https:// arxiv.org/abs/1409.1556. [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105, Lake Tahoe, NV, USA, December [29] R. Girshick, “Fast R-CNN,” 2015, https://arxiv.org/abs/1504. [30] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au- tonomous driving? (e KITTI vision benchmark suite,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, Providence, RI, USA, June 2012. The Scientific Advances in Advances in Journal of Journal of Operations Research Decision Sciences Applied Mathematics World Journal Probability and Statistics Hindawi Hindawi Hindawi Hindawi Publishing Corporation Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 http://www www.hindawi.com .hindawi.com V Volume 2018 olume 2013 www.hindawi.com Volume 2018 International Journal of Mathematics and Mathematical Sciences Journal of Optimization Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 Submit your manuscripts at www.hindawi.com International Journal of International Journal of Engineering Mathematics Analysis Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 Journal of A Ad dv va an nc ce es i s in n Discrete Dynamics in Mathematical Problems International Journal of Complex Analysis Num Num Num Num Num Num Num Num Num Num Num Nume e e e e e e e e e e er r r r r r r r r r r riiiiiiiiiiiic c c c c c c c c c c cal al al al al al al al al al al al A A A A A A A A A A A Anal nal nal nal nal nal nal nal nal nal nal naly y y y y y y y y y y ys s s s s s s s s s s siiiiiiiiiiiis s s s s s s s s s s s in Engineering Dierential Equations Nature and Society Hindawi Hindawi Hindawi Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com V Volume 2018 olume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 International Journal of Journal of Journal of Abstract and Advances in Stochastic Analysis Mathematics Function Spaces Applied Analysis Mathematical Physics Hindawi Hindawi Hindawi Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Mathematical Problems in Engineering Hindawi Publishing Corporation

Improving Faster R-CNN Framework for Fast Vehicle Detection

Mathematical Problems in Engineering , Volume 2019 – Nov 22, 2019

Loading next page...
 
/lp/hindawi-publishing-corporation/improving-faster-r-cnn-framework-for-fast-vehicle-detection-5p4dpB3Iof

References (36)

Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2019 Hoanh Nguyen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1024-123X
eISSN
1563-5147
DOI
10.1155/2019/3808064
Publisher site
See Article on Publisher Site

Abstract

Hindawi Mathematical Problems in Engineering Volume 2019, Article ID 3808064, 11 pages https://doi.org/10.1155/2019/3808064 Research Article Hoanh Nguyen Faculty of Electrical Engineering Technology, Industrial University of Ho Chi Minh City, Ho Chi Minh City, Vietnam Correspondence should be addressed to Hoanh Nguyen; nguyenhoanh@iuh.edu.vn Received 21 August 2019; Revised 16 October 2019; Accepted 5 November 2019; Published 22 November 2019 Academic Editor: Daniel Zaldivar Copyright © 2019 Hoanh Nguyen. (is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Vision-based vehicle detection plays an important role in intelligent transportation systems. With the fast development of deep convolutional neural networks (CNNs), vision-based vehicle detection approaches have achieved significant improvements compared to traditional approaches. However, due to large vehicle scale variation, heavy occlusion, or truncation of the vehicle in an image, recent deep CNN-based object detectors still showed a limited performance. (is paper proposes an improved framework based on Faster R-CNN for fast vehicle detection. Firstly, MobileNet architecture is adopted to build the base convolution layer in Faster R-CNN. (en, NMS algorithm after the region proposal network in the original Faster R-CNN is replaced by the soft-NMS algorithm to solve the issue of duplicate proposals. Next, context-aware RoI pooling layer is adopted to adjust the proposals to the specified size without sacrificing important contextual information. Finally, the structure of depthwise separable convolution in MobileNet architecture is adopted to build the classifier at the final stage of the Faster R-CNN framework to classify proposals and adjust the bounding box for each of the detected vehicle. Experimental results on the KITTI vehicle dataset and LSVH dataset show that the proposed approach achieved better performance compared to original Faster R-CNN in both detection accuracy and inference time. More specific, the performance of the proposed method is improved comparing with the original Faster R-CNN framework by 4% on the KITTI test set and 24.5% on the LSVH test set. methods focus on modifying the base network to fit different 1. Introduction scales by applying multiscale feature maps of CNN [4] or Vision-based vehicle detection is an essential prerequisite in utilizing input images with multiple resolutions [3]. In most many intelligent transportation systems, such as advanced public test datasets, these methods show better detection driving assistance systems, autonomous driving, intelligent accuracy compared to traditional CNN-based object de- traffic management systems, and so on. Traditional methods tectors. However, these methods still need significant usually use motion and handcrafted features to detect ve- computation cost and thus are still incapable of real-time hicles from images directly. In recent years, deep con- vehicle detection. volutional neural networks (CNNs) have achieved incredible In view of the aforementioned research challenges, this success on object detection tasks as well as vehicle detection paper proposes an improved framework based on Faster R-CNN for real-time vehicle detection. First, MobileNet [1]. However, when applying CNNs to vehicle detection, real-time vehicle detection in driving environment is still architecture [5] is adopted to build the base network instead very challenging. (ese challenges come from many oc- of VGG architecture in the original Faster R-CNN frame- cluded and truncated vehicles with large vehicle scale var- work. MobileNet splits the convolution into a 3 ×3 iations in traffic images. (us, the popular CNN-based depthwise convolution and a 1 ×1 pointwise convolution, object detectors such as Faster R-CNN [2] and SSD [3] effectively reducing both computational cost and number of without modification did not achieve very good perfor- parameters. (us, the proposed framework improves both mance on vehicle detection. Many recent methods are based computation cost and inference time. In the region proposal on modifying the popular CNN-based object detectors to network, nonmaximum suppression algorithm is replaced enhance the performance of detection results. (ese by soft nonmaximum suppression algorithm [6] in order to 2 Mathematical Problems in Engineering detection performance improvement. SSD framework [18] solve the issue of heavy vehicle occlusion. Furthermore, context-aware RoI pooling [7] is used instead of RoI pooling skips the region proposal stage and directly uses multiple feature maps with different resolutions to perform object to maintain the original structures of the small objects. Finally, a classifier based on MobileNet architecture is built localization and classification. YOLOv2 [19] introduces to classify proposals and adjust the bounding box for each of improvements of batch normalization, high-resolution the proposal. (e proposed approach is evaluated on the classifier, convolutional with anchor boxes, and dimension KITTI benchmark dataset and the LSVH dataset. (e results clusters compared to original YOLO [20]. Comparing to show that the proposed approach achieved better perfor- YOLO, YOLOv2 achieves higher accuracy and higher speed. mance compared to other traditional deep CNN-based To better handle the detection problem of vehicles in object detectors. More specific, the performance of the complex conditions, Chu et al. [21] proposed a vehicle proposed method is improved comparing with the Faster detection scheme based on multitask deep CNN in which learning is trained on four tasks: category classification, R-CNN framework by 4% average with the KITTI test set and 24.5% with the LSVH test set. bounding box regression, overlap prediction, and sub- category classification. A region of interest voting scheme (is paper is organized as follows: an overview of pre- vious methods on vehicle detection is presented in Section 2. and multilevel localization are then used to further improve Section 3 describes in detail the proposed method. Section 4 detection accuracy and reliability. Experimental results on demonstrates experimental results. Finally, the conclusion is the standard test dataset showed better performance than made in Section 5. other methods. In [22], the authorsproposedthe deepmodel for vehicle detection which consists of feature extraction, deformation processing, occlusion processing, and classifier 2. Theoretical Basis training using the back propagation algorithm. Li et al. [23] In this section, this paper introduces previous methods on proposed a multivehicle detection method which consists of YOLO under the Darknet framework.Tomake the fulluse of vehicle detection, including traditional methods and re- cently proposed methods based on deep CNN. the advantages of the depth information of lidar and the obstacle classification ability of vision, Wang et al. [24] Vision-based vehicle detection system firstly locates vehicle candidate regions. (en, a classifier is constructed to proposed a real-time vehicledetection algorithm whichfuses vision and lidar point cloud information. (e experimental eliminate false vehicle candidate regions. Traditional methods can be divided into two categories: motion-based results showed that the proposed algorithm significantly methods and static appearance feature-based methods. improved the vehicle detection accuracy at different de- Motion-based methods use the motion to detect the vehicles tection difficulty levels compared to the original YOLOv3 in the image frame. Background subtraction methods [8] are algorithm, especially for the vehicles with severe occlusion. most widely used. (e backgroundremoval methods include In [25], the authors presented a two-stage detector based on Kalman filter [9], single Gaussian pixel distribution [10], Faster R-CNN for high-occluded vehicledetection. (e part- aware RPN is proposed to replace the original RPN at the Gaussian mixture model (GMM) [11], and wavelets [12]. Another method of motion feature is based on the optical first stage of the Faster R-CNN module, and the part-aware NMS is proposed to refine final results. Kim et al. [26] flow [13]. Optical flow is widely used in vehicle detection since it is less susceptible to occlusion issues. Static ap- proposed to integrate additional prediction layers into pearance feature-based methods focus on external physical conventional YOLOv3 using spatial pyramid pooling to features such as color, texture, edge, and shape. A variety of complement the detection accuracy of the vehicle for large- feature descriptors have been used in this field such as HOG scale changes or being occluded by other objects. (is ar- [14], SURF [15], Gabor [16], and Haar-like [17]. (ese chitecture showed a state-of-the-art mAP detection ratio feature descriptors are usually followed by classifiers like against the other vehicle detection approaches with rea- SVM [14], artificial neural network [16], and AdaBoost [17]. sonable run-time speed. Traditional methods showed high accuracy in limited conditions. However, with the effect of shadow, occluded 3. The Proposed Framework vehicle, complex scenarios, and environments, traditional methods showed poor performance. Figure 1 shows the overall framework of the proposed ap- Recently, deep CNN-based methods have become the proach. To differentiate from the original Faster R-CNN leading method for high-quality general object detection, framework, the proposed enhancements are highlighted by including vehicle detection. Faster region-based convolu- red boxes in Figure 1. In the first stage, MobileNet archi- tional neural network (Faster R-CNN) [2] defined a region tecture [5] is used to build the base convolution layer instead proposal network (RPN) for generating region proposals of VGG-16 architecture [27] in the original Faster R-CNN and a network using these proposals to detect objects. RPN framework. In the region proposal network, soft-NMS al- shares full-image convolutional features with the detection gorithm is used to solve the issue of heavy vehicle occlusion. network, thus enabling nearly cost-free region proposals. RoI pooling layer is then replaced by the context-aware RoI (is method has achieved state-of-the-art detection per- pooling layer to maintain the original structures of the small formance and becomes a commonly employed paradigm for vehicles. (e classifier based on MobileNet architecture is general object detection. MS-CNN [4] extends the detection built at the final stage to classify proposals into the vehicle over multiple scales of feature layers, which produce good and background and adjust the bounding box for each of the Mathematical Problems in Engineering 3 The classifier Classification The base network Context-aware Region RoI pooling proposal network Regression Fixed size Object proposals feature maps Input image Final detection Figure 1: (e overall framework of the proposed approach. Table 1: Comparison of the MobileNet model with the VGG detected vehicle. In the following section, the proposed model. approach is explained in detail. ImageNet Multiply-adds Parameters Model accuracy (%) (million) (million) 3.1. "e Base Network. (e original Faster R-CNN frame- MobileNet 70.6 569 4.2 work used VGG-16 [27] as the base network. In [18], Liu VGG-16 71.5 15300 138 et al. proved that about 80% of the forward time is spent on the base network so that using a faster base network can Table 2: (e architecture of the base network. greatly improve the speed of the whole framework. Mobi- leNet architecture [5] is an efficient network which splits the Type/stride Filter shape Input size convolution into a 3 ×3 depthwise convolution and a 1 ×1 Conv/s2 3 ×3 ×3 ×32 224 ×224 ×3 pointwise convolution, effectively reducing both computa- Conv dw/s1 3 ×3 ×32 dw 112 ×112 ×32 tional cost and number of parameters. Table 1 shows the Conv/s1 1 ×1 ×32 ×64 112 ×112 ×32 comparison of MobileNet and VGG-16 on ImageNet [28]. Conv dw/s2 3 ×3 ×64 dw 112 ×112 ×64 As shown, MobileNet is nearly as accurate as VGG-16 while Conv/s1 1 ×1 ×64 ×128 56 ×56 ×64 being 32 times smaller and 27 times less compute intensive. Conv dw/s1 3 ×3 ×128 dw 56 ×56 ×128 Conv/s1 1 ×1 ×128 ×128 56 ×56 ×128 With the purpose of real-time vehicle detection in traffic Conv dw/s2 3 ×3 ×128 dw 56 ×56 ×128 scenes, MobileNet architecture is used as the base network in Conv/s1 1 ×1 ×128 ×256 28 ×28 ×128 this study. MobileNet introduces two parameters which can Conv dw/s1 3 ×3 ×256 dw 28 ×28 ×256 be used to tune to fit the resource/accuracy trade-off, in- Conv/s1 1 ×1 ×256 ×256 28 ×28 ×256 cluding width multiplier and resolution multiplier. (e Conv dw/s2 3 ×3 ×256 dw 28 ×28 ×256 width multiplier allows us to thin the network, while the Conv/s1 1 ×1 ×256 ×512 14 ×14 ×256 resolution multiplier changes the input dimensions of the 5 ×conv dw/s1 3 ×3 ×512 dw 14 ×14 ×512 image, thus reducing the internal representation at every 5 ×conv/s1 1 ×1 ×512 ×512 14 ×14 ×512 layer. In this study, MobileNet is adopted to build the base convolutional layers in Faster R-CNN instead of VGG-16 in the original framework for fast vehicle detection. Since this of kernel size. More details about MobileNet architecture paper uses only the convolution layers in MobileNet ar- can be found in [5]. chitecture, the size of the input image does not have to be fixed. Supposing the size of the input image is 224 ×224 ×3, 3.2. Region Proposal Network (RPN). (e RPN first generates the architecture of the base network is defined, as shown in a set of anchor boxes from the convolution feature map Table 2. generated by the base network. An anchor is centered at the In Table 2, “Conv” represents as a standard convolution; sliding window and is associated with a scale and aspect “Conv dw” represents as a depthwise separable convolution; “s1” represents that the convolution stride is 1 ×1; and “s2” ratio. For the trade-off between recall and processing speed, three anchor box scales of 128, 256, and 512 and three represents that the convolution stride is 2 ×2. Depthwise separable convolution is made up of two anchor box ratios of 1:1, 1:2, and 2:1 are used for each anchor in this paper as in [2], yielding 9 anchors at each layers: depthwise convolutions and pointwise convolutions. Depthwise convolutions are used to apply a single filter per sliding position. For a convolutional feature map of a size 14 ×14,there are1,764anchors intotal,asshowninFigure2. each input channel, while pointwise convolution, a simple 1 ×1 convolution, is used to create a linear combination of (e RPN then takes all the anchor boxes and outputs two different outputs for each of the anchors. (e first one is the output of the depthwise layer. MobileNet architecture objectness score, which means the probability that an anchor uses both batch norm and ReLU nonlinearities for both is an object. (e second output is the bounding box re- layers. (e reduction of computational cost is in proportion gression for adjusting the anchors to better fit the object, as to the number of output feature map channel and the square Soft-NMS Conv layer FC layer FC layer 4 Mathematical Problems in Engineering Sliding window Convolution feature map For each position of the sliding window on convolution feature map, 9 anchor boxes with different scales and aspect ratios are created. Figure 2: Anchor boxes generated by RPN. 2k classification 4k bounding box (a) (b) scores offsets Figure 4:Detectionresults with(a)Soft-NMS and(b)NMS. Dueto heavy vehicle occlusion, NMS removed one car in detection results, Classification layer Regression layer while soft-NMS detected two cars separately. suppression (NMS) algorithm will be discussed more in detail in the following section. Figure 4 shows the example of detection results with NMS (right) and soft-NMS (left). As 256-d shown, NMS removed one car due to high overlap between two cars, while soft-NMS kept two cars separately. 3.2.1. Soft Nonmaximum Suppression Algorithm. Let P � in 􏼈p , p , p , . . . , p 􏼉 denote an initial proposal set output 1 2 3 n from the object proposal layers, in which the proposals are sorted by their objectiveness scores. For a proposal p , any Sliding window other proposal that has an overlap more than a predefined threshold Twith proposal p is called a neighbor proposal of proposal p . In this paper, the neighbor proposal threshold T Convolution feature map generated by is set to 0.5 by cross-validation. Let S denote the objec- the base network tiveness score of p , which is the maximum value in the Figure 3: (e region proposal network. classification score vector of p . For a proposal set, the proposal with the highest objectiveness score is called the winning proposal. Let p be a winning proposal and p be a i j shown in Figure 3. Using the final proposal coordinates and neighbor proposal of p . (e updated objectiveness score of their objectness score, a good set of proposals for vehicles is p (denoted by S ) is computed by the following formula [6]: j j created. Since anchors usually overlap, proposals end up also S � S 􏼐1 − O 􏼑, (1) i p ,p overlapping over the same object. Soft nonmaximum sup- i i pression (NMS) algorithm [6] is used to solve the issue of where O denotes the intersection of union (IoU) between p ,p i i duplicate proposals. In most state-of-the-art object de- proposal p and proposal p and is computed by the fol- i j tections, including Faster R-CNN, NMS algorithm is used to lowing formula: remove duplicate proposals. Traditional NMS removes any area􏼐p ∩ p 􏼑 other proposal that has an overlap more than a predefined i j O � . (2) p ,p threshold with a winning proposal. Due to heavy vehicle i i area􏼐p ∪ p 􏼑 i j occlusion in traffic scenes, traditional NMS algorithm may remove positive proposals unexpectedly (as shown in Fig- Soft-NMS algorithm is described by the flowchart in ure 4). To address the NMS issue with occluded vehicles, this Figure 5. paper adopts soft-NMS algorithm. With soft-NMS, the neighbor proposals of a winning proposal are not completely suppressed. Instead they are suppressed based on updated 3.3. Context-Aware RoI Pooling. In most two-stage object objectiveness scores of the neighbor proposals, which are detection algorithms, such as Fast R-CNN, Faster R-CNN, computed according to the overlap level of the neighbor and so on, the RoI pooling layer [29] is used to adjust the proposals and the winning proposal. Soft nonmaximum size of proposals to the fixed size. (e principle of an RoI Mathematical Problems in Engineering 5 Start P = input proposal set in Create a temporary proposal set P = P temp in Yes Check if P temp Final proposal set = P End out is empty? No Move the winning proposal in P to final proposal set P temp out Compute the updated score of the neighbor proposals of winning proposal in P based on (1) temp Update set P by removing the neighbor temp proposals of the winning proposal if their updated scores are lower than a predefined threshold T (T is set to 0.005 in this paper) s s Figure 5: (e flowchart of the soft-NMS algorithm. (b) RoI pooling process Max pooling Proposal operation Output feature map (a) e base network and the RPN network Max pooling operation Proposal Output feature map (c) CARoI pooling Deconvolution process operation Figure 6: Context-aware RoI pooling scheme. (a) Feature maps and proposals generated by the base network and the RPN. (b) Traditional RoI pooling process. (c) CARoI pooling process. pooling layer is illustrated in Figure 6(b). (e RoI pooling of approximate size (h/H) × (w/W), and then max pooling layer uses max pooling to convert the features inside any the values in each subwindow into the corresponding valid region of interest into a small feature map with a fixed output grid cell. If a proposal is smaller than H × W, it will spatial extent of H × W. RoI max pooling works by dividing be enlarged to H × W by adding replicated values to fill the the h × w RoI proposal into an H × W grid of subwindows new space. RoI pooling avoids repeatedly computing the 6 Mathematical Problems in Engineering Table 3: (e architecture of the classifier. convolutional layers, so it can significantly speed up both train and test time. However, adding replicated values to Type/stride Filter shape Input size small proposals is not appropriate, especially with small Conv dw/s2 3 ×3 ×512 dw 14 ×14 ×512 vehicles, as it may destroy the original structures of the Conv/s1 1 ×1 ×512 ×1024 7 ×7 ×512 small vehicles. Moreover, adding replicated values for small Conv dw/s2 3 ×3 ×1024 dw 7 ×7 ×1024 proposals will lead to inaccurate representations in the Conv/s1 1 ×1 ×1024 ×1024 7 ×7 ×1024 forward propagation and accumulation of errors in the Avg pool/s1 Pool 7 ×7 7 ×7 ×1024 backward propagation during the training process. (us, FC/s1 1024 ×2 1 ×1 ×1024 the performance of detecting small vehicles will be reduced. Softmax Classification RoI ×2 FC/s1 1024 ×4 1 ×1 ×1024 To adjust the size of proposals to the fixed size without Linear Regression RoI ×4 destroying the original structures of the small vehicles and enhance the performance of the proposed approach on detecting small vehicles, context-aware RoI pooling loss function of the RPN and the loss function of the (CARoI pooling) [7] is used in this paper. (e principle classifier share the same form but are optimized separately. of context-aware RoI pooling layer is illustrated in Figure 6(c). In CARoI pooling process, if the size of a 4. Results and Discussion proposal is larger than the fixed size of the output feature map, max pooling is used to reduce the size of the proposal In order to compare the effectiveness of the proposed ap- to a fixed size as in traditional RoI pooling. If the size of a proach with other state-of-the-art approaches on vehicle proposal is smaller than the fixed size of the output feature detection, this paper conducts experiments on widely used map, deconvolution operation is applied to enlarge the size public datasets: KITTI dataset [30] and LSVH dataset [7]. of the proposal to a fixed size as the following formula: (e proposed approach is implemented on a Window y � F ⊕ h , (3) k k k system machine with Intel Core i7 8700 CPU, NVIDIA GeForce GTX 1080Ti GPU and 16Gb of RAM. TensorFlow where y represents the output feature map with the fixed is adopted for implementing deep CNN frameworks, and size, F represents the input proposal, and h is the kernel of k k OpenCV library is used for real-time processing. the deconvolution operation. (e size of the kernel is equal to the ratio between the size of the output feature map and the size of the input proposal. Moreover, when the width of a 4.1. Dataset. KITTI dataset [30] is a widely used dataset for proposal is larger than the fixed width of the output feature evaluating vehicle detection algorithms. (is dataset consists map and the height of this proposal is smaller than the fixed of 7,481 images for training with available ground truth and height of the output feature map, deconvolution operation 7,518 images for testing with no available ground truth. as in (3) is adopted to enlarge the height of this proposal, and Images in this dataset include various scales of car in dif- max pooling is applied to this proposal to reduce the width ferent scenes and conditions and were divided into three of this proposal. With the CARoI pooling layer, the size of difficulty level groups: easy, moderate, and hard. If the proposals has been adjusted to the fixed size while dis- bounding box size was larger than 40 pixels, a completely criminative features from the small proposals can be still unshielded vehicle was considered to be an easy object, if the extracted. bounding box size was larger than 25 pixels but smaller than 40 pixels, a partially shielded vehicle was considered as a moderate object, and a vehicle with the bounding box size 3.4. "e Classifier. (e classifier is the final stage in the smaller than 25 pixels and an invisible vehicle that was proposed framework. After extracting features for each of difficult to see with the naked eye were considered as hard proposals via context-aware RoI pooling, these features are objects. LSVH dataset [7] contains 16 videos captured under used for classification. (e classifier has two different goals: different scenes, time, weathers, and resolutions and is di- classify proposals into the vehicle and background class and vided into two groups: sparse and crowded. A video scene adjust the bounding box for each of the detected vehicle containing more than 15 vehicles per frame on average is according to the predicted class. (e proposed classifier is considered as a crowded scene. Otherwise, it is considered as defined in Table 3. (is study uses the structure of depthwise a sparse scene. As in [7], this paper uses the eight videos in separable convolution in MobileNet architecture to build the the sparse group as the training data and the left four videos classifier. (e classifier has two fully connected (FC) layers, a in the sparse group as the testing data. box classification layer and a box regression layer. (e first FC layer is fed into the softmax layer to compute the confidence probabilities of being vehicle and background. 4.2. Evaluation Metrics. (is paper uses the average pre- (e second FC layer with linear activation functions re- cision(AP) andintersectionoverunion(IoU)metrics[31]to gresses the bounding box of the detected vehicle. All con- evaluate the performance of the proposed method in all volutional layers are followed by a batch normalization layer three difficulty level groups of the KITTI dataset and LSVH and a ReLU layer. (e loss function and the parameteri- dataset. (ese criteria have been used to assess various object zation of coordinates for bounding box regression are the detection algorithms [30, 31]. (e IoU is set to 0.7 in this same as in the original Faster R-CNN framework [2]. (e paper, which means only the overlap between the detected Mathematical Problems in Engineering 7 original Faster R-CNN. (ese results demonstrate the ef- bounding box and the ground truth bounding box greater than or equal to 70% is considered as a correct detection. fectiveness of soft-NMS on solving the issue of duplicate vehicles in driving environments. Comparing with RoI pooling in the original Faster R-CNN, context-aware RoI 4.3. Training. For the base network, this paper uses the poolingprocessdramatically improvesthe accuracywhileno MobileNet pretrained model on the ImageNet dataset [32] rd extra time is introduced (as shown in the 3 row). Par- and further fine-tuned on the KITTI dataset and the LSVH ticularly, the improvements are significant with the “mod- dataset. To accelerate training and reduce overfitting, the erate” and “hard” groups. (ese results demonstrate that the weights of each batch normalization layer in the pretrained recovered high-resolution semantic features are very useful model are frozen during the training process. (e RPN and for detecting small vehicles. (e last row in Table 4 shows the the classifier are trained by turns. First, the RPN is trained on AP results of Faster R-CNN with MobileNet architecture. As a mini-batch, and the parameters of the RPN and the base shown, MobileNet is nearly as accurate as VGG while network are updated once. (en, positive proposals and dramatically improving the inference time. More specific, negative proposals generated by the RPN are used to train Faster R-CNN with MobileNet needs 0.15 second to process and update the classifier. (e parameters of the classifier are an image, while Faster R-CNN with VGG-16 needs up to 2 updated once and the parameters of the base convolutional seconds. layers are updated once again. (e RPN and the MobileNet- Figure 7 presents some examples of detection results of based classifier share the base convolutional layers. (e loss the proposed method (shown in the left column) and the function and the parameterization of coordinates for original Faster R-CNN framework (shown in the right bounding box regression in this study are the same as those column) onthe KITTI validationset. As showninthis figure, in original Faster R-CNN. (e balancing parameter λ is set with the contribution of context-aware RoI pooling, the to1 in the loss function.(e Adamalgorithm [33] is adopted proposed approachcandetect moresmallvehicles compared for optimizing the loss functions. (e initial learning rates of to Faster R-CNN. Furthermore, with the contribution of the RPN and the classifier are set to 0.0001 with the learning soft-NMS postprocessing, the proposed method can avoid rate decay of 0.0005 per mini-batch. (e network is trained removing positive vehicle proposals unexpectedly (shown in for 200 epochs. the first row and the third row in Figure 7). 4.4. Performance Results. In this section, this study first 4.4.2. Experimental Results on the KITTI Test Set. Next, this checks the effectiveness of each enhanced module, and then study trains the proposed network with the KITTI training compares the performance of the proposed approach with set and compares the results of the proposed method with other state-of-the-art approaches on the KITTI dataset and recently published methods over the KITTI test set. Table 5 the LSVH dataset, including original Faster R-CNN [2], SSD shows the comparison of detection results on all three [18], YOLO [20], YOLOv2 [19], and MS-CNN [4]. categories of the KITTI test set. As shown from Table 5, the performance of the proposed method is improved com- paring with the Faster R-CNN framework by 2.49%, 5.92%, 4.4.1. Experimental Results on the KITTI Validation Set. and 3.6% in “easy,” “moderate,” and “hard” groups, re- Since the ground truth of the KITTI test set is not publicly available, this paper splits the KITTI training images into a spectively. Furthermore, comparing with the SSD frame- work, the proposed algorithm improves by 11.49%, 23.8%, train set and a validation set to conduct experiments as in [3], which results in 3,682 images for training and 3,799 and 18.55% in “easy,” “moderate,” and “hard” groups, respectively. For the computational efficiency, the pro- images for validation. To examine the effectiveness of each proposed enhancement, this paper conducts separate ex- posed method takes 0.15 second for processing an image, while the original Faster R-CNN framework takes up to 2 periments on each of the enhanced module and compares the resultswith the originalFaster R-CNN framework.In the seconds. (e MobileNet architecture dramatically im- proves processing speed of the proposed approach. (us, first experiment, the NMS algorithm is replaced by the soft- NMS algorithm, while other modules in original Faster the proposed approach meets the real-time detection standard and can be applied to the road driving envi- R-CNN are kept unchanged. In the second experiment, ronment of actual vehicles. Comparing the average pre- context-aware RoI pooling is adopted to replace RoI pooling process, and the NMS algorithm is kept unchanged in this cision and the processing time results in Table 5, it can be concluded that there is no absolute winner with dominant experiment. Finally, MobileNet architecture is adopted as the base network instead of VGG-16 architecture, while the performance over all the comparison aspects. Among the compared leading approaches, MS-CNN [4] ranked the NMS algorithm and RoI pooling layer are kept unchanged. Table 4 reports the AP results and the inference time of each first. However, MS-CNN has the second longest processing time (0.4 second), while the proposed approach needs only enhanced module and the original Faster R-CNN for vehicle detection over the KITTI validation set. It can be observed 0.15second. Other one-stage deep learning-based detectors (YOLO, YOLOv2, and SSD) are faster than the proposed that soft-NMS improves the performance in all groups with no extra computation time. More specific, the AP with soft- approach, but with very low accuracy. Figure 8 shows NMS increases by 1.33%, 0.6%, and 0.01% in “easy,” detection results of the proposed method on the KITTI test set (the left column). “moderate,” and “hard” groups, respectively, compared to 8 Mathematical Problems in Engineering Table 4: (e AP results and the inference time of each enhanced module and the original Faster R-CNN. Average precision Method Processing time (s) Easy (%) Moderate (%) Hard (%) Faster R-CNN [2] 87.33 86.67 76.78 2 Soft-NMS 88.66 87.27 76.79 2 Context-aware RoI pooling 88.05 90.84 80.16 2 MobileNet 86.25 86.07 76.18 0.15 (a) (b) Figure 7: Detection results of the proposed method (a) and the original Faster R-CNN framework (b) on the KITTI validation set. As shown, the proposed approach can detect more small vehicles and avoid removing positive vehicle proposals unexpectedly (shown in the first row and the third row) compared to the original Faster R-CNN. Mathematical Problems in Engineering 9 Table 5: Detection results of the proposed method and other methods on the KITTI test set. Average precision Method Processing time (s) Easy (%) Moderate (%) Hard (%) Faster R-CNN [2] 86.71 81.84 71.12 2 SSD [18] 77.71 64.06 56.17 0.06 MS-CNN [4] 90.03 89.02 76.11 0.4 YOLO [20] 47.69 35.74 29.65 0.03 YOLOv2 [19] 76.79 61.31 50.25 0.03 Proposed approach 89.20 87.86 74.72 0.15 (a) (b) Figure 8: Detection results of the proposed method on the KITTI test set (a) and the LSVH test set (b). 10 Mathematical Problems in Engineering Table 6: Comparison of the results of the proposed method and license plate, pedestrian, traffic sign, and so on. (e good other methods on the LSVH test set. performance of the proposed algorithm on vehicle de- tection has a high reference value in the field of intelligent Method Average precision (%) Processing time (s) driving. In the future, this paper will investigate more Faster R-CNN [2] 40.22 0.48 enhancements to improve detection results. MS-CNN [4] 72.66 0.35 YOLO [20] 23.78 0.03 YOLOv2 [19] 54.00 0.03 Data Availability Proposed approach 64.72 0.10 (e codes used in this paper are available from the author upon request. 4.4.3. Experimental Results on the LSVH Test Set. (is paper also compares the proposed method on another public Conflicts of Interest dataset: LSVH dataset [7]. In this paper, the eight videos in the sparse group are used for training the proposed network, (e author declares that there are no conflicts of interest and the left four videos in the sparse group are used for regarding the publication of this paper. testing the proposed network. To avoid retrieving similar images, this paper extracts one frame in every seven frames References of these videos as the training/testing images as in [7]. Table 6 shows the comparison of detection results on the [1] J. Huang, V. Rathod, C. Sun et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” 2017, https:// LSVH test set. As shown in Table 6, the performance of the arxiv.org/abs/1611.10012. proposed method is improved comparing with the Faster [2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards R-CNN framework by 24.5%. Figure 8 shows detection real-time object detection with region proposal networks,” in results of the proposed method on the LSVH test set (the Proceedings of the Advances in Neural Information Processing right column). Systems, pp. 91–99, Montreal, Canada, December 2015. [3] X. Chen, K. Kundu, Y. Zhu et al., “3d object proposals for accurate objectclass detection,” in Proceedings of the Advances 5. Conclusions in Neural Information Processing Systems 28 (NIPS 2015), Most of state-of-the-art approaches on vehicle detection Montreal, Canada, December 2015. [4] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified are focused on detection accuracy. In driving environment, multi-scale deep convolutional neural network for fast object apart from the detection accuracy, the inference speed is detection,” Computer Vision—ECCV 2016, pp. 354–370, also a large concern. Moreover, vehicles are unlikely to be Springer, Berlin, Germany, 2016. equipped with high-end graphic cards as powerful as used [5] A. G. Howard, M. Zhu, B. Chen et al., “MobileNets: efficient in research environments. (us, it is necessary to build a convolutional neural networks for mobile vision applica- faster framework for vehicle detection in driving envi- tions,” 2017, https://arxiv.org/pdf/1704.04861.pdf. ronments. In this paper, an improved Faster R-CNN [6] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Improving framework for fast vehicle detection is proposed. To im- objectdetectionwith one lineofcode,”2017,https://arxiv.org/ prove the detection accuracy and the inference time in the abs/1704.04503. challenging driving environment such as large vehicle scale [7] X. Hu, X. Xu, Y. Xiao et al., “SINet: a scale-insensitive convolutional neural network for fast vehicle detection,” IEEE variation, vehicle occlusion, and bad light conditions, Transactions on Intelligent Transportation Systems, vol. 20, MobileNet architecture is first adopted to build the base no. 3, pp. 1010–1019, 2019. network of the Faster R-CNN framework. Soft-NMS al- [8] N. C. Mithun, N. U. Rashid, and S. M. M. Rahman, “Detection gorithm is used after the region proposal network to solve and classification of vehicles from video using multiple time- the issue of duplicate proposals. Context-aware RoI spatial images,” IEEE Transactions on Intelligent Trans- pooling is then used to adjust the proposals to the specified portation Systems, vol. 13, no. 3, pp. 1215–1225, 2012. size without sacrificing important contextual information. [9] S. Messelodi, C. M. Modena, and M. Zanin, “A computer Furthermore, the structure of depthwise separable con- vision system for the detection and classification of vehicles at volution in MobileNet architecture is adopted to build the urban road intersections,” Pattern Analysis and Applications, classifier at the final stage of the Faster R-CNN framework vol. 8, no. 1-2, pp. 17–31, 2005. to classify proposals and adjust the bounding box for each [10] B. T. Morris and M. M. Trivedi, “Learning, modeling, and classification of vehicle track patterns from live video,” IEEE of the proposal. (e proposed approach is evaluated on the Transactions on Intelligent Transportation Systems, vol. 9, KITTI dataset and the LSVH dataset. Compared with the no. 3, pp. 425–437, 2008. original Faster R-CNN framework, the proposed approach [11] J. Zheng, Y. Wang, N. L. Nihan, and M. E. Hallenbeck, showed better results in both detection accuracy and “Extracting roadway background image,” Transportation processing time. (e results demonstrated that the pro- Research Record: Journal of the Transportation Research posed network is simple, fast, and efficient. Moreover, Board, vol. 1944, no. 1, pp. 82–88, 2006. compared with other state-of-the-art methods on vehicle [12] T. Gao, Z.-G. Liu, W.-C. Gao, and J. Zhang, “A robust detection, the proposed framework can easily be extended technique for background subtraction in traffic video,” Ad- and applied to the detection and recognition of other types vances in Neuro-Information Processing, pp. 736–744, of objects encountered in the driving environment, such as Springer, Berlin, Germany, 2009. Mathematical Problems in Engineering 11 [13] A. Ottlik and H.-H. Nagel, “Initialization of model-based [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and vehicle tracking in video sequences of inner-city in- A. Zisserman, “(e Pascal visual object classes (VOC) chal- tersections,” International Journal of Computer Vision, vol. 80, lenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2009. no. 2, pp. 211–225, 2008. [14] Q. Yuan, A. (angali, V. Ablavsky, and S. Sclaroff, “Learning a [32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, family of detectors via multiplicative kernels,” IEEE Trans- “ImageNet: a large-scale hierarchical image database,” in actions on Pattern Analysis and Machine Intelligence, vol. 33, Proceedings of the 2009 IEEE Conference on Computer Vision no. 3, pp. 514–530, 2011. and Pattern Recognition, pp. 248–255, Miami, FL, USA, June [15] J.-W. Hsieh, L.-C. Chen, and D.-Y. Chen, “Symmetrical SURF 2009. and its applications to vehicle detection and vehicle make and [33] D. Kingma and J. Ba, “Adam: a method for stochastic opti- model recognition,” IEEE Transactions on Intelligent Trans- mization,” 2014, https://arxiv.org/abs/1412.6980. portation Systems, vol. 15, no. 1, pp. 6–20, 2014. [16] R. M. Z. Sun and G. Bebis, “Monocular precrash vehicle detection: features and classifiers,” IEEE Trans. Image Process, vol. 15, no. 7, pp. 2019–2034, 2006. [17] W. C. Chang and C. W. Cho, “Online boosting for vehicle detection,” IEEE Transactions on Systems, Man, and Cyber- netics, Part B (Cybernetics), vol. 40, no. 3, pp. 892–902, 2010. [18] W. Liu, D. Anguelov, D. Erhan et al., “Single shot multibox detector,” in Proceedings of the European Conference on Computer Vision, pp. 21–37, Springer, Amsterdam, (e Netherlands, October 2016. [19] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” 2017, https://arxiv.org/abs/1612.08242. [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, real-time object detection,” 2016, https:// arxiv.org/abs/1506.02640. [21] W. Chu, Y. Liu, C. Shen, D. Cai, and X.-S. Hua, “Multi-task vehicle detection with region-of-interest voting,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 432–441, [22] Y. Cai, Z. Liu, X. Sun, L. Chen, H. Wang, and Y. Zhang, “Vehicle detection based on deep dual-vehicle deformable PartModels,” Journal of Sensors,vol.2017,ArticleID5627281, 10 pages, 2017. [23] X. Li, Y. Liu, Z. Zhao, Y. Zhang, and L. He, “A deep learning approach of vehicle multitarget detection from traffic video,” Journal of Advanced Transportation, vol. 2018, Article ID 7075814, 11 pages, 2018. [24] H. Wang, X. Lou, Y. Cai, Y. Li, and L. Chen, “Real-time vehicle detection algorithm based on vision and lidar point cloud fusion,” Journal of Sensors, vol. 2019, Article ID 8473980, 9 pages, 2019. [25] W. Zhang, Y. Zheng, Q. Gao, and Z. Mi, “Part-aware region proposal for vehicle detection in high occlusion environ- ment,” IEEE Access, vol. 7, pp. 100383–100393, 2019. [26] K.-J. Kim, P.-K. Kim, Y.-S. Chung, and D.-H. Choi, “Multi- scale detector for accurate vehicle detection in traffic sur- veillance data,” IEEE Access, vol. 7, pp. 78311–78319, 2019. [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, https:// arxiv.org/abs/1409.1556. [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105, Lake Tahoe, NV, USA, December [29] R. Girshick, “Fast R-CNN,” 2015, https://arxiv.org/abs/1504. [30] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au- tonomous driving? (e KITTI vision benchmark suite,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, Providence, RI, USA, June 2012. The Scientific Advances in Advances in Journal of Journal of Operations Research Decision Sciences Applied Mathematics World Journal Probability and Statistics Hindawi Hindawi Hindawi Hindawi Publishing Corporation Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 http://www www.hindawi.com .hindawi.com V Volume 2018 olume 2013 www.hindawi.com Volume 2018 International Journal of Mathematics and Mathematical Sciences Journal of Optimization Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 Submit your manuscripts at www.hindawi.com International Journal of International Journal of Engineering Mathematics Analysis Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 Journal of A Ad dv va an nc ce es i s in n Discrete Dynamics in Mathematical Problems International Journal of Complex Analysis Num Num Num Num Num Num Num Num Num Num Num Nume e e e e e e e e e e er r r r r r r r r r r riiiiiiiiiiiic c c c c c c c c c c cal al al al al al al al al al al al A A A A A A A A A A A Anal nal nal nal nal nal nal nal nal nal nal naly y y y y y y y y y y ys s s s s s s s s s s siiiiiiiiiiiis s s s s s s s s s s s in Engineering Dierential Equations Nature and Society Hindawi Hindawi Hindawi Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com V Volume 2018 olume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 International Journal of Journal of Journal of Abstract and Advances in Stochastic Analysis Mathematics Function Spaces Applied Analysis Mathematical Physics Hindawi Hindawi Hindawi Hindawi Hindawi www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018

Journal

Mathematical Problems in EngineeringHindawi Publishing Corporation

Published: Nov 22, 2019

There are no references for this article.