Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Face Detection and Segmentation Based on Improved Mask R-CNN

Face Detection and Segmentation Based on Improved Mask R-CNN Hindawi Discrete Dynamics in Nature and Society Volume 2020, Article ID 9242917, 11 pages https://doi.org/10.1155/2020/9242917 Research Article Face Detection and Segmentation Based on Improved Mask R-CNN Kaihan Lin , Huimin Zhao , Jujian Lv , Canyao Li, Xiaoyong Liu, Rongjun Chen, and Ruoyan Zhao School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China Correspondence should be addressed to Huimin Zhao; zhaohuimin@gpnu.edu.cn and Jujian Lv; jujianlv@gpnu.edu.cn Received 18 December 2019; Accepted 11 March 2020; Published 1 May 2020 Guest Editor: Zheng Wang Copyright © 2020 Kaihan Lin et al. -is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Deep convolutional neural networks have been successfully applied to face detection recently. Despite making remarkable progress, most of the existing detection methods only localize each face using a bounding box, which cannot segment each face from the background image simultaneously. To overcome this drawback, we present a face detection and segmentation method based on improved Mask R-CNN, named G-Mask, which incorporates face detection and segmentation into one framework aiming to obtain more fine-grained information of face. Specifically, in this proposed method, ResNet-101 is utilized to extract features, RPN is used to generate RoIs, and RoIAlign faithfully preserves the exact spatial locations to generate binary mask through Fully Convolution Network (FCN). Furthermore, Generalized Intersection over Union (GIoU) is used as the bounding box loss function to improve the detection accuracy. Compared with Faster R-CNN, Mask R-CNN, and Multitask Cascade CNN, the proposed G-Mask method has achieved promising results on FDDB, AFW, and WIDER FACE benchmarks. face detection to improve detection performance [14–18], 1. Introduction including R-CNN [19], Fast R-CNN [20], and Faster R-CNN Face detection is a key link of subsequent face-related ap- [21]. -ese methods mainly implement face detection and plications, such as face recognition [1], facial expression the location of the face bounding box, which may have some recognition [2], and face hallucination [3], because its effect drawbacks such as the extracted face features have back- directly affects the subsequent applications performance. ground noise, spatial quantization is rough and cannot be -erefore, face detection has become a research hotspot in accurately positioned. -ese drawbacks will directly affect the field of pattern recognition and computer vision and has the follow-up subsequent face-related applications, such as been widely studied in the past two decades. face recognition, facial expression recognition, and face Large amounts of approaches have been proposed for alignment [22]. -erefore, it is necessary to study a face face detection. -e early research on face detection [4–9] detection and segmentation method. mainly focused on the design of handcraft feature and used Mask R-CNN [23], an improved object detection model traditional machine learning algorithms to train effective based on Faster R-CNN, has an impressive performance on classifiers for detection and recognition. Such approaches various object detection and segmentation benchmarks such are limited in that the efficient feature design is complex and as COCO challenges [24] and Cityscapes dataset [25]. Unlike the detection accuracy is relatively low. In recent years, face traditional R-CNN series methods, Mask R-CNN adds a detection methods based on deep convolutional neural mask branch for predicting segmentation masks on each network [10–13] have been widely studied, which are more Region of Interest (RoI), which can fulfil both detection and robust and efficient than handcraft feature methods. Besides, segmentation tasks. In order to fulfil both face detection and a series of efficient object detection frameworks are used for segmentation tasks from the image to overcome the 2 Discrete Dynamics in Nature and Society drawbacks of the existing methods, a face detection and 2.2. Neural Networks Based Methods. As early as 1994, segmentation method based on improved Mask R-CNN (G- Vaillant et al. [10] first proposed using neural network to detect faces. In this work, Convolutional Neural Networks Mask) is proposed in this paper. In particular, our scheme introduces Generalized Intersection over Union (GIoU) [26] (CNN) is used to classify whether each pixel is part of a face as the loss function for bounding box regression to improve and then determine the location of the face through another detection accuracy of face detection. -e main contributions CNN. After that, the researchers did a lot of research based of this paper are as follows: on this work. In recent years, the deep learning approaches has significantly promoted the development of the computer (1) A new dataset was created (more details are de- vision technology, including face detection. Li et al. [11] scribed in Section 4.1), which annotated 5115 images proposed a cascade CNN network architecture for rapid face randomly selected from the FDDB [27] and detection, which is a multiresolution network structure that ChokePoint datasets [28]. can quickly eliminate background regions in the low-res- (2) A face detection and segmentation method based on olution stage and carefully evaluate challenging candidates improved Mask R-CNN was proposed, which can in the last high resolution stage. Ranjan et al. [12] proposed a detect faces correctly while also precisely segmenting deformation part model based on normalized features each face in an image. Furthermore, the proposed extracted by deep convolutional neural network. Yang et al. method improves the detection performance by [13] proposed a method called Convolutional Channel introducing GIoU as a bounding box loss function. Feature (CCF) by combining the advantages of both filtered -e experimental results verify that our proposed channel features and CNN, which has a lower computational G-Mask method achieves promising performance on cost and storage cost than the general end-to-end CNN several mainstream benchmarks, including the method. FDDB, AFW [29], and WIDER FACE [30]. Recently, witnessing the significant advancement of object detection using region-based methods, researchers -e remainder of this paper is organized as follows. have gradually applied the R-CNN series of methods to face Section 2 briefly reviews the related work. -e G-Mask detection. Qinet al. [14] proposed ajoint training scheme for framework for face detection and segmentation is described CNN cascade, Region Proposal Network (RPN), and Fast in detail in Section 3. Section 4 presents the experiment and R-CNN. In [15], Jiang et al. trained the Faster R-CNN model discussion of the proposed method. In the last section, the by using WIDER dataset and verified performance on the work is summarized and the direction of future work is FDDB and IJB-A benchmarks. Sun et al. [16] improve the proposed. Faster R-CNN framework through a series of strategies such as multiscale training, hard negative mining, and feature concatenation. Wu et al. [17] proposed a different scales face 2. Related Work detection method based on Faster R-CNN for the challenge Face detection as one of the important research directions of of small-scale face detection. Liu et al. [18] proposed a computer vision has been extensively studied in recent years. cascaded backbone branches fully convolutional neural From the development process of face detection, we can network (BB-FCN) and used facial landmark localization simply classify previous work as handcraft feature based and results to guide R-CNN-based face detection. -e neural neural networks based methods. networks based methods are already the mainstream of face detection because of its high efficiency and stability. In this work, we propose a G-Mask scheme, which achieves fairly 2.1. Handcraft Feature Based Methods. With the appearance progress in face detection task compared to the original of the first real-time face detection method called Viola- architecture. Jones [4] in 2004, face detection has begun to be applied in practice. -e well-known Viola-Jones can perform real-time 3. Improved Mask R-CNN detection using Haar feature and cascaded structure, but it also has some drawbacks, such as large feature size and low 3.1. Network Architecture. -e proposed method is extended recognition rate for complex situations. To address these from the Mask R-CNN [23] framework, which is the state- concerns, a lot of new handcraft features are proposed, such of-the-art object detection scheme and demonstrated im- as HOG [5], SIFT [6], SUFT [7], and LBP [8], which have pressive performance on various object detection bench- achieved outstanding results. Apart from the above marks. As stated in Figure 1, the proposed G-Mask method methods, one of the significant advances was Deformable consists of two branches, one for face detection and the other Part Model (DPM), proposed by Felzenszwalb et al. [9]. In for face and background image segmentation. In this work, the DPM model, the face is represented as a set of de- the ResNet-101 backbone is used to extract the facial features formable parts, and the improved HOG feature and SVM are of the input image, and the Region of Interest (RoI) is rapidly used for detection, achieving remarkable performance. In generated on the feature map through the Region Proposal general, the advantages of handcraft features are that the Network (RPN). We also use the Region of Interest Align model is intuitive and extensible, and the disadvantage is (RoIAlign) to faithfully preserve exact spatial locations and that the detection accuracy is limited in the face of multi- output the feature map to a fixed size. At the end of the objective tasks. network, the bounding box is located and classified in the Discrete Dynamics in Nature and Society 3 Fixed size Box feature map Fully Class connected RoIAlign layers ResNet Mask RPN Fully convolution network Figure 1: Network architecture of the G-Mask. detection branch, and the corresponding face mask is generated on the image in the segmentation branch through the Fully Convolution Network (FCN) [31]. In the following, we will introduce the key steps of our network in detail. 3.2. Region Proposal Network. For images with human faces in our daily life, there are generally some face objects with different scales and aspect ratios. -erefore, in our approach, Region Proposal Network (RPN) generates RoIs by sliding windows on the feature map through anchors with different scales and different aspect ratios. Details are shown in Figure 2. -e largest rectangle in the figure represents the feature map extracted by the convolutional neural network, and the dotted line indicates that the anchor is the standard anchor. Assume that the standard anchor size is 64 pixels, and the three anchors it contained represent three anchors withaspect ratios of 1:1,1:2, and 2:1.-e dot-dashline and the solid line represent the anchors of 32 and 128 pixels, Figure 2: Illustration of RPN network. respectively. Similarly, each of them also has three aspect ratios anchors. For traditional RPN, the above three scales and three aspect ratios are used to slide on the feature map to 2 2 2 generate RoIs. In this paper, we use 5 scales (16 , 32 , 64 , 2 2 128 , and 256 ) and 3 aspect ratios (1:1, 1:2, and 2:1), Fixed size output leading to 15 anchors at each location, which was more effective in detecting objects of different scales. Pooling 3.3. RoIAlign Layer. G-Mask, unlike the general face de- tection methods, has a segmentation operation, which re- quires more refined spatial quantization for feature extraction. In the traditional region-based approaches, RoIPool is the standard operation for extracting small Figure 3: Bilinear interpolation in RoIAlign, where the dashed feature map from RoIs, which have two quantization op- background grid represents the feature map, the solid grid rep- erations that result in misalignments between the RoI and resents an RoI (with 2 ×2 bins in this example), and the dots the extracted features. For traditional detection methods, represent the four sample points in each bin. this may not affect classification and localization, while for our approach, it has a great impact on prediction of pixel- accurate masks, as well as for small object detection. It can be seen that the RoIAlign layer cancels the harsh In response to the above problem, we introduced the quantization operations on the feature map and uses bilinear RoIAlign layer, following the scheme of [23]. As shown in interpolation to preserve the floating-number coordinates, Figure 3, suppose the feature map is divided into 2 ×2 bins. thereby avoiding misalignments between the RoI and the 4 Discrete Dynamics in Nature and Society extracted features. -e bilinear interpolation function has difficult to optimize the nonoverlapping bounding boxes; (b) two steps, which are defined as follows: the IoU value may be the same when two objects intersect in different orientations, so the IoU function does not reflect Interpolate on the x-axis direction as follows: how the two objects overlap. To overcome these drawbacks, x − x x − x 2 1 f R ≈ f Q + f Q , R � x, y , 􏼁 􏼁 􏼁 􏼁 1 11 21 1 1 GIoU not only focuses on the situation where two objects x − x x − x 2 1 2 1 overlap but also considers the situation of nonoverlapping. (1) -e details of the GIoU metric are shown in Figure 4. p p p p g g g g Suppose B � (x , y , x , y ) and B � (x , y , x , y ) are x − x x − x p 1 1 2 2 g 1 1 2 2 2 1 f R ≈ f Q + f Q , R � x, y . 􏼁 􏼁 􏼁 􏼁 2 12 22 2 2 the coordinates of an object’s predicted bounding box and x − x x − x 2 1 2 1 the ground-truth bounding box, where x > x and y > y 2 1 2 1 (2) in B and B ; then, the area of them is P g Interpolate on the y-axis direction as follows: p p p p A � 􏼐x − x 􏼑 × 􏼐y − y 􏼑, (4) 2 1 2 1 y − y y − y 2 1 f(P) � f(x, y) ≈ f R 􏼁 + f R 􏼁 , (3) 1 2 y − y y − y 2 1 2 1 g g g g A � x − x × y − y . 􏼁 􏼁 (5) g 2 1 2 1 where f(x, y) is the value of the sampling point P, f(Q ), -e coordinates and area of intersection I of B and B f(Q ), f(Q ), and f(Q ) are the values of the four P g 12 21 22 can be calculated as nearby grid points Q � (x , y ), Q � (x , y ), 11 1 1 12 1 2 Q � (x , y ), and Q � (x , y ), and f(R ), f(R ) are the 21 2 1 22 2 2 1 2 i p g x � max􏼐x , x 􏼑, 1 1 1 value obtained by interpolating in the x-axis direction. (6) i p g x � min􏼐x , x 􏼑, 2 2 2 3.4. Mask Branch. -e mask branch realizes the seg- i p g y � max􏼐y , y 􏼑, mentation of face object and background image in 1 1 1 (7) i p g G-Mask model, which predicts the segmentation mask in y � min􏼐y , y 􏼑, 2 2 2 a pixel to pixel manner by applying Full Convolutional Network (FCN) [31] to each RoI. -e FCN scheme is one i i i i i i i i x − x 􏼁 × y − y 􏼁 , if x > x , y > y , 2 1 2 1 2 1 2 1 of the solutions for instance segmentation, which orig- A � 􏼨 (8) 0, otherwise. inates from CNN but is also different from general CNN. For the traditional CNN network architecture, in order to Similarly, the smallest enclosing box B can be found obtain the feature vector of fixed dimensions, the con- through volutional layer is generally connected with several full p g connection layers, and finally the output is a numerical x � min􏼐x , x 􏼑, 1 1 1 (9) description of the input, which is generally applicable to p g x � max􏼐x , x 􏼑, 2 2 2 tasks such as image recognition and classification, object detection, and positioning. -e FCN framework is similar c p g y � min􏼐y , y 􏼑, to the traditional CNN network, which also includes the 1 1 1 (10) convolutional layer and the pooling layer. In particular, p g y � max y , y , 􏼐 􏼑 2 2 2 the FCN uses the deconvolution to up-sample the feature map in the end convolution layer so that the output image and the area of B can be computed as size can be restored to the original image size, and finally c c c c A � x − x 􏼁 × y − y 􏼁 . (11) c 2 1 2 1 uses the Softmax classifier to predict the category of each pixel. -e IoU between B and B is defined as P g IoU � . (12) 3.5. Generalized Intersection over Union. Bounding box re- A + A − A p g i gression, as one of the fundamental components of many -erefore, GIoU can be calculated by the definition of computer vision tasks, deserves further study by researchers [32]. However, unlike the architecture and feature extraction A − A + A − A 􏼐 􏼑 strategy improvement researches, which have made great c p g i (13) GIoU � IoU − . progress in recent years [33], the research of bounding box regression has lagged behind somewhat. -e Generalized Intersection over Union (GIoU) [26], as the latest metric and bounding box regression method, demonstrates state-of- 3.6. Loss Function. -e proposed G-Mask model consists of the-art results on various object detection benchmarks by two stages, which are the same as the general region-based incorporating with the general object detection frameworks. model. In the first stage, RPN proposes the candidate For traditional IoU, there are two weaknesses when it is used bounding boxes of the object face. -e second stage, follow as a metric or a bounding box regression loss: (a) the IoU the Fast R-CNN architecture, extracts features from each value is zero when two objects do not overlap, making it candidate box and then performs classification and Discrete Dynamics in Nature and Society 5 For bounding box loss, we introduce GIoU as the loss function, and the definition of GIoU metric is described in (13), so the loss bounding box function is defined as follows: � 1 − GIoU. (17) box For segmentation box loss, we adopt the average binary cross-entropy loss, which is defined in ∗ k k L � − 􏽘 􏽨y log y 􏽢 + 􏼐1 − y 􏼑log􏼐1 − y 􏽢 􏼑􏽩, mask ij ij ij ij (18) 1≤i,j≤m where y is the label value of a cell (i, j) for the region of size ij m × m and y 􏽢 is the predicted value of the k-th class of this ij cell. L is only defined on a specific mask, which is related mask to the ground-truth class k, and other mask outputs do not affect the loss. 4. Experiments 4.1. Experimental Setup. Unlike object detection and generic face detection, there are no off-the-shelf face datasets with masks annotation that can be employed to train our model Figure 4: Illustration of GIoU metric. -e solid line indicates the [34]. -erefore, the first step of our work is to create a new prediction box and ground truth box, the dotted line indicates the dataset with mask annotations. In order to enhance the smallest enclosing box, and the shaded portion indicates the in- reliability of the samples, we selected 5115 samples from tersection of the prediction box and the ground truth box. FDDB and ChokePoint datasets and annotated them with masks labels. After the annotation work, we trained the G-Mask model on this dataset. bounding box location. In addition, like the Mask R-CNN, For implementation, we adopt Keras [35] framework to we added a mask branch parallel to the classification branch train the G-Mask model in Ubuntu 16.04. ResNet-101 [36] is and the bounding box location branch. -erefore, we define used as the backbone network architecture in our work. In a multitasking objective function, which includes classifi- the training phase, the G-Mask model is train on afore- cation loss L , bounding box location loss L , and seg- mentioned dataset for 150,000 iterations (where the epoch is cls box mentation loss L . Our loss function for each image is 50 and the steps of per epoch are 3000) with the learning rate mask defined as set to 0.001 and the weight decay rate set to 0.0001. We randomly sample one image per batch for training [37], in ∗ ∗ ∗ L � L + L + L . (14) which the short side of each image was resized to 800 and the cls box mask long side was resized to 1024. In the RPN part, RoIs is In (14), the classification loss L and segmentation loss cls generated by sliding the window on the feature map through L are defined the same as in Mask R-CNN. For the mask anchors of different scales and different aspect ratios. It will bounding box loss, we found that GIoU can better respond have 2000 RoIs kept after nonmaximum suppression, and to face detection tasks through several experiments com- the RoIs will only be considered as foreground if its IoU with pared with the traditional bounding box regression method. the ground truth is greater than 0.5. -e testing phase -erefore, in this paper, we introduced GIoU as a bounding settings are the same as the training phase, and the region box loss function. In more detail, the classification loss is proposal is considered to be a face only if the confidence defined as in score is greater than 0.7. -e training and testing process is carried out on the same server, which is a Xeon E5 CPU of L � 􏼈p 􏼉􏼁 � 􏽘 L p , p 􏼁 , (15) 128GB flash memory and NVIDIA GeForce GTX cls i cls i i cls 1080TiGPU. where N is the minibatch size, i is the index of an anchor cls in a minibatch, and p is the prediction probability of i 4.2. Experimental Results. In this work, G-Mask model not whether anchor i is a face target. -e ground-truth label only realized the bounding box localization of the face target ∗ ∗ p � 1 if the anchor is positive, and p � 0 when the i i but also separated the face information from the background anchor is negative. -e classification loss L of each cls image by binary mask, so that more detailed face infor- anchor is log loss of whether an object is a face, which is mation could be obtained through the above process. -e defined as comparison experiment was carried out on three popular face benchmark datasets, including FDDB, AFW, and ∗ ∗ ∗ L p , p 􏼁 � −􏼂p log p + 1 − p 􏼁 log 1 − p􏼁 􏼃. (16) cls i i i i i i WIDER FACE. 6 Discrete Dynamics in Nature and Society 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 300 600 900 1200 1500 False positive Viola-Jones Faster RCNN (0.6362) (0.8083) Pico Mask RCNN (0.7393) (0.8298) Koestinger et al G-Mask (0.7414) (0.8880) Figure 5: Comparisons of face detection with other methods on FDDB benchmark. (a) (b) Figure 6: Different detection results of Mask R-CNN and G-Mask in the complex scene of FDDB dataset. (a) Mask R-CNN model and (b) G-Mask model. -e FDDB [27] dataset is a well-known face detection performs better than Faster R-CNN when there are more evaluation dataset and benchmark, which contains 2845 than 160 false positives. When there are more than 280 false images of 5171 human faces. In this dataset, the faces of each positives, the performance of G-Mask is better than that of image come from different scenes, which is quite chal- Mask R-CNN. Furthermore, our method can achieve 88.80% true positive rate in 1500 false positives, which lenging. We compared several methods on the FDDB dataset, including Faster R-CNN [15], Mask R-CNN [23], exceeded all the comparison methods. -e comparison Pico [38], Viola-Jones [39], and Koestinger [40]. For ef- results of the FDDB dataset show that our proposed fective comparison, the training data of the G-Mask, Mask G-Mask method has achieved promising results, demon- R-CNN, and Faster R-CNN models are the same, which is strating that our method can segment face information the dataset constructed in this work. We compared the true while detecting effectively. Some detection results of the positive rates at 1500 false positives, and the results are Mask R-CNN and G-Mask models in the complex scenario shown in Figure 5. It can be seen from Figure 5 that G-Mask of FDDB dataset are shown in Figure 6. It is obvious that True positive rate Discrete Dynamics in Nature and Society 7 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall DPM (AP 97.21) Shen et al. (AP 89.03) HeadHunter (AP 97.14) TSM (AP 87.99) G-Mask (AP 95.97) Face.com SquaresChnFtrs-5 (AP 95.24) Picasa Structured Models (AP 95.19) Face++ Figure 7: -e precision-recall curve of our method on the AFW benchmark. Data of other models and evaluation code are derived from [41]. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall MSCNN-0.916 LDCF+–0.790 MSCNN-0.903 LDCF+–0.769 G-Mask-0.902 Faceness-WIDER-0.713 CMS-RCNN-0.874 Multiscale Cascade CNN-0.664 CMS-RCNN-0.899 Multiscale Cascade ScaleFace-0.867 CNN-0.691 Faceness-WIDER-0.634 ScaleFace-0.868 G-Mask-0.854 Two-stage CNN-0.618 Two-stage CNN-0.681 Multitask Cascade Multitask Cascade CNN-0.848 CNN-0.825 ACF-WIDER-0.541 ACF-WIDER-0.659 (a) (b) Figure 8: Continued. Precision Precision Precision 8 Discrete Dynamics in Nature and Society 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall MSCNN-0.802 LDCF+–0.522 ScaleFace-0.772 Multiscale Cascade CNN-0.424 G-Mask-0.662 CMS-RCNN-0.624 Faceness-WIDER-0.345 Multitask Cascade Two-stage CNN-0.323 CNN-0.598 ACF-WIDER-0.273 (c) Figure 8: -e precision-recall curve on the WIDER FACE benchmark: (a) on the easy subset, (b) on the medium subset, and (c) on the hard subset. Figure 9: More results of G-Mask method. the G-Mask model performs better in the multiscale face -e AFW dataset [29] is a face dataset and benchmark task, which demonstrates the effectiveness of the proposed established by using Flickr image, which contains 205 images method in face detection. with 473 labeled faces. -e precision-recall curve of our method Precision Discrete Dynamics in Nature and Society 9 Table 1: Running time of different region-based methods. complexity. However, the G-Mask method can achieve higher accuracy with less time consumption compared with Running time (s) Method other region-based methods and can also obtain more FDDB AFW ChokePoint detailed face information through segmentation branches R-CNN 14.75 15.32 14.51 while accurately locating. Fast R-CNN 3.12 3.08 2.84 Faster R-CNN 0.30 0.32 0.28 5. Conclusions Mask R-CNN 0.32 0.35 0.33 G-Mask 0.35 0.42 0.33 In this paper, a G-Mask method was proposed for face detection and segmentation. -e approach can extract on the AFW benchmark is shown in Figure 7, and it can be seen features by ResNet-101, generate RoIs by RPN, preserve the that the G-Mask method achieved 95.97% average precision precise spatial position by RoIAlign, and generate binary (AP). Although our dataset has a different label format from the masks through the full convolutional network (FCN). In AFW benchmark, as well as the moderately sized training doing so, the proposed framework is able to detect faces dataset, we also demonstrate the generalization of our method. correctly while also precisely segmenting each face in an WIDER FACE [30], one of the largest and most chal- image. Experimental results with self-built face dataset as lenging face detection datasets in the open source data, has well as public available datasets have verified that our 32,203 images and 393,703 labeled faces. In this dataset, proposed G-Mask method achieves promising performance. various changes in the face size, pose, and occlusion have For the future work, we will consider improving the speed of brought great challenges to face detection, and the dataset is the proposed method. divided into easy, medium, and hard subsets according to the difficulty level. To further demonstrate the detection Data Availability performance of our proposed method, we trained the G-Mask model on WIDER FACE dataset and verified it on -e data used to support the findings of this study are the validation dataset. -e proposed method is compared available from the corresponding author upon request. with several major methods including MSCNN [42], CMS- RCNN [43], ScaleFace [44], Multitask Cascade CNN [45], Conflicts of Interest and Faceness-WIDER [46]. -e precision-recall curves of G-Mask method on the WIDER FACE benchmark are -e authors declare that there are no conflicts of interest shown in Figure 8. It can be seen that our method obtained regarding the publication of this paper. 0.902AP in the easy subset, 0.854AP in the medium subset, and 0.662AP in the hard subset, which exceeded most of the Acknowledgments comparison methods. Compared with the state-of-the-art MSCNN method, the AP value of the proposed method is -is work was partly supported by Innovation Team Project of only 0.014 lower in the easy subset and 0.049 lower in the the Education Department of Guangdong Province medium subset. -ere are some gaps between G-Mask and (2017KCXTD021), Key Laboratory of the Education Depart- MSCNN methods on hard subset. -e reason may be that ment of Guangdong Province (2019KSYS009), Foundation for the MSCNN method uses a series of strategies for small-scale Youth Innovation Talents in Higher Education of Guangdong faces detection and thus they can deal with more challenging Province (2018KQNCX139), Project forDistinctive Innovation cases. Nevertheless, the G-Mask method still achieves of Ordinary Universities of Guangdong Province promising performance, which demonstrates the effective- (2018KTSCX120), and the Ph.D. Start-Up Fund of Natural ness of the G-Mask method. Science Foundation of Guangdong Province We further demonstrate more qualitative results of (2016A030310335). G-Mask method in Figure 9. It can be observed that the proposed method can detect faces correctly while also References precisely segmenting each face in an image. We also compared the running time of different region- [1] J. Deng, J. Guo, N. Xue et al., “Arcface: additive angular based methods in the a series of dataset such as FDDB, margin loss for deep face recognition,” in Proceedings of the AFW, and ChokePoint. -e WIDER FACE dataset was not IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699, Long Beach, CA, USA, June 2019. used for testing because the running time of the hard and [2] N. Zeng, H. Zhang, B. Song et al., “Facial expression recog- easy subset on the WIDER FACE was quite different. We nition via learning deep sparse autoencoders,” Neuro- randomly selected 100 images from each of the above computing, vol. 273, pp. 4690–4699, 2018. datasets to test and calculate their average time, and the [3] Y. Shi, L. I. Guanbin, Q. Cao et al., “Face hallucination by results are reported in Table 1. We can clearly see that attentive sequence optimization with reinforcement learn- Faster R-CNN has the shortest running time because of its ing,” IEEE Transactions on Pattern Analysis and Machine relatively simple structure, while the proposed method has Intelligence, 2019. a running time similar to Mask R-CNN. Compared with [4] P. Viola and M. J. Jones, “Robust real-time face detection,” Faster RCNN method, G-Mask adds a segmentation International Journal of Computer Vision, vol. 57, no. 2, branch, which leads to an increase in computational pp. 137–154, 2004. 10 Discrete Dynamics in Nature and Society [5] N. Dalal and B. Triggs, “Histograms of oriented gradients for Transactions on Pattern Analysis and Machine Intelligence, human detection,” in Proceedings of the IEEE Conference on vol. 39, no. 6, pp. 91–99, 2015. [22] W. Wu, C. Qian, S. Yang et al., “Look at boundary: a Computer Vision and Pattern Recognition (CVPR), pp. 886– boundary-aware face alignment algorithm,,” in Proceedings of 893, San Diego, CA, USA, June 2005. the IEEE Conference on Computer Vision and Pattern Rec- [6] D. G. Lowe, “Distinctive image features from scale-invariant ognition (CVPR), pp. 2129–2138, Salt Lake City, UT, USA, keypoints,” International Journal of Computer Vision, vol. 60, June 2018. no. 2, pp. 91–110, 2004. [23] K. He, G. Gkioxari, P. Dollar ´ et al., “Mask r-CNN,” in Pro- [7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up ceedings of the IEEE Conference on Computer Vision and robust features (SURF),” Computer Vision and Image Un- Pattern Recognition (CVPR), pp. 2961–2969, Honolulu, HI, derstanding, vol. 110, no. 3, pp. 346–359, 2008. USA, July 2017. [8] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description [24] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco: with local binary patterns: application to face recognition,” common objects in context,” Computer Vision—ECCV 2014, IEEE Transactions on Pattern Analysis and Machine Intelli- Springer, Berlin, Germany, pp. 740–755, 2014. gence, vol. 28, no. 12, pp. 2037–2041, 2006. [25] M. Cordts, M. Omran, S. Ramos et al., “-e cityscapes dataset [9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester et al., “Object for semantic urban scene understanding,,” in Proceedings of detection with discriminatively trained part-based models,” the IEEE Conference on Computer Vision and Pattern Rec- IEEE Transactions on Pattern Analysis and Machine Intelli- ognition (CVPR), pp. 3213–3223, Las Vegas, NV, USA, July gence, vol. 32, no. 9, pp. 1627–1645, 2009. [10] R. Vaillant, C. Monrocq, and Y. Le Cun, “Original approach [26] H. Rezatofighi, N. Tsoi, J. Y. Gwak et al., “Generalized in- for the localisation of objects in images,” IEEE Procee- tersection over union: a metric and a loss for bounding box dings—Vision, Image, and Signal Processing, vol. 141, no. 4, regression,” in Proceedings of the IEEE Conference on Com- pp. 245–250, 1994. puter Vision and Pattern Recognition (CVPR), pp. 658–666, [11] H. Li, Z. Lin, X. Shen et al., “A convolutional neural network Long Beach, CA, USA, June 2019. cascade for face detection,” in Proceedings of the IEEE Con- [27] V. Jain and E. Learned-Miller, “FDDB: a benchmark for ference on Computer Vision and Pattern Recognition (CVPR), facedetection in unconstrained settings,” Technical report pp. 5325–5334, Boston, MA, USA, June 2015. UM-CS-2010-009, 2010. [12] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: a deep [28] Y. Wong, S. Chen, S. Mau et al., “Patch-based probabilistic multi-task learning framework for face detection, landmark image quality assessment for face selection and improved localization, pose estimation, and gender recognition,” IEEE video-based face recognition,” in Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Conference on Computer Vision and Pattern Recognition vol. 41, no. 1, pp. 121–135, 2017. (CVPR), pp. 74–81, Colorado Springs, CO, USA, June 2011. [13] B. Yang, J. Yan, Z. Lei et al., “Convolutional channel features,” [29] X. Zhu and D. Ramanan, “Face detection, pose estimation, in Proceedings of the IEEE Conference on Computer Vision and and landmark localization in the wild,” in Proceedings of the Pattern Recognition (CVPR), pp. 82–90, Boston, MA, USA, IEEE Conference on Computer Vision and Pattern Recognition June 2015. (CVPR), pp. 2879–2886, Providence, RI, USA, June 2012. [14] H. Qin, J. Yan, X. Li et al., “Joint training of cascaded CNN for [30] S. Yang, P. Luo, C. C. Loy et al., “Wider face: a face detection face detection,” in Proceedings of the IEEE Conference on benchmark,,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), pp. 3456–3465, Las Vegas, NV, USA, July 2016. pp. 5525–5533, Las Vegas, NV, USA, June 2016. [15] H. Jiang and E. Learned-Miller, “Face detection with the faster [31] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional R-CNN,” in Proceedings of the IEEE International Conference networks for semantic segmentation,” in Proceedings of the on Automatic Face and Gesture Recognition, pp. 650–657, IEEE Conference on Computer Vision and Pattern Recognition Washington, DC, USA, June 2017. (CVPR), pp. 3431–3440, Boston, MA, USA, June 2015. [16] X. Sun, P. Wu, and S. C. H. Hoi, “Face detection using deep [32] J. Ren, A. Hussain, J. Han, and X. Jia, “Cognitive modelling learning: an improved faster RCNN approach,” Neuro- and learning for multimedia mining and understanding,” computing, vol. 299, no. 1, pp. 42–50, 2018. Cognitive Computation, vol. 11, no. 6, pp. 761-762, 2019. [17] W. Wu, Y. Yin, X. Wang, and D. Xu, “Face detection with [33] J. Tschannerl, J. Ren, P. Yuen et al., “MIMR-DGSA: unsu- different scales based on faster R-CNN,” IEEE Transactions on pervised hyperspectral band selection based on information Cybernetics, vol. 49, no. 11, pp. 4017–4028, 2019. theory and a modified discrete gravitational search algo- [18] L. Liu, G. Li, Y. Xie et al., “Facial landmark machines: a rithm,” Information Fusion, vol. 51, pp. 189–200, 2019. backbone-branches architecture with progressive represen- [34] K. Lin, H. Zhao, J. Lv et al., “Face detection and segmentation tation learning,” IEEE Transactions on Multimedia, vol. 21, with generalized intersection over union based on mask no. 9, 2019. R-CNN,” in Proceedings of the International Conference On [19] R. Girshick, J. Donahue, T. Darrell et al., “Rich feature hi- Brain Inspired Cognitive Systems, pp. 106–116, Guangzhou, erarchies for accurate object detection and semantic seg- China, July 2019. mentation,” in Proceedings of the IEEE Conference on [35] F. Chollet, “Keras, github repository,” 2015, https://github. Computer Vision and Pattern Recognition (CVPR), pp. 580– com/fchollet/keras. 587, Columbus, OH, USA, June 2014. [36] K. He, X. Zhang, S. Ren et al., “Deep residual learning for [20] R. Girshick, “Fast r-CNN,” in Proceedings of the IEEE Con- image recognition,” in Proceedings of the IEEE Conference on ference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), pp. 770– pp. 1440–1448, Boston, MA, USA, June 2015. 778, Las Vegas, NV, USA, July 2016. [21] S. Ren, K. He, R. Girshick et al., “Faster r-cnn: towards real- [37] P. Wan, C. Wu, Y. Lin et al., “Driving anger states detection time object detection with region proposal networks,” IEEE based on incremental association markov blanket and least Discrete Dynamics in Nature and Society 11 square support vector machine,” Discrete Dynamics in Nature and Society, vol. 2019, Article ID 2745381, 17 pages, 2019. [38] N. Markuˇs, M. Frljak, I. S. Pandzic et al., “A method for object detection based on pixel intensity comparisons organized in decision trees,” 2013, https://arxiv.org/abs/1305.4537. [39] D. Hefenbrock, J. Oberg, N. T. N. -anh et al., “Accelerating viola-jones face detection to Fpga-level using gpus,” in Pro- ceedings of the IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, pp.11–18, Charlotte, NC, USA, May 2010. [40] M. Kostinger, ¨ P. Wohlhart, P. M. Roth et al., “Robust face detection by simple means,” in Proceedings of the DAGM 2012 CVAW Workshop, Graz, Austria, August 2012. [41] M. Mathias, R. Benenson, M. Pedersoli et al., “Face detection without bells and whistles,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–735, Zurich, Switzerland, September 2014. [42] Z. Cai, Q. Fan, R. S. Feris et al., “A unified multi-scale deep convolutional neural network for fast object detection,” in Proceedings of the European conference on computer vision (ECCV), pp. 354–370, Amsterdam, Netherlands, October [43] C. Zhu, Y. Zheng, K. Luu, and M. Savvides, “Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection,” Deep Learning for Biometrics, Springer, Berlin, Germany, pp. 57–79, 2017. [44] S. Yang, Y. Xiong, C. C. Loy et al., “Face detection through scale-friendly deep convolutional networks,” 2017, https:// arxiv.org/abs/1706.02863. [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional net- works,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. [46] S. Yang, P. Luo, C. C. Loy et al., “Faceness-net: face detection through deep facial part responses,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1845–1859, 2017. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Discrete Dynamics in Nature and Society Hindawi Publishing Corporation

Face Detection and Segmentation Based on Improved Mask R-CNN

Loading next page...
 
/lp/hindawi-publishing-corporation/face-detection-and-segmentation-based-on-improved-mask-r-cnn-lj2Dl5Fbub

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2020 Kaihan Lin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1026-0226
eISSN
1607-887X
DOI
10.1155/2020/9242917
Publisher site
See Article on Publisher Site

Abstract

Hindawi Discrete Dynamics in Nature and Society Volume 2020, Article ID 9242917, 11 pages https://doi.org/10.1155/2020/9242917 Research Article Face Detection and Segmentation Based on Improved Mask R-CNN Kaihan Lin , Huimin Zhao , Jujian Lv , Canyao Li, Xiaoyong Liu, Rongjun Chen, and Ruoyan Zhao School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China Correspondence should be addressed to Huimin Zhao; zhaohuimin@gpnu.edu.cn and Jujian Lv; jujianlv@gpnu.edu.cn Received 18 December 2019; Accepted 11 March 2020; Published 1 May 2020 Guest Editor: Zheng Wang Copyright © 2020 Kaihan Lin et al. -is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Deep convolutional neural networks have been successfully applied to face detection recently. Despite making remarkable progress, most of the existing detection methods only localize each face using a bounding box, which cannot segment each face from the background image simultaneously. To overcome this drawback, we present a face detection and segmentation method based on improved Mask R-CNN, named G-Mask, which incorporates face detection and segmentation into one framework aiming to obtain more fine-grained information of face. Specifically, in this proposed method, ResNet-101 is utilized to extract features, RPN is used to generate RoIs, and RoIAlign faithfully preserves the exact spatial locations to generate binary mask through Fully Convolution Network (FCN). Furthermore, Generalized Intersection over Union (GIoU) is used as the bounding box loss function to improve the detection accuracy. Compared with Faster R-CNN, Mask R-CNN, and Multitask Cascade CNN, the proposed G-Mask method has achieved promising results on FDDB, AFW, and WIDER FACE benchmarks. face detection to improve detection performance [14–18], 1. Introduction including R-CNN [19], Fast R-CNN [20], and Faster R-CNN Face detection is a key link of subsequent face-related ap- [21]. -ese methods mainly implement face detection and plications, such as face recognition [1], facial expression the location of the face bounding box, which may have some recognition [2], and face hallucination [3], because its effect drawbacks such as the extracted face features have back- directly affects the subsequent applications performance. ground noise, spatial quantization is rough and cannot be -erefore, face detection has become a research hotspot in accurately positioned. -ese drawbacks will directly affect the field of pattern recognition and computer vision and has the follow-up subsequent face-related applications, such as been widely studied in the past two decades. face recognition, facial expression recognition, and face Large amounts of approaches have been proposed for alignment [22]. -erefore, it is necessary to study a face face detection. -e early research on face detection [4–9] detection and segmentation method. mainly focused on the design of handcraft feature and used Mask R-CNN [23], an improved object detection model traditional machine learning algorithms to train effective based on Faster R-CNN, has an impressive performance on classifiers for detection and recognition. Such approaches various object detection and segmentation benchmarks such are limited in that the efficient feature design is complex and as COCO challenges [24] and Cityscapes dataset [25]. Unlike the detection accuracy is relatively low. In recent years, face traditional R-CNN series methods, Mask R-CNN adds a detection methods based on deep convolutional neural mask branch for predicting segmentation masks on each network [10–13] have been widely studied, which are more Region of Interest (RoI), which can fulfil both detection and robust and efficient than handcraft feature methods. Besides, segmentation tasks. In order to fulfil both face detection and a series of efficient object detection frameworks are used for segmentation tasks from the image to overcome the 2 Discrete Dynamics in Nature and Society drawbacks of the existing methods, a face detection and 2.2. Neural Networks Based Methods. As early as 1994, segmentation method based on improved Mask R-CNN (G- Vaillant et al. [10] first proposed using neural network to detect faces. In this work, Convolutional Neural Networks Mask) is proposed in this paper. In particular, our scheme introduces Generalized Intersection over Union (GIoU) [26] (CNN) is used to classify whether each pixel is part of a face as the loss function for bounding box regression to improve and then determine the location of the face through another detection accuracy of face detection. -e main contributions CNN. After that, the researchers did a lot of research based of this paper are as follows: on this work. In recent years, the deep learning approaches has significantly promoted the development of the computer (1) A new dataset was created (more details are de- vision technology, including face detection. Li et al. [11] scribed in Section 4.1), which annotated 5115 images proposed a cascade CNN network architecture for rapid face randomly selected from the FDDB [27] and detection, which is a multiresolution network structure that ChokePoint datasets [28]. can quickly eliminate background regions in the low-res- (2) A face detection and segmentation method based on olution stage and carefully evaluate challenging candidates improved Mask R-CNN was proposed, which can in the last high resolution stage. Ranjan et al. [12] proposed a detect faces correctly while also precisely segmenting deformation part model based on normalized features each face in an image. Furthermore, the proposed extracted by deep convolutional neural network. Yang et al. method improves the detection performance by [13] proposed a method called Convolutional Channel introducing GIoU as a bounding box loss function. Feature (CCF) by combining the advantages of both filtered -e experimental results verify that our proposed channel features and CNN, which has a lower computational G-Mask method achieves promising performance on cost and storage cost than the general end-to-end CNN several mainstream benchmarks, including the method. FDDB, AFW [29], and WIDER FACE [30]. Recently, witnessing the significant advancement of object detection using region-based methods, researchers -e remainder of this paper is organized as follows. have gradually applied the R-CNN series of methods to face Section 2 briefly reviews the related work. -e G-Mask detection. Qinet al. [14] proposed ajoint training scheme for framework for face detection and segmentation is described CNN cascade, Region Proposal Network (RPN), and Fast in detail in Section 3. Section 4 presents the experiment and R-CNN. In [15], Jiang et al. trained the Faster R-CNN model discussion of the proposed method. In the last section, the by using WIDER dataset and verified performance on the work is summarized and the direction of future work is FDDB and IJB-A benchmarks. Sun et al. [16] improve the proposed. Faster R-CNN framework through a series of strategies such as multiscale training, hard negative mining, and feature concatenation. Wu et al. [17] proposed a different scales face 2. Related Work detection method based on Faster R-CNN for the challenge Face detection as one of the important research directions of of small-scale face detection. Liu et al. [18] proposed a computer vision has been extensively studied in recent years. cascaded backbone branches fully convolutional neural From the development process of face detection, we can network (BB-FCN) and used facial landmark localization simply classify previous work as handcraft feature based and results to guide R-CNN-based face detection. -e neural neural networks based methods. networks based methods are already the mainstream of face detection because of its high efficiency and stability. In this work, we propose a G-Mask scheme, which achieves fairly 2.1. Handcraft Feature Based Methods. With the appearance progress in face detection task compared to the original of the first real-time face detection method called Viola- architecture. Jones [4] in 2004, face detection has begun to be applied in practice. -e well-known Viola-Jones can perform real-time 3. Improved Mask R-CNN detection using Haar feature and cascaded structure, but it also has some drawbacks, such as large feature size and low 3.1. Network Architecture. -e proposed method is extended recognition rate for complex situations. To address these from the Mask R-CNN [23] framework, which is the state- concerns, a lot of new handcraft features are proposed, such of-the-art object detection scheme and demonstrated im- as HOG [5], SIFT [6], SUFT [7], and LBP [8], which have pressive performance on various object detection bench- achieved outstanding results. Apart from the above marks. As stated in Figure 1, the proposed G-Mask method methods, one of the significant advances was Deformable consists of two branches, one for face detection and the other Part Model (DPM), proposed by Felzenszwalb et al. [9]. In for face and background image segmentation. In this work, the DPM model, the face is represented as a set of de- the ResNet-101 backbone is used to extract the facial features formable parts, and the improved HOG feature and SVM are of the input image, and the Region of Interest (RoI) is rapidly used for detection, achieving remarkable performance. In generated on the feature map through the Region Proposal general, the advantages of handcraft features are that the Network (RPN). We also use the Region of Interest Align model is intuitive and extensible, and the disadvantage is (RoIAlign) to faithfully preserve exact spatial locations and that the detection accuracy is limited in the face of multi- output the feature map to a fixed size. At the end of the objective tasks. network, the bounding box is located and classified in the Discrete Dynamics in Nature and Society 3 Fixed size Box feature map Fully Class connected RoIAlign layers ResNet Mask RPN Fully convolution network Figure 1: Network architecture of the G-Mask. detection branch, and the corresponding face mask is generated on the image in the segmentation branch through the Fully Convolution Network (FCN) [31]. In the following, we will introduce the key steps of our network in detail. 3.2. Region Proposal Network. For images with human faces in our daily life, there are generally some face objects with different scales and aspect ratios. -erefore, in our approach, Region Proposal Network (RPN) generates RoIs by sliding windows on the feature map through anchors with different scales and different aspect ratios. Details are shown in Figure 2. -e largest rectangle in the figure represents the feature map extracted by the convolutional neural network, and the dotted line indicates that the anchor is the standard anchor. Assume that the standard anchor size is 64 pixels, and the three anchors it contained represent three anchors withaspect ratios of 1:1,1:2, and 2:1.-e dot-dashline and the solid line represent the anchors of 32 and 128 pixels, Figure 2: Illustration of RPN network. respectively. Similarly, each of them also has three aspect ratios anchors. For traditional RPN, the above three scales and three aspect ratios are used to slide on the feature map to 2 2 2 generate RoIs. In this paper, we use 5 scales (16 , 32 , 64 , 2 2 128 , and 256 ) and 3 aspect ratios (1:1, 1:2, and 2:1), Fixed size output leading to 15 anchors at each location, which was more effective in detecting objects of different scales. Pooling 3.3. RoIAlign Layer. G-Mask, unlike the general face de- tection methods, has a segmentation operation, which re- quires more refined spatial quantization for feature extraction. In the traditional region-based approaches, RoIPool is the standard operation for extracting small Figure 3: Bilinear interpolation in RoIAlign, where the dashed feature map from RoIs, which have two quantization op- background grid represents the feature map, the solid grid rep- erations that result in misalignments between the RoI and resents an RoI (with 2 ×2 bins in this example), and the dots the extracted features. For traditional detection methods, represent the four sample points in each bin. this may not affect classification and localization, while for our approach, it has a great impact on prediction of pixel- accurate masks, as well as for small object detection. It can be seen that the RoIAlign layer cancels the harsh In response to the above problem, we introduced the quantization operations on the feature map and uses bilinear RoIAlign layer, following the scheme of [23]. As shown in interpolation to preserve the floating-number coordinates, Figure 3, suppose the feature map is divided into 2 ×2 bins. thereby avoiding misalignments between the RoI and the 4 Discrete Dynamics in Nature and Society extracted features. -e bilinear interpolation function has difficult to optimize the nonoverlapping bounding boxes; (b) two steps, which are defined as follows: the IoU value may be the same when two objects intersect in different orientations, so the IoU function does not reflect Interpolate on the x-axis direction as follows: how the two objects overlap. To overcome these drawbacks, x − x x − x 2 1 f R ≈ f Q + f Q , R � x, y , 􏼁 􏼁 􏼁 􏼁 1 11 21 1 1 GIoU not only focuses on the situation where two objects x − x x − x 2 1 2 1 overlap but also considers the situation of nonoverlapping. (1) -e details of the GIoU metric are shown in Figure 4. p p p p g g g g Suppose B � (x , y , x , y ) and B � (x , y , x , y ) are x − x x − x p 1 1 2 2 g 1 1 2 2 2 1 f R ≈ f Q + f Q , R � x, y . 􏼁 􏼁 􏼁 􏼁 2 12 22 2 2 the coordinates of an object’s predicted bounding box and x − x x − x 2 1 2 1 the ground-truth bounding box, where x > x and y > y 2 1 2 1 (2) in B and B ; then, the area of them is P g Interpolate on the y-axis direction as follows: p p p p A � 􏼐x − x 􏼑 × 􏼐y − y 􏼑, (4) 2 1 2 1 y − y y − y 2 1 f(P) � f(x, y) ≈ f R 􏼁 + f R 􏼁 , (3) 1 2 y − y y − y 2 1 2 1 g g g g A � x − x × y − y . 􏼁 􏼁 (5) g 2 1 2 1 where f(x, y) is the value of the sampling point P, f(Q ), -e coordinates and area of intersection I of B and B f(Q ), f(Q ), and f(Q ) are the values of the four P g 12 21 22 can be calculated as nearby grid points Q � (x , y ), Q � (x , y ), 11 1 1 12 1 2 Q � (x , y ), and Q � (x , y ), and f(R ), f(R ) are the 21 2 1 22 2 2 1 2 i p g x � max􏼐x , x 􏼑, 1 1 1 value obtained by interpolating in the x-axis direction. (6) i p g x � min􏼐x , x 􏼑, 2 2 2 3.4. Mask Branch. -e mask branch realizes the seg- i p g y � max􏼐y , y 􏼑, mentation of face object and background image in 1 1 1 (7) i p g G-Mask model, which predicts the segmentation mask in y � min􏼐y , y 􏼑, 2 2 2 a pixel to pixel manner by applying Full Convolutional Network (FCN) [31] to each RoI. -e FCN scheme is one i i i i i i i i x − x 􏼁 × y − y 􏼁 , if x > x , y > y , 2 1 2 1 2 1 2 1 of the solutions for instance segmentation, which orig- A � 􏼨 (8) 0, otherwise. inates from CNN but is also different from general CNN. For the traditional CNN network architecture, in order to Similarly, the smallest enclosing box B can be found obtain the feature vector of fixed dimensions, the con- through volutional layer is generally connected with several full p g connection layers, and finally the output is a numerical x � min􏼐x , x 􏼑, 1 1 1 (9) description of the input, which is generally applicable to p g x � max􏼐x , x 􏼑, 2 2 2 tasks such as image recognition and classification, object detection, and positioning. -e FCN framework is similar c p g y � min􏼐y , y 􏼑, to the traditional CNN network, which also includes the 1 1 1 (10) convolutional layer and the pooling layer. In particular, p g y � max y , y , 􏼐 􏼑 2 2 2 the FCN uses the deconvolution to up-sample the feature map in the end convolution layer so that the output image and the area of B can be computed as size can be restored to the original image size, and finally c c c c A � x − x 􏼁 × y − y 􏼁 . (11) c 2 1 2 1 uses the Softmax classifier to predict the category of each pixel. -e IoU between B and B is defined as P g IoU � . (12) 3.5. Generalized Intersection over Union. Bounding box re- A + A − A p g i gression, as one of the fundamental components of many -erefore, GIoU can be calculated by the definition of computer vision tasks, deserves further study by researchers [32]. However, unlike the architecture and feature extraction A − A + A − A 􏼐 􏼑 strategy improvement researches, which have made great c p g i (13) GIoU � IoU − . progress in recent years [33], the research of bounding box regression has lagged behind somewhat. -e Generalized Intersection over Union (GIoU) [26], as the latest metric and bounding box regression method, demonstrates state-of- 3.6. Loss Function. -e proposed G-Mask model consists of the-art results on various object detection benchmarks by two stages, which are the same as the general region-based incorporating with the general object detection frameworks. model. In the first stage, RPN proposes the candidate For traditional IoU, there are two weaknesses when it is used bounding boxes of the object face. -e second stage, follow as a metric or a bounding box regression loss: (a) the IoU the Fast R-CNN architecture, extracts features from each value is zero when two objects do not overlap, making it candidate box and then performs classification and Discrete Dynamics in Nature and Society 5 For bounding box loss, we introduce GIoU as the loss function, and the definition of GIoU metric is described in (13), so the loss bounding box function is defined as follows: � 1 − GIoU. (17) box For segmentation box loss, we adopt the average binary cross-entropy loss, which is defined in ∗ k k L � − 􏽘 􏽨y log y 􏽢 + 􏼐1 − y 􏼑log􏼐1 − y 􏽢 􏼑􏽩, mask ij ij ij ij (18) 1≤i,j≤m where y is the label value of a cell (i, j) for the region of size ij m × m and y 􏽢 is the predicted value of the k-th class of this ij cell. L is only defined on a specific mask, which is related mask to the ground-truth class k, and other mask outputs do not affect the loss. 4. Experiments 4.1. Experimental Setup. Unlike object detection and generic face detection, there are no off-the-shelf face datasets with masks annotation that can be employed to train our model Figure 4: Illustration of GIoU metric. -e solid line indicates the [34]. -erefore, the first step of our work is to create a new prediction box and ground truth box, the dotted line indicates the dataset with mask annotations. In order to enhance the smallest enclosing box, and the shaded portion indicates the in- reliability of the samples, we selected 5115 samples from tersection of the prediction box and the ground truth box. FDDB and ChokePoint datasets and annotated them with masks labels. After the annotation work, we trained the G-Mask model on this dataset. bounding box location. In addition, like the Mask R-CNN, For implementation, we adopt Keras [35] framework to we added a mask branch parallel to the classification branch train the G-Mask model in Ubuntu 16.04. ResNet-101 [36] is and the bounding box location branch. -erefore, we define used as the backbone network architecture in our work. In a multitasking objective function, which includes classifi- the training phase, the G-Mask model is train on afore- cation loss L , bounding box location loss L , and seg- mentioned dataset for 150,000 iterations (where the epoch is cls box mentation loss L . Our loss function for each image is 50 and the steps of per epoch are 3000) with the learning rate mask defined as set to 0.001 and the weight decay rate set to 0.0001. We randomly sample one image per batch for training [37], in ∗ ∗ ∗ L � L + L + L . (14) which the short side of each image was resized to 800 and the cls box mask long side was resized to 1024. In the RPN part, RoIs is In (14), the classification loss L and segmentation loss cls generated by sliding the window on the feature map through L are defined the same as in Mask R-CNN. For the mask anchors of different scales and different aspect ratios. It will bounding box loss, we found that GIoU can better respond have 2000 RoIs kept after nonmaximum suppression, and to face detection tasks through several experiments com- the RoIs will only be considered as foreground if its IoU with pared with the traditional bounding box regression method. the ground truth is greater than 0.5. -e testing phase -erefore, in this paper, we introduced GIoU as a bounding settings are the same as the training phase, and the region box loss function. In more detail, the classification loss is proposal is considered to be a face only if the confidence defined as in score is greater than 0.7. -e training and testing process is carried out on the same server, which is a Xeon E5 CPU of L � 􏼈p 􏼉􏼁 � 􏽘 L p , p 􏼁 , (15) 128GB flash memory and NVIDIA GeForce GTX cls i cls i i cls 1080TiGPU. where N is the minibatch size, i is the index of an anchor cls in a minibatch, and p is the prediction probability of i 4.2. Experimental Results. In this work, G-Mask model not whether anchor i is a face target. -e ground-truth label only realized the bounding box localization of the face target ∗ ∗ p � 1 if the anchor is positive, and p � 0 when the i i but also separated the face information from the background anchor is negative. -e classification loss L of each cls image by binary mask, so that more detailed face infor- anchor is log loss of whether an object is a face, which is mation could be obtained through the above process. -e defined as comparison experiment was carried out on three popular face benchmark datasets, including FDDB, AFW, and ∗ ∗ ∗ L p , p 􏼁 � −􏼂p log p + 1 − p 􏼁 log 1 − p􏼁 􏼃. (16) cls i i i i i i WIDER FACE. 6 Discrete Dynamics in Nature and Society 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 300 600 900 1200 1500 False positive Viola-Jones Faster RCNN (0.6362) (0.8083) Pico Mask RCNN (0.7393) (0.8298) Koestinger et al G-Mask (0.7414) (0.8880) Figure 5: Comparisons of face detection with other methods on FDDB benchmark. (a) (b) Figure 6: Different detection results of Mask R-CNN and G-Mask in the complex scene of FDDB dataset. (a) Mask R-CNN model and (b) G-Mask model. -e FDDB [27] dataset is a well-known face detection performs better than Faster R-CNN when there are more evaluation dataset and benchmark, which contains 2845 than 160 false positives. When there are more than 280 false images of 5171 human faces. In this dataset, the faces of each positives, the performance of G-Mask is better than that of image come from different scenes, which is quite chal- Mask R-CNN. Furthermore, our method can achieve 88.80% true positive rate in 1500 false positives, which lenging. We compared several methods on the FDDB dataset, including Faster R-CNN [15], Mask R-CNN [23], exceeded all the comparison methods. -e comparison Pico [38], Viola-Jones [39], and Koestinger [40]. For ef- results of the FDDB dataset show that our proposed fective comparison, the training data of the G-Mask, Mask G-Mask method has achieved promising results, demon- R-CNN, and Faster R-CNN models are the same, which is strating that our method can segment face information the dataset constructed in this work. We compared the true while detecting effectively. Some detection results of the positive rates at 1500 false positives, and the results are Mask R-CNN and G-Mask models in the complex scenario shown in Figure 5. It can be seen from Figure 5 that G-Mask of FDDB dataset are shown in Figure 6. It is obvious that True positive rate Discrete Dynamics in Nature and Society 7 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall DPM (AP 97.21) Shen et al. (AP 89.03) HeadHunter (AP 97.14) TSM (AP 87.99) G-Mask (AP 95.97) Face.com SquaresChnFtrs-5 (AP 95.24) Picasa Structured Models (AP 95.19) Face++ Figure 7: -e precision-recall curve of our method on the AFW benchmark. Data of other models and evaluation code are derived from [41]. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall MSCNN-0.916 LDCF+–0.790 MSCNN-0.903 LDCF+–0.769 G-Mask-0.902 Faceness-WIDER-0.713 CMS-RCNN-0.874 Multiscale Cascade CNN-0.664 CMS-RCNN-0.899 Multiscale Cascade ScaleFace-0.867 CNN-0.691 Faceness-WIDER-0.634 ScaleFace-0.868 G-Mask-0.854 Two-stage CNN-0.618 Two-stage CNN-0.681 Multitask Cascade Multitask Cascade CNN-0.848 CNN-0.825 ACF-WIDER-0.541 ACF-WIDER-0.659 (a) (b) Figure 8: Continued. Precision Precision Precision 8 Discrete Dynamics in Nature and Society 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall MSCNN-0.802 LDCF+–0.522 ScaleFace-0.772 Multiscale Cascade CNN-0.424 G-Mask-0.662 CMS-RCNN-0.624 Faceness-WIDER-0.345 Multitask Cascade Two-stage CNN-0.323 CNN-0.598 ACF-WIDER-0.273 (c) Figure 8: -e precision-recall curve on the WIDER FACE benchmark: (a) on the easy subset, (b) on the medium subset, and (c) on the hard subset. Figure 9: More results of G-Mask method. the G-Mask model performs better in the multiscale face -e AFW dataset [29] is a face dataset and benchmark task, which demonstrates the effectiveness of the proposed established by using Flickr image, which contains 205 images method in face detection. with 473 labeled faces. -e precision-recall curve of our method Precision Discrete Dynamics in Nature and Society 9 Table 1: Running time of different region-based methods. complexity. However, the G-Mask method can achieve higher accuracy with less time consumption compared with Running time (s) Method other region-based methods and can also obtain more FDDB AFW ChokePoint detailed face information through segmentation branches R-CNN 14.75 15.32 14.51 while accurately locating. Fast R-CNN 3.12 3.08 2.84 Faster R-CNN 0.30 0.32 0.28 5. Conclusions Mask R-CNN 0.32 0.35 0.33 G-Mask 0.35 0.42 0.33 In this paper, a G-Mask method was proposed for face detection and segmentation. -e approach can extract on the AFW benchmark is shown in Figure 7, and it can be seen features by ResNet-101, generate RoIs by RPN, preserve the that the G-Mask method achieved 95.97% average precision precise spatial position by RoIAlign, and generate binary (AP). Although our dataset has a different label format from the masks through the full convolutional network (FCN). In AFW benchmark, as well as the moderately sized training doing so, the proposed framework is able to detect faces dataset, we also demonstrate the generalization of our method. correctly while also precisely segmenting each face in an WIDER FACE [30], one of the largest and most chal- image. Experimental results with self-built face dataset as lenging face detection datasets in the open source data, has well as public available datasets have verified that our 32,203 images and 393,703 labeled faces. In this dataset, proposed G-Mask method achieves promising performance. various changes in the face size, pose, and occlusion have For the future work, we will consider improving the speed of brought great challenges to face detection, and the dataset is the proposed method. divided into easy, medium, and hard subsets according to the difficulty level. To further demonstrate the detection Data Availability performance of our proposed method, we trained the G-Mask model on WIDER FACE dataset and verified it on -e data used to support the findings of this study are the validation dataset. -e proposed method is compared available from the corresponding author upon request. with several major methods including MSCNN [42], CMS- RCNN [43], ScaleFace [44], Multitask Cascade CNN [45], Conflicts of Interest and Faceness-WIDER [46]. -e precision-recall curves of G-Mask method on the WIDER FACE benchmark are -e authors declare that there are no conflicts of interest shown in Figure 8. It can be seen that our method obtained regarding the publication of this paper. 0.902AP in the easy subset, 0.854AP in the medium subset, and 0.662AP in the hard subset, which exceeded most of the Acknowledgments comparison methods. Compared with the state-of-the-art MSCNN method, the AP value of the proposed method is -is work was partly supported by Innovation Team Project of only 0.014 lower in the easy subset and 0.049 lower in the the Education Department of Guangdong Province medium subset. -ere are some gaps between G-Mask and (2017KCXTD021), Key Laboratory of the Education Depart- MSCNN methods on hard subset. -e reason may be that ment of Guangdong Province (2019KSYS009), Foundation for the MSCNN method uses a series of strategies for small-scale Youth Innovation Talents in Higher Education of Guangdong faces detection and thus they can deal with more challenging Province (2018KQNCX139), Project forDistinctive Innovation cases. Nevertheless, the G-Mask method still achieves of Ordinary Universities of Guangdong Province promising performance, which demonstrates the effective- (2018KTSCX120), and the Ph.D. Start-Up Fund of Natural ness of the G-Mask method. Science Foundation of Guangdong Province We further demonstrate more qualitative results of (2016A030310335). G-Mask method in Figure 9. It can be observed that the proposed method can detect faces correctly while also References precisely segmenting each face in an image. We also compared the running time of different region- [1] J. Deng, J. Guo, N. Xue et al., “Arcface: additive angular based methods in the a series of dataset such as FDDB, margin loss for deep face recognition,” in Proceedings of the AFW, and ChokePoint. -e WIDER FACE dataset was not IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699, Long Beach, CA, USA, June 2019. used for testing because the running time of the hard and [2] N. Zeng, H. Zhang, B. Song et al., “Facial expression recog- easy subset on the WIDER FACE was quite different. We nition via learning deep sparse autoencoders,” Neuro- randomly selected 100 images from each of the above computing, vol. 273, pp. 4690–4699, 2018. datasets to test and calculate their average time, and the [3] Y. Shi, L. I. Guanbin, Q. Cao et al., “Face hallucination by results are reported in Table 1. We can clearly see that attentive sequence optimization with reinforcement learn- Faster R-CNN has the shortest running time because of its ing,” IEEE Transactions on Pattern Analysis and Machine relatively simple structure, while the proposed method has Intelligence, 2019. a running time similar to Mask R-CNN. Compared with [4] P. Viola and M. J. Jones, “Robust real-time face detection,” Faster RCNN method, G-Mask adds a segmentation International Journal of Computer Vision, vol. 57, no. 2, branch, which leads to an increase in computational pp. 137–154, 2004. 10 Discrete Dynamics in Nature and Society [5] N. Dalal and B. Triggs, “Histograms of oriented gradients for Transactions on Pattern Analysis and Machine Intelligence, human detection,” in Proceedings of the IEEE Conference on vol. 39, no. 6, pp. 91–99, 2015. [22] W. Wu, C. Qian, S. Yang et al., “Look at boundary: a Computer Vision and Pattern Recognition (CVPR), pp. 886– boundary-aware face alignment algorithm,,” in Proceedings of 893, San Diego, CA, USA, June 2005. the IEEE Conference on Computer Vision and Pattern Rec- [6] D. G. Lowe, “Distinctive image features from scale-invariant ognition (CVPR), pp. 2129–2138, Salt Lake City, UT, USA, keypoints,” International Journal of Computer Vision, vol. 60, June 2018. no. 2, pp. 91–110, 2004. [23] K. He, G. Gkioxari, P. Dollar ´ et al., “Mask r-CNN,” in Pro- [7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up ceedings of the IEEE Conference on Computer Vision and robust features (SURF),” Computer Vision and Image Un- Pattern Recognition (CVPR), pp. 2961–2969, Honolulu, HI, derstanding, vol. 110, no. 3, pp. 346–359, 2008. USA, July 2017. [8] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description [24] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco: with local binary patterns: application to face recognition,” common objects in context,” Computer Vision—ECCV 2014, IEEE Transactions on Pattern Analysis and Machine Intelli- Springer, Berlin, Germany, pp. 740–755, 2014. gence, vol. 28, no. 12, pp. 2037–2041, 2006. [25] M. Cordts, M. Omran, S. Ramos et al., “-e cityscapes dataset [9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester et al., “Object for semantic urban scene understanding,,” in Proceedings of detection with discriminatively trained part-based models,” the IEEE Conference on Computer Vision and Pattern Rec- IEEE Transactions on Pattern Analysis and Machine Intelli- ognition (CVPR), pp. 3213–3223, Las Vegas, NV, USA, July gence, vol. 32, no. 9, pp. 1627–1645, 2009. [10] R. Vaillant, C. Monrocq, and Y. Le Cun, “Original approach [26] H. Rezatofighi, N. Tsoi, J. Y. Gwak et al., “Generalized in- for the localisation of objects in images,” IEEE Procee- tersection over union: a metric and a loss for bounding box dings—Vision, Image, and Signal Processing, vol. 141, no. 4, regression,” in Proceedings of the IEEE Conference on Com- pp. 245–250, 1994. puter Vision and Pattern Recognition (CVPR), pp. 658–666, [11] H. Li, Z. Lin, X. Shen et al., “A convolutional neural network Long Beach, CA, USA, June 2019. cascade for face detection,” in Proceedings of the IEEE Con- [27] V. Jain and E. Learned-Miller, “FDDB: a benchmark for ference on Computer Vision and Pattern Recognition (CVPR), facedetection in unconstrained settings,” Technical report pp. 5325–5334, Boston, MA, USA, June 2015. UM-CS-2010-009, 2010. [12] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: a deep [28] Y. Wong, S. Chen, S. Mau et al., “Patch-based probabilistic multi-task learning framework for face detection, landmark image quality assessment for face selection and improved localization, pose estimation, and gender recognition,” IEEE video-based face recognition,” in Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Conference on Computer Vision and Pattern Recognition vol. 41, no. 1, pp. 121–135, 2017. (CVPR), pp. 74–81, Colorado Springs, CO, USA, June 2011. [13] B. Yang, J. Yan, Z. Lei et al., “Convolutional channel features,” [29] X. Zhu and D. Ramanan, “Face detection, pose estimation, in Proceedings of the IEEE Conference on Computer Vision and and landmark localization in the wild,” in Proceedings of the Pattern Recognition (CVPR), pp. 82–90, Boston, MA, USA, IEEE Conference on Computer Vision and Pattern Recognition June 2015. (CVPR), pp. 2879–2886, Providence, RI, USA, June 2012. [14] H. Qin, J. Yan, X. Li et al., “Joint training of cascaded CNN for [30] S. Yang, P. Luo, C. C. Loy et al., “Wider face: a face detection face detection,” in Proceedings of the IEEE Conference on benchmark,,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), pp. 3456–3465, Las Vegas, NV, USA, July 2016. pp. 5525–5533, Las Vegas, NV, USA, June 2016. [15] H. Jiang and E. Learned-Miller, “Face detection with the faster [31] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional R-CNN,” in Proceedings of the IEEE International Conference networks for semantic segmentation,” in Proceedings of the on Automatic Face and Gesture Recognition, pp. 650–657, IEEE Conference on Computer Vision and Pattern Recognition Washington, DC, USA, June 2017. (CVPR), pp. 3431–3440, Boston, MA, USA, June 2015. [16] X. Sun, P. Wu, and S. C. H. Hoi, “Face detection using deep [32] J. Ren, A. Hussain, J. Han, and X. Jia, “Cognitive modelling learning: an improved faster RCNN approach,” Neuro- and learning for multimedia mining and understanding,” computing, vol. 299, no. 1, pp. 42–50, 2018. Cognitive Computation, vol. 11, no. 6, pp. 761-762, 2019. [17] W. Wu, Y. Yin, X. Wang, and D. Xu, “Face detection with [33] J. Tschannerl, J. Ren, P. Yuen et al., “MIMR-DGSA: unsu- different scales based on faster R-CNN,” IEEE Transactions on pervised hyperspectral band selection based on information Cybernetics, vol. 49, no. 11, pp. 4017–4028, 2019. theory and a modified discrete gravitational search algo- [18] L. Liu, G. Li, Y. Xie et al., “Facial landmark machines: a rithm,” Information Fusion, vol. 51, pp. 189–200, 2019. backbone-branches architecture with progressive represen- [34] K. Lin, H. Zhao, J. Lv et al., “Face detection and segmentation tation learning,” IEEE Transactions on Multimedia, vol. 21, with generalized intersection over union based on mask no. 9, 2019. R-CNN,” in Proceedings of the International Conference On [19] R. Girshick, J. Donahue, T. Darrell et al., “Rich feature hi- Brain Inspired Cognitive Systems, pp. 106–116, Guangzhou, erarchies for accurate object detection and semantic seg- China, July 2019. mentation,” in Proceedings of the IEEE Conference on [35] F. Chollet, “Keras, github repository,” 2015, https://github. Computer Vision and Pattern Recognition (CVPR), pp. 580– com/fchollet/keras. 587, Columbus, OH, USA, June 2014. [36] K. He, X. Zhang, S. Ren et al., “Deep residual learning for [20] R. Girshick, “Fast r-CNN,” in Proceedings of the IEEE Con- image recognition,” in Proceedings of the IEEE Conference on ference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), pp. 770– pp. 1440–1448, Boston, MA, USA, June 2015. 778, Las Vegas, NV, USA, July 2016. [21] S. Ren, K. He, R. Girshick et al., “Faster r-cnn: towards real- [37] P. Wan, C. Wu, Y. Lin et al., “Driving anger states detection time object detection with region proposal networks,” IEEE based on incremental association markov blanket and least Discrete Dynamics in Nature and Society 11 square support vector machine,” Discrete Dynamics in Nature and Society, vol. 2019, Article ID 2745381, 17 pages, 2019. [38] N. Markuˇs, M. Frljak, I. S. Pandzic et al., “A method for object detection based on pixel intensity comparisons organized in decision trees,” 2013, https://arxiv.org/abs/1305.4537. [39] D. Hefenbrock, J. Oberg, N. T. N. -anh et al., “Accelerating viola-jones face detection to Fpga-level using gpus,” in Pro- ceedings of the IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, pp.11–18, Charlotte, NC, USA, May 2010. [40] M. Kostinger, ¨ P. Wohlhart, P. M. Roth et al., “Robust face detection by simple means,” in Proceedings of the DAGM 2012 CVAW Workshop, Graz, Austria, August 2012. [41] M. Mathias, R. Benenson, M. Pedersoli et al., “Face detection without bells and whistles,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–735, Zurich, Switzerland, September 2014. [42] Z. Cai, Q. Fan, R. S. Feris et al., “A unified multi-scale deep convolutional neural network for fast object detection,” in Proceedings of the European conference on computer vision (ECCV), pp. 354–370, Amsterdam, Netherlands, October [43] C. Zhu, Y. Zheng, K. Luu, and M. Savvides, “Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection,” Deep Learning for Biometrics, Springer, Berlin, Germany, pp. 57–79, 2017. [44] S. Yang, Y. Xiong, C. C. Loy et al., “Face detection through scale-friendly deep convolutional networks,” 2017, https:// arxiv.org/abs/1706.02863. [45] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional net- works,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. [46] S. Yang, P. Luo, C. C. Loy et al., “Faceness-net: face detection through deep facial part responses,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1845–1859, 2017.

Journal

Discrete Dynamics in Nature and SocietyHindawi Publishing Corporation

Published: May 1, 2020

References