Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

U$^2$-Net: Going Deeper with Nested U-Structure for Salient Object Detection

U$^2$-Net: Going Deeper with Nested U-Structure for Salient Object Detection In this paper, we design a simple yet powerful deep network architecture, U -Net, for salient object detection (SOD). The architecture of our U -Net is a two-level nested U-structure. The design has the following advantages: (1) it is able to capture more contextual information from dif- ferent scales thanks to the mixture of receptive fields of dif- ferent sizes in our proposed ReSidual U-blocks (RSU), (2) it increases the depth of the whole architecture without sig- nificantly increasing the computational cost because of the pooling operations used in these RSU blocks. This architec- ture enables us to train a deep network from scratch with- out using backbones from image classification tasks. We instantiate two models of the proposed architecture, U - Figure 1. Comparison of model size and performance of Net (176.3 MB, 30 FPS on GTX 1080Ti GPU) and U - our U -Net with other state-of-the-art SOD models. The Net (4.7 MB, 40 FPS), to facilitate the usage in differ- maxF measure is computed on dataset ECSSD [46]. The ent environments. Both models achieve competitive perfor- red star denotes our U -Net (Ours) (176.3 MB) and the blue 2 y y mance on six SOD datasets. The code is available:https: star denotes our small version U -Net (Ours ) (4.7 MB). //github.com/NathanUA/U-2-Net . trained on ImageNet [5] data which is data-inefficient espe- 1. Introduction cially if the target data follows a different distribution than ImageNet. Salient Object Detection (SOD) aims at segmenting the This leads to our first question: can we design a new most visually attractive objects in an image. It is widely network for SOD, that allows training from scratch and used in many fields, such as visual tracking and image seg- achieves comparable or better performance than those mentation. Recently, with the development of deep con- based on existing pre-trained backbones? volutional neural networks (CNNs), especially the rise of Fully Convolutional Networks (FCN) [24] in image seg- There are a few more issues on the network architectures mentation, the salient object detection has been improved for SOD. First, they are often overly complicated [58]. It is significantly. It is natural to ask, what is still missing? Let’s partially due to the additional feature aggregation modules take a step back and look at the remaining challenges. that are added to the existing backbones to extract multi- There is a common pattern in the design of most SOD level saliency features from these backbones. Secondly, the networks [18, 27, 41, 6], that is, they focus on making existing backbones usually achieve deeper architecture by good use of deep features extracted by existing backbones, sacrificing high resolution of feature maps [58]. To run such as Alexnet [17], VGG [35], ResNet [12], ResNeXt these deep models with affordable memory and computa- [44], DenseNet [15], etc. However, these backbones are all tional cost, the feature maps are down scaled to lower res- originally designed for image classification. They extract olution at early stages. For instance, at the early layers of features that are representative of semantic meaning rather both ResNet and DenseNet [15], a convolution with stride than local details and global contrast information, which are of two followed by a maxpooling with stride of two are uti- essential to saliency detection. And they need to be pre- lized to reduce the size of the feature maps to one fourth of arXiv:2005.09007v3 [cs.CV] 8 Mar 2022 the input maps. However, high resolution also plays an im- features. Zhang et al. (LFR) [52] predict saliency maps portant role in segmentation besides the deep architecture by extracting features from both original input images and [21]. their reflection images with a sibling architecture. Hou et Hence, our follow-up question is: can we go deeper al. (DSS+) [13] propose to integrate multi-level features by while maintaining high resolution feature maps, at a low introducing short connections from deep layers to shallow memory and computation cost? layers. Chen et al. (RAS) [4] predict and refine saliency Our main contribution is a novel and simple network ar- maps by iteratively using the side output saliency of a back- chitecture, called U -Net, that addresses the two questions bone network as the feature attention guidance. Zhang et above. First, U -Net is a two-level nested U-structure that is al. (BMPM) [50] propose to integrate features from shal- designed for SOD without using any pre-trained backbones low and deep layers by a controlled bi-directional passing from image classification. It can be trained from scratch strategy. Deng et al. (R Net+) [6] alternately incorporate to achieve competitive performance. Second, the novel ar- shallow and deep layers’ features to refine the predicted chitecture allows the network to go deeper, attain high res- saliency maps. Hu et al. (RADF+) [14] propose to detect olution, without significantly increasing the memory and salient objects by recurrently aggregating multi-level deep computation cost. This is achieved by a nested U-structure: features. Wu et al. (MLMS) [42] improve the saliency de- on the bottom level, we design a novel ReSidual U-block tection accuracy by developing a novel Mutual Learning (RSU), which is able to extract intra-stage multi-scale fea- Module for better leveraging the correlation of boundaries tures without degrading the feature map resolution; on the and regions. Wu et al. [43] propose to use Cascaded Par- top level, there is a U-Net like structure, in which each stage tial Decoder (CPD) framework for fast and accurate salient is filled by a RSU block. The two-level configuration results object detection. Deep methods in this category take advan- in a nested U-structure (see Fig. 5). Our U -Net (176.3 MB) tage of the multi-level deep features extracted by backbone achieves competitive performance against the state-of-the- networks and greatly raise the bar of salient object detection against traditional methods. art (SOTA) methods on six public datasets, and runs at real- time (30 FPS, with input size of 3203203) on a 1080Ti Multi-scale feature extraction: As mentioned earlier, GPU. To facilitate the usage of our design in computation saliency detection requires both local and global informa- and memory constrained environments, we provide a small tion. A 3  3 filter is good for extracting local features at 2 2 y 2 version of our U -Net, called U -Net (4.7 MB). The U - each layer. However, it is difficult to extract global infor- Net achieves competitive results against most of the SOTA mation by simply enlarging the filter size because it will models (see Fig. 1) at 40 FPS. increase the number of parameters and computation costs dramatically. Many works pay more attention to extracting 2. Related Works global context. Wang et al. (SRM) [40] adapt the pyramid In recent years, many deep salient object detection net- pooling module [57] to capture global context and propose works [22, 33] have been proposed. Compared with tradi- a multi-stage refinement mechanism for saliency maps re- tional methods [2] based on hand-crafted features like fore- finement. Zhang et al. (PAGRN) [56] develop a spatial and ground consistency [49], hyperspectral information [20], a channel-wise attention module to obtain the global infor- superpixels’ similarity [55], histograms [26, 25] and so on, mation of each layer and propose a progressive attention deep salient object detection networks show more competi- guidance mechanism to refine the saliency maps. Wang tive performance. et al. (DGRL) [41] develop an inception-like [36] contex- Multi-level deep feature integration: Recent works tual weighting module to localize salient objects globally [24, 45] have shown that features from multiple deep layers and then use a boundary refinement module to refine the are able to generate better results [50]. Then, many strate- saliency map locally. Liu et al. (PiCANet) [23] recurrently gies and methods for integrating and aggregating multi- capture the local and global pixel-wise contextual attention level deep features are developed for SOD. Li et al. (MDF) and predict the saliency map by incorporating it with a U- [18] propose to feed an image patch around a target pixel Net architecture. Zhang et al. (CapSal) [51] design a lo- to a network and then obtain a feature vector for describing cal and global perception module to extract both local and the saliency of this pixel. Zhang et al. (Amulet) [53] pre- global information from features extracted by backbone net- dict saliency maps by aggregating multi-level features into work. Zeng et al. (MSWS) [48] design an attention module different resolutions. Zhang et al. (UCF) [54] propose to re- to predict the spatial distribution of foreground objects over duce the checkerboard artifacts of deconvolution operators image regions meanwhile aggregate their features. Feng et by introducing a reformulated dropout and a hybrid upsam- al. (AFNet) [9] develop a global perception module and at- pling module. Luo et al. [27] design a saliency detection tentive feedback modules to better explore the structure of network (NLDF+) with a 45 grid architecture, in which salient objects. Qin et al. (BASNet) [33] propose a predict- deeper features are progressively integrated with shallower refine model by stacking two differently configured U-Nets 2 Figure 2. Illustration of existing convolution blocks and our proposed residual U-block RSU: (a) Plain convolution block PLN, (b) Residual-like block RES, (c) Dense-like block DSE, (d) Inception-like block INC and (e) Our residual U-block RSU. sequentially and a Hybrid loss for boundary-aware salient 3. Proposed Method object detection. Liu et al. (PoolNet) [22] develop encoder- First, we introduce the design of our proposed resid- decoder architecture for salient object detection by intro- ual U-block and then describe the details of the nested U- ducing a global guidance module for extraction of global architecture built with this block. The network supervision localization features and a multi-scale feature aggregation strategy and the training loss are described at the end of this module adapted from pyramid pooling module for fusing section. global and fine-level features. In these methods, many in- spiring modules are proposed to extract multi-scale features 3.1. Residual U-blocks from multi-level deep features extracted from existing back- Both local and global contextual information are very bones. Diversified receptive fields and richer multi-scale important for salient object detection and other segmenta- contextual features introduced by these novel modules sig- tion tasks. In modern CNN designs, such as VGG, ResNet, nificantly improve the performance of salient object detec- DenseNet and so on, small convolutional filters with size tion models. of 11 or 33 are the most frequently used components for feature extraction. They are in favor since they require less storage space and are computationally efficient. Fig- ures 2(a)-(c) illustrates typical existing convolution blocks In summary, multi-level deep feature integration meth- with small receptive fields. The output feature maps of shal- ods mainly focus on developing better multi-level feature low layers only contain local features because the receptive aggregation strategies. On the other hand, methods in the field of 11 or 33 filters are too small to capture global category of multi-scale feature extraction target at design- information. To achieve more global information at high ing new modules for extracting both local and global infor- resolution feature maps from shallow layers, the most di- mation from features obtained by backbone networks. As rect idea is to enlarge the receptive field. Fig. 2 (d) shows we can see, almost all of the aforementioned methods try an inception like block [50], which tries to extract both local to make better use of feature maps generated by the ex- and non-local features by enlarging the receptive fields us- isting image classification backbones. Instead of develop- ing dilated convolutions [3]. However, conducting multiple ing and adding more complicated modules and strategies to dilated convolutions on the input feature map (especially in use these backbones’ features, we propose a novel and sim- the early stage) with original resolution requires too much ple architecture, which directly extracts multi-scale features computation and memory resources. To decrease the com- stage by stage, for salient object detection. putation costs, PoolNet [22] adapt the parallel configura- 3 Figure 3. Comparison of the residual block and our RSU. Figure 4. Computation costs (GFLOPS Giga Floating Point tion from pyramid pooling modules (PPM) [57], which uses Operations) of different blocks shown in Fig. 2: the com- small kernel filters on the downsampled feature maps other putation costs are calculated based on transferring an in- than the dilated convolutions on the original size feature put feature map with dimension 320 320 3 to a 320 maps. But fusion of different scale features by direct up- 32064 output feature map. “PLN”, “RES”, “DSE”, “INC” sampling and concatenation (or addition) may lead to degra- and “RSU” denote plain convolution block, residual block, dation of high resolution features. dense block, inception block and our residual U-block re- Inspired by U-Net [34], we propose a novel ReSidual U- spectively. block, RSU, to capture intra-stage multi-scale features. The structure of RSU-L(C ; M; C ) is shown in Fig. 2(e), in out where L is the number of layers in the encoder, C , C operations in this setting. The main design difference be- in out denote input and output channels, and M denotes the num- tween RSU and residual block is that RSU replaces the ber of channels in the internal layers of RSU. Hence, our plain, single-stream convolution with a U-Net like structure, RSU mainly consists of three components: and replace the original feature with the local feature trans- (i) an input convolution layer, which transforms the input formed by a weight layer: H (x) = U(F (x))+F (x), RSU 1 1 feature map x (HWC ) to an intermediate mapF (x) whereU represents the multi-layer U-structure illustrated in in 1 with channel of C . This is a plain convolutional layer for Fig. 2(e). This design change empowers the network to ex- out local feature extraction. tract features from multiple scales directly from each resid- (ii) a U-Net like symmetric encoder-decoder structure with ual block. More notably, the computation overhead due height of L which takes the intermediate feature mapF (x) to the U-structure is small, since most operations are ap- as input and learns to extract and encode the multi-scale plied on the downsampled feature maps. This is illustrated contextual information U(F (x)). U represents the U-Net in Fig. 4, where we show the computation cost compari- like structure as shown in Fig. 2(e). Larger L leads to deeper son between RSU and other feature extraction modules in residual U-block (RSU), more pooling operations, larger Fig. 2 (a)-(d). The FLOPs of dense block (DSE), inception range of receptive fields and richer local and global features. block (INC) and RSU all grow quadratically with the num- Configuring this parameter enables extraction of multi-scale ber of internal channel M . But RSU has a much smaller features from input feature maps with arbitrary spatial reso- coefficient on the quadratic term, leading to an improved lutions. The multi-scale features are extracted from gradu- efficiency. Its computational overhead compared with plain ally downsampled feature maps and encoded into high reso- convolution (PLN) and residual block (RES) blocks, which lution feature maps by progressive upsampling, concatena- are both linear w.r.t. M , is not significant. tion and convolution. This process mitigates the loss of fine 3.2. Architecture of U -Net details caused by direct upsampling with large scales. (iii) a residual connection which fuses local features and the Stacking multiple U-Net-like structures for different multi-scale features by the summation: F (x) +U(F (x)). 1 1 tasks has been explored for a while. , e.g. stacked hourgalss To better illustrate the intuition behind our design, we network [31], DocUNet [28], CU-Net [38] for pose estima- compare our residual U-block (RSU) with the original tion, etc. These methods usually stack U-Net-like structures residual block [12] in Fig. 3. The operation in the residual sequentially to build cascaded models and can be summa- block can be summarized asH(x) = F (F (x))+x, where rized as ”(Un-Net)”, where n is the number of repeated 2 1 H(x) denotes the desired mapping of the input features x; U-Net modules. The issue is that the computation and the F ;F stand for the weight layers, which are convolution memory costs get magnified by n. 2 1 4 2 Figure 5. Illustration of our proposed U -Net architecture. The main architecture is a U-Net like Encoder-Decoder, where each stage consists of our newly proposed residual U-block (RSU). For example, En 1 is based on our RSU block shown in Fig. 2(e). Detailed configuration of RSU block of each stage is given in the last two rows of Table 1. In this paper, we propose a different formulation, U - more efficiently. Net, of stacking U-structure for salient object detection. As illustrated in Fig.5, the U -Net mainly consists of Our exponential notation refers to nested U-structure rather three parts: (1) a six stages encoder, (2) a five stages de- than cascaded stacking. Theoretically, the exponent n can coder and (3) a saliency map fusion module attached with be set as an arbitrary positive integer to achieve single-level the decoder stages and the last encoder stage: or multi-level nested U-structure. But architectures with (i) In encoder stages En 1, En 2, En 3 and En 4, we use too many nested levels will be too complicated to be im- residual U-blocks RSU-7, RSU-6, RSU-5 and RSU-4, re- plemented and employed in real applications. spectively. As mentioned before, “7”, “6”, “5” and “4” de- 2 2 Here, we set n as 2 to build our U -Net. Our U -Net note the heights (L) of RSU blocks. The L is usually config- is a two-level nested U-structure shown in Fig. 5. Its top ured according to the spatial resolution of the input feature level is a big U-structure consists of 11 stages (cubes in maps. For feature maps with large height and width, we use Fig. 5). Each stage is filled by a well configured residual U- greater L to capture more large scale information. The res- block (RSU) (bottom level U-structure). Hence, the nested olution of feature maps in En 5 and En 6 are relatively low, U-structure enables the extraction of intra-stage multi-scale further downsampling of these feature maps leads to loss of features and aggregation of inter-stage multi-level features useful context. Hence, in both En 5 and En 6 stages, RSU- 5 Table 1. Detailed configurations of different architectures used in ablation study. “PLN”, “RES”, “DSE”, “INC”, “PPM” and “RSU” denote plain convolution block, residual block, dense block, inception block, Pyramid Pooling Module and our residual U-block respectively. “NIV U -Net” denotes U-Net with its each stage replaced by a naive U-Net block. “I”, “M” and “O” indicate the number of input channels (C ), middle channels and output channels (C ) of each block. “En i” and in out “De j” denote the encoder and decoder stages respectively. The number “L” in “NIV-L” and “RSU-L” denotes the height of the naive U-block and our residual U-block. Architecture with Stages different blocks En 1 En 2 En 3 En 4 En 5 En 6 De 5 De 4 De 3 De 2 De 1 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 PLN U-Net M:64 M:128 M:256 M:512 M:512 M:512 M:512 M:256 M:128 M:64 M:64 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 RES U-Net M:64 M:128 M:256 M:512 M:512 M:512 M:512 M:256 M:128 M:64 M:64 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 DSE U-Net M:32 M:32 M:64 M:128 M:128 M:128 M:128 M:64 M:32 M:16 M:16 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 INC U-Net M:32 M:32 M:64 M:128 M:128 M:128 M:128 M:64 M:32 M:16 M:16 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 M:32 M:32 M:64 M:128 M:128 M:128 M:128 M:64 M:32 M:16 M:16 PPM U-Net O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 NIV-7 NIV-6 NIV-5 NIV-4 NIV-4F NIV-4F NIV-4F NIV-4 NIV-5 NIV-6 NIV-7 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 NIV U -Net M:32 M:32 M:64 M:128 M:256 M:256 M:256 M:128 M:64 M:32 M:16 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 RSU-7 RSU-6 RSU-5 RSU-4 RSU-4F RSU-4F RSU-4F RSU-4 RSU-5 RSU-6 RSU-7 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 U -Net (Ours) M:32 M:32 M:64 M:128 M:256 M:256 M:256 M:128 M:64 M:32 M:16 O:64 O:128 O:256 O:512 O:512) O:512) O:512 O:256 O:128 O:64 O:64 RSU-7 RSU-6 RSU-5 RSU-4 RSU-4F RSU-4F RSU-4F RSU-4 RSU-5 RSU-6 RSU-7 I:3 I:64 I:64 I:64 I:64 I:64 I:128 I:128 I:128 I:128 I:128 2 y y U -Net (Ours ) M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 4F are used, where “F” means that the RSU is a dilated ver- lowed by a 11 convolution layer and a sigmoid function to sion, in which we replace the pooling and upsampling op- generate the final saliency probability map S (see bot- fuse erations with dilated convolutions (see Fig. 5). That means tom right of Fig. 5). all of intermediate feature maps of RSU-4F have the same In summary, the design of our U -Net allows having resolution with its input feature maps. deep architecture with rich multi-scale features and rela- (ii) The decoder stages have similar structures to their sym- tively low computation and memory costs. In addition, metrical encoder stages with respect to En 6. In De 5, we since our U -Net architecture is only built upon our RSU also use the dilated version residual U-block RSU-4F which blocks without using any pre-trained backbones adapted is similar to that used in the encoder stages En 5 and En 6. from image classification, it is flexible and easy to be Each decoder stage takes the concatenation of the upsam- adapted to different working environments with insignifi- pled feature maps from its previous stage and those from its cant performance loss. In this paper, we provide two in- symmetrical encoder stage as the input, see Fig. 5. stances of our U -Net by using different configurations of (iii) The last part is the saliency map fusion module which is filter numbers: a normal version U -Net (176.3 MB) and a 2 y used to generate saliency probability maps. Similar to HED relatively smaller version U -Net (4.7 MB). Detailed con- [45], our U -Net first generates six side output saliency figurations are presented in the last two rows of Table 1. (6) (5) (4) (3) (2) (1) probability maps S , S , S , S , S , S side side side side side side from stages En 6, De 5, De 4, De 3, De 2 and De 1 by 3.3. Supervision a 3  3 convolution layer and a sigmoid function. Then, it upsamples the logits (convolution outputs before sigmoid In the training process, we use deep supervision similar functions) of the side output saliency maps to the input im- to HED [45]. Its effectiveness has been proven in HED and DSS. Our training loss is defined as: age size and fuses them with a concatenation operation fol- 6 with the input images. Each pixel of the predicted saliency maps has a value within the range of 0 and 1 (or [0, 255]). (m) (m) L = w ` + w ` (1) fuse fuse side side The ground truth are usually binary masks, in which each m=1 pixel is either 0 or 1 (or 0 and 255) where 0 indicates the (m) background pixels and 1 indicates the foreground salient where ` (M = 6, as the Sup1, Sup2,  , Sup6 in Fig. side (m) object pixels. 5) is the loss of the side output saliency map S and ` fuse side To comprehensively evaluate the quality of those prob- (Sup7 in Fig. 5) is the loss of the final fusion output saliency (m) ability maps against the ground truth, six measures in- map S . w and w are the weights of each loss fuse fuse side cluding (1) Precision-Recall (PR) curves , (2) maximal F- term. For each term `, we use the standard binary cross- measure (maxF ) [1] , (3) Mean Absolute Error (MAE) entropy to calculate the loss: [23, 33, 22], (4) weighted F-measure (F ) [29] , (5) struc- ture measure (S ) [8] and (6) relaxed F-measure of bound- (H;W ) b ary (relaxF ) [33] are used: ` = [P logP + (1 P )log(1 P )] G(r;c) S(r;c) G(r;c) S(r;c) (1) PR curve is plotted based on a set of precision-recall (r;c) pairs. Given a predicted saliency probability map, its preci- (2) sion and recall scores are computed by comparing its thresh- where (r; c) is the pixel coordinates and (H; W) is image olded binary mask against the ground truth mask. The pre- size: height and width. P and P denote the pixel G(r;c) S(r;c) cision and recall of a dataset are computed by averaging the values of the ground truth and the predicted saliency proba- precision and recall scores of those saliency maps. By vary- bility map, respectively. The training process tries to mini- ing the thresholds from 0 to 1, we can obtain a set of average mize the overall lossL of Eq. (1). In the testing process, we precision-recall pairs of the dataset. choose the fusion output ` as our final saliency map. fuse (2) F-measure F is used to comprehensively evaluate both precision and recall as: 4. Experimental Results (1+ )PrecisionRecall 4.1. Datasets F = : (3) Precision+Recall Training dataset: We train our network on DUTS-TR, We set the to 0.3 and report the maximum F (maxF ) which is a part of DUTS dataset [39]. DUTS-TR contains for each dataset similar to previous works [1, 23, 50]. 10553 images in total. Currently, it is the largest and most (3) MAE is the Mean Absolute Error which denotes the av- frequently used training dataset for salient object detection. erage per-pixel difference between a predicted saliency map We augment this dataset by horizontal flipping to obtain and its ground truth mask. It is defined as: 21106 training images offline. Evaluation datasets: Six frequently used benchmark P P H W datasets are used to evaluate our method including: DUT- MAE = jP(r; c) G(r; c)j (4) HW r=1 c=1 OMRON [47], DUTS-TE [39], HKU-IS [18], ECSSD [46], where P and G are the probability map of the salient object PASCAL-S [19], SOD [30]. DUT-OMRON includes 5168 detection and the corresponding ground truth respectively, images, most of which contain one or two structurally com- (H , W ) and (r; c) are the (height, width) and the pixel co- plex foreground objects. DUTS dataset consists of two ordinates. parts: DUTS-TR and DUTS-TE. As mentioned above we (4) weighted F-measure (F ) [29] is utilized as a comple- use DUTS-TR for training. Hence, DUTS-TE, which con- mentary measure to maxF for overcoming the possible un- tains 5019 images, is selected as one of our evaluation fair comparison caused by “interpolation flaw, dependency dataset. HKU-IS contains 4447 images with multiple fore- flaw and equal-importance flaw” [23]. It is defined as: ground objects. ECSSD contains 1000 structurally complex images and many of them contain large foreground objects. PASCAL-S contains 850 images with complex foreground w w Precision  Recall w 2 F = (1 + ) : (5) objects and cluttered background. SOD only contains 300 2 w w Precision + Recall images. But it is very challenging. Because it was originally designed for image segmentation and many images are low (5) S-measure (S ) is used to evaluate the structure sim- contrast or contain complex foreground objects overlapping ilarity of the predicted non-binary saliency map and the with the image boundary. ground truth. The S-measure is defined as the weighted sum of region-aware S and object-aware S structural similar- r o 4.2. Evaluation Metrics ity: The outputs of the deep salient object methods are usu- ally probability maps that have the same spatial resolution S = (1 )S + S : (6) r o 7 Table 2. Results of ablation study on different blocks, ar- where is usually set to 0.5. chitectures and backbones. “PLN”, “RES”, “DSE”, “INC”, (6) relax boundary F-measure relaxF [7] is utilized to “PPM” and “RSU” denote plain convolution block, residual quantitatively evaluate boundaries’ quality of the predicted block, dense block, inception block, pyramid pooling mod- saliency maps [33]. Given a saliency probability map ule and our residual U-block respectively. “NIV U -Net” P 2 [0; 1], its binary mask P is obtained by a simple bw denotes U-Net with its each stage replaced by a naive U- thresholding operation (threshold is set to 0:5). Then, the Net block. The “Time (ms)” (ms: milliseconds) costs are XOR(P ; P ) operation is conducted to obtain its one bw erd computed by averaging the inference time costs of images pixel wide boundary, where P denotes the eroded binary erd in ECSSD dataset. Values with bold fonts indicate the best mask [11] of P . The boundaries of ground truth mask bw two performance. are obtained in the same way. The computation of relaxed boundary F-measure relaxF is similar to equation (3). The DUT-OMRON ECSSD b b Configuration Time (ms) difference is that relaxPrecision and relaxRecall other maxF MAE maxF MAE Baseline U-Net 0.725 0.082 0.896 0.066 14 than Precision and Recall are used in equation (3). The PLN U-Net 0.782 0.062 0.928 0.043 16 definition of relaxed boundary precision (relaxPrecision ) RES U-Net 0.781 0.065 0.933 0.042 19 DSE U-Net 0.790 0.067 0.927 0.046 70 is the fraction of predicted boundary pixels within a range INC U-Net 0.777 0.069 0.921 0.047 57 of  pixels from ground truth boundary pixels. The relaxed PPM U-Net 0.792 0.062 0.928 0.049 105 boundary recall (relaxRecall ) is defined as the fraction of Stacked HourglassNet [31] 0.756 0.073 0.905 0.059 103 CU-NET [37] 0.767 0.072 0.913 0.061 50 ground truth boundary pixels that are within  pixels of pre- NIV U -Net 0.803 0.061 0.938 0.085 30 dicted boundary pixels. The slack parameter  is set to 3 U -Net w/ VGG-16 backbone 0.808 0.063 0.942 0.038 23 U -Net w/ ResNet-50 backbone 0.813 0.058 0.937 0.041 41 as in the previous work [33]. Given a dataset, its average (Ours) RSU U -Net 0.823 0.054 0.951 0.033 33 relaxF of all predicted saliency maps is reported in this y 2 y (Ours ) RSU U -Net 0.813 0.060 0.943 0.041 25 paper. 4.3. Implementation Details 4.4.1 Ablation on Blocks In the training process, each image is first resized to 320320 and randomly flipped vertically and cropped to In the blocks ablation, the goal is to validate the effec- 288288. We are not using any existing backbones in our tiveness of our newly designed residual U-blocks (RSUs). network. Hence, we train our network from scratch and all Specifically, we fix the outside Encoder-Decoder architec- of our convolutional layers are initialized by Xavier [10]. ture of our U -Net and replace its stages with other popular (m) The loss weights w and w are all set to 1. Adam blocks including plain convolution blocks (PLN), residual- fuse side optimizer [16] is used to train our network and its hyper like blocks (RSE), dense-like blocks (DSE), inception-like parameters are set to default (initial learning rate lr=1e-3, blocks (INC) and pyramid pooling module (PPM) other betas=(0.9, 0.999), eps=1e-8, weight decay=0). We train than RSU block, as shown in Fig. 2 (a)-(d). Detailed con- the network until the loss converges without using valida- figurations can be found in Table 1. tion set which follows the previous methods [22, 23, 50]. Table 2 shows the quantitative results of the ablation After 600k iterations (with a batch size of 12), the training study. As we can see, the performance of baseline U-Net is loss converges and the whole training process takes about the worst, while PLN U-Net, RES U-Net, DES U-Net, INC 120 hours. During testing, the input images (H  W ) are U-Net and PPM U-Net achieve better performance than the resized to 320320 and fed into the network to obtain the baseline U-Net. Because they are either deeper or have the saliency maps. The predicted saliency maps with size of capability of extracting multi-scale features. However, their 320320 are resized back to the original size of the input performance is still inferior to both our full size U -Net and 2 y 2 image (H  W ). Bilinear interpolation is used in both re- small version U -Net . Particularly, our full size U -Net sizing processes. Our network is implemented based on Py- improves the maxF about 3.3% and 1.8%, and decreases torch 0.4.0 [32]. Both training and testing are conducted the MAE over 12.9% and 21.4% against the second best on an eight-core, 16 threads PC with an AMD Ryzen 1800x model (in the blocks ablation study) on DUT-OMRON and 3.5 GHz CPU (32GB RAM) and a GTX 1080ti GPU (11GB ECSSD datasets, respectively. Furthermore, our U -Net 2 y memory). We will release our code later. and U -Net increase the maxF by 9.8% and 8.8% and decrease the MAE by 34.1% and 27.0%, which are signif- 4.4. Ablation Study icant improvements, on DUT-OMRON dataset against the To verify the effectiveness of our U -Net, ablation stud- baseline U-Net. On ECSSD dataset, although the maxF 2 2 y ies are conducted on the following three aspects: i) basic improvements (5.5%, 4.7%) of our U -Net and U -Net blocks, ii) architectures and iii) backbones. All the ablation against the baseline U-Net is slightly less significant than studies follow the same implementation setup. that on DUT-OMRON, the improvements of MAE are 8 much greater (50.0%, 38.0%). Therefore, we believe that ods including one AlexNet based model: MDF; 10 VGG our newly designed residual U-block RSU is better then based models: UCF, Amulet, NLDF, DSS, RAS, PAGRN, others in this salient object detection task. Besides, there is BMPM, PiCANet, MLMS, AFNet; one DenseNet based no significant time costs increasing of our residual U-block model MSWS; one ResNeXt based model: R Net; and (RSU) based U -Net architectures. seven ResNet based models: CapSal, SRM, DGRL, Pi- CANetR, CPD, PoolNet, BASNet. For fair comparison, we mainly use the salient object detection results provided 4.4.2 Ablation on Architectures by the authors. For the missing results on certain datasets of As we mentioned above, previous methods usually use cas- some methods, we run their released code with their trained caded ways to stack multiple similar structures for build- models on their suggested environment settings. ing more expressive models. One of the intuitions behind this idea is that multiple similar structures are able to refine 4.5.1 Quantitative Comparison the results gradually while reducing overfitting. Stacked HourglassNet [31] and CU-Net [37] are two representative Fig. 6 illustrates the precision-recall curves of our models 2 2 y models in this category. Therefore, we adapted the stacked (U -Net, 176.3 MB and U -Net , 4.7 MB) and typical state- HourglassNet and CU-Net to compare the performance be- of-the-art methods on the six datasets. The curves are con- tween the cascaded architectures and our nested architec- sistent with the Table 3 and 4 which demonstrate the state- 2 2 tures. As shown in Table. 2, both our full size U -Net and of-the-art performance of our U -Net on DUT-OMRON, 2 y small size model U -Net outperform these two cascaded HKU-IS and ECSSD and competitive performance on other models. It is worth noting the both stacked HourglassNet datasets. Table 3 and 4 compares five (six include the model and CU-Net utilizes improved U-Net-like modules as their size) evaluation metrics and the model size of our proposed stacking sub-models. To further demonstrate the effective- method with others. As we can see, our U -Net achieves the ness of our nested architecture, we also illustrate the perfor- best performance on datasets DUT-OMRON, HKU-IS and mance of an U -Net based on naive U-blocks (NIV) other ECSSD in terms of almost all of the five evaluation metrics. than our newly proposed residual U-blocks. We can see that On DUTS-TE dataset our U -Net achieves the second best the NIV U -Net still achieves better performance than these overall performance, which is slightly inferior to PoolNet. two cascaded models. In addition, the nested architectures On PASCAL-S, the performance of our U -Net is slightly are faster than the cascaded ones. In summary, our nested inferior to AFNet, CPD and PoolNet. It is worth noting that architecture is able to achieve better performance than the U -Net achieves the second best performance in terms of cascaded architecture both in terms of accuracy and speed. the boundary quality evaluation metric relaxF . On SOD dataset, PoolNet performs the best and our U -Net is the second best in terms of the overall performance. 4.4.3 Ablation on Backbones 2 y Our U -Net is only 4.7 MB, which is currently the Different from the previous salient object detection mod- smallest model in the field of salient object detection. With els which use backbones (e.g. VGG, ResNet, etc.) as their much fewer number of parameters against other models, encoders, our newly proposed U -Net architecture is back- it still achieves surprisingly competitive performance. Al- bone free. To validate the backbone free design, we conduct though its performance is not as good as our full size U - ablation studies on replacing the encoder part of our full size Net, its small size will facilitate its applications in many U -Net with different backbones: VGG16 and ResNet50. computation and memory constrained environments. Practically, we adapt the backbones (VGG-16 and ResNet- 50) by adding an extra stage after their last convolutional 4.5.2 Qualitative Comparison: stages to achieve the same receptive fields with our origi- nal U -Net architecture design. As shown in Table 2, the To give an intuitive understanding of the promising perfor- models using backbones and our RSUs as decoders achieve mance of our models, we illustrate the sample results of our better performance than the previous ablations and compa- models and several other state-of-the-art methods in Fig. 7. 2 2 y rable performance against our small size U -Net. However, As we can see, our U -Net and U -Net are able to handle they are still inferior to our full size U -Net. Therefore, we different types of targets and produce accurate salient object believe that our backbones free design is more competitive detection results. than backbones-based design in this salient object detection The 1st and 2nd row of Fig. 7 show the results of small 2 2 task. and large objects. As we can observe, our U -Net and U - Net are able to produce accurate results on both small and 4.5. Comparison with State-of-the-arts large objects. Other models either prone to miss the small We compare our models (full size U -Net, 176.3 MB and target or produce large object with poor accuracy. The 3rd 2 y small size U -Net , 4.7 MB) with 20 state-of-the-art meth- row shows the results of target touching image borders. Our 9 1.0 1.0 Ours Ours 0.9 0.9 Ours Ours BASNet BASNet PoolNet PoolNet CPD CPD 0.8 0.8 PiCANetR PiCANetR SRM SRM CapSal CapSal 0.7 0.7 R3Net+ R3Net+ MSWS MSWS AFNet AFNet MLMS MLMS 0.6 0.6 BMPM BMPM DSS+ DSS+ MDF DUT-OMRON MDF DUTS-TE 0.5 0.5 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall 1.0 1.0 Ours Ours 0.9 0.9 Ours Ours BASNet BASNet PoolNet PoolNet CPD CPD 0.8 0.8 PiCANetR PiCANetR SRM SRM CapSal CapSal 0.7 0.7 R3Net+ R3Net+ MSWS MSWS AFNet AFNet MLMS MLMS 0.6 0.6 BMPM BMPM DSS+ DSS+ MDF HKU-IS MDF ECSSD 0.5 0.5 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall 1.0 1.0 Ours Ours 0.9 0.9 Ours Ours BASNet BASNet PoolNet PoolNet CPD CPD 0.8 0.8 PiCANetR PiCANetR SRM SRM CapSal CapSal 0.7 0.7 R3Net+ R3Net+ MSWS MSWS AFNet AFNet MLMS MLMS 0.6 0.6 BMPM BMPM DSS+ DSS+ MDF MDF PASCAL-S SOD 0.5 0.5 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall Figure 6. Precision-Recall curves of our models and other typical state-of-the-art models on six SOD datasets. 2 2 U -Net correctly segments all the regions. Although U - sists of both large and thin structures. As we can see, most Net erroneously segments the bottom right hole, it is still of other models extract large regions well while missing the much better than other models. The 4th row demonstrates cable-wise thin structure except for AFNet (col (j)). The 5th the performance of models in segmenting targets that con- row shows a tree with relatively clean background of blue Precision Precision Precision Precision Precision Precision Table 3. Comparison of our method and 20 SOTA methods on DUT-OMRON, DUTS-TE, HKU-IS in terms of model size, maxF ("), w b MAE (#), weighted F ("), structure measure S (") and relax boundary F-measure relaxF ("). Red, Green, and Blue indicate the best, second best and third best performance. DUT-OMRON (5168) DUTS-TE (5019) HKU-IS (4447) Method Backbone Size(MB) w b w b w b maxF MAE F S relaxF maxF MAE F S relaxF maxF MAE F S relaxF m m m MDF AlexNet 112.1 0.694 0.142 0.565 0.721 0.406 0.729 0.099 0.543 0.723 0.447 0.860 0.129 0.564 0.810 0.594 TIP16 UCF VGG-16 117.9 0.730 0.120 0.573 0.760 0.480 0.773 0.112 0.596 0.777 0.518 0.888 0.062 0.779 0.875 0.679 ICCV17 Amulet VGG-16 132.6 0.743 0.098 0.626 0.781 0.528 0.778 0.084 0.658 0.796 0.568 0.897 0.051 0.817 0.886 0.716 ICCV17 NLDF+ VGG-16 428.0 0.753 0.080 0.634 0.770 0.514 0.813 0.065 0.710 0.805 0.591 0.902 0.048 0.838 0.879 0.694 CVPR17 DSS+ VGG-16 237.0 0.781 0.063 0.697 0.790 0.559 0.825 0.056 0.755 0.812 0.606 0.916 0.040 0.867 0.878 0.706 CVPR17 RAS VGG-16 81.0 0.786 0.062 0.695 0.814 0.615 0.831 0.059 0.740 0.828 0.656 0.913 0.045 0.843 0.887 0.748 ECCV18 PAGRN VGG-19 - 0.771 0.071 0.622 0.775 0.582 0.854 0.055 0.724 0.825 0.692 0.918 0.048 0.820 0.887 0.762 CVPR18 BMPM VGG-16 - 0.774 0.064 0.681 0.809 0.612 0.852 0.048 0.761 0.851 0.699 0.921 0.039 0.859 0.907 0.773 CVPR18 PiCANet VGG-16 153.3 0.794 0.068 0.691 0.826 0.643 0.851 0.054 0.747 0.851 0.704 0.921 0.042 0.847 0.906 0.784 CVPR18 MLMS VGG-16 263.0 0.774 0.064 0.681 0.809 0.612 0.852 0.048 0.761 0.851 0.699 0.921 0.039 0.859 0.907 0.773 CVPR19 AFNet VGG-16 143.0 0.797 0.057 0.717 0.826 0.635 0.862 0.046 0.785 0.855 0.714 0.923 0.036 0.869 0.905 0.772 CVPR19 MSWS Dense-169 48.6 0.718 0.109 0.527 0.756 0.362 0.767 0.908 0.586 0.749 0.376 0.856 0.084 0.685 0.818 0.438 CVPR19 R Net+ ResNeXt 215.0 0.795 0.063 0.728 0.817 0.599 0.828 0.058 0.763 0.817 0.601 0.915 0.036 0.877 0.895 0.740 IJCAI18 CapSal ResNet-101 - 0.699 0.101 0.482 0.674 0.396 0.823 0.072 0.691 0.808 0.605 0.882 0.062 0.782 0.850 0.654 CVPR19 SRM ResNet-50 189.0 0.769 0.069 0.658 0.798 0.523 0.826 0.058 0.722 0.824 0.592 0.906 0.046 0.835 0.887 0.680 ICCV17 DGRL ResNet-50 646.1 0.779 0.063 0.697 0.810 0.584 0.834 0.051 0.760 0.836 0.656 0.913 0.037 0.865 0.897 0.744 CVPR18 PiCANetR ResNet-50 197.2 0.803 0.065 0.695 0.832 0.632 0.860 0.050 0.755 0.859 0.696 0.918 0.043 0.840 0.904 0.765 CVPR18 CPD ResNet-50 183.0 0.797 0.056 0.719 0.825 0.655 0.865 0.043 0.795 0.858 0.741 0.925 0.034 0.875 0.905 0.795 CVPR19 PoolNet ResNet-50 273.3 0.808 0.056 0.729 0.836 0.675 0.880 0.040 0.807 0.871 0.765 0.932 0.033 0.881 0.917 0.811 CVPR19 BASNet ResNet-34 348.5 0.805 0.056 0.751 0.836 0.694 0.860 0.047 0.803 0.853 0.758 0.928 0.032 0.889 0.909 0.807 CVPR19 U -Net (Ours) RSU 176.3 0.823 0.054 0.757 0.847 0.702 0.873 0.044 0.804 0.861 0.765 0.935 0.031 0.890 0.916 0.812 2 y U -Net (Ours) RSU 4.7 0.813 0.060 0.731 0.837 0.676 0.852 0.054 0.763 0.847 0.723 0.928 0.037 0.867 0.908 0.794 Table 4. Comparison of our method and 20 SOTA methods on ECSSD, PASCAL-S, SOD in terms of model size, maxF ("), MAE (#), w b weighted F ("), structure measure S (") and relax boundary F-measure relaxF ("). Red, Green, and Blue indicate the best, second best and third best performance. ECSSD (1000) PASCAL-S (850) SOD (300) Method Backbone Size(MB) w b w b w b maxF MAE F S relaxF maxF MAE F S relaxF maxF MAE F S relaxF m m m MDF AlexNet 112.1 0.832 0.105 0.705 0.776 0.472 0.759 0.142 0.589 0.696 0.343 0.746 0.192 0.508 0.643 0.311 TIP16 UCF VGG-16 117.9 0.903 0.069 0.806 0.884 0.669 0.814 0.115 0.694 0.805 0.493 0.808 0.148 0.675 0.762 0.471 ICCV17 Amulet VGG-16 132.6 0.915 0.059 0.840 0.894 0.711 0.828 0.100 0.734 0.818 0.541 0.798 0.144 0.677 0.753 0.454 ICCV17 NLDF+ VGG-16 428.0 0.905 0.063 0.839 0.897 0.666 0.822 0.098 0.737 0.798 0.495 0.841 0.125 0.709 0.755 0.475 CVPR17 DSS+ VGG-16 237.0 0.921 0.052 0.872 0.882 0.696 0.831 0.093 0.759 0.798 0.499 0.846 0.124 0.710 0.743 0.444 CVPR17 RAS VGG-16 81.0 0.921 0.056 0.857 0.893 0.741 0.829 0.101 0.736 0.799 0.560 0.851 0.124 0.720 0.764 0.544 ECCV18 PAGRN VGG-19 - 0.927 0.061 0.834 0.889 0.747 0.847 0.090 0.738 0.822 0.594 - - - - - CVPR18 BMPM VGG-16 - 0.928 0.045 0.871 0.911 0.770 0.850 0.074 0.779 0.845 0.617 0.856 0.108 0.726 0.786 0.562 CVPR18 PiCANetCVPR18 VGG-16 153.3 0.931 0.046 0.865 0.914 0.784 0.856 0.078 0.772 0.848 0.612 0.854 0.103 0.722 0.789 0.572 MLMS VGG-16 263.0 0.928 0.045 0.871 0.911 0.770 0.855 0.074 0.779 0.844 0.620 0.856 0.108 0.726 0.786 0.562 CVPR19 AFNet VGG-16 143.0 0.935 0.042 0.887 0.914 0.776 0.863 0.070 0.798 0.849 0.626 0.856 0.111 0.723 0.774 - CVPR19 MSWS Dense-169 48.6 0.878 0.096 0.716 0.828 0.411 0.786 0.133 0.614 0.768 0.289 0.800 0.167 0.573 0.700 0.231 CVPR19 R Net+ ResNeXt 215.0 0.934 0.040 0.902 0.910 0.759 0.834 0.092 0.761 0.807 0.538 0.850 0.125 0.735 0.759 0.431 IJCAI18 CapSal ResNet-101 - 0.874 0.077 0.771 0.826 0.574 0.861 0.073 0.786 0.837 0.527 0.773 0.148 0.597 0.695 0.404 CVPR19 SRM ResNet-50 189.0 0.917 0.054 0.853 0.895 0.672 0.838 0.084 0.758 0.834 0.509 0.843 0.128 0.670 0.741 0.392 ICCV17 DGRL ResNet-50 646.1 0.925 0.042 0.883 0.906 0.753 0.848 0.074 0.787 0.839 0.569 0.848 0.106 0.731 0.773 0.502 CVPR18 PiCANetR ResNet-50 197.2 0.935 0.046 0.867 0.917 0.775 0.857 0.076 0.777 0.854 0.598 0.856 0.104 0.724 0.790 0.528 CVPR18 CPD ResNet-50 183.0 0.939 0.037 0.898 0.918 0.811 0.861 0.071 0.800 0.848 0.639 0.860 0.112 0.714 0.767 0.556 CVPR19 PoolNet ResNet-50 273.3 0.944 0.039 0.896 0.921 0.813 0.865 0.075 0.798 0.832 0.644 0.871 0.102 0.759 0.797 0.606 CVPR19 BASNet ResNet-34 348.5 0.942 0.037 0.904 0.916 0.826 0.856 0.076 0.798 0.838 0.660 0.851 0.113 0.730 0.769 0.603 CVPR19 U -Net (Ours) RSU 176.3 0.951 0.033 0.910 0.928 0.836 0.859 0.074 0.797 0.844 0.657 0.861 0.108 0.748 0.786 0.613 2 y U -Net (Ours) RSU 4.7 0.943 0.041 0.885 0.918 0.808 0.849 0.086 0.768 0.831 0.627 0.841 0.124 0.697 0.759 0.559 sky. It seems easy, but it is actually challenging to most can produce results even finer than the ground truth. La- of the models because of the complicated shape of the tar- beling these small holes in the 7th image is burdensome get. As we can see, our models segment both the trunk and and time-consuming. Hence, these repeated fine structures branches well, while others fail in segmenting the compli- are usually ignored in the annotation process. Inferring the cated tree branch region. Compared with the 5th row, the correct results from these imperfect labeling is challenging. bench shown in the 6th row is more complex thanks to the But our models show promising capability in segmenting hollow structure. Our U -Net produces near perfect result. these fine structures thanks to the well designed architec- Although the bottom right of the prediction map of U -Nety tures for extracting and integrating high resolution local and is imperfect, its overall performance on this target is much low resolution global information. The 8th and 9th row are better than other models. Besides, the results of our models illustrated to show the strong ability of our models in de- are more homogenous with fewer gray areas than models tecting targets with cluttered backgrounds and complicated like PoolNet (col (f)), CPD (col (g)), PiCANetR (col (h)) foreground appearance. The 10th row shows that our mod- and AFNet (col (j)). The 7th row shows that our models els are able to segment multiple targets while capturing the 11 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) Figure 7. Qualitative comparison of the proposed method with seven other SOTA methods: (a) image, (b) GT, (c) Ours, (d) y 3 Ours , (e) BASNet, (f) PoolNet, (g) CPD, (h) PiCANetR, (i) R Net+, (j) AFNet, (k) DSS+, where ‘+’ indicates the CRF post-processing. details of the detected targets (see the gap region of the two 5. Conclusions pieces of sail of each sailboat). In summary, both our full In this paper, we proposed a novel deep network: U - size and small size models are able to handle various sce- Net, for salient object detection. The main architecture of narios and produce high accuracy salient object detection our U -Net is a two-level nested U-structure. The nested U- results. structure with our newly designed RSU blocks enables the network to capture richer local and global information from both shallow and deep layers regardless of the resolutions. 12 Compared with those SOD models built upon the existing on Integrating ontology, pages 25–32. No commercial edi- tor., 2005. backbones, our U -Net is purely built on the proposed RSU [8] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali blocks which makes it possible to be trained from scratch Borji. Structure-measure: A new way to evaluate foreground and configured to have different model size according to the 2 maps. In Proceedings of the IEEE Conference on Computer target environment constraints. We provide a full size U - Vision and Pattern Recognition, pages 4548–4557, 2017. 2 y Net (176.3 MB, 30 FPS) and a smaller size version U -Net [9] Mengyang Feng, Huchuan Lu, and Errui Ding. Attentive (4.7 MB, 40 FPS) in this paper. Experimental results on feedback network for boundary-aware salient object detec- six public salient object detection datasets demonstrate that tion. In Proceedings of the IEEE Conference on Computer both models achieve very competitive performance against Vision and Pattern Recognition, pages 1623–1632, 2019. other 20 state-of-the-art methods in terms of both qualitative [10] Xavier Glorot and Yoshua Bengio. Understanding the diffi- and quantitative measures. culty of training deep feedforward neural networks. In Pro- Although our models achieve competitive results against ceedings of the Thirteenth International Conference on Ar- other state-of-the-art methods, faster and smaller models are tificial Intelligence and Statistics, AISTATS, pages 249–256, needed for computation and memory limited devices, such as mobile phones, robots, etc. In the near future, we will [11] Robert M Haralick, Stanley R Sternberg, and Xinhua explore different techniques and architectures to further im- Zhuang. Image analysis using mathematical morphology. IEEE transactions on pattern analysis and machine intelli- prove the speed and decrease the model size. In addition, gence, (4):532–550, 1987. larger diversified salient object datasets are needed to train [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. more accurate and robust models. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern Acknowledgments recognition, pages 770–778, 2016. This work is supported by the Alberta Innovates Grad- [13] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip Torr. Deeply supervised salient ob- uate Student Scholarship and Natural Sciences and Engi- ject detection with short connections. In Proceedings of the neering Research Council of Canada (NSERC) Discovery IEEE Conference on Computer Vision and Pattern Recogni- Grants Program, NO.: 2016-06365. tion, pages 5300–5309, 2017. [14] Xiaowei Hu, Lei Zhu, Jing Qin, Chi-Wing Fu, and Pheng- References Ann Heng. Recurrently aggregating deep features for salient object detection. In AAAI, pages 6943–6950, 2018. [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. In 2009 IEEE [15] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- Conference on Computer Vision and Pattern Recognition, ian Q Weinberger. Densely connected convolutional net- pages 1597–1604, 2009. works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017. [2] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark. IEEE Trans. Image [16] Diederik P Kingma and Jimmy Ba. Adam: A method for Processing, 24(12):5706–5722, 2015. stochastic optimization. arXiv preprint, 2014. [3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Imagenet classification with deep convolutional neural net- segmentation with deep convolutional nets, atrous convolu- works. In Advances in neural information processing sys- tion, and fully connected crfs. IEEE transactions on pattern tems, pages 1097–1105, 2012. analysis and machine intelligence, 40(4):834–848, 2017. [18] Guanbin Li and Yizhou Yu. Visual saliency detection based [4] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Re- on multiscale deep cnn features. IEEE Transactions on Im- verse attention for salient object detection. In Proceedings age Processing, 25(11):5012–5024, 2016. of the European Conference on Computer Vision (ECCV), [19] Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and pages 234–250, 2018. Alan L Yuille. The secrets of salient object segmentation. [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, In Proceedings of the IEEE Conference on Computer Vision and Li Fei-Fei. Imagenet: A large-scale hierarchical image and Pattern Recognition, pages 280–287, 2014. database. In 2009 IEEE conference on computer vision and [20] Jie Liang, Jun Zhou, Lei Tong, Xiao Bai, and Bin Wang. pattern recognition, pages 248–255. IEEE, 2009. Material based salient object detection from hyperspectral [6] Zijun Deng, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Jing Qin, images. Pattern Recognition, 76:476–490, 2018. Guoqiang Han, and Pheng-Ann Heng. R3net: Recurrent [21] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig residual refinement network for saliency detection. In Pro- Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- ceedings of the 27th International Joint Conference on Arti- deeplab: Hierarchical neural architecture search for semantic ficial Intelligence, pages 684–690. AAAI Press, 2018. image segmentation. In Proceedings of the IEEE Conference ´ ˆ [7] Marc Ehrig and Jerome Euzenat. Relaxed precision and re- on Computer Vision and Pattern Recognition, pages 82–92, call for ontology matching. In Proc. K-Cap 2005 workshop 2019. 13 [22] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng, Vanhoucke, and Andrew Rabinovich. Going deeper with and Jianmin Jiang. A simple pooling-based design for real- convolutions. In Proceedings of the IEEE conference on time salient object detection. In Proceedings of the IEEE computer vision and pattern recognition, pages 1–9, 2015. Conference on Computer Vision and Pattern Recognition, [37] Zhiqiang Tang, Xi Peng, Shijie Geng, Lingfei Wu, Shaoting pages 3917–3926, 2019. Zhang, and Dimitris Metaxas. Quantized densely connected [23] Nian Liu, Junwei Han, and Ming-Hsuan Yang. Picanet: u-nets for efficient landmark localization. In Proceedings Learning pixel-wise contextual attention for saliency detec- of the European Conference on Computer Vision (ECCV), tion. In Proceedings of the IEEE Conference on Computer pages 339–354, 2018. Vision and Pattern Recognition, pages 3089–3098, 2018. [38] Zhiqiang Tang, Xi Peng, Shijie Geng, Yizhe Zhu, and Dim- [24] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully itris N Metaxas. Cu-net: coupled u-nets. arXiv preprint convolutional networks for semantic segmentation. In Pro- arXiv:1808.06521, 2018. ceedings of the IEEE conference on computer vision and pat- [39] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, tern recognition, pages 3431–3440, 2015. Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de- [25] Shijian Lu and Joo-Hwee Lim. Saliency modeling from im- tect salient objects with image-level supervision. In Proceed- age histograms. In European Conference on Computer Vi- ings of the IEEE Conference on Computer Vision and Pattern sion, pages 321–332. Springer, 2012. Recognition, pages 136–145, 2017. [26] Shijian Lu, Cheston Tan, and Joo-Hwee Lim. Robust and [40] Tiantian Wang, Ali Borji, Lihe Zhang, Pingping Zhang, and efficient saliency modeling from image co-occurrence his- Huchuan Lu. A stagewise refinement model for detecting tograms. IEEE transactions on pattern analysis and machine salient objects in images. In Proceedings of the IEEE Inter- intelligence, 36(1):195–201, 2013. national Conference on Computer Vision, pages 4039–4048, [27] Zhiming Luo, Akshaya Mishra, Andrew Achkar, Justin Eichel, Shaozi Li, and Pierre-Marc Jodoin. Non-local deep [41] Tiantian Wang, Lihe Zhang, Shuo Wang, Huchuan Lu, Gang features for salient object detection. In Proceedings of the Yang, Xiang Ruan, and Ali Borji. Detect globally, refine IEEE Conference on Computer Vision and Pattern Recogni- locally: A novel approach to saliency detection. In Proceed- tion, pages 6593–6601, 2017. ings of the IEEE Conference on Computer Vision and Pattern [28] Ke Ma, Zhixin Shu, Xue Bai, Jue Wang, and Dimitris Sama- Recognition, pages 3127–3135, 2018. ras. Docunet: Document image unwarping via a stacked u- [42] Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, net. In CVPR, pages 4700–4709, 2018. Huchuan Lu, and Errui Ding. A mutual learning method for [29] Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to salient object detection with intertwined multi-supervision. evaluate foreground maps. 2014 IEEE Conference on Com- In Proceedings of the IEEE Conference on Computer Vision puter Vision and Pattern Recognition, pages 248–255, 2014. and Pattern Recognition, pages 8150–8159, 2019. [30] Vida Movahedi and James H Elder. Design and perceptual [43] Zhe Wu, Li Su, and Qingming Huang. Cascaded partial de- validation of performance measures for salient object seg- coder for fast and accurate salient object detection. In Pro- mentation. In 2010 IEEE Computer Society Conference on ceedings of the IEEE Conference on Computer Vision and Computer Vision and Pattern Recognition-Workshops , pages Pattern Recognition, pages 3907–3916, 2019. 49–56. IEEE, 2010. [44] Saining Xie, Ross Girshick, Piotr Dollar ´ , Zhuowen Tu, and [31] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- Kaiming He. Aggregated residual transformations for deep glass networks for human pose estimation. In European con- neural networks. In Proceedings of the IEEE Conference ference on computer vision, pages 483–499. Springer, 2016. on Computer Vision and Pattern Recognition, pages 5987– [32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory 5995, 2017. Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban [45] Saining Xie and Zhuowen Tu. Holistically-nested edge de- Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- tection. In Proceedings of the IEEE international conference ferentiation in pytorch. In Autodiff workshop on NIPS, 2017. on computer vision, pages 1395–1403, 2015. [33] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, [46] Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical Masood Dehghan, and Martin Jagersand. Basnet: Boundary- saliency detection. In Proceedings of the IEEE Conference aware salient object detection. In Proceedings of the IEEE on Computer Vision and Pattern Recognition, pages 1155– Conference on Computer Vision and Pattern Recognition, 1162, 2013. pages 7479–7489, 2019. [34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- [47] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and net: Convolutional networks for biomedical image segmen- Ming-Hsuan Yang. Saliency detection via graph-based man- tation. In International Conference on Medical image com- ifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3166– puting and computer-assisted intervention , pages 234–241. 3173, 2013. Springer, 2015. [35] Karen Simonyan and Andrew Zisserman. Very deep convo- [48] Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, lutional networks for large-scale image recognition. arXiv Mingyang Qian, and Yizhou Yu. Multi-source weak supervi- preprint arXiv:1409.1556, 2014. sion for saliency detection. In Proceedings of the IEEE Con- [36] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, ference on Computer Vision and Pattern Recognition, pages Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent 6074–6083, 2019. 14 [49] Jinxia Zhang, Krista A. Ehinger, Haikun Wei, Kanjian Zhang, and Jingyu Yang. A novel graph-based optimization framework for salient object detection. Pattern Recognition, 64:39–50, 2017. [50] Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. A bi-directional message passing model for salient object de- tection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1741–1750, 2018. [51] Lu Zhang, Jianming Zhang, Zhe Lin, Huchuan Lu, and You He. Capsal: Leveraging captioning to boost semantics for salient object detection. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 6024–6033, 2019. [52] Pingping Zhang, Wei Liu, Huchuan Lu, and Chunhua Shen. Salient object detection by lossless feature reflection. In IJ- CAI, pages 1149–1155, 2018. [53] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. Amulet: Aggregating multi-level convo- lutional features for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 202–211, 2017. [54] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Baocai Yin. Learning uncertain convolutional features for accurate saliency detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 212– 221, 2017. [55] Qiang Zhang, Zhen Huo, Yi Liu, Yunhui Pan, Caifeng Shan, and Jungong Han. Salient object detection employing a local tree-structured low-rank representation and foreground con- sistency. Pattern Recognition, 92:119–134, 2019. [56] Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, and Gang Wang. Progressive attention guided recurrent net- work for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 714–722, 2018. [57] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890, 2017. [58] Yunzhi Zhuge, Yu Zeng, and Huchuan Lu. Deep embedding features for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9340–9347, 2019. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University)

U$^2$-Net: Going Deeper with Nested U-Structure for Salient Object Detection

Loading next page...
 
/lp/arxiv-cornell-university/u-2-net-going-deeper-with-nested-u-structure-for-salient-object-Wx1e2zAoEH

References (58)

eISSN
ARCH-3344
DOI
10.1016/j.patcog.2020.107404
Publisher site
See Article on Publisher Site

Abstract

In this paper, we design a simple yet powerful deep network architecture, U -Net, for salient object detection (SOD). The architecture of our U -Net is a two-level nested U-structure. The design has the following advantages: (1) it is able to capture more contextual information from dif- ferent scales thanks to the mixture of receptive fields of dif- ferent sizes in our proposed ReSidual U-blocks (RSU), (2) it increases the depth of the whole architecture without sig- nificantly increasing the computational cost because of the pooling operations used in these RSU blocks. This architec- ture enables us to train a deep network from scratch with- out using backbones from image classification tasks. We instantiate two models of the proposed architecture, U - Figure 1. Comparison of model size and performance of Net (176.3 MB, 30 FPS on GTX 1080Ti GPU) and U - our U -Net with other state-of-the-art SOD models. The Net (4.7 MB, 40 FPS), to facilitate the usage in differ- maxF measure is computed on dataset ECSSD [46]. The ent environments. Both models achieve competitive perfor- red star denotes our U -Net (Ours) (176.3 MB) and the blue 2 y y mance on six SOD datasets. The code is available:https: star denotes our small version U -Net (Ours ) (4.7 MB). //github.com/NathanUA/U-2-Net . trained on ImageNet [5] data which is data-inefficient espe- 1. Introduction cially if the target data follows a different distribution than ImageNet. Salient Object Detection (SOD) aims at segmenting the This leads to our first question: can we design a new most visually attractive objects in an image. It is widely network for SOD, that allows training from scratch and used in many fields, such as visual tracking and image seg- achieves comparable or better performance than those mentation. Recently, with the development of deep con- based on existing pre-trained backbones? volutional neural networks (CNNs), especially the rise of Fully Convolutional Networks (FCN) [24] in image seg- There are a few more issues on the network architectures mentation, the salient object detection has been improved for SOD. First, they are often overly complicated [58]. It is significantly. It is natural to ask, what is still missing? Let’s partially due to the additional feature aggregation modules take a step back and look at the remaining challenges. that are added to the existing backbones to extract multi- There is a common pattern in the design of most SOD level saliency features from these backbones. Secondly, the networks [18, 27, 41, 6], that is, they focus on making existing backbones usually achieve deeper architecture by good use of deep features extracted by existing backbones, sacrificing high resolution of feature maps [58]. To run such as Alexnet [17], VGG [35], ResNet [12], ResNeXt these deep models with affordable memory and computa- [44], DenseNet [15], etc. However, these backbones are all tional cost, the feature maps are down scaled to lower res- originally designed for image classification. They extract olution at early stages. For instance, at the early layers of features that are representative of semantic meaning rather both ResNet and DenseNet [15], a convolution with stride than local details and global contrast information, which are of two followed by a maxpooling with stride of two are uti- essential to saliency detection. And they need to be pre- lized to reduce the size of the feature maps to one fourth of arXiv:2005.09007v3 [cs.CV] 8 Mar 2022 the input maps. However, high resolution also plays an im- features. Zhang et al. (LFR) [52] predict saliency maps portant role in segmentation besides the deep architecture by extracting features from both original input images and [21]. their reflection images with a sibling architecture. Hou et Hence, our follow-up question is: can we go deeper al. (DSS+) [13] propose to integrate multi-level features by while maintaining high resolution feature maps, at a low introducing short connections from deep layers to shallow memory and computation cost? layers. Chen et al. (RAS) [4] predict and refine saliency Our main contribution is a novel and simple network ar- maps by iteratively using the side output saliency of a back- chitecture, called U -Net, that addresses the two questions bone network as the feature attention guidance. Zhang et above. First, U -Net is a two-level nested U-structure that is al. (BMPM) [50] propose to integrate features from shal- designed for SOD without using any pre-trained backbones low and deep layers by a controlled bi-directional passing from image classification. It can be trained from scratch strategy. Deng et al. (R Net+) [6] alternately incorporate to achieve competitive performance. Second, the novel ar- shallow and deep layers’ features to refine the predicted chitecture allows the network to go deeper, attain high res- saliency maps. Hu et al. (RADF+) [14] propose to detect olution, without significantly increasing the memory and salient objects by recurrently aggregating multi-level deep computation cost. This is achieved by a nested U-structure: features. Wu et al. (MLMS) [42] improve the saliency de- on the bottom level, we design a novel ReSidual U-block tection accuracy by developing a novel Mutual Learning (RSU), which is able to extract intra-stage multi-scale fea- Module for better leveraging the correlation of boundaries tures without degrading the feature map resolution; on the and regions. Wu et al. [43] propose to use Cascaded Par- top level, there is a U-Net like structure, in which each stage tial Decoder (CPD) framework for fast and accurate salient is filled by a RSU block. The two-level configuration results object detection. Deep methods in this category take advan- in a nested U-structure (see Fig. 5). Our U -Net (176.3 MB) tage of the multi-level deep features extracted by backbone achieves competitive performance against the state-of-the- networks and greatly raise the bar of salient object detection against traditional methods. art (SOTA) methods on six public datasets, and runs at real- time (30 FPS, with input size of 3203203) on a 1080Ti Multi-scale feature extraction: As mentioned earlier, GPU. To facilitate the usage of our design in computation saliency detection requires both local and global informa- and memory constrained environments, we provide a small tion. A 3  3 filter is good for extracting local features at 2 2 y 2 version of our U -Net, called U -Net (4.7 MB). The U - each layer. However, it is difficult to extract global infor- Net achieves competitive results against most of the SOTA mation by simply enlarging the filter size because it will models (see Fig. 1) at 40 FPS. increase the number of parameters and computation costs dramatically. Many works pay more attention to extracting 2. Related Works global context. Wang et al. (SRM) [40] adapt the pyramid In recent years, many deep salient object detection net- pooling module [57] to capture global context and propose works [22, 33] have been proposed. Compared with tradi- a multi-stage refinement mechanism for saliency maps re- tional methods [2] based on hand-crafted features like fore- finement. Zhang et al. (PAGRN) [56] develop a spatial and ground consistency [49], hyperspectral information [20], a channel-wise attention module to obtain the global infor- superpixels’ similarity [55], histograms [26, 25] and so on, mation of each layer and propose a progressive attention deep salient object detection networks show more competi- guidance mechanism to refine the saliency maps. Wang tive performance. et al. (DGRL) [41] develop an inception-like [36] contex- Multi-level deep feature integration: Recent works tual weighting module to localize salient objects globally [24, 45] have shown that features from multiple deep layers and then use a boundary refinement module to refine the are able to generate better results [50]. Then, many strate- saliency map locally. Liu et al. (PiCANet) [23] recurrently gies and methods for integrating and aggregating multi- capture the local and global pixel-wise contextual attention level deep features are developed for SOD. Li et al. (MDF) and predict the saliency map by incorporating it with a U- [18] propose to feed an image patch around a target pixel Net architecture. Zhang et al. (CapSal) [51] design a lo- to a network and then obtain a feature vector for describing cal and global perception module to extract both local and the saliency of this pixel. Zhang et al. (Amulet) [53] pre- global information from features extracted by backbone net- dict saliency maps by aggregating multi-level features into work. Zeng et al. (MSWS) [48] design an attention module different resolutions. Zhang et al. (UCF) [54] propose to re- to predict the spatial distribution of foreground objects over duce the checkerboard artifacts of deconvolution operators image regions meanwhile aggregate their features. Feng et by introducing a reformulated dropout and a hybrid upsam- al. (AFNet) [9] develop a global perception module and at- pling module. Luo et al. [27] design a saliency detection tentive feedback modules to better explore the structure of network (NLDF+) with a 45 grid architecture, in which salient objects. Qin et al. (BASNet) [33] propose a predict- deeper features are progressively integrated with shallower refine model by stacking two differently configured U-Nets 2 Figure 2. Illustration of existing convolution blocks and our proposed residual U-block RSU: (a) Plain convolution block PLN, (b) Residual-like block RES, (c) Dense-like block DSE, (d) Inception-like block INC and (e) Our residual U-block RSU. sequentially and a Hybrid loss for boundary-aware salient 3. Proposed Method object detection. Liu et al. (PoolNet) [22] develop encoder- First, we introduce the design of our proposed resid- decoder architecture for salient object detection by intro- ual U-block and then describe the details of the nested U- ducing a global guidance module for extraction of global architecture built with this block. The network supervision localization features and a multi-scale feature aggregation strategy and the training loss are described at the end of this module adapted from pyramid pooling module for fusing section. global and fine-level features. In these methods, many in- spiring modules are proposed to extract multi-scale features 3.1. Residual U-blocks from multi-level deep features extracted from existing back- Both local and global contextual information are very bones. Diversified receptive fields and richer multi-scale important for salient object detection and other segmenta- contextual features introduced by these novel modules sig- tion tasks. In modern CNN designs, such as VGG, ResNet, nificantly improve the performance of salient object detec- DenseNet and so on, small convolutional filters with size tion models. of 11 or 33 are the most frequently used components for feature extraction. They are in favor since they require less storage space and are computationally efficient. Fig- ures 2(a)-(c) illustrates typical existing convolution blocks In summary, multi-level deep feature integration meth- with small receptive fields. The output feature maps of shal- ods mainly focus on developing better multi-level feature low layers only contain local features because the receptive aggregation strategies. On the other hand, methods in the field of 11 or 33 filters are too small to capture global category of multi-scale feature extraction target at design- information. To achieve more global information at high ing new modules for extracting both local and global infor- resolution feature maps from shallow layers, the most di- mation from features obtained by backbone networks. As rect idea is to enlarge the receptive field. Fig. 2 (d) shows we can see, almost all of the aforementioned methods try an inception like block [50], which tries to extract both local to make better use of feature maps generated by the ex- and non-local features by enlarging the receptive fields us- isting image classification backbones. Instead of develop- ing dilated convolutions [3]. However, conducting multiple ing and adding more complicated modules and strategies to dilated convolutions on the input feature map (especially in use these backbones’ features, we propose a novel and sim- the early stage) with original resolution requires too much ple architecture, which directly extracts multi-scale features computation and memory resources. To decrease the com- stage by stage, for salient object detection. putation costs, PoolNet [22] adapt the parallel configura- 3 Figure 3. Comparison of the residual block and our RSU. Figure 4. Computation costs (GFLOPS Giga Floating Point tion from pyramid pooling modules (PPM) [57], which uses Operations) of different blocks shown in Fig. 2: the com- small kernel filters on the downsampled feature maps other putation costs are calculated based on transferring an in- than the dilated convolutions on the original size feature put feature map with dimension 320 320 3 to a 320 maps. But fusion of different scale features by direct up- 32064 output feature map. “PLN”, “RES”, “DSE”, “INC” sampling and concatenation (or addition) may lead to degra- and “RSU” denote plain convolution block, residual block, dation of high resolution features. dense block, inception block and our residual U-block re- Inspired by U-Net [34], we propose a novel ReSidual U- spectively. block, RSU, to capture intra-stage multi-scale features. The structure of RSU-L(C ; M; C ) is shown in Fig. 2(e), in out where L is the number of layers in the encoder, C , C operations in this setting. The main design difference be- in out denote input and output channels, and M denotes the num- tween RSU and residual block is that RSU replaces the ber of channels in the internal layers of RSU. Hence, our plain, single-stream convolution with a U-Net like structure, RSU mainly consists of three components: and replace the original feature with the local feature trans- (i) an input convolution layer, which transforms the input formed by a weight layer: H (x) = U(F (x))+F (x), RSU 1 1 feature map x (HWC ) to an intermediate mapF (x) whereU represents the multi-layer U-structure illustrated in in 1 with channel of C . This is a plain convolutional layer for Fig. 2(e). This design change empowers the network to ex- out local feature extraction. tract features from multiple scales directly from each resid- (ii) a U-Net like symmetric encoder-decoder structure with ual block. More notably, the computation overhead due height of L which takes the intermediate feature mapF (x) to the U-structure is small, since most operations are ap- as input and learns to extract and encode the multi-scale plied on the downsampled feature maps. This is illustrated contextual information U(F (x)). U represents the U-Net in Fig. 4, where we show the computation cost compari- like structure as shown in Fig. 2(e). Larger L leads to deeper son between RSU and other feature extraction modules in residual U-block (RSU), more pooling operations, larger Fig. 2 (a)-(d). The FLOPs of dense block (DSE), inception range of receptive fields and richer local and global features. block (INC) and RSU all grow quadratically with the num- Configuring this parameter enables extraction of multi-scale ber of internal channel M . But RSU has a much smaller features from input feature maps with arbitrary spatial reso- coefficient on the quadratic term, leading to an improved lutions. The multi-scale features are extracted from gradu- efficiency. Its computational overhead compared with plain ally downsampled feature maps and encoded into high reso- convolution (PLN) and residual block (RES) blocks, which lution feature maps by progressive upsampling, concatena- are both linear w.r.t. M , is not significant. tion and convolution. This process mitigates the loss of fine 3.2. Architecture of U -Net details caused by direct upsampling with large scales. (iii) a residual connection which fuses local features and the Stacking multiple U-Net-like structures for different multi-scale features by the summation: F (x) +U(F (x)). 1 1 tasks has been explored for a while. , e.g. stacked hourgalss To better illustrate the intuition behind our design, we network [31], DocUNet [28], CU-Net [38] for pose estima- compare our residual U-block (RSU) with the original tion, etc. These methods usually stack U-Net-like structures residual block [12] in Fig. 3. The operation in the residual sequentially to build cascaded models and can be summa- block can be summarized asH(x) = F (F (x))+x, where rized as ”(Un-Net)”, where n is the number of repeated 2 1 H(x) denotes the desired mapping of the input features x; U-Net modules. The issue is that the computation and the F ;F stand for the weight layers, which are convolution memory costs get magnified by n. 2 1 4 2 Figure 5. Illustration of our proposed U -Net architecture. The main architecture is a U-Net like Encoder-Decoder, where each stage consists of our newly proposed residual U-block (RSU). For example, En 1 is based on our RSU block shown in Fig. 2(e). Detailed configuration of RSU block of each stage is given in the last two rows of Table 1. In this paper, we propose a different formulation, U - more efficiently. Net, of stacking U-structure for salient object detection. As illustrated in Fig.5, the U -Net mainly consists of Our exponential notation refers to nested U-structure rather three parts: (1) a six stages encoder, (2) a five stages de- than cascaded stacking. Theoretically, the exponent n can coder and (3) a saliency map fusion module attached with be set as an arbitrary positive integer to achieve single-level the decoder stages and the last encoder stage: or multi-level nested U-structure. But architectures with (i) In encoder stages En 1, En 2, En 3 and En 4, we use too many nested levels will be too complicated to be im- residual U-blocks RSU-7, RSU-6, RSU-5 and RSU-4, re- plemented and employed in real applications. spectively. As mentioned before, “7”, “6”, “5” and “4” de- 2 2 Here, we set n as 2 to build our U -Net. Our U -Net note the heights (L) of RSU blocks. The L is usually config- is a two-level nested U-structure shown in Fig. 5. Its top ured according to the spatial resolution of the input feature level is a big U-structure consists of 11 stages (cubes in maps. For feature maps with large height and width, we use Fig. 5). Each stage is filled by a well configured residual U- greater L to capture more large scale information. The res- block (RSU) (bottom level U-structure). Hence, the nested olution of feature maps in En 5 and En 6 are relatively low, U-structure enables the extraction of intra-stage multi-scale further downsampling of these feature maps leads to loss of features and aggregation of inter-stage multi-level features useful context. Hence, in both En 5 and En 6 stages, RSU- 5 Table 1. Detailed configurations of different architectures used in ablation study. “PLN”, “RES”, “DSE”, “INC”, “PPM” and “RSU” denote plain convolution block, residual block, dense block, inception block, Pyramid Pooling Module and our residual U-block respectively. “NIV U -Net” denotes U-Net with its each stage replaced by a naive U-Net block. “I”, “M” and “O” indicate the number of input channels (C ), middle channels and output channels (C ) of each block. “En i” and in out “De j” denote the encoder and decoder stages respectively. The number “L” in “NIV-L” and “RSU-L” denotes the height of the naive U-block and our residual U-block. Architecture with Stages different blocks En 1 En 2 En 3 En 4 En 5 En 6 De 5 De 4 De 3 De 2 De 1 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 PLN U-Net M:64 M:128 M:256 M:512 M:512 M:512 M:512 M:256 M:128 M:64 M:64 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 RES U-Net M:64 M:128 M:256 M:512 M:512 M:512 M:512 M:256 M:128 M:64 M:64 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 DSE U-Net M:32 M:32 M:64 M:128 M:128 M:128 M:128 M:64 M:32 M:16 M:16 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 INC U-Net M:32 M:32 M:64 M:128 M:128 M:128 M:128 M:64 M:32 M:16 M:16 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 M:32 M:32 M:64 M:128 M:128 M:128 M:128 M:64 M:32 M:16 M:16 PPM U-Net O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 NIV-7 NIV-6 NIV-5 NIV-4 NIV-4F NIV-4F NIV-4F NIV-4 NIV-5 NIV-6 NIV-7 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 NIV U -Net M:32 M:32 M:64 M:128 M:256 M:256 M:256 M:128 M:64 M:32 M:16 O:64 O:128 O:256 O:512 O:512 O:512 O:512 O:256 O:128 O:64 O:64 RSU-7 RSU-6 RSU-5 RSU-4 RSU-4F RSU-4F RSU-4F RSU-4 RSU-5 RSU-6 RSU-7 I:3 I:64 I:128 I:256 I:512 I:512 I:1024 I:1024 I:512 I:256 I:128 U -Net (Ours) M:32 M:32 M:64 M:128 M:256 M:256 M:256 M:128 M:64 M:32 M:16 O:64 O:128 O:256 O:512 O:512) O:512) O:512 O:256 O:128 O:64 O:64 RSU-7 RSU-6 RSU-5 RSU-4 RSU-4F RSU-4F RSU-4F RSU-4 RSU-5 RSU-6 RSU-7 I:3 I:64 I:64 I:64 I:64 I:64 I:128 I:128 I:128 I:128 I:128 2 y y U -Net (Ours ) M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 M:16 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 O:64 4F are used, where “F” means that the RSU is a dilated ver- lowed by a 11 convolution layer and a sigmoid function to sion, in which we replace the pooling and upsampling op- generate the final saliency probability map S (see bot- fuse erations with dilated convolutions (see Fig. 5). That means tom right of Fig. 5). all of intermediate feature maps of RSU-4F have the same In summary, the design of our U -Net allows having resolution with its input feature maps. deep architecture with rich multi-scale features and rela- (ii) The decoder stages have similar structures to their sym- tively low computation and memory costs. In addition, metrical encoder stages with respect to En 6. In De 5, we since our U -Net architecture is only built upon our RSU also use the dilated version residual U-block RSU-4F which blocks without using any pre-trained backbones adapted is similar to that used in the encoder stages En 5 and En 6. from image classification, it is flexible and easy to be Each decoder stage takes the concatenation of the upsam- adapted to different working environments with insignifi- pled feature maps from its previous stage and those from its cant performance loss. In this paper, we provide two in- symmetrical encoder stage as the input, see Fig. 5. stances of our U -Net by using different configurations of (iii) The last part is the saliency map fusion module which is filter numbers: a normal version U -Net (176.3 MB) and a 2 y used to generate saliency probability maps. Similar to HED relatively smaller version U -Net (4.7 MB). Detailed con- [45], our U -Net first generates six side output saliency figurations are presented in the last two rows of Table 1. (6) (5) (4) (3) (2) (1) probability maps S , S , S , S , S , S side side side side side side from stages En 6, De 5, De 4, De 3, De 2 and De 1 by 3.3. Supervision a 3  3 convolution layer and a sigmoid function. Then, it upsamples the logits (convolution outputs before sigmoid In the training process, we use deep supervision similar functions) of the side output saliency maps to the input im- to HED [45]. Its effectiveness has been proven in HED and DSS. Our training loss is defined as: age size and fuses them with a concatenation operation fol- 6 with the input images. Each pixel of the predicted saliency maps has a value within the range of 0 and 1 (or [0, 255]). (m) (m) L = w ` + w ` (1) fuse fuse side side The ground truth are usually binary masks, in which each m=1 pixel is either 0 or 1 (or 0 and 255) where 0 indicates the (m) background pixels and 1 indicates the foreground salient where ` (M = 6, as the Sup1, Sup2,  , Sup6 in Fig. side (m) object pixels. 5) is the loss of the side output saliency map S and ` fuse side To comprehensively evaluate the quality of those prob- (Sup7 in Fig. 5) is the loss of the final fusion output saliency (m) ability maps against the ground truth, six measures in- map S . w and w are the weights of each loss fuse fuse side cluding (1) Precision-Recall (PR) curves , (2) maximal F- term. For each term `, we use the standard binary cross- measure (maxF ) [1] , (3) Mean Absolute Error (MAE) entropy to calculate the loss: [23, 33, 22], (4) weighted F-measure (F ) [29] , (5) struc- ture measure (S ) [8] and (6) relaxed F-measure of bound- (H;W ) b ary (relaxF ) [33] are used: ` = [P logP + (1 P )log(1 P )] G(r;c) S(r;c) G(r;c) S(r;c) (1) PR curve is plotted based on a set of precision-recall (r;c) pairs. Given a predicted saliency probability map, its preci- (2) sion and recall scores are computed by comparing its thresh- where (r; c) is the pixel coordinates and (H; W) is image olded binary mask against the ground truth mask. The pre- size: height and width. P and P denote the pixel G(r;c) S(r;c) cision and recall of a dataset are computed by averaging the values of the ground truth and the predicted saliency proba- precision and recall scores of those saliency maps. By vary- bility map, respectively. The training process tries to mini- ing the thresholds from 0 to 1, we can obtain a set of average mize the overall lossL of Eq. (1). In the testing process, we precision-recall pairs of the dataset. choose the fusion output ` as our final saliency map. fuse (2) F-measure F is used to comprehensively evaluate both precision and recall as: 4. Experimental Results (1+ )PrecisionRecall 4.1. Datasets F = : (3) Precision+Recall Training dataset: We train our network on DUTS-TR, We set the to 0.3 and report the maximum F (maxF ) which is a part of DUTS dataset [39]. DUTS-TR contains for each dataset similar to previous works [1, 23, 50]. 10553 images in total. Currently, it is the largest and most (3) MAE is the Mean Absolute Error which denotes the av- frequently used training dataset for salient object detection. erage per-pixel difference between a predicted saliency map We augment this dataset by horizontal flipping to obtain and its ground truth mask. It is defined as: 21106 training images offline. Evaluation datasets: Six frequently used benchmark P P H W datasets are used to evaluate our method including: DUT- MAE = jP(r; c) G(r; c)j (4) HW r=1 c=1 OMRON [47], DUTS-TE [39], HKU-IS [18], ECSSD [46], where P and G are the probability map of the salient object PASCAL-S [19], SOD [30]. DUT-OMRON includes 5168 detection and the corresponding ground truth respectively, images, most of which contain one or two structurally com- (H , W ) and (r; c) are the (height, width) and the pixel co- plex foreground objects. DUTS dataset consists of two ordinates. parts: DUTS-TR and DUTS-TE. As mentioned above we (4) weighted F-measure (F ) [29] is utilized as a comple- use DUTS-TR for training. Hence, DUTS-TE, which con- mentary measure to maxF for overcoming the possible un- tains 5019 images, is selected as one of our evaluation fair comparison caused by “interpolation flaw, dependency dataset. HKU-IS contains 4447 images with multiple fore- flaw and equal-importance flaw” [23]. It is defined as: ground objects. ECSSD contains 1000 structurally complex images and many of them contain large foreground objects. PASCAL-S contains 850 images with complex foreground w w Precision  Recall w 2 F = (1 + ) : (5) objects and cluttered background. SOD only contains 300 2 w w Precision + Recall images. But it is very challenging. Because it was originally designed for image segmentation and many images are low (5) S-measure (S ) is used to evaluate the structure sim- contrast or contain complex foreground objects overlapping ilarity of the predicted non-binary saliency map and the with the image boundary. ground truth. The S-measure is defined as the weighted sum of region-aware S and object-aware S structural similar- r o 4.2. Evaluation Metrics ity: The outputs of the deep salient object methods are usu- ally probability maps that have the same spatial resolution S = (1 )S + S : (6) r o 7 Table 2. Results of ablation study on different blocks, ar- where is usually set to 0.5. chitectures and backbones. “PLN”, “RES”, “DSE”, “INC”, (6) relax boundary F-measure relaxF [7] is utilized to “PPM” and “RSU” denote plain convolution block, residual quantitatively evaluate boundaries’ quality of the predicted block, dense block, inception block, pyramid pooling mod- saliency maps [33]. Given a saliency probability map ule and our residual U-block respectively. “NIV U -Net” P 2 [0; 1], its binary mask P is obtained by a simple bw denotes U-Net with its each stage replaced by a naive U- thresholding operation (threshold is set to 0:5). Then, the Net block. The “Time (ms)” (ms: milliseconds) costs are XOR(P ; P ) operation is conducted to obtain its one bw erd computed by averaging the inference time costs of images pixel wide boundary, where P denotes the eroded binary erd in ECSSD dataset. Values with bold fonts indicate the best mask [11] of P . The boundaries of ground truth mask bw two performance. are obtained in the same way. The computation of relaxed boundary F-measure relaxF is similar to equation (3). The DUT-OMRON ECSSD b b Configuration Time (ms) difference is that relaxPrecision and relaxRecall other maxF MAE maxF MAE Baseline U-Net 0.725 0.082 0.896 0.066 14 than Precision and Recall are used in equation (3). The PLN U-Net 0.782 0.062 0.928 0.043 16 definition of relaxed boundary precision (relaxPrecision ) RES U-Net 0.781 0.065 0.933 0.042 19 DSE U-Net 0.790 0.067 0.927 0.046 70 is the fraction of predicted boundary pixels within a range INC U-Net 0.777 0.069 0.921 0.047 57 of  pixels from ground truth boundary pixels. The relaxed PPM U-Net 0.792 0.062 0.928 0.049 105 boundary recall (relaxRecall ) is defined as the fraction of Stacked HourglassNet [31] 0.756 0.073 0.905 0.059 103 CU-NET [37] 0.767 0.072 0.913 0.061 50 ground truth boundary pixels that are within  pixels of pre- NIV U -Net 0.803 0.061 0.938 0.085 30 dicted boundary pixels. The slack parameter  is set to 3 U -Net w/ VGG-16 backbone 0.808 0.063 0.942 0.038 23 U -Net w/ ResNet-50 backbone 0.813 0.058 0.937 0.041 41 as in the previous work [33]. Given a dataset, its average (Ours) RSU U -Net 0.823 0.054 0.951 0.033 33 relaxF of all predicted saliency maps is reported in this y 2 y (Ours ) RSU U -Net 0.813 0.060 0.943 0.041 25 paper. 4.3. Implementation Details 4.4.1 Ablation on Blocks In the training process, each image is first resized to 320320 and randomly flipped vertically and cropped to In the blocks ablation, the goal is to validate the effec- 288288. We are not using any existing backbones in our tiveness of our newly designed residual U-blocks (RSUs). network. Hence, we train our network from scratch and all Specifically, we fix the outside Encoder-Decoder architec- of our convolutional layers are initialized by Xavier [10]. ture of our U -Net and replace its stages with other popular (m) The loss weights w and w are all set to 1. Adam blocks including plain convolution blocks (PLN), residual- fuse side optimizer [16] is used to train our network and its hyper like blocks (RSE), dense-like blocks (DSE), inception-like parameters are set to default (initial learning rate lr=1e-3, blocks (INC) and pyramid pooling module (PPM) other betas=(0.9, 0.999), eps=1e-8, weight decay=0). We train than RSU block, as shown in Fig. 2 (a)-(d). Detailed con- the network until the loss converges without using valida- figurations can be found in Table 1. tion set which follows the previous methods [22, 23, 50]. Table 2 shows the quantitative results of the ablation After 600k iterations (with a batch size of 12), the training study. As we can see, the performance of baseline U-Net is loss converges and the whole training process takes about the worst, while PLN U-Net, RES U-Net, DES U-Net, INC 120 hours. During testing, the input images (H  W ) are U-Net and PPM U-Net achieve better performance than the resized to 320320 and fed into the network to obtain the baseline U-Net. Because they are either deeper or have the saliency maps. The predicted saliency maps with size of capability of extracting multi-scale features. However, their 320320 are resized back to the original size of the input performance is still inferior to both our full size U -Net and 2 y 2 image (H  W ). Bilinear interpolation is used in both re- small version U -Net . Particularly, our full size U -Net sizing processes. Our network is implemented based on Py- improves the maxF about 3.3% and 1.8%, and decreases torch 0.4.0 [32]. Both training and testing are conducted the MAE over 12.9% and 21.4% against the second best on an eight-core, 16 threads PC with an AMD Ryzen 1800x model (in the blocks ablation study) on DUT-OMRON and 3.5 GHz CPU (32GB RAM) and a GTX 1080ti GPU (11GB ECSSD datasets, respectively. Furthermore, our U -Net 2 y memory). We will release our code later. and U -Net increase the maxF by 9.8% and 8.8% and decrease the MAE by 34.1% and 27.0%, which are signif- 4.4. Ablation Study icant improvements, on DUT-OMRON dataset against the To verify the effectiveness of our U -Net, ablation stud- baseline U-Net. On ECSSD dataset, although the maxF 2 2 y ies are conducted on the following three aspects: i) basic improvements (5.5%, 4.7%) of our U -Net and U -Net blocks, ii) architectures and iii) backbones. All the ablation against the baseline U-Net is slightly less significant than studies follow the same implementation setup. that on DUT-OMRON, the improvements of MAE are 8 much greater (50.0%, 38.0%). Therefore, we believe that ods including one AlexNet based model: MDF; 10 VGG our newly designed residual U-block RSU is better then based models: UCF, Amulet, NLDF, DSS, RAS, PAGRN, others in this salient object detection task. Besides, there is BMPM, PiCANet, MLMS, AFNet; one DenseNet based no significant time costs increasing of our residual U-block model MSWS; one ResNeXt based model: R Net; and (RSU) based U -Net architectures. seven ResNet based models: CapSal, SRM, DGRL, Pi- CANetR, CPD, PoolNet, BASNet. For fair comparison, we mainly use the salient object detection results provided 4.4.2 Ablation on Architectures by the authors. For the missing results on certain datasets of As we mentioned above, previous methods usually use cas- some methods, we run their released code with their trained caded ways to stack multiple similar structures for build- models on their suggested environment settings. ing more expressive models. One of the intuitions behind this idea is that multiple similar structures are able to refine 4.5.1 Quantitative Comparison the results gradually while reducing overfitting. Stacked HourglassNet [31] and CU-Net [37] are two representative Fig. 6 illustrates the precision-recall curves of our models 2 2 y models in this category. Therefore, we adapted the stacked (U -Net, 176.3 MB and U -Net , 4.7 MB) and typical state- HourglassNet and CU-Net to compare the performance be- of-the-art methods on the six datasets. The curves are con- tween the cascaded architectures and our nested architec- sistent with the Table 3 and 4 which demonstrate the state- 2 2 tures. As shown in Table. 2, both our full size U -Net and of-the-art performance of our U -Net on DUT-OMRON, 2 y small size model U -Net outperform these two cascaded HKU-IS and ECSSD and competitive performance on other models. It is worth noting the both stacked HourglassNet datasets. Table 3 and 4 compares five (six include the model and CU-Net utilizes improved U-Net-like modules as their size) evaluation metrics and the model size of our proposed stacking sub-models. To further demonstrate the effective- method with others. As we can see, our U -Net achieves the ness of our nested architecture, we also illustrate the perfor- best performance on datasets DUT-OMRON, HKU-IS and mance of an U -Net based on naive U-blocks (NIV) other ECSSD in terms of almost all of the five evaluation metrics. than our newly proposed residual U-blocks. We can see that On DUTS-TE dataset our U -Net achieves the second best the NIV U -Net still achieves better performance than these overall performance, which is slightly inferior to PoolNet. two cascaded models. In addition, the nested architectures On PASCAL-S, the performance of our U -Net is slightly are faster than the cascaded ones. In summary, our nested inferior to AFNet, CPD and PoolNet. It is worth noting that architecture is able to achieve better performance than the U -Net achieves the second best performance in terms of cascaded architecture both in terms of accuracy and speed. the boundary quality evaluation metric relaxF . On SOD dataset, PoolNet performs the best and our U -Net is the second best in terms of the overall performance. 4.4.3 Ablation on Backbones 2 y Our U -Net is only 4.7 MB, which is currently the Different from the previous salient object detection mod- smallest model in the field of salient object detection. With els which use backbones (e.g. VGG, ResNet, etc.) as their much fewer number of parameters against other models, encoders, our newly proposed U -Net architecture is back- it still achieves surprisingly competitive performance. Al- bone free. To validate the backbone free design, we conduct though its performance is not as good as our full size U - ablation studies on replacing the encoder part of our full size Net, its small size will facilitate its applications in many U -Net with different backbones: VGG16 and ResNet50. computation and memory constrained environments. Practically, we adapt the backbones (VGG-16 and ResNet- 50) by adding an extra stage after their last convolutional 4.5.2 Qualitative Comparison: stages to achieve the same receptive fields with our origi- nal U -Net architecture design. As shown in Table 2, the To give an intuitive understanding of the promising perfor- models using backbones and our RSUs as decoders achieve mance of our models, we illustrate the sample results of our better performance than the previous ablations and compa- models and several other state-of-the-art methods in Fig. 7. 2 2 y rable performance against our small size U -Net. However, As we can see, our U -Net and U -Net are able to handle they are still inferior to our full size U -Net. Therefore, we different types of targets and produce accurate salient object believe that our backbones free design is more competitive detection results. than backbones-based design in this salient object detection The 1st and 2nd row of Fig. 7 show the results of small 2 2 task. and large objects. As we can observe, our U -Net and U - Net are able to produce accurate results on both small and 4.5. Comparison with State-of-the-arts large objects. Other models either prone to miss the small We compare our models (full size U -Net, 176.3 MB and target or produce large object with poor accuracy. The 3rd 2 y small size U -Net , 4.7 MB) with 20 state-of-the-art meth- row shows the results of target touching image borders. Our 9 1.0 1.0 Ours Ours 0.9 0.9 Ours Ours BASNet BASNet PoolNet PoolNet CPD CPD 0.8 0.8 PiCANetR PiCANetR SRM SRM CapSal CapSal 0.7 0.7 R3Net+ R3Net+ MSWS MSWS AFNet AFNet MLMS MLMS 0.6 0.6 BMPM BMPM DSS+ DSS+ MDF DUT-OMRON MDF DUTS-TE 0.5 0.5 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall 1.0 1.0 Ours Ours 0.9 0.9 Ours Ours BASNet BASNet PoolNet PoolNet CPD CPD 0.8 0.8 PiCANetR PiCANetR SRM SRM CapSal CapSal 0.7 0.7 R3Net+ R3Net+ MSWS MSWS AFNet AFNet MLMS MLMS 0.6 0.6 BMPM BMPM DSS+ DSS+ MDF HKU-IS MDF ECSSD 0.5 0.5 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall 1.0 1.0 Ours Ours 0.9 0.9 Ours Ours BASNet BASNet PoolNet PoolNet CPD CPD 0.8 0.8 PiCANetR PiCANetR SRM SRM CapSal CapSal 0.7 0.7 R3Net+ R3Net+ MSWS MSWS AFNet AFNet MLMS MLMS 0.6 0.6 BMPM BMPM DSS+ DSS+ MDF MDF PASCAL-S SOD 0.5 0.5 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Recall Recall Figure 6. Precision-Recall curves of our models and other typical state-of-the-art models on six SOD datasets. 2 2 U -Net correctly segments all the regions. Although U - sists of both large and thin structures. As we can see, most Net erroneously segments the bottom right hole, it is still of other models extract large regions well while missing the much better than other models. The 4th row demonstrates cable-wise thin structure except for AFNet (col (j)). The 5th the performance of models in segmenting targets that con- row shows a tree with relatively clean background of blue Precision Precision Precision Precision Precision Precision Table 3. Comparison of our method and 20 SOTA methods on DUT-OMRON, DUTS-TE, HKU-IS in terms of model size, maxF ("), w b MAE (#), weighted F ("), structure measure S (") and relax boundary F-measure relaxF ("). Red, Green, and Blue indicate the best, second best and third best performance. DUT-OMRON (5168) DUTS-TE (5019) HKU-IS (4447) Method Backbone Size(MB) w b w b w b maxF MAE F S relaxF maxF MAE F S relaxF maxF MAE F S relaxF m m m MDF AlexNet 112.1 0.694 0.142 0.565 0.721 0.406 0.729 0.099 0.543 0.723 0.447 0.860 0.129 0.564 0.810 0.594 TIP16 UCF VGG-16 117.9 0.730 0.120 0.573 0.760 0.480 0.773 0.112 0.596 0.777 0.518 0.888 0.062 0.779 0.875 0.679 ICCV17 Amulet VGG-16 132.6 0.743 0.098 0.626 0.781 0.528 0.778 0.084 0.658 0.796 0.568 0.897 0.051 0.817 0.886 0.716 ICCV17 NLDF+ VGG-16 428.0 0.753 0.080 0.634 0.770 0.514 0.813 0.065 0.710 0.805 0.591 0.902 0.048 0.838 0.879 0.694 CVPR17 DSS+ VGG-16 237.0 0.781 0.063 0.697 0.790 0.559 0.825 0.056 0.755 0.812 0.606 0.916 0.040 0.867 0.878 0.706 CVPR17 RAS VGG-16 81.0 0.786 0.062 0.695 0.814 0.615 0.831 0.059 0.740 0.828 0.656 0.913 0.045 0.843 0.887 0.748 ECCV18 PAGRN VGG-19 - 0.771 0.071 0.622 0.775 0.582 0.854 0.055 0.724 0.825 0.692 0.918 0.048 0.820 0.887 0.762 CVPR18 BMPM VGG-16 - 0.774 0.064 0.681 0.809 0.612 0.852 0.048 0.761 0.851 0.699 0.921 0.039 0.859 0.907 0.773 CVPR18 PiCANet VGG-16 153.3 0.794 0.068 0.691 0.826 0.643 0.851 0.054 0.747 0.851 0.704 0.921 0.042 0.847 0.906 0.784 CVPR18 MLMS VGG-16 263.0 0.774 0.064 0.681 0.809 0.612 0.852 0.048 0.761 0.851 0.699 0.921 0.039 0.859 0.907 0.773 CVPR19 AFNet VGG-16 143.0 0.797 0.057 0.717 0.826 0.635 0.862 0.046 0.785 0.855 0.714 0.923 0.036 0.869 0.905 0.772 CVPR19 MSWS Dense-169 48.6 0.718 0.109 0.527 0.756 0.362 0.767 0.908 0.586 0.749 0.376 0.856 0.084 0.685 0.818 0.438 CVPR19 R Net+ ResNeXt 215.0 0.795 0.063 0.728 0.817 0.599 0.828 0.058 0.763 0.817 0.601 0.915 0.036 0.877 0.895 0.740 IJCAI18 CapSal ResNet-101 - 0.699 0.101 0.482 0.674 0.396 0.823 0.072 0.691 0.808 0.605 0.882 0.062 0.782 0.850 0.654 CVPR19 SRM ResNet-50 189.0 0.769 0.069 0.658 0.798 0.523 0.826 0.058 0.722 0.824 0.592 0.906 0.046 0.835 0.887 0.680 ICCV17 DGRL ResNet-50 646.1 0.779 0.063 0.697 0.810 0.584 0.834 0.051 0.760 0.836 0.656 0.913 0.037 0.865 0.897 0.744 CVPR18 PiCANetR ResNet-50 197.2 0.803 0.065 0.695 0.832 0.632 0.860 0.050 0.755 0.859 0.696 0.918 0.043 0.840 0.904 0.765 CVPR18 CPD ResNet-50 183.0 0.797 0.056 0.719 0.825 0.655 0.865 0.043 0.795 0.858 0.741 0.925 0.034 0.875 0.905 0.795 CVPR19 PoolNet ResNet-50 273.3 0.808 0.056 0.729 0.836 0.675 0.880 0.040 0.807 0.871 0.765 0.932 0.033 0.881 0.917 0.811 CVPR19 BASNet ResNet-34 348.5 0.805 0.056 0.751 0.836 0.694 0.860 0.047 0.803 0.853 0.758 0.928 0.032 0.889 0.909 0.807 CVPR19 U -Net (Ours) RSU 176.3 0.823 0.054 0.757 0.847 0.702 0.873 0.044 0.804 0.861 0.765 0.935 0.031 0.890 0.916 0.812 2 y U -Net (Ours) RSU 4.7 0.813 0.060 0.731 0.837 0.676 0.852 0.054 0.763 0.847 0.723 0.928 0.037 0.867 0.908 0.794 Table 4. Comparison of our method and 20 SOTA methods on ECSSD, PASCAL-S, SOD in terms of model size, maxF ("), MAE (#), w b weighted F ("), structure measure S (") and relax boundary F-measure relaxF ("). Red, Green, and Blue indicate the best, second best and third best performance. ECSSD (1000) PASCAL-S (850) SOD (300) Method Backbone Size(MB) w b w b w b maxF MAE F S relaxF maxF MAE F S relaxF maxF MAE F S relaxF m m m MDF AlexNet 112.1 0.832 0.105 0.705 0.776 0.472 0.759 0.142 0.589 0.696 0.343 0.746 0.192 0.508 0.643 0.311 TIP16 UCF VGG-16 117.9 0.903 0.069 0.806 0.884 0.669 0.814 0.115 0.694 0.805 0.493 0.808 0.148 0.675 0.762 0.471 ICCV17 Amulet VGG-16 132.6 0.915 0.059 0.840 0.894 0.711 0.828 0.100 0.734 0.818 0.541 0.798 0.144 0.677 0.753 0.454 ICCV17 NLDF+ VGG-16 428.0 0.905 0.063 0.839 0.897 0.666 0.822 0.098 0.737 0.798 0.495 0.841 0.125 0.709 0.755 0.475 CVPR17 DSS+ VGG-16 237.0 0.921 0.052 0.872 0.882 0.696 0.831 0.093 0.759 0.798 0.499 0.846 0.124 0.710 0.743 0.444 CVPR17 RAS VGG-16 81.0 0.921 0.056 0.857 0.893 0.741 0.829 0.101 0.736 0.799 0.560 0.851 0.124 0.720 0.764 0.544 ECCV18 PAGRN VGG-19 - 0.927 0.061 0.834 0.889 0.747 0.847 0.090 0.738 0.822 0.594 - - - - - CVPR18 BMPM VGG-16 - 0.928 0.045 0.871 0.911 0.770 0.850 0.074 0.779 0.845 0.617 0.856 0.108 0.726 0.786 0.562 CVPR18 PiCANetCVPR18 VGG-16 153.3 0.931 0.046 0.865 0.914 0.784 0.856 0.078 0.772 0.848 0.612 0.854 0.103 0.722 0.789 0.572 MLMS VGG-16 263.0 0.928 0.045 0.871 0.911 0.770 0.855 0.074 0.779 0.844 0.620 0.856 0.108 0.726 0.786 0.562 CVPR19 AFNet VGG-16 143.0 0.935 0.042 0.887 0.914 0.776 0.863 0.070 0.798 0.849 0.626 0.856 0.111 0.723 0.774 - CVPR19 MSWS Dense-169 48.6 0.878 0.096 0.716 0.828 0.411 0.786 0.133 0.614 0.768 0.289 0.800 0.167 0.573 0.700 0.231 CVPR19 R Net+ ResNeXt 215.0 0.934 0.040 0.902 0.910 0.759 0.834 0.092 0.761 0.807 0.538 0.850 0.125 0.735 0.759 0.431 IJCAI18 CapSal ResNet-101 - 0.874 0.077 0.771 0.826 0.574 0.861 0.073 0.786 0.837 0.527 0.773 0.148 0.597 0.695 0.404 CVPR19 SRM ResNet-50 189.0 0.917 0.054 0.853 0.895 0.672 0.838 0.084 0.758 0.834 0.509 0.843 0.128 0.670 0.741 0.392 ICCV17 DGRL ResNet-50 646.1 0.925 0.042 0.883 0.906 0.753 0.848 0.074 0.787 0.839 0.569 0.848 0.106 0.731 0.773 0.502 CVPR18 PiCANetR ResNet-50 197.2 0.935 0.046 0.867 0.917 0.775 0.857 0.076 0.777 0.854 0.598 0.856 0.104 0.724 0.790 0.528 CVPR18 CPD ResNet-50 183.0 0.939 0.037 0.898 0.918 0.811 0.861 0.071 0.800 0.848 0.639 0.860 0.112 0.714 0.767 0.556 CVPR19 PoolNet ResNet-50 273.3 0.944 0.039 0.896 0.921 0.813 0.865 0.075 0.798 0.832 0.644 0.871 0.102 0.759 0.797 0.606 CVPR19 BASNet ResNet-34 348.5 0.942 0.037 0.904 0.916 0.826 0.856 0.076 0.798 0.838 0.660 0.851 0.113 0.730 0.769 0.603 CVPR19 U -Net (Ours) RSU 176.3 0.951 0.033 0.910 0.928 0.836 0.859 0.074 0.797 0.844 0.657 0.861 0.108 0.748 0.786 0.613 2 y U -Net (Ours) RSU 4.7 0.943 0.041 0.885 0.918 0.808 0.849 0.086 0.768 0.831 0.627 0.841 0.124 0.697 0.759 0.559 sky. It seems easy, but it is actually challenging to most can produce results even finer than the ground truth. La- of the models because of the complicated shape of the tar- beling these small holes in the 7th image is burdensome get. As we can see, our models segment both the trunk and and time-consuming. Hence, these repeated fine structures branches well, while others fail in segmenting the compli- are usually ignored in the annotation process. Inferring the cated tree branch region. Compared with the 5th row, the correct results from these imperfect labeling is challenging. bench shown in the 6th row is more complex thanks to the But our models show promising capability in segmenting hollow structure. Our U -Net produces near perfect result. these fine structures thanks to the well designed architec- Although the bottom right of the prediction map of U -Nety tures for extracting and integrating high resolution local and is imperfect, its overall performance on this target is much low resolution global information. The 8th and 9th row are better than other models. Besides, the results of our models illustrated to show the strong ability of our models in de- are more homogenous with fewer gray areas than models tecting targets with cluttered backgrounds and complicated like PoolNet (col (f)), CPD (col (g)), PiCANetR (col (h)) foreground appearance. The 10th row shows that our mod- and AFNet (col (j)). The 7th row shows that our models els are able to segment multiple targets while capturing the 11 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) Figure 7. Qualitative comparison of the proposed method with seven other SOTA methods: (a) image, (b) GT, (c) Ours, (d) y 3 Ours , (e) BASNet, (f) PoolNet, (g) CPD, (h) PiCANetR, (i) R Net+, (j) AFNet, (k) DSS+, where ‘+’ indicates the CRF post-processing. details of the detected targets (see the gap region of the two 5. Conclusions pieces of sail of each sailboat). In summary, both our full In this paper, we proposed a novel deep network: U - size and small size models are able to handle various sce- Net, for salient object detection. The main architecture of narios and produce high accuracy salient object detection our U -Net is a two-level nested U-structure. The nested U- results. structure with our newly designed RSU blocks enables the network to capture richer local and global information from both shallow and deep layers regardless of the resolutions. 12 Compared with those SOD models built upon the existing on Integrating ontology, pages 25–32. No commercial edi- tor., 2005. backbones, our U -Net is purely built on the proposed RSU [8] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali blocks which makes it possible to be trained from scratch Borji. Structure-measure: A new way to evaluate foreground and configured to have different model size according to the 2 maps. In Proceedings of the IEEE Conference on Computer target environment constraints. We provide a full size U - Vision and Pattern Recognition, pages 4548–4557, 2017. 2 y Net (176.3 MB, 30 FPS) and a smaller size version U -Net [9] Mengyang Feng, Huchuan Lu, and Errui Ding. Attentive (4.7 MB, 40 FPS) in this paper. Experimental results on feedback network for boundary-aware salient object detec- six public salient object detection datasets demonstrate that tion. In Proceedings of the IEEE Conference on Computer both models achieve very competitive performance against Vision and Pattern Recognition, pages 1623–1632, 2019. other 20 state-of-the-art methods in terms of both qualitative [10] Xavier Glorot and Yoshua Bengio. Understanding the diffi- and quantitative measures. culty of training deep feedforward neural networks. In Pro- Although our models achieve competitive results against ceedings of the Thirteenth International Conference on Ar- other state-of-the-art methods, faster and smaller models are tificial Intelligence and Statistics, AISTATS, pages 249–256, needed for computation and memory limited devices, such as mobile phones, robots, etc. In the near future, we will [11] Robert M Haralick, Stanley R Sternberg, and Xinhua explore different techniques and architectures to further im- Zhuang. Image analysis using mathematical morphology. IEEE transactions on pattern analysis and machine intelli- prove the speed and decrease the model size. In addition, gence, (4):532–550, 1987. larger diversified salient object datasets are needed to train [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. more accurate and robust models. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern Acknowledgments recognition, pages 770–778, 2016. This work is supported by the Alberta Innovates Grad- [13] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip Torr. Deeply supervised salient ob- uate Student Scholarship and Natural Sciences and Engi- ject detection with short connections. In Proceedings of the neering Research Council of Canada (NSERC) Discovery IEEE Conference on Computer Vision and Pattern Recogni- Grants Program, NO.: 2016-06365. tion, pages 5300–5309, 2017. [14] Xiaowei Hu, Lei Zhu, Jing Qin, Chi-Wing Fu, and Pheng- References Ann Heng. Recurrently aggregating deep features for salient object detection. In AAAI, pages 6943–6950, 2018. [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. In 2009 IEEE [15] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- Conference on Computer Vision and Pattern Recognition, ian Q Weinberger. Densely connected convolutional net- pages 1597–1604, 2009. works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2261–2269, 2017. [2] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark. IEEE Trans. Image [16] Diederik P Kingma and Jimmy Ba. Adam: A method for Processing, 24(12):5706–5722, 2015. stochastic optimization. arXiv preprint, 2014. [3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Imagenet classification with deep convolutional neural net- segmentation with deep convolutional nets, atrous convolu- works. In Advances in neural information processing sys- tion, and fully connected crfs. IEEE transactions on pattern tems, pages 1097–1105, 2012. analysis and machine intelligence, 40(4):834–848, 2017. [18] Guanbin Li and Yizhou Yu. Visual saliency detection based [4] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. Re- on multiscale deep cnn features. IEEE Transactions on Im- verse attention for salient object detection. In Proceedings age Processing, 25(11):5012–5024, 2016. of the European Conference on Computer Vision (ECCV), [19] Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and pages 234–250, 2018. Alan L Yuille. The secrets of salient object segmentation. [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, In Proceedings of the IEEE Conference on Computer Vision and Li Fei-Fei. Imagenet: A large-scale hierarchical image and Pattern Recognition, pages 280–287, 2014. database. In 2009 IEEE conference on computer vision and [20] Jie Liang, Jun Zhou, Lei Tong, Xiao Bai, and Bin Wang. pattern recognition, pages 248–255. IEEE, 2009. Material based salient object detection from hyperspectral [6] Zijun Deng, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Jing Qin, images. Pattern Recognition, 76:476–490, 2018. Guoqiang Han, and Pheng-Ann Heng. R3net: Recurrent [21] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig residual refinement network for saliency detection. In Pro- Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- ceedings of the 27th International Joint Conference on Arti- deeplab: Hierarchical neural architecture search for semantic ficial Intelligence, pages 684–690. AAAI Press, 2018. image segmentation. In Proceedings of the IEEE Conference ´ ˆ [7] Marc Ehrig and Jerome Euzenat. Relaxed precision and re- on Computer Vision and Pattern Recognition, pages 82–92, call for ontology matching. In Proc. K-Cap 2005 workshop 2019. 13 [22] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng, Vanhoucke, and Andrew Rabinovich. Going deeper with and Jianmin Jiang. A simple pooling-based design for real- convolutions. In Proceedings of the IEEE conference on time salient object detection. In Proceedings of the IEEE computer vision and pattern recognition, pages 1–9, 2015. Conference on Computer Vision and Pattern Recognition, [37] Zhiqiang Tang, Xi Peng, Shijie Geng, Lingfei Wu, Shaoting pages 3917–3926, 2019. Zhang, and Dimitris Metaxas. Quantized densely connected [23] Nian Liu, Junwei Han, and Ming-Hsuan Yang. Picanet: u-nets for efficient landmark localization. In Proceedings Learning pixel-wise contextual attention for saliency detec- of the European Conference on Computer Vision (ECCV), tion. In Proceedings of the IEEE Conference on Computer pages 339–354, 2018. Vision and Pattern Recognition, pages 3089–3098, 2018. [38] Zhiqiang Tang, Xi Peng, Shijie Geng, Yizhe Zhu, and Dim- [24] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully itris N Metaxas. Cu-net: coupled u-nets. arXiv preprint convolutional networks for semantic segmentation. In Pro- arXiv:1808.06521, 2018. ceedings of the IEEE conference on computer vision and pat- [39] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, tern recognition, pages 3431–3440, 2015. Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de- [25] Shijian Lu and Joo-Hwee Lim. Saliency modeling from im- tect salient objects with image-level supervision. In Proceed- age histograms. In European Conference on Computer Vi- ings of the IEEE Conference on Computer Vision and Pattern sion, pages 321–332. Springer, 2012. Recognition, pages 136–145, 2017. [26] Shijian Lu, Cheston Tan, and Joo-Hwee Lim. Robust and [40] Tiantian Wang, Ali Borji, Lihe Zhang, Pingping Zhang, and efficient saliency modeling from image co-occurrence his- Huchuan Lu. A stagewise refinement model for detecting tograms. IEEE transactions on pattern analysis and machine salient objects in images. In Proceedings of the IEEE Inter- intelligence, 36(1):195–201, 2013. national Conference on Computer Vision, pages 4039–4048, [27] Zhiming Luo, Akshaya Mishra, Andrew Achkar, Justin Eichel, Shaozi Li, and Pierre-Marc Jodoin. Non-local deep [41] Tiantian Wang, Lihe Zhang, Shuo Wang, Huchuan Lu, Gang features for salient object detection. In Proceedings of the Yang, Xiang Ruan, and Ali Borji. Detect globally, refine IEEE Conference on Computer Vision and Pattern Recogni- locally: A novel approach to saliency detection. In Proceed- tion, pages 6593–6601, 2017. ings of the IEEE Conference on Computer Vision and Pattern [28] Ke Ma, Zhixin Shu, Xue Bai, Jue Wang, and Dimitris Sama- Recognition, pages 3127–3135, 2018. ras. Docunet: Document image unwarping via a stacked u- [42] Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, net. In CVPR, pages 4700–4709, 2018. Huchuan Lu, and Errui Ding. A mutual learning method for [29] Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. How to salient object detection with intertwined multi-supervision. evaluate foreground maps. 2014 IEEE Conference on Com- In Proceedings of the IEEE Conference on Computer Vision puter Vision and Pattern Recognition, pages 248–255, 2014. and Pattern Recognition, pages 8150–8159, 2019. [30] Vida Movahedi and James H Elder. Design and perceptual [43] Zhe Wu, Li Su, and Qingming Huang. Cascaded partial de- validation of performance measures for salient object seg- coder for fast and accurate salient object detection. In Pro- mentation. In 2010 IEEE Computer Society Conference on ceedings of the IEEE Conference on Computer Vision and Computer Vision and Pattern Recognition-Workshops , pages Pattern Recognition, pages 3907–3916, 2019. 49–56. IEEE, 2010. [44] Saining Xie, Ross Girshick, Piotr Dollar ´ , Zhuowen Tu, and [31] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- Kaiming He. Aggregated residual transformations for deep glass networks for human pose estimation. In European con- neural networks. In Proceedings of the IEEE Conference ference on computer vision, pages 483–499. Springer, 2016. on Computer Vision and Pattern Recognition, pages 5987– [32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory 5995, 2017. Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban [45] Saining Xie and Zhuowen Tu. Holistically-nested edge de- Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- tection. In Proceedings of the IEEE international conference ferentiation in pytorch. In Autodiff workshop on NIPS, 2017. on computer vision, pages 1395–1403, 2015. [33] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, [46] Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical Masood Dehghan, and Martin Jagersand. Basnet: Boundary- saliency detection. In Proceedings of the IEEE Conference aware salient object detection. In Proceedings of the IEEE on Computer Vision and Pattern Recognition, pages 1155– Conference on Computer Vision and Pattern Recognition, 1162, 2013. pages 7479–7489, 2019. [34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- [47] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and net: Convolutional networks for biomedical image segmen- Ming-Hsuan Yang. Saliency detection via graph-based man- tation. In International Conference on Medical image com- ifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3166– puting and computer-assisted intervention , pages 234–241. 3173, 2013. Springer, 2015. [35] Karen Simonyan and Andrew Zisserman. Very deep convo- [48] Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, lutional networks for large-scale image recognition. arXiv Mingyang Qian, and Yizhou Yu. Multi-source weak supervi- preprint arXiv:1409.1556, 2014. sion for saliency detection. In Proceedings of the IEEE Con- [36] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, ference on Computer Vision and Pattern Recognition, pages Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent 6074–6083, 2019. 14 [49] Jinxia Zhang, Krista A. Ehinger, Haikun Wei, Kanjian Zhang, and Jingyu Yang. A novel graph-based optimization framework for salient object detection. Pattern Recognition, 64:39–50, 2017. [50] Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. A bi-directional message passing model for salient object de- tection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1741–1750, 2018. [51] Lu Zhang, Jianming Zhang, Zhe Lin, Huchuan Lu, and You He. Capsal: Leveraging captioning to boost semantics for salient object detection. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 6024–6033, 2019. [52] Pingping Zhang, Wei Liu, Huchuan Lu, and Chunhua Shen. Salient object detection by lossless feature reflection. In IJ- CAI, pages 1149–1155, 2018. [53] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. Amulet: Aggregating multi-level convo- lutional features for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 202–211, 2017. [54] Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Baocai Yin. Learning uncertain convolutional features for accurate saliency detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 212– 221, 2017. [55] Qiang Zhang, Zhen Huo, Yi Liu, Yunhui Pan, Caifeng Shan, and Jungong Han. Salient object detection employing a local tree-structured low-rank representation and foreground con- sistency. Pattern Recognition, 92:119–134, 2019. [56] Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, and Gang Wang. Progressive attention guided recurrent net- work for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 714–722, 2018. [57] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890, 2017. [58] Yunzhi Zhuge, Yu Zeng, and Huchuan Lu. Deep embedding features for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9340–9347, 2019.

Journal

Computing Research RepositoryarXiv (Cornell University)

Published: May 18, 2020

There are no references for this article.